Zero-Shot Human-Object Interaction Synthesis
with Multimodal Priors (Arxiv 2025)

Yuke Lou1,*        Yiming Wang2,*        Zhen Wu3        Rui Zhao4       
Wenjia Wang1        Mingyi Shi1        Taku Komura1, †        

1The University of Hong Kong      2ETH Zurich     3Stanford University     4Tencent Robotics X    

(*: equal contribution.)

Your Image
The diverse human-object interactions generated by our method:
Guitar
Barbell
Umbrella

Microphone
Flag
Chair

Box
Water can
Yoga ball

To learn more about our method and see additional results, please watch the full video below:

Abstract

Human-object interaction (HOI) synthesis is important for various applica- tions, ranging from virtual reality to robotics. However, acquiring 3D HOI data is challenging due to its complexity and high cost, limiting existing methods to the narrow diversity of object types and interaction patterns in training datasets. This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. The core idea of our method lies in leveraging extensive HOI knowledge from pre-trained Multimodal Models. Given a text description, our system first obtains temporally consistent 2D HOI image sequences using image or video generation models, which are then uplifted to 3D HOI milestones of human and object poses. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images. Our estimation method is adaptive to various object templates obtained from text-to-3D models or online retrieval. A physics-based track- ing of the 3D HOI kinematic milestone is further applied to refine both body motions and object poses, yielding more physically plausible HOI generation results. The experimental results demonstrate that our method is capable of generating open-vocabulary HOIs with physical realism and semantic diversity.

Framework Overview
Your Image

BibTeX

@misc{lou2025zeroshothumanobjectinteractionsynthesis,
    title={Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors}, 
    author={Yuke Lou and Yiming Wang and Zhen Wu and Rui Zhao and Wenjia Wang and Mingyi Shi and Taku Komura},
    year={2025},
    eprint={2503.20118},
    archivePrefix={arXiv},
    primaryClass={cs.GR},
    url={https://arxiv.org/abs/2503.20118}, 
}