ROSA: Harnessing Robot States for Vision-Language and Action Alignment

Yuqing Wen^1*, Kefan Gu^2*, Haoxuan Liu^3*, Yucheng Zhao^3†, Tiancai Wang³, Haoqiang Fan³ Xiaoyan Sun^1‡,

¹University of Science and Technology of China, ²Nanjing University, ³Dexmal

^*This work was done during the internship at Dexmal. ^†Project lead. ^‡Corresponding Author.

We propose a novel generalist policy for robot control, named ROSA, which achieves high success rates across multiple tasks in both RLBench and real-world settings. It also demonstrates strong generalization to unseen tasks.

🎯 Motivation 🎯

The spatial and temporal gaps in adapting VLMs to VLAs. VLMs are pretrained with large-scale VQA datasets to observe current high-level semantics in images, while VLAs are designed to predict low-level future actions in 3D space. The spatial-temporal gap poses challenges to the alignment process and results in data inefficiency in developing VLAs.

🌟 A Novel Training Paradigm 🌟

Expert Action Data + Robot State Estimation Data

Two types of data used by ROSA to train VLA models. (a). Expert action prediction data, which requires human effort to collect. (b). Robot state estimation data, which is obtained automatically without human collection by letting the robot move randomly. ROSA performs joint training using these two types of data.

🤖 Architecture 🤖

Overview of the ROSA architecture. ROSA adopts a classic VLM architecture. Image observations are encoded into image tokens by a vision encoder and a projector. These image tokens are combined with text tokens and fed into an LLM. The model is trained with an autoregressive next-token prediction objective.

🔥 Effectiveness Under Different Data Scales 🔥

Performance under varying data scales. It can be observed that ROSA consistently outperforms the baseline in different data scales, with particularly significant advantages under limited data conditions.

😎 Generalization Ability 😎

Performance on unseen tasks. ROSA's superior performance over the baseline demonstrates its strong generalization capability.

🏆 Comparasion with Previous Methods 🏆

Comparison with previous methods on RLBench. ROSA outperforms these related methods, highlighting its superior effectiveness.

BibTex

                
@artical{@misc{wen2025rosaharnessingrobotstates,
    title={ROSA: Harnessing Robot States for Vision-Language and Action Alignment}, 
    author={Yuqing Wen and Kefan Gu and Haoxuan Liu and Yucheng Zhao and Tiancai Wang and Haoqiang Fan and Xiaoyan Sun},
    year={2025},
    eprint={2506.13679},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2506.13679}, 
}
}

Contact

Feel free to contact us at wenyuqing AT mail.ustc.edu.cn or wangtiancai AT megvii.com