具身智能行为学习需要什么数据？（六）-CSDN博客

本文链接：https://blog.csdn.net/yorkhunter/article/details/144280067

7. 可穿戴人工智能

将他人的活动映射到自我中心视角是人类从很小的时候就掌握的基本技能。

可穿戴人工智能或自我人工智能本质上是一种机器人应用。智能眼镜、神经腕带和 AR 耳机（Meta Project Aria [180]、VisionProTeleop [253]）等设备使用人工智能来感知用户的环境、理解空间背景并做出预测 [218、304、323]。

project Aria
Vision Pro

虽然从自我中心视角（基于可穿戴设备）收集了大量数据，但对于人工智能智体来说，直接从从不同视角捕获的演示视频中学习至关重要。

只有少数数据集以时间同步的方式记录了同一环境中的自我为中心和外部为中心视角的视频。在具身智能动作学习的泛化中，需要第三人称视角和第一人称视角之间的转换 [53、257]。

8. 数据集需求

基于以上各方面的分析，需求列表可以体现如下：

1）数据集旨在促进大规模具身学习任务的研究。
2）数据集支持对新目标、新环境、新任务甚至新具身实体的泛化。
3）数据集满足实体、时间、地点、视角、目标、技能多样性的需求。
4）数据集提供足够准确的真值：标定、同步、地图和定位以及注释。
5）数据集符合隐私和道德标准：去识别。
6）数据集包括真实和模拟数据：实现了real2sim和sim2real转换。
7）数据集包括Exo-Ego视图数据：支持Exo-Ego视图的灵活转换。
8）数据集制定统一的格式标准：可在各种数据格式之间转换。
9）数据集提供了评估基准：感知、认知（反思、推理、规划）和行动（操作）。

9. 结论

本文概述了传统人工智能向 LLM、VLM、智体、空间智能和具身人工智能的演变，并分析了具身动作/行为的策略训练、数据捕获平台的具身灵活性、模拟平台和自我中心/可穿戴人工智能等。然后，体现了构建数据集的必要要求。

最后，讨论具身人工智能中的泛化技巧，这为具身数据捕获提供了见解。

9.1 泛化技巧

具身人工智能中的策略泛化方法可以如下。
1）RL 中的 Sim-2-Real 域迁移 [25, 46]；
2）数据增强和生成式人工智能模型（比如GAN和扩散策略）[127]；
3）数据规模和多样性（Open-X）[195]；
4）中间表示（包括估计 affordance）[355]；
5）大规模的模型架构（transformer）[113,172]；
6）预训练的大基础模型[299]；
7）训练后微调[200]；
8）推理-时间优化[334]。

参考文献

J Aloimonos, I Weiss, A Bandyopadhyay, “Active vision”, IJCV, vol. 1, Jan. 1987
R. Bajcsy, “Active Perception”, IEEE Proceedings, Vol 76, No 8, Aug. 1988.
B. M. Yamauchi, “Packbot: a versatile platform for military robotics” (iRobot), Unmanned ground vehicle technology VI, vol. 5422. SPIE, Sept. 2004.
N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source multi-robot simulator,” IEEE/RSJ IRS, Oct., 2004.
J P. R. Wurman, R. D’Andrea, and M. Mountz, “Coordinating hundreds of cooperative, autonomous vehicles in warehouses” (Kiva Systems), AI magazine, 29 (1), July, 2008.
M. Raibert, K. Blankespoor, G. Nelson, and R. Playter, “Bigdog, the rough-terrain quadruped robot,” IFAC Proceedings Volumes, 41(2), July 2008.
Deng, W Dong, R Socher, and et al. “ImageNet: A large-scale hierarchical image database”, IEEE CVPR, Aug. 2009
G. Echeverria, N. Lassabe, A. Degroote and S. Lemaignan, “Modular open robots simulation engine: MORSE,” IEEE ICRA, May, 2011.
E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” IEEE/RSJ IRS, Oct. 2012
MIT Quadruped Robot Cheetah, https://spectrum.ieee.org/mit-cheetah-robot-running, IEEE Spectrum, May 2013
E. Rohmer, S. P. Singh, and M. Freese, “V-Rep: A versatile and scalable robot simulation framework” (CoppeliaSim), IEEE/RSJ IRS, Nov. 2013
Y Bai and C K Liu. “Dexterous manipulation using both palm and fingers” (DMPF). IEEE ICRA, June, 2014.
F. Tanaka, K. Isshiki, F. Takahashi, and et al., “Pepper learns together with children: Development of an educational application”, IEEE Int. Conf. on Humanoid Robots, Nov. 2015.
E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” https://github.com/bulletphysics/bullet3, 2016
S. Maniatopoulos, P. Schillinger, V. Pong, D. C. Conner, and H. Kress-Gazit, “Reactive high-level behavior synthesis for an Atlas humanoid robot,” IEEE ICRA, May 2016.
V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous manipulation policies from experience and imitation” (LPEI), arXiv 1611.05095, 2016.
S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-fidelity visual and physical simulation for autonomous vehicles,” arXiv 1705.05065, 2017
A. Chang, A. Dai, T. Funkhouser, and et al., “Matterport3D: Learning from RGB-D data in indoor environments”, arXiv1709.06158, 2017
A. Rajeswaran, V. Kumar, A. Gupta, and et al, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations” (DAPG), RSS’18, arXiv 1709.10087, 2017.
M Savva, A Chang, A Dosovitskiy, and et al., “MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments”, arXiv 1712.03931, 2017
E. Kolve, R. Mottaghi, D. Gordon, and et al., “AI2-THOR: An interactive 3d environment for visual AI,” arXiv 1712.05474, 2017
A Vaswani, N Shazeer, N Parmar, et al. “Attention is All You Need” (Transformer). Advances in Neural Information Processing Systems, 2017.
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training” (GPT-1), https://openai.com/index/language-unsupervised/, June 2018.
X. Puig, K. Ra, M. Boben, and et al., “Virtualhome: Simulating household activities via programs,” in IEEE/CVF CVPR, Jun 2018
OpenAI team, “Learning dexterous in-hand manipulation” (LDIM), arXiv 1808.00177, 2018.
A. Juliani, V-P Berges, E. Teng, and et al., “Unity: A general platform for intelligent agents” (Unity ML-Agents), arXiv 1809.02627, 2018.
S Li, X Ma, H Liang, and et al. “Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network” (TeachNet). ICRA, arXiv 1809.06268, 2018.
A Mandlekar, Y Zhu, A Garg, and et al. “RoboTurk: A crowdsourcing platform for robotic skill learning through imitation”. ICRL, arXiv 1811.02790, 2018.
A. Radford, J. Wu, R. Child, et al., “Language models are unsupervised multitask learners” (GPT-2), OpenAI blog, 2019.
Kroemer, O., Niekum, S., & Konidaris, G. “A review of robot learning for manipulation: Challenges, representations, and algorithms”. arXiv 1907.03146, 2019
M Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”, arXiv 1909.08053, 2019
A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation” (PDDM), arXiv 1909.11652, 2019.
S Rajbhandari, J Rasley, O Ruwase, Y He, “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”, arXiv 1910.02054, 2019
A Handa, K Van Wyk, W Yang, and et al. “DexPilot: Vision-based teleoperation of dexterous robotic hand-arm system”. IEEE ICRA, arXiv 1910.03135, 2019.
S Dasari, F Ebert, S Tian, and et al. “RoboNet: Large-scale multi-robot learning”. CoRL’19, arXiv 1910.11215, 2019
M Savva, A Kadian, O Maksymets, and et al. “Habitat: A platform for embodied AI research”. IEEE ICCV, 2019.
Hafner D, Lillicrap T, Ba J, et al. “Dream to control: learning behaviors by latent imagination” (Dreamer v1). arXiv 1912.01603, 2019
Ravichandar, H., Polydoros, A. S., Chernova, S., & Billard, A. “Recent advances in robot learning from demonstration” (review). Annual Review of Control, Robotics, Auto. Systems, vol.3, 2020
C C. Kessens, J Fink, A Hurwitz, and et al., “Toward fieldable human-scale mobile manipulation using RoMan”, AI and Machine Learning for Multi-Domain Operations Applications II, Volume 11413, SPIE, April, 2020
I. Radosavovic, X. Wang, L. Pinto, and J. Malik, “State-only imitation learning for dexterous manipulation” (SOIL), IEEE/RSJ IROS’21. arXiv 2004.04650, 2020.
M Deitke, W Han, A Herrasti and et al. “RoboTHOR: An open simulation-to-real embodied AI platform”. CVPR’20, arXiv 2004.06799, 2020
Damen D, Doughty H, Farinella G M, et al. “The EPIC-Kitchens dataset: collection, challenges and baselines”. arXiv 2005.00343, IEEE T-PAMI, 43(11): 4125–4141, 2021
T. B. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learners” (GPT-3), arXiv 2005.14165, 2020
C. Li, S. Zhu, Z. Sun, and J. Rogers, “BAS optimized ELM for KUKA iiwa Robot Learning,” IEEE Transactions on Circuits and Systems II: Express Briefs, 68 (6), Oct. 2020.
F. Xiang, Y. Qin, K. Mo, and et al., “SAPIEN: A simulated part-based interactive environment,” arXiv 2003.08515, IEEE/CVF CVPR, Jun 2020.
Zhao, W., Queralta, J. P., and Westerlund, T. “Sim-to-real transfer in deep reinforcement learning for robotics: a survey”. arXiv 2009.13303, 2020.
Hafner D, Lillicrap T, Norouzi M, et al. “Mastering Atari with discrete world models” (Dreamer v2). arXiv 2010.02193, 2020
A. Zeng, P. Florence, J. Tompson, and et al., “Transporter networks: Rearranging the visual world for robotic manipulation”. CoRL’20, arXiv 2010.14406, 2020
Y Ishiguro, T Makabe, Y Nagamatsu, and et al., “Bilateral humanoid teleoperation system using whole-body exoskeleton cockpit TABLIS”, IEEE IROS, Oct. 2020
B. Shen, F. Xia, C. Li, and et al., “iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes,” arXiv 2012.02924, IEEE/RSJ IRS, 2021
J Ren, S Rajbhandari, R Y Aminabadi et al., “ZeRO-offload: Democratizing Billion-Scale Model Training”, arXiv 2101.06840, 2021
S Rajbhandari et al., “ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning”, arXiv 2104.07857, 2021
Y Li, T Nagarajan, B Xiong, and K Grauman. “Ego-Exo: Transferring visual representations from third-person to first-person videos”. arXiv 2104.07905, CVPR, 2021
D Kalashnikov, J Varley, Y Chebotar, and et al., “MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale”, arXiv 2104.08212, 2021
K. Ehsani, W. Han, A. Herrasti, and et al., “ManipulaTHOR: A framework for visual object manipulation,” arXiv 2104.11213, IEEE/CVF CVPR, 2021.
M Caron, H Touvron, I Misra, and et al. “Emerging Properties in Self-Supervised Vision Transformers” (Dino v1), arXiv 2104.14294, 2021
Chen L, Lu K, Rajeswaran A, et al. “Decision transformer: reinforcement learning via sequence modeling”, arXiv 2106.01345, 2021
Janner M, Li Q, Levine S. “Offline reinforcement learning as one big sequence modeling problem” (Trajectory Transformer), arXiv 2106.02039, 2021
E Hu et al., “LORA: Low-Rank Adaptation of Large Language Models”, arXiv 2106.09685, 2021
A Szot, A Clegg, E Undersander, and et al. “Habitat 2.0: Training Home Assistants to Rearrange their Habitat”, arXiv 2106.14405, 2021
Mu T Z, Ling Z, Xiang F B, et al. “Maniskill: generalizable manipulation skill benchmark with large-scale demonstrations”, arXiv 2107.14483, 2021
A. Jaegle, S. Borgeaud, J. B. Alayrac, and et al. “Perceiver IO: A general architecture for structured inputs & outputs”. arXiv 2107.14795, 2021.
A Radford, J W Kim, C Hallacy, et al. “Learning transferable visual models from natural language supervision” (CLIP). ICML 2021.
A Ramesh, M Pavlov, G Goh, et al., “Zero-shot text-to-image generation” (DALL-E). ICML. Virtual event, July 2021
C. Li, F. Xia, R. Mart ́ın-Mart ́ın, and et al., “iGibson 2.0: Object- centric simulation for robot learning of everyday household tasks,” arXiv 2108.03272, CRL’21, 2021
Y Qin, Y-H Wu, S Liu, and et-al. “DexMV: Imitation learning for dexterous manipulation from human videos”. ECCV’22, arXiv 2108.05877, 2021.
Tesla Bot (Optimus), https://spectrum.ieee.org/elon-musk-robot, IEEE Spectrum, Aug., 2021
S K Ramakrishnan, A Gokaslan, E Wijmans, and et al. “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D environments for embodied AI”. arXiv 2109.08238, 2021
M. Shridhar, L. Manuelli, and D. Fox, “CliPort: What and where pathways for robotic manipulation,” arXiv 2109.12098, 2021
F. Ebert, Y. Yang, K. Schmeckpeper, and et al. “Bridge data: Boosting generalization of robotic skills with cross-domain datasets”. arXiv 2109.13396, 2021.
K. Grauman, A. Westbury, E. Byrne and et al. “Ego4D: Around the world in 3,000 hours of egocentric video”. arXiv 2110.07058, 2021
Z Bian et al., “Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training”, arXiv 2110.14883, 2021
E Jang, A Irpan, M Khansari, and et al. “BC-Z: Zero-shot task generalization with robotic imitation learning”. CoRL, 2021
C. Gan, J. Schwartz, S. Alter, and et al., “ThreeDWorld: A platform for interactive multi-modal physical simulation,” arXiv 2007.04954, NeuIPS’21, 2021
R Rombach, A Blattmann, D Lorenz, P Esser, and B Ommer. “High-resolution image synthesis with latent diffusion models” (Stable Diffusion). arXiv 2112.10752, 2021.
W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents” (T-LM), arXiv 2201.07207, ICML, 2022.
P Mandikal and K Grauman. “DexVIP: Learning dexterous grasping with human hand pose priors from video”. CoRL, arXiv 2202.00164, 2022.
Li J, Li D, Xiong C, et al. “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation”, arXiv 2201.12086, 2022
Y. Liu, Y. Liu, C. Jiang, and et al., “HOI4D: A 4D egocentric dataset for category-level human-object interaction”. CVPR’22, arXiv 2203.014577, 2022
L Ouyang, J Wu, X Jiang et al., “Training language models to follow instructions with human feedback” (GPT-3.5/InstructGPT), arXiv 2203.02155, 2022
N Hansen, X Wang, H Su, “Temporal Difference Learning for Model Predictive Control” (TD-MPC), arXiv 2203.04955, 2022
S P Arunachalam, S Silwal, B Evans, and L Pinto. “Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation” (DIME). arXiv 2203.13251, 2022.
F Sener, D Chatterjee, D Shelepov, and et al. “Assembly101: A large-scale multi-view video dataset for understanding procedural activities”. CVPR’22, arXiv 2203.14712, 2022
A. Zeng, M. Attarian, K. M. Choromanski, and et al., “Socratic models: Composing zero-shot multimodal reasoning with language”, arXiv 2204.00598, 2022
M Ahn, A Brohan, N Brown, and et al., “Do as I Can, Not as I Say: Grounding Language in Robotic Affordances” (SayCan), arXiv 2204.01691, 2022
A Ramesh, P Dhariwal, A Nichol, and et al. “Hierarchical text-conditional image generation with clip latents” (DALL-E2). arXiv 2204.06125，2022.
Y Qin, H Su, and X Wang. “From one hand to multiple hands: Imitation learning for dexterous manipulation from single-camera teleoperation” (IMDM). RA-L, 7(4), arXiv 2204.12490, 2022.
J-B Alayrac, J Donahue, P Luc, et al., “Flamingo: a visual language model for few-shot learning”. arXiv 2204.14198, 2022
Reed, S., Zolna, K., Parisotto, E., and et al. “A Generalist Agent” (GATO). arXiv 2205.06175, 2022
T Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”, arXiv 2205.14135, 2022
S. Haddadin, S. Parusel, L. Johannsmeier, and et al., “The Franka Emika Robot: A reference platform for robotics research and education”, IEEE Robotics & Automation Magazine, 29 (2), June, 2022.
M. Deitke, E. VanderBilt, A. Herrasti, and et al., “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation”, arXiv 2206.06994, NeurIPS’22, 2022
N M Shafiullah, Z J Cui, A Altanzaya, L Pinto, “Behavior Transformers: Cloning k modes with one stone”, arXiv 2206.11251, 2022
P Wu, A Escontrela, D Hafner, P Abbeel, and K Goldberg. “DayDreamer: World models for physical robot learning”. arXiv 2206.14176, 2022
Y. Seo, D. Hafner, H. Liu, and et al., “Masked world models for visual control” (MWM), arXiv 2206.14244, 2022
Huang W, Xia F, Xiao T, et al. “Inner monologue: embodied reasoning through planning with language models”. arXiv 2207.05608, 2022
S Bahl, A Gupta, D Pathak, “Human-to-Robot Imitation in the Wild” (WHIRL), arXiv 2207.09450, July, 2022
A Bucker, L Figueredo, S Haddadin, and et al., “LATTE: LAnguage Trajectory TransformEr”, arXiv 2208.02918, 2022
J Liang, W Huang, F Xia, and et al., “Code as Policies: Language Model Programs for Embodied Control” (CaP), arXiv 2209.07753, 2022
L Yang, Z Zhang, S Hong et al., “Diffusion Models: A Comprehensive Survey of Methods and Applications”, arXiv 2209.00796, 2022
M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-Actor: A multi-task transformer for robotic manipulation,” arXiv 2209.05451, 2022
I. Singh, V. Blukis, A. Mousavian, and et al., “ProgPrompt: Generating situated robot task plans using large language models,” arXiv 2209.11302, IEEE ICRA’23, 2022.
B. Reily, P. Gao, F. Han, H. Wang, and H. Zhang, “Real-time recognition of team behaviors by multisensory graph-embedded robot learning” (Jackal Robot/Clearpath Robotics), IJRA, 41(8), Sep. 2022.
Z Q Chen, K Van Wyk, Y-W Chao, and et-al. “DexTransfer: Real world multi-fingered dexterous grasping with minimal human demonstrations”. arXiv 2209.14284, 2022.
K. Gao, Y. Gao, H. He, et al., “NeRF: Neural radiance field in 3d vision, a comprehensive review”. arXiv 2210.00379, 2022.
S Yao, J Zhao, D Yu, and et al., “ReAct: Synergizing Reasoning and Acting in Language Models”, arXiv 2210.03629, 2022
S P Arunachalam, I Güzey, S Chintala, and Lerrel Pinto. “Holo-Dex: Teaching dexterity with immersive mixed reality”. IEEE ICRA’23, arXiv 2210.06463, 2022.
A Handa, A Allshire, V Makoviychuk, and et al. “Dextreme: Transfer of agile in-hand manipulation from simulation to reality”. arXiv 2210.13702, 2022.
Mohammed, Q., Kwek, C., Chua, C. and et al. “Review of learning-based robotic manipulation in cluttered environments”. Sensors, vol. 22 (20), 2022.
T Chen, M Tippur, S Wu, and et al. “Visual dexterity: In-hand dexterous manipulation from depth”. arXiv 2211.11744, 2022.
C H Song, J Wu, C Washington, and et al., “LLM-Planner: Few-shot grounded planning for embodied agents with large language models”. arXiv 2212.04088, 2022
K Shaw, S Bahl, and D Pathak. “VideoDex: Learning dexterity from internet videos”. arXiv 2212.04498, 2022.
A Brohan, N Brown, J Carbajal, and et al. “RT-1: Robotics transformer for real-world control at scale”. arXiv 2212.06817, 2022
P Liu, W Yuan, J Fu, and et al. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing”. ACM Computing Surveys, 55(9):1–35, 2023.
Q Dong, L Li, D Dai, and et al., “A survey for in-context learning”. arXiv 2301.00234, 2023.
Hafner D, Pasukonis J, Ba J, et al. “Mastering diverse domains through world models” (Dreamer v3), arXiv 2301.04104, 2023
M Mittal, C Yu, Q Yu, and et al. “ORBIT: A Unified Simulation Framework for Interactive Robot Learning Environments”, arXiv 2301.04195, 2017

K. Nottingham, P. Ammanabrolu, A. Suhr, and et al. “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling” (DEKARD), arXiv 2301.12050, 2023

Li J, Li D, Savarese S, et al. “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models”. arXiv 2301.12597, 2023
Y Du, M Yang, B Dai, and et al., “Learning Universal Policies via Text-Guided Video Generation” (UniPi), arXiv 2302.00111, 2023
Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents” (DEPS), arXiv 2302.01560, 2023.
J Gu, F Xiang, X Li, and et al., “ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills”, arXiv 2302.04659, 2023
H. Touvron, T. Lavril, G. Izacard, and et al. “LLaMA: Open and efficient foundation language models”. arXiv 2302.13971, 2023.
D Driess, F Xia, M. Sajjadi, et al., “PaLM-E: An Embodied Multimodal Language Model”, arXiv 2303.03378, 2023
G Khandate, S Shang, ET Chang, and et-al. “Sampling- based Exploration for Reinforcement Learning of Dexterous Manipulation” (M-RRT/G-RRT). RSS’23, arXiv 2303.03486, 2023.
S Yang, O Nachum, Y Du, and et al., “Foundation models for decision making: Problems, methods, and opportunities” (review). arXiv 2303.04129, 2023
C Chi, Z Xu, S Feng, and et al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion”, arXiv 2303.04137, 2023
Y Cao, S Li, Y Liu, and et al. “A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT”, arXiv 2303.04226, 2023
J Pitz, L Ro ̈stel, L Sievers, and B Ba ̈uml. “Dextrous tactile in-hand manipulation using a modular reinforcement learning architecture” (DTIM). arXiv 2303.04705, 2023.
J Robine, M H”oftmann, T Uelwer, and S Harmeling. “Transformer-based world models are happy with 100k interactions” (TWM). ICLR’23, arXiv 2303.07109, 2023
C Zhang, C Zhang, M Zhang, I S Kweon, “Text-to-image Diffusion Models in Generative AI: A Survey”, arXiv 2303.07909, 2023
J. Achiam, S. Adler, S. Agarwal, and et al. “GPT-4 technical report”. arXiv 2303.08774, 2023
Z-H Yin, B Huang, Y Qin, Q Chen, and X Wang. “Rotating without seeing: Towards in-hand dexterity through touch” (Touch Dexterity). arXiv 2303.10880, 2023.
Shinn N, Cassano F, Berman E, et al. “Reflexion: language agents with verbal reinforcement learning”, arXiv 2303.11366, 2023
I Guzey, B Evans, S Chintala, and L Pinto. “Dexterity from touch: Self-supervised pre- training of tactile representations with robotic play” (T-Dex). arXiv 2303.12076, 2023.
Madaan A, Tandon N, Gupta P, et al. “Self-Refine: iterative refinement with self-feedback”, arXiv 2303.17651, 2023
W X Zhao, K Zhou, J Li, and et al., “A Survey of Large Language Models”, arXiv 2303.18233, Mar. 2023
L Zhao, T Yang, Y Yang, and P Yu. “A wearable upper limb exoskeleton for intuitive teleoperation of anthropomorphic manipulators” (WULE). MDPI Machines, 11(4):441, Mar. 2023.
Figure 01, https://www.fastcompany.com/90859010/the-race-to-build-ai-powered-humanoids-is-heating-up, Mar., 2023
A. Ugenti, R. Galati, G. Mantriota, and G. Reina, “Analysis of an all-terrain tracked robot with innovative suspension system” (Polibot), Mechanism and Machine Theory, vol. 182, April, 2023.
A Kirillov, E Mintun, N Ravi, and et al. “Segment Anything” (SAM). arXiv 2304.02643, 2023
J Park, J Brien, C Cai and et al., “Generative Agents: Interactive Simulacra of Human Behavior” (GA), arXiv 2304.03442, 2023
X Zou, J Yang, H Zhang, et al., “Segment everything everywhere all at once” (SEEM). arXiv 2304.06718, 2023
M Oquab, T Darcet, T Moutakanni, and et al. “Dinov2: Learning robust visual features without supervision”. arXiv 2304.07193, 2023
T Z. Zhao, V Kumar, S Levine, C Finn, “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” (ALOHA/ACT), arXiv 2304.13705, 2023
Y Xie, K Kawaguchi K, Y Zhao, and et al. “Self-evaluation guided beam search for reasoning”, arXiv 2305.00633, 2023
M Heo, Y Lee, D Lee, and J. Lim. “FurnitureBench: Reproducible real-world benchmark for long-horizon complex manipulation”. arXiv 2305.12821, 2023.
S Yao, D Yu, J Zhao, and et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, arXiv 2305.10601, 2023
Mu Y, Zhang Q, Hu M, et al. “EmbodiedGPT: vision-language pre-training via embodied chain of thought”. arXiv 2305.15021, 2023
G Wang, Y Xie, Y Jiang, and et al., “VOYAGER: An Open-Ended Embodied Agent with Large Language Models”, arXiv 2305.16291, 2023
M Kulkarni, T J. L. Forgaard, K Alexis, “Aerial Gym – Isaac Gym Simulator for Aerial Robots”, arXiv 2305.16510, 2023
B Y Lin B Y, Y Fu, K Yang, and et al. “SwiftSage: a generative agent with fast and slow thinking for complex interactive tasks”. arXiv 2305.17390, 2023
Cyberbotics, “Webots: open-source robot simulator”, https://github.com/cyberbotics/webots, 2023
NVIDIA, “Nvidia Isaac Sim: Robotics simulation and synthetic data,” https://developer.nvidia.com/isaac/sim, 2023

AKM Shahariar, Azad Rabby, C Zhang, “BeyondPixels: A Comprehensive Review of the Evolution of Neural Radiance Fields”, arXiv 2306.03000, 2023

B Liu, Y Zhu, C Gao, and et al. “LIBERO: Benchmarking knowledge transfer for lifelong robot learning”. arXiv 2306.03310, 2023

P Ren, K Zhang, H Zheng, and et al. “Surfer: Progressive reasoning with world models for robotic manipulation”, arXiv 2306.11335, 2023

Microsoft, “Textbooks Are All You Need” (phi-1), arXiv 2306.11644, 2023
Bousmalis K, Vezzani G, Rao D, et al. “RoboCat: a self-improving generalist agent for robotic manipulation”. arXiv 2306.11706, 2023
A Goyal, J Xu, Y Guo, and et al. “RVT: Robotic view transformer for 3D object manipulation”. arXiv 2306.14896, 2023

Vemprala S, Bonatti R, Bucker A, and et al. “ChatGPT for robotics: design principles and model abilities”, arXiv 2306.17582, 2023

Y Guo, Y-J Wang, L Zha, J Chen, “DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment”, arXiv 2307.00329, 2023
X Li, V Belagali, J Shang and M S. Ryoo, “Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning”, arXiv 2307.01849, 2023
Y Qin, W Yang, B Huang, and et al. “AnyTeleop: A general vision-based dexterous robot arm-hand teleoperation system”. arXiv 2307.04577, 2023.

Huang W, Wang C, Zhang R, et al. “VoxPoser: Composable 3D value maps for robotic manipulation with language models”. arXiv 2307.05973, 2023

K. Rana, J. Haviland, S. Garg, and et al. “SayPlan: Grounding large language models using 3d scene graphs for scalable task planning,” arXiv 2307.06135, ICRL’23, 2023.
Wang, Y., He, Y., Li, Y., and et al. “InternVid: A large-scale video-text dataset for multimodal understanding and generation”. arXiv 2307.06942, 2023.
T Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”, arXiv 2307.08691, 2023
H. Touvron, L. Martin, K. Stone, and et al. “Llama 2: Open foundation and fine-tuned chat models”. arXiv 2307.09288, 2023.
J Gu, Z Han, S Chen, and et al. “A systematic survey of prompt engineering on vision-language foundation models”. arXiv 2307.12980, 2023
H Ha, P Florence, and S Song. “Scaling up and distilling down: Language-guided robot skill acquisition” (SUDD). CoRL’23, arXiv 2307.14535, 2023
A Brohan, N Brown, J Carbajal, and et al. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control”, arXiv 2307.15818, 2023
H Fang, H Fang, Z Tang, and et al. “RH20T: A robotic dataset for learning diverse skills in one-shot”. RSS 2023 Workshop on Learning for Task and Motion Planning, arXiv 2307.00595, July 2023
P. Arm, G. Waibel, J. Preisig, and et al., “Scientific exploration of challenging planetary analog environments with a team of legged robots” (ANYmal C), arXiv 2307.10079, Science robotics, 8 (80), July, 2023.
Lin J, Du Y, Watkins O, et al. “Learning to model the world with language” (Dynalang). arXiv 2308.01399, 2023
Jing, Y., Zhu, X., Liu, X., and et al. “Exploring visual pre-training for robot manipulation: Datasets, models and methods” (Vi-PRoM). arXiv 2308.03620, 2023.
S Zhang, L Dong, X Li and et al., “Instruction Tuning for Large Language Models: A Survey”, arXiv 2308.10792, 2023
L Wang, C Ma, X Feng, and et al. “A Survey on Large Language Model based Autonomous Agents”, arXiv 2308.11432, 2023
H. Walke, K. Black, A. Lee, and et al. “Bridgedata v2: A dataset for robot learning at scale”, arXiv 2308.12952, 2023.
K Somasundaram, J Dong, H Tang, and et al. “Project Aria: A new tool for egocentric multi-modal AI research”. arXiv 2308.13561, 2023.
L G Foo, H Rahmani, and J Liu, “AIGC for Various Data Modalities: A Survey”, arXiv 2308.14177, Aug. 2023
H Bharadhwaj, J Vakil, M Sharma, and et al., “RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking” (MT-ACT/RoboSet), arXiv 2309.01918, 2023
Microsoft, “Textbooks Are All You Need II: Phi-1.5 technical report”, arXiv 2309.05463, 2023
W Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM), arXiv 2309.06180, 2023
Z Xi, W Chen, X Guo, and et al. “The Rise and Potential of Large Language Model Based Agents: A Survey”, arXiv 2309.07864, 2023
C Li, Z Gan, Z Yang, and et al. “Multimodal Foundation Models: From Specialists to General-Purpose Assistants” (survey), arXiv 2309.10020, 2023
P Wu, Y Shentu, Z Yi, X Lin, and P Abbeel. “GELLO: A general, low-cost, and intuitive tele-operation framework for robot manipulators”. arXiv 2309.13037, 2023
H Fang, H Fang, Y Wang, and et al. “AirExo: Low-cost exoskeletons for learning whole-arm manipulation in the wild”. arXiv 2309.14975, 2023
T Shen, R Jin, Y Huang, and et al., “Large Language Model Alignment: A Survey”, arXiv 2309.15025, 2023
Z Chu, J Chen, Q Chen, and et al., “A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future”, arXiv 2309.15402, 2023
Q. Gu, A. Kuwajerwala, S. Morin, and et al., “ConceptGraphs: Open-vocabulary 3d scene graphs for perception and planning,” arXiv 2309.16650, 2023
Z Yang, L Li, K Lin, and et al., “The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)”, arXiv 2309.17421, 2023
L Wang, Y Ling, Z Yuan, and et al. “GenSim: Generating Robotic Simulation Tasks via Large Language Models”, arXiv 2310.01361, 2023
A Q. Jiang, A Sablayrolles, A Mensch, and et al., “Mistral 7B”, arXiv 2310.06285, 2023
A Padalkar, A Pooley, A Jain, and et al. “Open X-Embodiment: Robotic learning datasets and RT-x models”. arXiv 2310.08864, 2023
Zhang W, Wang G, Sun J, et al. “STORM: efficient stochastic transformer based world models for reinforcement learning”. arXiv 2310.09615, 2023
Du, Y., Yang, M., Florence, P. R., and et al. “Video language planning” (VLP). arXiv 2310.10625, Oct. 2023
Y J Ma, W Liang, G Wang, and et al. “EUREKA: Human-Level Reward Design Via Coding Large Language Models”, arXiv 2310.12931, ICLR’24, 2023
Puig, X., Undersander, E., Szot, A., and et al. “Habitat 3.0: A co-habitat for humans, avatars and robots”. arXiv 2310.13724, 2023
Y Feng, N Hansen, Z Xiong, and et al., “Fine-tuning Offline World Models in the Real World” (FOWM), arXiv 2310.16029, 2023
Hansen N, Su H, Wang X. “TD-MPC2: scalable, robust world models for continuous control”. arXiv 2310.16828, 2023
A Mandlekar, S Nasiriany, B Wen, and et al., “MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations”, arXiv 2310.17596, 2023
J Betker, G Goh, L Jing, and et al., “Improving Image Generation with Better Captions” (DALL-E3), OpenAI report, Oct., 2023
Li X, Liu M, Zhang H, et al. “Vision-language foundation models as effective robot imitators” (RoboFlamingo). arXiv 2311.01378, 2023
Wang Y, Xian Z, Chen F, et al. “RoboGen: towards unleashing infinite data for automated robot learning via generative simulation”. arXiv 2311.01455, 2023
J Gu, S Kirmani, P Wohlhart, and et al., “RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches”, arXiv 2311.01977, 2023
M R Morris, J Sohl-dickstein, N Fiedel, and et al. “Levels of AGI: Operationalizing Progress on the Path to AGI”, arXiv 2311.02462, 2023
H Peng, C Ding, T Geng, and et al., “Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs”, arXiv 2311.04417, 2023
Zeng, F., Gan, W., Wang, Y., and et al. “Large language models for robotics: A survey”. arXiv 2311.07226, 2023
Y Huang, Y Chen, Z Li, “Applications of Large Scale Foundation Models for Autonomous Driving” (survey), arXiv 2311.12144, 2023
J. Huang, S. Yong, X. Ma, and et al., “An embodied generalist agent in 3d world” (LEO), arXiv 2311.12871, 2023
X Xiao, J Liu, Z Wang, and et al., “Robot Learning in the Era of Foundation Models: A Survey”, arXiv 2311.14379, 2023
Y. Chen, W. Cui, Y. Chen, and et al., “RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks,” arXiv 2311.15649, 2023.
N Shafiullah, A Rai, H Etukuru, and et al. “On bringing robots home” (Dobb·E/Stick v1/HoNY). arXiv 2311.16098, 2023.
Y. Hu, F Lin, T Zhang, L Yi, and Y Gao, “Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning” (ViLa), arXiv 2311.17842, 2023.
K Grauman, A Westbury, L Torresani, and et al. “Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives”, arXiv 2311.18259, 2023
Javaheripi M, Bubeck S, Abdin M, et al. “Phi-2: the surprising power of small language models”. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/, 2023
Y Song, E Byrne, T Nagarajan, and et al. “Ego4D goal-step: Toward hierarchical understanding of procedural activities”. NeurIPS, 2023
I Leal, K Choromanski, D Jain, and et al., “SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention”, arXiv 2312.01990, 2023
R Firoozi, J Tucker, S Tian, and et al., “Foundation Models in Robotics: Applications, Challenges, and the Future” (review), arXiv 2312.07843, 2023
Y Hu, Q Xie, V Jain, and et al. “Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis”, arXiv 2312.08782, 2023
P Wang, L Li, Z Shao, and et al. “Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations”. arXiv 2312.08935, 2023
Team, G., Anil, R., Borgeaud, S., and et al. “Gemini: a family of highly capable multimodal models”. arXiv 2312.11805, 2023.
H Wu, Y Jing, C Cheang, and et al., “GR-1: Unleashing Large-Scale Video Generative Pre-Training For Visual Robot Manipulation”, arXiv 2312.13139, 2023
P Ding, H Zhao, W Song, and et al., “QUAR-VLA: Vision-Language-Action Model for Quadruped Robots”, arXiv 2312.14457, 2023
Mistral AI, “Mixtral of experts: A high quality Sparse Mixture-of-Experts”, https://mistral.ai/news/mixtral-of-experts/, Dec. 2023
C Wen, X Lin, J So, and et al., “Any-point Trajectory Modeling for Policy Learning” (ATM), arXiv 2401.00025, 2024
Z Fu, T Z Zhao, and C Finn, “Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation”, arXiv 2401.02117, 2024
Y Cheng, C Zhang, Z Zhang, and et al. “Exploring Large Language Model Based Intelligent Agents: Definitions, Methods, and Prospects” (survey), arXiv 2401.03428, 2024
G Chen and W Wang, “A Survey on 3D Gaussian Splatting”, arXiv 2401.03890, 2024
T Cai, Y Li, Z Geng, and et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads”, arXiv 2401.10774, 2024
B Chen, Z Xu, S Kirmani, and et al. “SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities”, arXiv 2401.12168, 2024
M Ahn, D Dwibedi, C Finn, and et al., “AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents”, arXiv 2401.12963, 2024
Ming, R., Huang, Z., Ju, Z., and et al. “A survey on video prediction: From deterministic to generative approaches”. arXiv 2401.14718, 2024.
LLaMA.cpp, LLM inference in C/C++, https://github.com/ggerganov/llama.cpp, Jan. 2024
S. Le Cleac’h, T. A. Howell, S. Yang, and et al, “Fast contact-implicit model predictive control” (Unitree Go1), IEEE Transactions on Robotics, Jan. 2024.
X Yan, J Xu, Y Huo, H Bao, “Neural Rendering and Its Hardware Acceleration: A Review”, arXiv 2402.00028, 2024
Z Xu, K Wu, J Wen, and et al. “A Survey on Robotics with Foundation Models: toward Embodied AI”, arXiv 2402.02385, 2024
Z Wang, Y Li, Y Wu, and et al. “Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision” (MiPS). arXiv 2402.02658, 2024b.
X Huang, W Liu, X Chen, and et al. “Understanding the planning of LLM agents: A survey”, arXiv 2402.02716, 2024
G Paolo, J G-Billandon, B Kegl, “A Call for Embodied AI”, arXiv 2402.03824, 2024
K Kawaharazuka, T Matsushima, A Gambardella, and et al. “Real-World Robot Applications of Foundation Models: A Review”, arXiv 2402.05741, 2024
S Minaee, T Mikolov, N Nikzad, and et al. “Large Language Models: A Survey”, arXiv 2402.06196, 2024
C Eze, C Crick. “Learning by watching: A review of video-based learning approaches for robot manipulation”. arXiv 2402.07127, 2024
B Fei, J Xu, R Zhang, and et al., “3D Gaussian as A New Vision Era: A Survey”, arXiv 2402.07181, 2024
G Yenduri, Ramalingam M, P Maddikunta, and et al., “Spatial Computing: Concept, Applications, Challenges and Future Directions”, arXiv 2402.07912, 2024
C Chi, Z Xu, C Pan, and et al., “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots” (UMI), arXiv 2402.10329, 2024
Z Tan, A Beigi, S Wang, and et al. “Large Language Models for Data Annotation: A Survey”, arXiv 2402.13446, 2024
P Gao, P Wang, F Gao, et al. “Vision-Language Navigation with Embodied Intelligence: A Survey”, arXiv 2402.14304, 2024
S Yang, J Walker, J Parker-Holder and et al. “Video as the new language for real-world decision making”, arXiv 2402.17139, 2024
Y Liu, J Cao, C Liu, and et al., “Datasets for Large Language Models: A Comprehensive Survey”, arXiv 2402.18041, 2024
OpenAI Sora, “Video generation models as world simulators”, https://openai.com/index/video-generation-models-as-world-simulators/, Feb. 2024
Y Park and P Agrawal. “Using apple vision pro to train and control robots” (VisionProTeleop), https://github.com/Improbable-AI/VisionProTeleop, 2024
2402.07912, S. Belkhale, T. Ding, T. Xiao, “RT-H: Action hierarchies using language,” arXiv 2403.01823, Mar. 2024
S Lee, Y Wang, H Etukuru, and et al., “Behavior Generation with Latent Actions” (VQ-BeT), arXiv 2403.03181, 2024
Ze Y, Zhang G, Zhang K, et al. “3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations”. arXiv 2403.03954, 2024
M Luo, Z Xue, A Dimakis, K Grauman, “Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos”, arXiv 2403.06351, 2024
A Iyer, Z Peng, Y Dai, and et al. “Open Teach: A versatile teleoperation system for robotic manipulation”. arXiv 2403.07870, 2024.

Google Gemma Team, “Gemma: Open Models Based on Gemini Research and Technology”, arXiv 2403.08295, 2024

Li, C., Zhang, R., Wong, J., and et al. “Behavior-1k: A human-centered, embodied AI benchmark with 1,000 everyday activities and realistic simulation”, arXiv 2403.09227, 2024
H. Zhen, X Qiu, P Chen, and et al., “3D-VLA: 3d vision-language-action generative world model,” arXiv:2403.09631, 2024.
T Wu, Y Yuan, L Zhang, and et al. “Recent Advances in 3D Gaussian Splatting” (review), arXiv 2403.11134, 2023
C Wang, H Shi, W Wang, and et al., “DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation”, arXiv 2403.07788, 2024
C. Sferrazza, D.-M. Huang, X. Lin, Y. Lee, and P. Abbeel. “HumanoidBench: Simulated humanoid benchmark for whole-body locomotion and manipulation”. arXiv 2403.10506, 2024.
J F. Mullen Jr, D Manocha, “LAP, Using Action Feasibility for Improved Uncertainty Alignment of Large Language Model Planners”, arXiv 2403.13198, 2024
Y Huang, G Chen, J Xu, et al. “EgoExoLearn: A Dataset for Bridging AsynchronousEgo- and Exo-centric View of Procedural Activities in Real World”, arXiv 2403.16182, 2024

“AI Power: Accurate Models at Blazing Speeds | SambaNova”, https://sambanova.ai/blog/accurate-models-at-blazing-speed, Samba COE v0.2, March, 2024

Unitree humanoid H1, https://kr-asia.com/unitree-robotics-develops-personal-robot-dogs-that-jog-alongside-you, Mar. 2024
A. Khazatsky, K. Pertsch, S. Nair, “Droid: A large-scale in-the-wild robot manipulation dataset”, arXiv 2403.12945, 2024
S Zhou, Y Du, J Chen, and et al. “RoboDreamer: Learning compositional world models for robot imagination”, arXiv 2404.12377, 2024
ALOHA 2 Team, and et al, “ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation”, arXiv 2405.02292, 2024
J W Kim, T Z. Zhao, S Schmidgall, and et al., “Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks”, arXiv 2407.12998, 2024
Microsoft, “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone”, arXiv 2404.14219, 2024
S. Shin, J. Kim, G.-C. Kang, and et al., “Socratic planner: Inquiry-based zero-shot planning for embodied instruction following,” arXiv 2404.15190, 2024.
Y Xia, R Wang, X Liu, and et al., “Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs”, arXiv 2404.15676, 2024
R Xu, S Yang, Y Wang, and et al., “Visual Mamba: A Survey and New Outlooks”, arXiv 2404.18861, 2024
R McCarthy, D Tan, D Schmidt, and et al. “Towards Generalist Robot Learning from Internet Video: A Survey”, arXiv 2404.19664, 2024
R Cadene, S Alibert, A Soare, and et al., https://github.com/huggingface/lerobot (LeRobot), May, 2024
G. Wang, L. Pan, S. Peng, and et al., “NeRF in robotics: A survey,” arXiv 2405.01333, 2024.
M Dalal, T Chiruvolu, D Chaplot, and R Salakhutdinov. “Plan-Seq-Learn: Language model guided rl for solving long horizon robotics tasks” (PSL), arXiv 2405.01534, 2024
A Dalal, D Hagen, K Robbersmyr, and et al. “Gaussian Splatting: 3D Reconstruction and Novel View Synthesis, a Review”, arXiv 2405.03417, 2024
Z Zhu, X Wang, W Zhao, and et al. “Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond”, arXiv 2405.03520, 2024
X Li, K Hsu, J Gu, and et al., “Evaluating Real-World Robot Manipulation Policies in Simulation” (SIMPLER), arXiv 2405.05941, 2024
K F Gbagbe, M A Cabrera, A Alabbas, and et al., “Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations”, arXiv 2405.06039, 2024
Y Huang, “Levels of AI Agents: from Rules to Large Language Models”, arXiv 2405.06643, May, 2024
R Prabhakar, R Sivaramakrishnan, D Gandhi, and et al., “SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts”, arXiv 2405.07518, 2024
Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, and et al. “Octo: An open-source generalist robot policy”,arXiv 2405.12213, 2024
Y Ma, Z Song, Y Zhuang, and et al. “A Survey on Vision-Language-Action Models for Embodied AI”, arXiv 2405.14093, 2024
Zhang Y, Yang S, Bai C J, et al. “Towards efficient LLM grounding for embodied multi-agent collaboration” (ReAd). arXiv 2405.14314, 2024
F Bordes, R Y Pang, A Ajay, and et al. “An Introduction to Vision-Language Modeling”, arXiv 2405.17247, 2024
T Zhang, D Li, Y Li, and et al., “Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks” (BRMData), arXiv 2405.18860, 2024
Fei-Fei Li, “With Spatial Intelligence, Artificial Intelligence Will Understand the Real World”, https://www.youtube.com/watch?v=y8NtMZ7VGmU, May, 2024
OpenAI GPT-4o, https://openai.com/index/hello-gpt-4o/, May, 2024
J. Liu, M. Liu, Z. Wang, and et al., “RoboMamba: Multimodal state space model for efficient robot reasoning and manipulation,” arXiv 2406.04339, 2024
L Luo, Y Liu, R Liu, and et al. “Improve mathematical reasoning in language models by automated process supervision” (OmegaPRM). arXiv 2406.06592, 2024.
A. Szot, B Mazoure, H Agrawal, and et al., “Grounding multimodal large language models in actions” (Grounding-RL), arXiv 2406.07904, 2024.
A Goyal, V Blukis, J Xu, and et al. “RVT-2: Learning Precise Manipulation from Few Demonstrations”. arXiv 2406.08545, 2024
T He, Z Luo, X He, and et al., “OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Tele-operation and Learning”, arXiv 2406.08858, 2024
M J Kim, K Pertsch, S Karamcheti, and et al., “OpenVLA: An Open-Source Vision-Language-Action Model”, arXiv 2406.09246, 2024
W Cai, J Jiang, F Wang, and et al., “A Survey on Mixture of Experts”, arXiv 2407.06204, 2024
Z Fu, Q Zhao, Q Wu, G Wetzstein, and C Finn. “HumanPlus: Humanoid shadowing and imitation from humans”. arXiv 2406.10454, 2024.
D Niu, Y Sharma, G Biamby, and et al, “LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning”, arXiv 2406.11815, 2024
P Mazzaglia, T Verbelen, B Dhoedt, and et al., “GenRL: Multimodal-foundation world models for generalization in embodied agents”, arXiv 2406.18043, 2024
B Pei, G Chen, J Xu, and et al. “EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation”, arXiv 2406.18070, 2024
Isaac Ong, Amjad Almahairi, V Wu, and et al., “RouteLLM: Learning to Route LLMs with Preference Data”, arXiv 2406.18665, 2024
X Mai, Z Tao, J Lin, and et al. “From Efficient Multimodal Models to World Models: A Survey”, arXiv 2407.00118, 2024
X Cheng, J Li, S Yang, G Yang, and X Wang. “Open-Television: Teleoperation with immersive active visual feedback”, arXiv 2407.01512, 2024
I Georgiev, V Giridhar, N Hansen, A Garg, “PWM: Policy Learning with Large World Models”, arXiv 2407.02466, 2024
R Ding, Y Qin, J Zhu, and et al, “Bunny-VisionPro: Real-Time Bimanual Dexterous Teleoperation for Imitation Learning”, arXiv 2407.03162, 2024

Y Liu, W Chen, Y Bai and et al. “Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI”, arXiv 2407.06886, 2024

L Zheng, F Yan, F Liu, and et al., “RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios”, arXiv 2407.06951, 2024
N Chernyadev, N Backshall, X Ma, and et al., “BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark”, arXiv 2407.07788, 2024
A Lee, I Chuang, L-Y Chen, I Soltani, “InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation”, arXiv 2409.07914, 2024
W Wu, H He, Y Wang, and et al., “MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces”, arXiv 2407.08725, 2024
H Ha, Y Gao, Z Fu, J Tan, and S Song, “UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers”, arXiv 2407.10353, 2024
H Wang, J Chen, W Huang, and et al., “GRUtopia: Dream General Robots in a City at Scale”, arXiv 2407.10943, 2024
Y Bao, T Ding, J Huo, and et al. “3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities”, arXiv 2407.17418, 2024
Llama Team, Meta AI, “The Llama 3 Herd of Models”, arXiv 2407.21783, 2024
Y Wu, Z Sun, S Li, S Welleck, Y Yang. “Inference Scaling Laws: An Empirical Analysis of Compute-optimal Inference For LLM Problem-solving” (REBASE), arXiv 2408.00724, 2024
H Qu, L Ning, R An, and et al., “A Survey of Mamba”, arXiv 2408.01129, 2024
K Maeda, T Hirasawa, A Hashimoto, and et al. “COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark”, atXiv 2408.02272, 2024
C Snell, J Lee, K Xu, and A Kumar. “Scaling LLM test-time compute optimally can be more effective than scaling model parameters”. arXiv 2408.03314, 2024.
Z Fang, M Yang, W Zeng, and et al., “Egocentric Vision Language Planning” (EgoPlan), arXiv 2408.05802, 2024
H Arai, K Miwa, K Sasaki, and et al., “CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving”, arXiv 2408.10845, 2024
Z Wang, H Zheng, Y Nie, and et al., “All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents” (ARIO). arXiv 2408.10899, 2024
Y Zheng, L Yao, Y Su, and et al., “A Survey of Embodied Learning for Object-Centric Robotic Manipulation”, arXiv 2408.11537, 2024
S Yang, M Liu, Y Qin, and et al. “ACE: A cross-platform visual-exoskeletons system for low-cost dexterous teleoperation”, arXiv 2408.11805, 2024
R Doshi, H Walke, O Mees, S Dasari, S Levine, “Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation” (CrossFormer), arXiv 2408.11812, 2024
Figure 02, https://techcrunch.com/2024/08/06/figures-new-humanoid-robot-leverages-openai-for-natural-speech-conversations/, Aug. 2024
Y. Yang, F.-Y. Sun, L. Weihs, and et al., “Holodeck: Language guided generation of 3d embodied AI environments,” IEEE/CVF CVPR, 2024
Y. Yang, B. Jia, P. Zhi, and S. Huang, “Physcene: Physically interactable 3d scene synthesis for embodied AI,” IEEE/CVF CVPR, 2024
H Etukuru, N Naka, Z Hu, and et al., “Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments” (Stick-v2/RUM), arXiv 2409.05865, 2024
K Li, S M Wagh, N Sharma, and et al., “Haptic-ACT: Bridging Human Intuition with Compliant Robotic Manipulation via Immersive VR”, arXiv 2409.11925, 2024
A Yang, B Zhang, B Hui, and et al. “Qwen2.5-math technical report: Toward mathematical expert model via self-improvement”. arXiv 2409.12122, 2024.
J Wen, Y Zhu, J Li, and et al., “TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation”, arXiv 2409.12514, 2024
A Anwar, J Welsh, J Biswas, S Pouya, Y Chang, “ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation”, arXiv 2409.13682, 2024
I Chuang, A Lee, D Gao, I Soltani, “Active Vision Might Be All You Need: Exploring Active Vision in Bimanual Robotic Manipulation” (AV-ALOHA), arXiv 2409.17435, 2024
Z Wu, T Wang, Z Zhuoma, and et al., “Fast-UMI: A Scalable and Hardware-Independent Universal Manipulation Interface”, arXiv 2409.19499, 2024
OpenAI o1, “Learning to reason with LLMs”. https://openai.com/index/learning-to-reason-with-llms, 2024.
World Labs, an AI company for spatial intelligence, https://www.worldlabs.ai/, Sep. 2024
C-L Cheang, G Chen, Y Jing, and et al., “GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation”, ByteDance Research, Tech. Report, arXiv 2410.06158, Oct., 2024
J Wang, M Fang, Z Wan, and et al., “OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models”, Tech. Report, arXiv 2410.09671, Oct. 2024
S Tao, F Xiang, A Shukla, and et al., “ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI”, arXiv 2410.00425, 2024
P Hua, M Liu, A Macaluso, and et al., “GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs”, arXiv 2410.03645, 2024
S Liu, L Wu, B Li, and et al.,“RDT-1B: a Diffusion Foundation Model For Bimanual Manipulation”, arXiv 2410.07864, 2024
S Chen, C Wang, K Nguyen, Li F-F, C. K Liu, “ARCap: Collecting High-quality Human Demonstrations for Robot Learning with Augmented Reality Feedback”, arXiv 2410.08464, 2024
D Su, S Sukhbaatar, M Rabbat, Y Tian, Q Zheng, “Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces”, arXiv 2410.09918, 2024
S Dasari, O Mees, S Zhao, M K Srirama, S Levine, “The Ingredients for Robotic Diffusion Transformers” (DiT-Block Policy), arXiv 2410.10088, 2024
Y Ze, Z Chen, W Wang, and et al., “Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies” (iDP3), arXiv 2410.10803, 2024
T Z. Zhao, J Tompson, D Driess, and et al., “ALOHA Unleashed: A Simple Recipe for Robot Dexterity”, arXiv 2410.13126, 2024
S Zhu, G Wang, D Kong, H Wang, “3D Gaussian Splatting in Robotics: A Survey”, arXiv 2410.12262, 2024
B Han, J Kim, and J Jang, “A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM” (DP-VLA), arXiv 2410.15549, 2024
Y Zhang, Z Li, M Zhou, S Wu, Jiajun Wu, “The Scene Language: Representing Scenes with Programs, Words, and Embeddings” (SL-DSL), arXiv 2410.16770, 2024
Y Yue, Y Wang, B Kang, and et al., “DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution”, arXiv 2411.02359, 2024
S Nasiriany, S Kirmani, T Ding, and et al., “RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation”, arXiv 2411.02704, 2024
Y Chen, C Wang, Y Yang, C. Liu, “Object-Centric Dexterous Manipulation from Human Motion Data” (OCDM), arXiv 2411.04005, 2024
S Zhao, X Zhu, Y Chen, and et al., “DexH2R: Task-oriented Dexterous Manipulation from Human to Robots”, arXiv 2411.04428, 2024
Z Zhang, R Chen, J Ye, and et al., “WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making”, arXiv 2411.05619, 2024
K Shaw, Y Li, J Yang, and et al., “Bimanual Dexterity for Complex Tasks” (BiDex), arXiv 2411.13677, 2024
X Wang, L Horrigan, J Pinskier, and et al., “DexGrip: Multi-modal Soft Gripper with Dexterous Grasping and In-hand Manipulation Capacity”, arXiv 2411.17124, 2024
Z Liang, Y Mu, Y Wang, and et al., “DexDiffuser: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation”, arXiv 2411.18562, 2024
Anthropic, https://www.anthropic.com/news/3-5-models-and-computer-use, Nov. 2024