具身智能行为学习需要什么数据?(六)

7. 可穿戴人工智能

将他人的活动映射到自我中心视角是人类从很小的时候就掌握的基本技能。

可穿戴人工智能或自我人工智能本质上是一种机器人应用。智能眼镜、神经腕带和 AR 耳机(Meta Project Aria [180]、VisionProTeleop [253])等设备使用人工智能来感知用户的环境、理解空间背景并做出预测 [218、304、323]。

project Aria
Vision Pro

虽然从自我中心视角(基于可穿戴设备)收集了大量数据,但对于人工智能智体来说,直接从从不同视角捕获的演示视频中学习至关重要。

只有少数数据集以时间同步的方式记录了同一环境中的自我为中心和外部为中心视角的视频。在具身智能动作学习的泛化中,需要第三人称视角和第一人称视角之间的转换 [53、257]。

8. 数据集需求

基于以上各方面的分析,需求列表可以体现如下:

1)数据集旨在促进大规模具身学习任务的研究。
2)数据集支持对新目标、新环境、新任务甚至新具身实体的泛化。
3)数据集满足实体、时间、地点、视角、目标、技能多样性的需求。
4)数据集提供足够准确的真值:标定、同步、地图和定位以及注释。
5)数据集符合隐私和道德标准:去识别。
6)数据集包括真实和模拟数据:实现了real2sim和sim2real转换。
7)数据集包括Exo-Ego视图数据:支持Exo-Ego视图的灵活转换。
8)数据集制定统一的格式标准:可在各种数据格式之间转换。
9)数据集提供了评估基准:感知、认知(反思、推理、规划)和行动(操作)。

9. 结论

本文概述了传统人工智能向 LLM、VLM、智体、空间智能和具身人工智能的演变,并分析了具身动作/行为的策略训练、数据捕获平台的具身灵活性、模拟平台和自我中心/可穿戴人工智能等。然后,体现了构建数据集的必要要求。

最后,讨论具身人工智能中的泛化技巧,这为具身数据捕获提供了见解。

9.1 泛化技巧

具身人工智能中的策略泛化方法可以如下。
1)RL 中的 Sim-2-Real 域迁移 [25, 46];
2)数据增强和生成式人工智能模型(比如GAN和扩散策略)[127];
3)数据规模和多样性(Open-X)[195];
4)中间表示(包括估计 affordance)[355];
5)大规模的模型架构(transformer)[113,172];
6)预训练的大基础模型[299];
7)训练后微调[200];
8)推理-时间优化[334]。

参考文献

  1. J Aloimonos, I Weiss, A Bandyopadhyay, “Active vision”, IJCV, vol. 1, Jan. 1987
  2. R. Bajcsy, “Active Perception”, IEEE Proceedings, Vol 76, No 8, Aug. 1988.
  3. B. M. Yamauchi, “Packbot: a versatile platform for military robotics” (iRobot), Unmanned ground vehicle technology VI, vol. 5422. SPIE, Sept. 2004.
  4. N. Koenig and A. Howard, “Design and use paradigms for Gazebo, an open-source multi-robot simulator,” IEEE/RSJ IRS, Oct., 2004.
  5. J P. R. Wurman, R. D’Andrea, and M. Mountz, “Coordinating hundreds of cooperative, autonomous vehicles in warehouses” (Kiva Systems), AI magazine, 29 (1), July, 2008.
  6. M. Raibert, K. Blankespoor, G. Nelson, and R. Playter, “Bigdog, the rough-terrain quadruped robot,” IFAC Proceedings Volumes, 41(2), July 2008.
  7. Deng, W Dong, R Socher, and et al. “ImageNet: A large-scale hierarchical image database”, IEEE CVPR, Aug. 2009
  8. G. Echeverria, N. Lassabe, A. Degroote and S. Lemaignan, “Modular open robots simulation engine: MORSE,” IEEE ICRA, May, 2011.
  9. E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” IEEE/RSJ IRS, Oct. 2012
  10. MIT Quadruped Robot Cheetah, https://spectrum.ieee.org/mit-cheetah-robot-running, IEEE Spectrum, May 2013
  11. E. Rohmer, S. P. Singh, and M. Freese, “V-Rep: A versatile and scalable robot simulation framework” (CoppeliaSim), IEEE/RSJ IRS, Nov. 2013
  12. Y Bai and C K Liu. “Dexterous manipulation using both palm and fingers” (DMPF). IEEE ICRA, June, 2014.
  13. F. Tanaka, K. Isshiki, F. Takahashi, and et al., “Pepper learns together with children: Development of an educational application”, IEEE Int. Conf. on Humanoid Robots, Nov. 2015.
  14. E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” https://github.com/bulletphysics/bullet3, 2016
  15. S. Maniatopoulos, P. Schillinger, V. Pong, D. C. Conner, and H. Kress-Gazit, “Reactive high-level behavior synthesis for an Atlas humanoid robot,” IEEE ICRA, May 2016.
  16. V. Kumar, A. Gupta, E. Todorov, and S. Levine, “Learning dexterous manipulation policies from experience and imitation” (LPEI), arXiv 1611.05095, 2016.
  17. S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-fidelity visual and physical simulation for autonomous vehicles,” arXiv 1705.05065, 2017
  18. A. Chang, A. Dai, T. Funkhouser, and et al., “Matterport3D: Learning from RGB-D data in indoor environments”, arXiv1709.06158, 2017
  19. A. Rajeswaran, V. Kumar, A. Gupta, and et al, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations” (DAPG), RSS’18, arXiv 1709.10087, 2017.
  20. M Savva, A Chang, A Dosovitskiy, and et al., “MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments”, arXiv 1712.03931, 2017
  21. E. Kolve, R. Mottaghi, D. Gordon, and et al., “AI2-THOR: An interactive 3d environment for visual AI,” arXiv 1712.05474, 2017
  22. A Vaswani, N Shazeer, N Parmar, et al. “Attention is All You Need” (Transformer). Advances in Neural Information Processing Systems, 2017.
  23. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training” (GPT-1), https://openai.com/index/language-unsupervised/, June 2018.
  24. X. Puig, K. Ra, M. Boben, and et al., “Virtualhome: Simulating household activities via programs,” in IEEE/CVF CVPR, Jun 2018
  25. OpenAI team, “Learning dexterous in-hand manipulation” (LDIM), arXiv 1808.00177, 2018.
  26. A. Juliani, V-P Berges, E. Teng, and et al., “Unity: A general platform for intelligent agents” (Unity ML-Agents), arXiv 1809.02627, 2018.
  27. S Li, X Ma, H Liang, and et al. “Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network” (TeachNet). ICRA, arXiv 1809.06268, 2018.
  28. A Mandlekar, Y Zhu, A Garg, and et al. “RoboTurk: A crowdsourcing platform for robotic skill learning through imitation”. ICRL, arXiv 1811.02790, 2018.
  29. A. Radford, J. Wu, R. Child, et al., “Language models are unsupervised multitask learners” (GPT-2), OpenAI blog, 2019.
  30. Kroemer, O., Niekum, S., & Konidaris, G. “A review of robot learning for manipulation: Challenges, representations, and algorithms”. arXiv 1907.03146, 2019
  31. M Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”, arXiv 1909.08053, 2019
  32. A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation” (PDDM), arXiv 1909.11652, 2019.
  33. S Rajbhandari, J Rasley, O Ruwase, Y He, “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”, arXiv 1910.02054, 2019
  34. A Handa, K Van Wyk, W Yang, and et al. “DexPilot: Vision-based teleoperation of dexterous robotic hand-arm system”. IEEE ICRA, arXiv 1910.03135, 2019.
  35. S Dasari, F Ebert, S Tian, and et al. “RoboNet: Large-scale multi-robot learning”. CoRL’19, arXiv 1910.11215, 2019
  36. M Savva, A Kadian, O Maksymets, and et al. “Habitat: A platform for embodied AI research”. IEEE ICCV, 2019.
  37. Hafner D, Lillicrap T, Ba J, et al. “Dream to control: learning behaviors by latent imagination” (Dreamer v1). arXiv 1912.01603, 2019
  38. Ravichandar, H., Polydoros, A. S., Chernova, S., & Billard, A. “Recent advances in robot learning from demonstration” (review). Annual Review of Control, Robotics, Auto. Systems, vol.3, 2020
  39. C C. Kessens, J Fink, A Hurwitz, and et al., “Toward fieldable human-scale mobile manipulation using RoMan”, AI and Machine Learning for Multi-Domain Operations Applications II, Volume 11413, SPIE, April, 2020
  40. I. Radosavovic, X. Wang, L. Pinto, and J. Malik, “State-only imitation learning for dexterous manipulation” (SOIL), IEEE/RSJ IROS’21. arXiv 2004.04650, 2020.
  41. M Deitke, W Han, A Herrasti and et al. “RoboTHOR: An open simulation-to-real embodied AI platform”. CVPR’20, arXiv 2004.06799, 2020
  42. Damen D, Doughty H, Farinella G M, et al. “The EPIC-Kitchens dataset: collection, challenges and baselines”. arXiv 2005.00343, IEEE T-PAMI, 43(11): 4125–4141, 2021
  43. T. B. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learners” (GPT-3), arXiv 2005.14165, 2020
  44. C. Li, S. Zhu, Z. Sun, and J. Rogers, “BAS optimized ELM for KUKA iiwa Robot Learning,” IEEE Transactions on Circuits and Systems II: Express Briefs, 68 (6), Oct. 2020.
  45. F. Xiang, Y. Qin, K. Mo, and et al., “SAPIEN: A simulated part-based interactive environment,” arXiv 2003.08515, IEEE/CVF CVPR, Jun 2020.
  46. Zhao, W., Queralta, J. P., and Westerlund, T. “Sim-to-real transfer in deep reinforcement learning for robotics: a survey”. arXiv 2009.13303, 2020.
  47. Hafner D, Lillicrap T, Norouzi M, et al. “Mastering Atari with discrete world models” (Dreamer v2). arXiv 2010.02193, 2020
  48. A. Zeng, P. Florence, J. Tompson, and et al., “Transporter networks: Rearranging the visual world for robotic manipulation”. CoRL’20, arXiv 2010.14406, 2020
  49. Y Ishiguro, T Makabe, Y Nagamatsu, and et al., “Bilateral humanoid teleoperation system using whole-body exoskeleton cockpit TABLIS”, IEEE IROS, Oct. 2020
  50. B. Shen, F. Xia, C. Li, and et al., “iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes,” arXiv 2012.02924, IEEE/RSJ IRS, 2021
  51. J Ren, S Rajbhandari, R Y Aminabadi et al., “ZeRO-offload: Democratizing Billion-Scale Model Training”, arXiv 2101.06840, 2021
  52. S Rajbhandari et al., “ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning”, arXiv 2104.07857, 2021
  53. Y Li, T Nagarajan, B Xiong, and K Grauman. “Ego-Exo: Transferring visual representations from third-person to first-person videos”. arXiv 2104.07905, CVPR, 2021
  54. D Kalashnikov, J Varley, Y Chebotar, and et al., “MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale”, arXiv 2104.08212, 2021
  55. K. Ehsani, W. Han, A. Herrasti, and et al., “ManipulaTHOR: A framework for visual object manipulation,” arXiv 2104.11213, IEEE/CVF CVPR, 2021.
  56. M Caron, H Touvron, I Misra, and et al. “Emerging Properties in Self-Supervised Vision Transformers” (Dino v1), arXiv 2104.14294, 2021
  57. Chen L, Lu K, Rajeswaran A, et al. “Decision transformer: reinforcement learning via sequence modeling”, arXiv 2106.01345, 2021
  58. Janner M, Li Q, Levine S. “Offline reinforcement learning as one big sequence modeling problem” (Trajectory Transformer), arXiv 2106.02039, 2021
  59. E Hu et al., “LORA: Low-Rank Adaptation of Large Language Models”, arXiv 2106.09685, 2021
  60. A Szot, A Clegg, E Undersander, and et al. “Habitat 2.0: Training Home Assistants to Rearrange their Habitat”, arXiv 2106.14405, 2021
  61. Mu T Z, Ling Z, Xiang F B, et al. “Maniskill: generalizable manipulation skill benchmark with large-scale demonstrations”, arXiv 2107.14483, 2021
  62. A. Jaegle, S. Borgeaud, J. B. Alayrac, and et al. “Perceiver IO: A general architecture for structured inputs & outputs”. arXiv 2107.14795, 2021.
  63. A Radford, J W Kim, C Hallacy, et al. “Learning transferable visual models from natural language supervision” (CLIP). ICML 2021.
  64. A Ramesh, M Pavlov, G Goh, et al., “Zero-shot text-to-image generation” (DALL-E). ICML. Virtual event, July 2021
  65. C. Li, F. Xia, R. Mart ́ın-Mart ́ın, and et al., “iGibson 2.0: Object- centric simulation for robot learning of everyday household tasks,” arXiv 2108.03272, CRL’21, 2021
  66. Y Qin, Y-H Wu, S Liu, and et-al. “DexMV: Imitation learning for dexterous manipulation from human videos”. ECCV’22, arXiv 2108.05877, 2021.
  67. Tesla Bot (Optimus), https://spectrum.ieee.org/elon-musk-robot, IEEE Spectrum, Aug., 2021
  68. S K Ramakrishnan, A Gokaslan, E Wijmans, and et al. “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D environments for embodied AI”. arXiv 2109.08238, 2021
  69. M. Shridhar, L. Manuelli, and D. Fox, “CliPort: What and where pathways for robotic manipulation,” arXiv 2109.12098, 2021
  70. F. Ebert, Y. Yang, K. Schmeckpeper, and et al. “Bridge data: Boosting generalization of robotic skills with cross-domain datasets”. arXiv 2109.13396, 2021.
  71. K. Grauman, A. Westbury, E. Byrne and et al. “Ego4D: Around the world in 3,000 hours of egocentric video”. arXiv 2110.07058, 2021
  72. Z Bian et al., “Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training”, arXiv 2110.14883, 2021
  73. E Jang, A Irpan, M Khansari, and et al. “BC-Z: Zero-shot task generalization with robotic imitation learning”. CoRL, 2021
  74. C. Gan, J. Schwartz, S. Alter, and et al., “ThreeDWorld: A platform for interactive multi-modal physical simulation,” arXiv 2007.04954, NeuIPS’21, 2021
  75. R Rombach, A Blattmann, D Lorenz, P Esser, and B Ommer. “High-resolution image synthesis with latent diffusion models” (Stable Diffusion). arXiv 2112.10752, 2021.
  76. W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents” (T-LM), arXiv 2201.07207, ICML, 2022.
  77. P Mandikal and K Grauman. “DexVIP: Learning dexterous grasping with human hand pose priors from video”. CoRL, arXiv 2202.00164, 2022.
  78. Li J, Li D, Xiong C, et al. “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation”, arXiv 2201.12086, 2022
  79. Y. Liu, Y. Liu, C. Jiang, and et al., “HOI4D: A 4D egocentric dataset for category-level human-object interaction”. CVPR’22, arXiv 2203.014577, 2022
  80. L Ouyang, J Wu, X Jiang et al., “Training language models to follow instructions with human feedback” (GPT-3.5/InstructGPT), arXiv 2203.02155, 2022
  81. N Hansen, X Wang, H Su, “Temporal Difference Learning for Model Predictive Control” (TD-MPC), arXiv 2203.04955, 2022
  82. S P Arunachalam, S Silwal, B Evans, and L Pinto. “Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation” (DIME). arXiv 2203.13251, 2022.
  83. F Sener, D Chatterjee, D Shelepov, and et al. “Assembly101: A large-scale multi-view video dataset for understanding procedural activities”. CVPR’22, arXiv 2203.14712, 2022
  84. A. Zeng, M. Attarian, K. M. Choromanski, and et al., “Socratic models: Composing zero-shot multimodal reasoning with language”, arXiv 2204.00598, 2022
  85. M Ahn, A Brohan, N Brown, and et al., “Do as I Can, Not as I Say: Grounding Language in Robotic Affordances” (SayCan), arXiv 2204.01691, 2022
  86. A Ramesh, P Dhariwal, A Nichol, and et al. “Hierarchical text-conditional image generation with clip latents” (DALL-E2). arXiv 2204.06125,2022.
  87. Y Qin, H Su, and X Wang. “From one hand to multiple hands: Imitation learning for dexterous manipulation from single-camera teleoperation” (IMDM). RA-L, 7(4), arXiv 2204.12490, 2022.
  88. J-B Alayrac, J Donahue, P Luc, et al., “Flamingo: a visual language model for few-shot learning”. arXiv 2204.14198, 2022
  89. Reed, S., Zolna, K., Parisotto, E., and et al. “A Generalist Agent” (GATO). arXiv 2205.06175, 2022
  90. T Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”, arXiv 2205.14135, 2022
  91. S. Haddadin, S. Parusel, L. Johannsmeier, and et al., “The Franka Emika Robot: A reference platform for robotics research and education”, IEEE Robotics & Automation Magazine, 29 (2), June, 2022.
  92. M. Deitke, E. VanderBilt, A. Herrasti, and et al., “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation”, arXiv 2206.06994, NeurIPS’22, 2022
  93. N M Shafiullah, Z J Cui, A Altanzaya, L Pinto, “Behavior Transformers: Cloning k modes with one stone”, arXiv 2206.11251, 2022
  94. P Wu, A Escontrela, D Hafner, P Abbeel, and K Goldberg. “DayDreamer: World models for physical robot learning”. arXiv 2206.14176, 2022
  95. Y. Seo, D. Hafner, H. Liu, and et al., “Masked world models for visual control” (MWM), arXiv 2206.14244, 2022
  96. Huang W, Xia F, Xiao T, et al. “Inner monologue: embodied reasoning through planning with language models”. arXiv 2207.05608, 2022
  97. S Bahl, A Gupta, D Pathak, “Human-to-Robot Imitation in the Wild” (WHIRL), arXiv 2207.09450, July, 2022
  98. A Bucker, L Figueredo, S Haddadin, and et al., “LATTE: LAnguage Trajectory TransformEr”, arXiv 2208.02918, 2022
  99. J Liang, W Huang, F Xia, and et al., “Code as Policies: Language Model Programs for Embodied Control” (CaP), arXiv 2209.07753, 2022
  100. L Yang, Z Zhang, S Hong et al., “Diffusion Models: A Comprehensive Survey of Methods and Applications”, arXiv 2209.00796, 2022
  101. M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-Actor: A multi-task transformer for robotic manipulation,” arXiv 2209.05451, 2022
  102. I. Singh, V. Blukis, A. Mousavian, and et al., “ProgPrompt: Generating situated robot task plans using large language models,” arXiv 2209.11302, IEEE ICRA’23, 2022.
  103. B. Reily, P. Gao, F. Han, H. Wang, and H. Zhang, “Real-time recognition of team behaviors by multisensory graph-embedded robot learning” (Jackal Robot/Clearpath Robotics), IJRA, 41(8), Sep. 2022.
  104. Z Q Chen, K Van Wyk, Y-W Chao, and et-al. “DexTransfer: Real world multi-fingered dexterous grasping with minimal human demonstrations”. arXiv 2209.14284, 2022.
  105. K. Gao, Y. Gao, H. He, et al., “NeRF: Neural radiance field in 3d vision, a comprehensive review”. arXiv 2210.00379, 2022.
  106. S Yao, J Zhao, D Yu, and et al., “ReAct: Synergizing Reasoning and Acting in Language Models”, arXiv 2210.03629, 2022
  107. S P Arunachalam, I Güzey, S Chintala, and Lerrel Pinto. “Holo-Dex: Teaching dexterity with immersive mixed reality”. IEEE ICRA’23, arXiv 2210.06463, 2022.
  108. A Handa, A Allshire, V Makoviychuk, and et al. “Dextreme: Transfer of agile in-hand manipulation from simulation to reality”. arXiv 2210.13702, 2022.
  109. Mohammed, Q., Kwek, C., Chua, C. and et al. “Review of learning-based robotic manipulation in cluttered environments”. Sensors, vol. 22 (20), 2022.
  110. T Chen, M Tippur, S Wu, and et al. “Visual dexterity: In-hand dexterous manipulation from depth”. arXiv 2211.11744, 2022.
  111. C H Song, J Wu, C Washington, and et al., “LLM-Planner: Few-shot grounded planning for embodied agents with large language models”. arXiv 2212.04088, 2022
  112. K Shaw, S Bahl, and D Pathak. “VideoDex: Learning dexterity from internet videos”. arXiv 2212.04498, 2022.
  113. A Brohan, N Brown, J Carbajal, and et al. “RT-1: Robotics transformer for real-world control at scale”. arXiv 2212.06817, 2022
  114. P Liu, W Yuan, J Fu, and et al. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing”. ACM Computing Surveys, 55(9):1–35, 2023.
  115. Q Dong, L Li, D Dai, and et al., “A survey for in-context learning”. arXiv 2301.00234, 2023.
  116. Hafner D, Pasukonis J, Ba J, et al. “Mastering diverse domains through world models” (Dreamer v3), arXiv 2301.04104, 2023
  117. M Mittal, C Yu, Q Yu, and et al. “ORBIT: A Unified Simulation Framework for Interactive Robot Learning Environments”, arXiv 2301.04195, 2017
  118. K. Nottingham, P. Ammanabrolu, A. Suhr, and et al. “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling” (DEKARD), arXiv 2301.12050, 2023
    
  119. Li J, Li D, Savarese S, et al. “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models”. arXiv 2301.12597, 2023
  120. Y Du, M Yang, B Dai, and et al., “Learning Universal Policies via Text-Guided Video Generation” (UniPi), arXiv 2302.00111, 2023
  121. Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents” (DEPS), arXiv 2302.01560, 2023.
  122. J Gu, F Xiang, X Li, and et al., “ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills”, arXiv 2302.04659, 2023
  123. H. Touvron, T. Lavril, G. Izacard, and et al. “LLaMA: Open and efficient foundation language models”. arXiv 2302.13971, 2023.
  124. D Driess, F Xia, M. Sajjadi, et al., “PaLM-E: An Embodied Multimodal Language Model”, arXiv 2303.03378, 2023
  125. G Khandate, S Shang, ET Chang, and et-al. “Sampling- based Exploration for Reinforcement Learning of Dexterous Manipulation” (M-RRT/G-RRT). RSS’23, arXiv 2303.03486, 2023.
  126. S Yang, O Nachum, Y Du, and et al., “Foundation models for decision making: Problems, methods, and opportunities” (review). arXiv 2303.04129, 2023
  127. C Chi, Z Xu, S Feng, and et al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion”, arXiv 2303.04137, 2023
  128. Y Cao, S Li, Y Liu, and et al. “A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT”, arXiv 2303.04226, 2023
  129. J Pitz, L Ro ̈stel, L Sievers, and B Ba ̈uml. “Dextrous tactile in-hand manipulation using a modular reinforcement learning architecture” (DTIM). arXiv 2303.04705, 2023.
  130. J Robine, M H”oftmann, T Uelwer, and S Harmeling. “Transformer-based world models are happy with 100k interactions” (TWM). ICLR’23, arXiv 2303.07109, 2023
  131. C Zhang, C Zhang, M Zhang, I S Kweon, “Text-to-image Diffusion Models in Generative AI: A Survey”, arXiv 2303.07909, 2023
  132. J. Achiam, S. Adler, S. Agarwal, and et al. “GPT-4 technical report”. arXiv 2303.08774, 2023
  133. Z-H Yin, B Huang, Y Qin, Q Chen, and X Wang. “Rotating without seeing: Towards in-hand dexterity through touch” (Touch Dexterity). arXiv 2303.10880, 2023.
  134. Shinn N, Cassano F, Berman E, et al. “Reflexion: language agents with verbal reinforcement learning”, arXiv 2303.11366, 2023
  135. I Guzey, B Evans, S Chintala, and L Pinto. “Dexterity from touch: Self-supervised pre- training of tactile representations with robotic play” (T-Dex). arXiv 2303.12076, 2023.
  136. Madaan A, Tandon N, Gupta P, et al. “Self-Refine: iterative refinement with self-feedback”, arXiv 2303.17651, 2023
  137. W X Zhao, K Zhou, J Li, and et al., “A Survey of Large Language Models”, arXiv 2303.18233, Mar. 2023
  138. L Zhao, T Yang, Y Yang, and P Yu. “A wearable upper limb exoskeleton for intuitive teleoperation of anthropomorphic manipulators” (WULE). MDPI Machines, 11(4):441, Mar. 2023.
  139. Figure 01, https://www.fastcompany.com/90859010/the-race-to-build-ai-powered-humanoids-is-heating-up, Mar., 2023
  140. A. Ugenti, R. Galati, G. Mantriota, and G. Reina, “Analysis of an all-terrain tracked robot with innovative suspension system” (Polibot), Mechanism and Machine Theory, vol. 182, April, 2023.
  141. A Kirillov, E Mintun, N Ravi, and et al. “Segment Anything” (SAM). arXiv 2304.02643, 2023
  142. J Park, J Brien, C Cai and et al., “Generative Agents: Interactive Simulacra of Human Behavior” (GA), arXiv 2304.03442, 2023
  143. X Zou, J Yang, H Zhang, et al., “Segment everything everywhere all at once” (SEEM). arXiv 2304.06718, 2023
  144. M Oquab, T Darcet, T Moutakanni, and et al. “Dinov2: Learning robust visual features without supervision”. arXiv 2304.07193, 2023
  145. T Z. Zhao, V Kumar, S Levine, C Finn, “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” (ALOHA/ACT), arXiv 2304.13705, 2023
  146. Y Xie, K Kawaguchi K, Y Zhao, and et al. “Self-evaluation guided beam search for reasoning”, arXiv 2305.00633, 2023
  147. M Heo, Y Lee, D Lee, and J. Lim. “FurnitureBench: Reproducible real-world benchmark for long-horizon complex manipulation”. arXiv 2305.12821, 2023.
  148. S Yao, D Yu, J Zhao, and et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, arXiv 2305.10601, 2023
  149. Mu Y, Zhang Q, Hu M, et al. “EmbodiedGPT: vision-language pre-training via embodied chain of thought”. arXiv 2305.15021, 2023
  150. G Wang, Y Xie, Y Jiang, and et al., “VOYAGER: An Open-Ended Embodied Agent with Large Language Models”, arXiv 2305.16291, 2023
  151. M Kulkarni, T J. L. Forgaard, K Alexis, “Aerial Gym – Isaac Gym Simulator for Aerial Robots”, arXiv 2305.16510, 2023
  152. B Y Lin B Y, Y Fu, K Yang, and et al. “SwiftSage: a generative agent with fast and slow thinking for complex interactive tasks”. arXiv 2305.17390, 2023
  153. Cyberbotics, “Webots: open-source robot simulator”, https://github.com/cyberbotics/webots, 2023
  154. NVIDIA, “Nvidia Isaac Sim: Robotics simulation and synthetic data,” https://developer.nvidia.com/isaac/sim, 2023
  155. AKM Shahariar, Azad Rabby, C Zhang, “BeyondPixels: A Comprehensive Review of the Evolution of Neural Radiance Fields”, arXiv 2306.03000, 2023
    
  156. B Liu, Y Zhu, C Gao, and et al. “LIBERO: Benchmarking knowledge transfer for lifelong robot learning”. arXiv 2306.03310, 2023
  157. P Ren, K Zhang, H Zheng, and et al. “Surfer: Progressive reasoning with world models for robotic manipulation”, arXiv 2306.11335, 2023
    
  158. Microsoft, “Textbooks Are All You Need” (phi-1), arXiv 2306.11644, 2023
  159. Bousmalis K, Vezzani G, Rao D, et al. “RoboCat: a self-improving generalist agent for robotic manipulation”. arXiv 2306.11706, 2023
  160. A Goyal, J Xu, Y Guo, and et al. “RVT: Robotic view transformer for 3D object manipulation”. arXiv 2306.14896, 2023
  161. Vemprala S, Bonatti R, Bucker A, and et al. “ChatGPT for robotics: design principles and model abilities”, arXiv 2306.17582, 2023
    
  162. Y Guo, Y-J Wang, L Zha, J Chen, “DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment”, arXiv 2307.00329, 2023
  163. X Li, V Belagali, J Shang and M S. Ryoo, “Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning”, arXiv 2307.01849, 2023
  164. Y Qin, W Yang, B Huang, and et al. “AnyTeleop: A general vision-based dexterous robot arm-hand teleoperation system”. arXiv 2307.04577, 2023.
  165. Huang W, Wang C, Zhang R, et al. “VoxPoser: Composable 3D value maps for robotic manipulation with language models”. arXiv 2307.05973, 2023
    
  166. K. Rana, J. Haviland, S. Garg, and et al. “SayPlan: Grounding large language models using 3d scene graphs for scalable task planning,” arXiv 2307.06135, ICRL’23, 2023.
  167. Wang, Y., He, Y., Li, Y., and et al. “InternVid: A large-scale video-text dataset for multimodal understanding and generation”. arXiv 2307.06942, 2023.
  168. T Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”, arXiv 2307.08691, 2023
  169. H. Touvron, L. Martin, K. Stone, and et al. “Llama 2: Open foundation and fine-tuned chat models”. arXiv 2307.09288, 2023.
  170. J Gu, Z Han, S Chen, and et al. “A systematic survey of prompt engineering on vision-language foundation models”. arXiv 2307.12980, 2023
  171. H Ha, P Florence, and S Song. “Scaling up and distilling down: Language-guided robot skill acquisition” (SUDD). CoRL’23, arXiv 2307.14535, 2023
  172. A Brohan, N Brown, J Carbajal, and et al. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control”, arXiv 2307.15818, 2023
  173. H Fang, H Fang, Z Tang, and et al. “RH20T: A robotic dataset for learning diverse skills in one-shot”. RSS 2023 Workshop on Learning for Task and Motion Planning, arXiv 2307.00595, July 2023
  174. P. Arm, G. Waibel, J. Preisig, and et al., “Scientific exploration of challenging planetary analog environments with a team of legged robots” (ANYmal C), arXiv 2307.10079, Science robotics, 8 (80), July, 2023.
  175. Lin J, Du Y, Watkins O, et al. “Learning to model the world with language” (Dynalang). arXiv 2308.01399, 2023
  176. Jing, Y., Zhu, X., Liu, X., and et al. “Exploring visual pre-training for robot manipulation: Datasets, models and methods” (Vi-PRoM). arXiv 2308.03620, 2023.
  177. S Zhang, L Dong, X Li and et al., “Instruction Tuning for Large Language Models: A Survey”, arXiv 2308.10792, 2023
  178. L Wang, C Ma, X Feng, and et al. “A Survey on Large Language Model based Autonomous Agents”, arXiv 2308.11432, 2023
  179. H. Walke, K. Black, A. Lee, and et al. “Bridgedata v2: A dataset for robot learning at scale”, arXiv 2308.12952, 2023.
  180. K Somasundaram, J Dong, H Tang, and et al. “Project Aria: A new tool for egocentric multi-modal AI research”. arXiv 2308.13561, 2023.
  181. L G Foo, H Rahmani, and J Liu, “AIGC for Various Data Modalities: A Survey”, arXiv 2308.14177, Aug. 2023
  182. H Bharadhwaj, J Vakil, M Sharma, and et al., “RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking” (MT-ACT/RoboSet), arXiv 2309.01918, 2023
  183. Microsoft, “Textbooks Are All You Need II: Phi-1.5 technical report”, arXiv 2309.05463, 2023
  184. W Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM), arXiv 2309.06180, 2023
  185. Z Xi, W Chen, X Guo, and et al. “The Rise and Potential of Large Language Model Based Agents: A Survey”, arXiv 2309.07864, 2023
  186. C Li, Z Gan, Z Yang, and et al. “Multimodal Foundation Models: From Specialists to General-Purpose Assistants” (survey), arXiv 2309.10020, 2023
  187. P Wu, Y Shentu, Z Yi, X Lin, and P Abbeel. “GELLO: A general, low-cost, and intuitive tele-operation framework for robot manipulators”. arXiv 2309.13037, 2023
  188. H Fang, H Fang, Y Wang, and et al. “AirExo: Low-cost exoskeletons for learning whole-arm manipulation in the wild”. arXiv 2309.14975, 2023
  189. T Shen, R Jin, Y Huang, and et al., “Large Language Model Alignment: A Survey”, arXiv 2309.15025, 2023
  190. Z Chu, J Chen, Q Chen, and et al., “A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future”, arXiv 2309.15402, 2023
  191. Q. Gu, A. Kuwajerwala, S. Morin, and et al., “ConceptGraphs: Open-vocabulary 3d scene graphs for perception and planning,” arXiv 2309.16650, 2023
  192. Z Yang, L Li, K Lin, and et al., “The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)”, arXiv 2309.17421, 2023
  193. L Wang, Y Ling, Z Yuan, and et al. “GenSim: Generating Robotic Simulation Tasks via Large Language Models”, arXiv 2310.01361, 2023
  194. A Q. Jiang, A Sablayrolles, A Mensch, and et al., “Mistral 7B”, arXiv 2310.06285, 2023
  195. A Padalkar, A Pooley, A Jain, and et al. “Open X-Embodiment: Robotic learning datasets and RT-x models”. arXiv 2310.08864, 2023
  196. Zhang W, Wang G, Sun J, et al. “STORM: efficient stochastic transformer based world models for reinforcement learning”. arXiv 2310.09615, 2023
  197. Du, Y., Yang, M., Florence, P. R., and et al. “Video language planning” (VLP). arXiv 2310.10625, Oct. 2023
  198. Y J Ma, W Liang, G Wang, and et al. “EUREKA: Human-Level Reward Design Via Coding Large Language Models”, arXiv 2310.12931, ICLR’24, 2023
  199. Puig, X., Undersander, E., Szot, A., and et al. “Habitat 3.0: A co-habitat for humans, avatars and robots”. arXiv 2310.13724, 2023
  200. Y Feng, N Hansen, Z Xiong, and et al., “Fine-tuning Offline World Models in the Real World” (FOWM), arXiv 2310.16029, 2023
  201. Hansen N, Su H, Wang X. “TD-MPC2: scalable, robust world models for continuous control”. arXiv 2310.16828, 2023
  202. A Mandlekar, S Nasiriany, B Wen, and et al., “MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations”, arXiv 2310.17596, 2023
  203. J Betker, G Goh, L Jing, and et al., “Improving Image Generation with Better Captions” (DALL-E3), OpenAI report, Oct., 2023
  204. Li X, Liu M, Zhang H, et al. “Vision-language foundation models as effective robot imitators” (RoboFlamingo). arXiv 2311.01378, 2023
  205. Wang Y, Xian Z, Chen F, et al. “RoboGen: towards unleashing infinite data for automated robot learning via generative simulation”. arXiv 2311.01455, 2023
  206. J Gu, S Kirmani, P Wohlhart, and et al., “RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches”, arXiv 2311.01977, 2023
  207. M R Morris, J Sohl-dickstein, N Fiedel, and et al. “Levels of AGI: Operationalizing Progress on the Path to AGI”, arXiv 2311.02462, 2023
  208. H Peng, C Ding, T Geng, and et al., “Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs”, arXiv 2311.04417, 2023
  209. Zeng, F., Gan, W., Wang, Y., and et al. “Large language models for robotics: A survey”. arXiv 2311.07226, 2023
  210. Y Huang, Y Chen, Z Li, “Applications of Large Scale Foundation Models for Autonomous Driving” (survey), arXiv 2311.12144, 2023
  211. J. Huang, S. Yong, X. Ma, and et al., “An embodied generalist agent in 3d world” (LEO), arXiv 2311.12871, 2023
  212. X Xiao, J Liu, Z Wang, and et al., “Robot Learning in the Era of Foundation Models: A Survey”, arXiv 2311.14379, 2023
  213. Y. Chen, W. Cui, Y. Chen, and et al., “RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks,” arXiv 2311.15649, 2023.
  214. N Shafiullah, A Rai, H Etukuru, and et al. “On bringing robots home” (Dobb·E/Stick v1/HoNY). arXiv 2311.16098, 2023.
  215. Y. Hu, F Lin, T Zhang, L Yi, and Y Gao, “Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning” (ViLa), arXiv 2311.17842, 2023.
  216. K Grauman, A Westbury, L Torresani, and et al. “Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives”, arXiv 2311.18259, 2023
  217. Javaheripi M, Bubeck S, Abdin M, et al. “Phi-2: the surprising power of small language models”. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/, 2023
  218. Y Song, E Byrne, T Nagarajan, and et al. “Ego4D goal-step: Toward hierarchical understanding of procedural activities”. NeurIPS, 2023
  219. I Leal, K Choromanski, D Jain, and et al., “SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention”, arXiv 2312.01990, 2023
  220. R Firoozi, J Tucker, S Tian, and et al., “Foundation Models in Robotics: Applications, Challenges, and the Future” (review), arXiv 2312.07843, 2023
  221. Y Hu, Q Xie, V Jain, and et al. “Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis”, arXiv 2312.08782, 2023
  222. P Wang, L Li, Z Shao, and et al. “Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations”. arXiv 2312.08935, 2023
  223. Team, G., Anil, R., Borgeaud, S., and et al. “Gemini: a family of highly capable multimodal models”. arXiv 2312.11805, 2023.
  224. H Wu, Y Jing, C Cheang, and et al., “GR-1: Unleashing Large-Scale Video Generative Pre-Training For Visual Robot Manipulation”, arXiv 2312.13139, 2023
  225. P Ding, H Zhao, W Song, and et al., “QUAR-VLA: Vision-Language-Action Model for Quadruped Robots”, arXiv 2312.14457, 2023
  226. Mistral AI, “Mixtral of experts: A high quality Sparse Mixture-of-Experts”, https://mistral.ai/news/mixtral-of-experts/, Dec. 2023
  227. C Wen, X Lin, J So, and et al., “Any-point Trajectory Modeling for Policy Learning” (ATM), arXiv 2401.00025, 2024
  228. Z Fu, T Z Zhao, and C Finn, “Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation”, arXiv 2401.02117, 2024
  229. Y Cheng, C Zhang, Z Zhang, and et al. “Exploring Large Language Model Based Intelligent Agents: Definitions, Methods, and Prospects” (survey), arXiv 2401.03428, 2024
  230. G Chen and W Wang, “A Survey on 3D Gaussian Splatting”, arXiv 2401.03890, 2024
  231. T Cai, Y Li, Z Geng, and et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads”, arXiv 2401.10774, 2024
  232. B Chen, Z Xu, S Kirmani, and et al. “SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities”, arXiv 2401.12168, 2024
  233. M Ahn, D Dwibedi, C Finn, and et al., “AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents”, arXiv 2401.12963, 2024
  234. Ming, R., Huang, Z., Ju, Z., and et al. “A survey on video prediction: From deterministic to generative approaches”. arXiv 2401.14718, 2024.
  235. LLaMA.cpp, LLM inference in C/C++, https://github.com/ggerganov/llama.cpp, Jan. 2024
  236. S. Le Cleac’h, T. A. Howell, S. Yang, and et al, “Fast contact-implicit model predictive control” (Unitree Go1), IEEE Transactions on Robotics, Jan. 2024.
  237. X Yan, J Xu, Y Huo, H Bao, “Neural Rendering and Its Hardware Acceleration: A Review”, arXiv 2402.00028, 2024
  238. Z Xu, K Wu, J Wen, and et al. “A Survey on Robotics with Foundation Models: toward Embodied AI”, arXiv 2402.02385, 2024
  239. Z Wang, Y Li, Y Wu, and et al. “Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision” (MiPS). arXiv 2402.02658, 2024b.
  240. X Huang, W Liu, X Chen, and et al. “Understanding the planning of LLM agents: A survey”, arXiv 2402.02716, 2024
  241. G Paolo, J G-Billandon, B Kegl, “A Call for Embodied AI”, arXiv 2402.03824, 2024
  242. K Kawaharazuka, T Matsushima, A Gambardella, and et al. “Real-World Robot Applications of Foundation Models: A Review”, arXiv 2402.05741, 2024
  243. S Minaee, T Mikolov, N Nikzad, and et al. “Large Language Models: A Survey”, arXiv 2402.06196, 2024
  244. C Eze, C Crick. “Learning by watching: A review of video-based learning approaches for robot manipulation”. arXiv 2402.07127, 2024
  245. B Fei, J Xu, R Zhang, and et al., “3D Gaussian as A New Vision Era: A Survey”, arXiv 2402.07181, 2024
  246. G Yenduri, Ramalingam M, P Maddikunta, and et al., “Spatial Computing: Concept, Applications, Challenges and Future Directions”, arXiv 2402.07912, 2024
  247. C Chi, Z Xu, C Pan, and et al., “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots” (UMI), arXiv 2402.10329, 2024
  248. Z Tan, A Beigi, S Wang, and et al. “Large Language Models for Data Annotation: A Survey”, arXiv 2402.13446, 2024
  249. P Gao, P Wang, F Gao, et al. “Vision-Language Navigation with Embodied Intelligence: A Survey”, arXiv 2402.14304, 2024
  250. S Yang, J Walker, J Parker-Holder and et al. “Video as the new language for real-world decision making”, arXiv 2402.17139, 2024
  251. Y Liu, J Cao, C Liu, and et al., “Datasets for Large Language Models: A Comprehensive Survey”, arXiv 2402.18041, 2024
  252. OpenAI Sora, “Video generation models as world simulators”, https://openai.com/index/video-generation-models-as-world-simulators/, Feb. 2024
  253. Y Park and P Agrawal. “Using apple vision pro to train and control robots” (VisionProTeleop), https://github.com/Improbable-AI/VisionProTeleop, 2024
  254. 2402.07912, S. Belkhale, T. Ding, T. Xiao, “RT-H: Action hierarchies using language,” arXiv 2403.01823, Mar. 2024
  255. S Lee, Y Wang, H Etukuru, and et al., “Behavior Generation with Latent Actions” (VQ-BeT), arXiv 2403.03181, 2024
  256. Ze Y, Zhang G, Zhang K, et al. “3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations”. arXiv 2403.03954, 2024
  257. M Luo, Z Xue, A Dimakis, K Grauman, “Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos”, arXiv 2403.06351, 2024
  258. A Iyer, Z Peng, Y Dai, and et al. “Open Teach: A versatile teleoperation system for robotic manipulation”. arXiv 2403.07870, 2024.
  259. Google Gemma Team, “Gemma: Open Models Based on Gemini Research and Technology”, arXiv 2403.08295, 2024
    
  260. Li, C., Zhang, R., Wong, J., and et al. “Behavior-1k: A human-centered, embodied AI benchmark with 1,000 everyday activities and realistic simulation”, arXiv 2403.09227, 2024
  261. H. Zhen, X Qiu, P Chen, and et al., “3D-VLA: 3d vision-language-action generative world model,” arXiv:2403.09631, 2024.
  262. T Wu, Y Yuan, L Zhang, and et al. “Recent Advances in 3D Gaussian Splatting” (review), arXiv 2403.11134, 2023
  263. C Wang, H Shi, W Wang, and et al., “DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation”, arXiv 2403.07788, 2024
  264. C. Sferrazza, D.-M. Huang, X. Lin, Y. Lee, and P. Abbeel. “HumanoidBench: Simulated humanoid benchmark for whole-body locomotion and manipulation”. arXiv 2403.10506, 2024.
  265. J F. Mullen Jr, D Manocha, “LAP, Using Action Feasibility for Improved Uncertainty Alignment of Large Language Model Planners”, arXiv 2403.13198, 2024
  266. Y Huang, G Chen, J Xu, et al. “EgoExoLearn: A Dataset for Bridging AsynchronousEgo- and Exo-centric View of Procedural Activities in Real World”, arXiv 2403.16182, 2024
  267. “AI Power: Accurate Models at Blazing Speeds | SambaNova”, https://sambanova.ai/blog/accurate-models-at-blazing-speed, Samba COE v0.2, March, 2024
    
  268. Unitree humanoid H1, https://kr-asia.com/unitree-robotics-develops-personal-robot-dogs-that-jog-alongside-you, Mar. 2024
  269. A. Khazatsky, K. Pertsch, S. Nair, “Droid: A large-scale in-the-wild robot manipulation dataset”, arXiv 2403.12945, 2024
  270. S Zhou, Y Du, J Chen, and et al. “RoboDreamer: Learning compositional world models for robot imagination”, arXiv 2404.12377, 2024
  271. ALOHA 2 Team, and et al, “ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation”, arXiv 2405.02292, 2024
  272. J W Kim, T Z. Zhao, S Schmidgall, and et al., “Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks”, arXiv 2407.12998, 2024
  273. Microsoft, “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone”, arXiv 2404.14219, 2024
  274. S. Shin, J. Kim, G.-C. Kang, and et al., “Socratic planner: Inquiry-based zero-shot planning for embodied instruction following,” arXiv 2404.15190, 2024.
  275. Y Xia, R Wang, X Liu, and et al., “Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs”, arXiv 2404.15676, 2024
  276. R Xu, S Yang, Y Wang, and et al., “Visual Mamba: A Survey and New Outlooks”, arXiv 2404.18861, 2024
  277. R McCarthy, D Tan, D Schmidt, and et al. “Towards Generalist Robot Learning from Internet Video: A Survey”, arXiv 2404.19664, 2024
  278. R Cadene, S Alibert, A Soare, and et al., https://github.com/huggingface/lerobot (LeRobot), May, 2024
  279. G. Wang, L. Pan, S. Peng, and et al., “NeRF in robotics: A survey,” arXiv 2405.01333, 2024.
  280. M Dalal, T Chiruvolu, D Chaplot, and R Salakhutdinov. “Plan-Seq-Learn: Language model guided rl for solving long horizon robotics tasks” (PSL), arXiv 2405.01534, 2024
  281. A Dalal, D Hagen, K Robbersmyr, and et al. “Gaussian Splatting: 3D Reconstruction and Novel View Synthesis, a Review”, arXiv 2405.03417, 2024
  282. Z Zhu, X Wang, W Zhao, and et al. “Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond”, arXiv 2405.03520, 2024
  283. X Li, K Hsu, J Gu, and et al., “Evaluating Real-World Robot Manipulation Policies in Simulation” (SIMPLER), arXiv 2405.05941, 2024
  284. K F Gbagbe, M A Cabrera, A Alabbas, and et al., “Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations”, arXiv 2405.06039, 2024
  285. Y Huang, “Levels of AI Agents: from Rules to Large Language Models”, arXiv 2405.06643, May, 2024
  286. R Prabhakar, R Sivaramakrishnan, D Gandhi, and et al., “SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts”, arXiv 2405.07518, 2024
  287. Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, and et al. “Octo: An open-source generalist robot policy”,arXiv 2405.12213, 2024
  288. Y Ma, Z Song, Y Zhuang, and et al. “A Survey on Vision-Language-Action Models for Embodied AI”, arXiv 2405.14093, 2024
  289. Zhang Y, Yang S, Bai C J, et al. “Towards efficient LLM grounding for embodied multi-agent collaboration” (ReAd). arXiv 2405.14314, 2024
  290. F Bordes, R Y Pang, A Ajay, and et al. “An Introduction to Vision-Language Modeling”, arXiv 2405.17247, 2024
  291. T Zhang, D Li, Y Li, and et al., “Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks” (BRMData), arXiv 2405.18860, 2024
  292. Fei-Fei Li, “With Spatial Intelligence, Artificial Intelligence Will Understand the Real World”, https://www.youtube.com/watch?v=y8NtMZ7VGmU, May, 2024
  293. OpenAI GPT-4o, https://openai.com/index/hello-gpt-4o/, May, 2024
  294. J. Liu, M. Liu, Z. Wang, and et al., “RoboMamba: Multimodal state space model for efficient robot reasoning and manipulation,” arXiv 2406.04339, 2024
  295. L Luo, Y Liu, R Liu, and et al. “Improve mathematical reasoning in language models by automated process supervision” (OmegaPRM). arXiv 2406.06592, 2024.
  296. A. Szot, B Mazoure, H Agrawal, and et al., “Grounding multimodal large language models in actions” (Grounding-RL), arXiv 2406.07904, 2024.
  297. A Goyal, V Blukis, J Xu, and et al. “RVT-2: Learning Precise Manipulation from Few Demonstrations”. arXiv 2406.08545, 2024
  298. T He, Z Luo, X He, and et al., “OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Tele-operation and Learning”, arXiv 2406.08858, 2024
  299. M J Kim, K Pertsch, S Karamcheti, and et al., “OpenVLA: An Open-Source Vision-Language-Action Model”, arXiv 2406.09246, 2024
  300. W Cai, J Jiang, F Wang, and et al., “A Survey on Mixture of Experts”, arXiv 2407.06204, 2024
  301. Z Fu, Q Zhao, Q Wu, G Wetzstein, and C Finn. “HumanPlus: Humanoid shadowing and imitation from humans”. arXiv 2406.10454, 2024.
  302. D Niu, Y Sharma, G Biamby, and et al, “LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning”, arXiv 2406.11815, 2024
  303. P Mazzaglia, T Verbelen, B Dhoedt, and et al., “GenRL: Multimodal-foundation world models for generalization in embodied agents”, arXiv 2406.18043, 2024
  304. B Pei, G Chen, J Xu, and et al. “EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation”, arXiv 2406.18070, 2024
  305. Isaac Ong, Amjad Almahairi, V Wu, and et al., “RouteLLM: Learning to Route LLMs with Preference Data”, arXiv 2406.18665, 2024
  306. X Mai, Z Tao, J Lin, and et al. “From Efficient Multimodal Models to World Models: A Survey”, arXiv 2407.00118, 2024
  307. X Cheng, J Li, S Yang, G Yang, and X Wang. “Open-Television: Teleoperation with immersive active visual feedback”, arXiv 2407.01512, 2024
  308. I Georgiev, V Giridhar, N Hansen, A Garg, “PWM: Policy Learning with Large World Models”, arXiv 2407.02466, 2024
  309. R Ding, Y Qin, J Zhu, and et al, “Bunny-VisionPro: Real-Time Bimanual Dexterous Teleoperation for Imitation Learning”, arXiv 2407.03162, 2024
  310. Y Liu, W Chen, Y Bai and et al. “Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI”, arXiv 2407.06886, 2024
    
  311. L Zheng, F Yan, F Liu, and et al., “RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios”, arXiv 2407.06951, 2024
  312. N Chernyadev, N Backshall, X Ma, and et al., “BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark”, arXiv 2407.07788, 2024
  313. A Lee, I Chuang, L-Y Chen, I Soltani, “InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation”, arXiv 2409.07914, 2024
  314. W Wu, H He, Y Wang, and et al., “MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces”, arXiv 2407.08725, 2024
  315. H Ha, Y Gao, Z Fu, J Tan, and S Song, “UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers”, arXiv 2407.10353, 2024
  316. H Wang, J Chen, W Huang, and et al., “GRUtopia: Dream General Robots in a City at Scale”, arXiv 2407.10943, 2024
  317. Y Bao, T Ding, J Huo, and et al. “3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities”, arXiv 2407.17418, 2024
  318. Llama Team, Meta AI, “The Llama 3 Herd of Models”, arXiv 2407.21783, 2024
  319. Y Wu, Z Sun, S Li, S Welleck, Y Yang. “Inference Scaling Laws: An Empirical Analysis of Compute-optimal Inference For LLM Problem-solving” (REBASE), arXiv 2408.00724, 2024
  320. H Qu, L Ning, R An, and et al., “A Survey of Mamba”, arXiv 2408.01129, 2024
  321. K Maeda, T Hirasawa, A Hashimoto, and et al. “COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark”, atXiv 2408.02272, 2024
  322. C Snell, J Lee, K Xu, and A Kumar. “Scaling LLM test-time compute optimally can be more effective than scaling model parameters”. arXiv 2408.03314, 2024.
  323. Z Fang, M Yang, W Zeng, and et al., “Egocentric Vision Language Planning” (EgoPlan), arXiv 2408.05802, 2024
  324. H Arai, K Miwa, K Sasaki, and et al., “CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving”, arXiv 2408.10845, 2024
  325. Z Wang, H Zheng, Y Nie, and et al., “All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents” (ARIO). arXiv 2408.10899, 2024
  326. Y Zheng, L Yao, Y Su, and et al., “A Survey of Embodied Learning for Object-Centric Robotic Manipulation”, arXiv 2408.11537, 2024
  327. S Yang, M Liu, Y Qin, and et al. “ACE: A cross-platform visual-exoskeletons system for low-cost dexterous teleoperation”, arXiv 2408.11805, 2024
  328. R Doshi, H Walke, O Mees, S Dasari, S Levine, “Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation” (CrossFormer), arXiv 2408.11812, 2024
  329. Figure 02, https://techcrunch.com/2024/08/06/figures-new-humanoid-robot-leverages-openai-for-natural-speech-conversations/, Aug. 2024
  330. Y. Yang, F.-Y. Sun, L. Weihs, and et al., “Holodeck: Language guided generation of 3d embodied AI environments,” IEEE/CVF CVPR, 2024
  331. Y. Yang, B. Jia, P. Zhi, and S. Huang, “Physcene: Physically interactable 3d scene synthesis for embodied AI,” IEEE/CVF CVPR, 2024
  332. H Etukuru, N Naka, Z Hu, and et al., “Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments” (Stick-v2/RUM), arXiv 2409.05865, 2024
  333. K Li, S M Wagh, N Sharma, and et al., “Haptic-ACT: Bridging Human Intuition with Compliant Robotic Manipulation via Immersive VR”, arXiv 2409.11925, 2024
  334. A Yang, B Zhang, B Hui, and et al. “Qwen2.5-math technical report: Toward mathematical expert model via self-improvement”. arXiv 2409.12122, 2024.
  335. J Wen, Y Zhu, J Li, and et al., “TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation”, arXiv 2409.12514, 2024
  336. A Anwar, J Welsh, J Biswas, S Pouya, Y Chang, “ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation”, arXiv 2409.13682, 2024
  337. I Chuang, A Lee, D Gao, I Soltani, “Active Vision Might Be All You Need: Exploring Active Vision in Bimanual Robotic Manipulation” (AV-ALOHA), arXiv 2409.17435, 2024
  338. Z Wu, T Wang, Z Zhuoma, and et al., “Fast-UMI: A Scalable and Hardware-Independent Universal Manipulation Interface”, arXiv 2409.19499, 2024
  339. OpenAI o1, “Learning to reason with LLMs”. https://openai.com/index/learning-to-reason-with-llms, 2024.
  340. World Labs, an AI company for spatial intelligence, https://www.worldlabs.ai/, Sep. 2024
  341. C-L Cheang, G Chen, Y Jing, and et al., “GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation”, ByteDance Research, Tech. Report, arXiv 2410.06158, Oct., 2024
  342. J Wang, M Fang, Z Wan, and et al., “OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models”, Tech. Report, arXiv 2410.09671, Oct. 2024
  343. S Tao, F Xiang, A Shukla, and et al., “ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI”, arXiv 2410.00425, 2024
  344. P Hua, M Liu, A Macaluso, and et al., “GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs”, arXiv 2410.03645, 2024
  345. S Liu, L Wu, B Li, and et al.,“RDT-1B: a Diffusion Foundation Model For Bimanual Manipulation”, arXiv 2410.07864, 2024
  346. S Chen, C Wang, K Nguyen, Li F-F, C. K Liu, “ARCap: Collecting High-quality Human Demonstrations for Robot Learning with Augmented Reality Feedback”, arXiv 2410.08464, 2024
  347. D Su, S Sukhbaatar, M Rabbat, Y Tian, Q Zheng, “Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces”, arXiv 2410.09918, 2024
  348. S Dasari, O Mees, S Zhao, M K Srirama, S Levine, “The Ingredients for Robotic Diffusion Transformers” (DiT-Block Policy), arXiv 2410.10088, 2024
  349. Y Ze, Z Chen, W Wang, and et al., “Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies” (iDP3), arXiv 2410.10803, 2024
  350. T Z. Zhao, J Tompson, D Driess, and et al., “ALOHA Unleashed: A Simple Recipe for Robot Dexterity”, arXiv 2410.13126, 2024
  351. S Zhu, G Wang, D Kong, H Wang, “3D Gaussian Splatting in Robotics: A Survey”, arXiv 2410.12262, 2024
  352. B Han, J Kim, and J Jang, “A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM” (DP-VLA), arXiv 2410.15549, 2024
  353. Y Zhang, Z Li, M Zhou, S Wu, Jiajun Wu, “The Scene Language: Representing Scenes with Programs, Words, and Embeddings” (SL-DSL), arXiv 2410.16770, 2024
  354. Y Yue, Y Wang, B Kang, and et al., “DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution”, arXiv 2411.02359, 2024
  355. S Nasiriany, S Kirmani, T Ding, and et al., “RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation”, arXiv 2411.02704, 2024
  356. Y Chen, C Wang, Y Yang, C. Liu, “Object-Centric Dexterous Manipulation from Human Motion Data” (OCDM), arXiv 2411.04005, 2024
  357. S Zhao, X Zhu, Y Chen, and et al., “DexH2R: Task-oriented Dexterous Manipulation from Human to Robots”, arXiv 2411.04428, 2024
  358. Z Zhang, R Chen, J Ye, and et al., “WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making”, arXiv 2411.05619, 2024
  359. K Shaw, Y Li, J Yang, and et al., “Bimanual Dexterity for Complex Tasks” (BiDex), arXiv 2411.13677, 2024
  360. X Wang, L Horrigan, J Pinskier, and et al., “DexGrip: Multi-modal Soft Gripper with Dexterous Grasping and In-hand Manipulation Capacity”, arXiv 2411.17124, 2024
  361. Z Liang, Y Mu, Y Wang, and et al., “DexDiffuser: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation”, arXiv 2411.18562, 2024
  362. Anthropic, https://www.anthropic.com/news/3-5-models-and-computer-use, Nov. 2024
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值