论文的创新点有下面四条:
1.
We
introduce
a new family of 3D-based Large Language models (3D-LLMs)
that can take 3D points with features and language prompts as input, and perform a
variety of 3D-related tasks
.
2.
We
devise novel
data collection pipelines that could generate large-scale 3D-language data
. Based on the pipelines, we collect a dataset that has over 300k 3D-language data that cover a diverse set of 3D-related tasks, including but not limited to 3D captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so
on.
3.
We
use a
3D feature extractor that extracts meaningful 3D features from rendered multi-view images
. We utilize 2D pretrained VLMs as our backbones for efficient training. We introduce a 3D localization mechanism for training the 3D-LLMs to better capture 3D spatial information.
4. We plan to release our 3D-LLMs, the 3D-language dataset, and language-aligned 3D features of the dataset for future research development.
使用输入3d场景的描述,使用chat gpt 生成描述语言,仔细看这个图很有意思,作者哪一个场景举例子,让gpt后续为其他场景生成描述性的语言
下面是模型的结构:其实也比较简单。