【LLAVA】Llava中在数据集制作过程中是怎么从CC3M中过滤出595K数据的？为什么这样做？

最新推荐文章于 2025-01-13 16:02:45 发布

页页读

最新推荐文章于 2025-01-13 16:02:45 发布

阅读量2.1k

点赞数 32

分类专栏：多模态模型文章标签：多模态模型

本文链接：https://blog.csdn.net/u014386899/article/details/136917574

版权

文章描述了一种方法，通过Spacy提取CC3M数据集中的名词短语，筛选出频率大于3的常见概念，减少冗余并保持多样性，最终得到约595K高质量的图像-文本对。筛选过程确保了概念的全面覆盖。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

原文：CC3M. We extract noun-phrases using Spacy for each caption over the whole cc3m dataset, and count the frequency of each unique noun-phrase. We skip noun-phrases whose frequency is smaller than 3, as they are usually rare combinations concept and attributes that has already been covered by other captions. We then start from the noun-phrases with lowest remaining frequency, add the captions that contain this noun-phrase to the candidate pool. If the frequency of the noun-phrase is larger than 100, we randomly choose a subset of size 100 out of all its captions. This results in around 595K image-text pairs.

上面这段话是摘自LLAVA原论文。下面说明这个处理过程。