beitv3训练自己的数据集

cv-daily

已于 2023-06-09 15:09:29 修改

阅读量1.1k

点赞数 2

文章标签：人工智能

于 2023-06-09 10:42:23 首次发布

本文链接：https://blog.csdn.net/weixin_41012399/article/details/131122626

版权

该工程涉及从GitHub下载数据集和模型，利用XLMRobertaTokenizer处理COCO数据集，创建caption数据集。通过读取dataset_coco.json文件，将图片路径和对应的文本描述转换为token_ids，为训练模型准备数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

工程：https://github.com/microsoft/unilm
第一步：下载数据集
数据集1：Download 2014 train images, 2014 val images
数据集2：(https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip)
安装以下方式存放：

/path/to/your_data/
  train2014/            
    COCO_train2014_000000000009.jpg                
    ...
  val2014/              
    COCO_val2014_000000000042.jpg
    ...       
  dataset_coco.json

备注dataset_coco.json是coco的caption数据集。如果要训练自己的数据集，需要把自己的数据集制作成caption数据集。
下载处理数据的模型：(https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm)
处理数据：


```python
from datasets import CaptioningDataset
from transformers import XLMRobertaTokenizer

tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")

CaptioningDataset.make_coco_captioning_dataset_index(
    data_path="/path/to/your_data",
    tokenizer=tokenizer,
)

处理过程：读取dataset_coco.json–>
具体代码

def _make_captioning_coco_karpathy_dataset_index(
        data_path, 
        tokenizer, 
        split=("train", "restval"), 
        split_name="train", 
):
    coco_karpathy_split_json_file = os.path.join(data_path, "dataset_coco.json")
    items = []
    image_counter = set()
    print("read %s" % coco_karpathy_split_json_file)
    with open(coco_karpathy_split_json_file, mode="r", encoding="utf-8") as reader:
        data = json.loads(reader.read())
        for item in data["images"]:
            if item["split"] in split:
                image_path = os.path.join(item["filepath"], item["filename"])
                if item["split"] in ["train", "restval"]:
                    for sent in item["sentences"]:
                        tokens = tokenizer.tokenize(sent["raw"])###这里的tokens是该图片对应的文字描述，例如a woman wearing a net on her head cutting a cake;一张图片有很多描述；把每个描述都append到items中。
                        token_ids = tokenizer.convert_tokens_to_ids(tokens)
                        items.append({
                                "image_path": image_path, 
                                "text_segment": token_ids, 
                                "image_id": item["cocoid"], 
                        })
                else:
                    items.append({
                                "image_path": image_path, 
                                "text_segment": None, 
                                "image_id": item["cocoid"], 
                    })
                if image_path not in image_counter:
                    image_counter.add(image_path)
    print("Find %d images and %d image-text pairs for karpathy dataset %s split !" % \
        (len(image_counter), len(items), split_name))
    index_file = os.path.join(data_path, "coco_captioning.%s.jsonl" % split_name)
    _write_data_into_jsonl(items, index_file)
    pass