SynthText文本数据详细解析

目录

1. 数据整体官方描述

2. 数据特点

2.1 imnames

2.2 wordBB:单词级别

2.3 charBB:字符级别的bbox

2.4 txt:文本级别


1. 数据整体官方描述

SynthText in the Wild Dataset
-----------------------------
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman
Visual Geometry Group, University of Oxford, 2016


Data format:
------------

SynthText.zip (size = 42074172 bytes (41GB)) contains 858,750 synthetic
scene-image files (.jpg) split into 200 directories, with 
7,266,866 word-instances, and 28,971,487 characters.

Ground-truth annotations are contained in the file "gt.mat" (Matlab format).
The file "gt.mat" contains the following cell-arrays, each of size 1x858750:

  1. imnames :  names of the image files

  2. wordBB  :  word-level bounding-boxes for each image, represented by
                tensors of size 2x4xNWORDS_i, where:
                   - the first dimension is 2 for x and y respectively,
                   - the second dimension corresponds to the 4 points
                     (clockwise, starting from top-left), and
                   -  the third dimension of size NWORDS_i, corresponds to
                      the number of words in the i_th image.

  3. charBB  : character-level bounding-boxes,
               each represented by a tensor of size 2x4xNCHARS_i
               (format is same as wordBB's above)

  4. txt     : text-strings contained in each image (char array).
               
               Words which belong to the same "instance", i.e.,
               those rendered in the same region with the same font, color,
               distortion etc., are grouped together; the instance
               boundaries are demarcated by the line-feed character (ASCII: 10)

               A "word" is any contiguous substring of non-whitespace
               characters.

               A "character" is defined as any non-whitespace character.


For any questions or comments, contact Ankush Gupta at:
removethisifyouarehuman-ankush@robots.ox.ac.uk

2. 数据特点

  数据集下文件如下。

(1)数据集总共有41g,858750张合成图片,jpg格式,这么图片分成200个场景图片(即图片背景不同,其实有202个场景),单词有7,266,866个,字符有28,971,487个;

(2)标注文件时mat格式,读取后保存内容如下。

2.1 imnames

保存图片文件相对路径

2.2 wordBB单词级别

每张图片对应其中一个标注tensor,该tensor的size是(2, 4, n_word_i):2是xy坐标;4是表示4个点,左上角开始,顺时针方向;n_word_i是第i张图片中的word个数。

“单词”是指任何非空白的连续字符串。

2.3 charBB字符级别的bbox

size也是(2, 4, n_char_i). 意义同wordBB.

字符是指任何非空白字符。

char_bbox 转labelme格式的json标注文件:

def syntext2json_char_level():
    data_dir = r"F:\BaiduNetdiskDownload\SynthText800k\detection"
    gt_path = os.path.join(data_dir, "gt.mat")
    img_paths = os.path.join(data_dir, "imgs")

    gt_mat = loadmat(gt_path)

    # word_bboxes = gt_mat['wordBB'][0]
    img_names = gt_mat['imnames'][0]
    char_bboxes = gt_mat['charBB'][0]

    for i in tqdm(range(img_names.size)):
        coco_output = {
            "version": "3.16.7",
            "flags": {},
            # "fillColor": [255, 0, 0, 128],
            # "lineColor": [0, 255, 0, 128],
            "imagePath": {},
            "shapes": [],
            "imageData": {}}
        img_name = img_names[i][0]
        img_full_path = os.path.join(img_paths, img_name)
        coco_output["imagePath"] = os.path.basename(img_full_path)
        coco_output["imageData"] = None

        json_full_path = img_full_path.replace(".jpg", ".json")
        # print(json_full_path)

        cur_img = cv2.imread(img_full_path)
        if cur_img is None:
            continue
        cur_bboxes = char_bboxes[i]  # (2,4,n)
        if len(cur_bboxes.shape) != 3:
            cur_bboxes = np.expand_dims(cur_bboxes, 2)
        # rectify_bboxes = np.zeros((cur_bboxes.shape[2], 4, 2))
        for j in range(cur_bboxes.shape[2]):  # (2,4,15)  多个cnt,多个字符
            bbox = cur_bboxes[:, :, j]  # (2,4)
            pt_list = [[int(bbox[0][m]), int(bbox[1][m])] for m in range(4)]  # 记录当前字符
            x, y, w, h = cv2.boundingRect(np.array(pt_list))
            rect = [[x, y], [x + w, y + h]]
            # cv2.rectangle(cur_img, pt_list[0], pt_list[2], (0, 0, 255), 3)
            # cv2.namedWindow("img", cv2.WINDOW_NORMAL), cv2.imshow("img", cur_img), cv2.waitKey()
            shape_info = {'points': rect,
                          'group_id': None,
                          # "fill_color": None,
                          # "line_color": None,
                          "label": "loc",
                          "shape_type": "rectangle",
                          "flags": {}
                          }
            coco_output["shapes"].append(shape_info)
        coco_output["imageHeight"] = cur_img.shape[0]
        coco_output["imageWidth"] = cur_img.shape[1]

        with open(json_full_path, 'w') as output_json_file:
            json.dump(coco_output, output_json_file, indent=4)
        output_json_file.close()

2.4 txt文本级别

每个图像中包含的文本字符串(字符数组)。

 

以图片ballet_106_0.jpg为例. 其标注有8个文本,同一个区域、且字体、颜色、扭曲等特征相同的单词被视为一个文本。

 

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 4
    评论
SynthText数据集的gt是以mat文件的形式存储的,包含了每张图像中所有文本实例的位置信息、文本内容等。如果您想将SynthText数据集中的gt转换为train list,可以按照以下步骤操作: 1. 解压SynthText数据集,将mat文件解析为Python中的数据结构。 2. 遍历所有的mat文件,读取每张图像的文件名和文本实例信息,将它们保存为一个列表。 3. 将这个列表保存为train list文件,每一行表示一个图像的信息,包括图像文件名和每个实例的位置信息、文本内容等。 具体的代码实现可以参考以下示例: ```python import scipy.io as sio import os # SynthText数据集解压后的路径 synthtext_path = "/path/to/SynthText/" # 保存train list的文件路径 train_list_path = "/path/to/train.lst" # 遍历所有mat文件 image_list = [] for root, dirs, files in os.walk(synthtext_path): for file in files: if file.endswith(".mat"): mat_path = os.path.join(root, file) # 读取mat文件中的数据 data = sio.loadmat(mat_path, verify_compressed_data_integrity=False, squeeze_me=True) # 获取图像文件名 image_name = os.path.basename(data["imnames"][0]) # 获取每个文本实例的位置和内容 word_bb = data["wordBB"] txt = data["txt"] for i in range(word_bb.shape[-1]): # 将每个实例的信息保存为一个字符串 instance_str = ",".join([str(x) for x in word_bb[:, :, i].flatten().tolist()]) + "," + txt[i] # 将图像文件名和实例信息拼接为一行,并添加到列表中 image_list.append(image_name + " " + instance_str) # 将列表中的信息保存为train list文件 with open(train_list_path, "w") as f: f.write("\n".join(image_list)) ``` 这样,就可以将SynthText数据集中的gt转换为train list了。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Mr.Q

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值