MarkupLM源码解析之数据准备（二）

coder1479

已于 2022-03-04 20:11:40 修改

阅读量273

点赞数

分类专栏：信息抽取 Python 文章标签： python

于 2022-03-03 19:47:40 首次发布

本文链接：https://blog.csdn.net/m0_48742971/article/details/123261951

版权

文章目录

前言
1. 读取标注数据
2. 读取HTML文件

前言

原文地址：https://blog.csdn.net/m0_48742971/article/details/123261951

这篇文章主要分析load_html_and_groundtruth方法，这个方法负责加载SWDE数据集中的html文件和标注数据。

1. 读取标注数据

下载SWDE数据集并解压后，可以查看目录结构，数据在sourceCode目录下，标注数据在groudtruth目录下，目录结构如下。

[root@localhost sourceCode]# tree groundtruth
groundtruth
├── auto
│   ├── auto-aol-engine.txt
│   ├── auto-aol-fuel_economy.txt
│   ├── auto-aol-model.txt
│   ├── auto-aol-price.txt

│   ├── auto-autobytel-engine.txt
│   ├── auto-autobytel-fuel_economy.txt
│   ├── auto-autobytel-model.txt
│   ├── auto-autobytel-price.txt
...略...
├── book
│   ├── book-abebooks-author.txt
│   ├── book-abebooks-isbn_13.txt
│   ├── book-abebooks-publication_date.txt
│   ├── book-abebooks-publisher.txt
│   ├── book-abebooks-title.txt
...略...

可以看到每个标注文件的命名方式：

vertical-website-field.txt
垂直领域-网站名称-字段.txt

源代码中从文件名直接提取出vertical, website, field。

vertical, website, field = truthfile.replace(".txt", "").split("-")

下面开始读取groudtruth文件的内容。

with open(os.path.join(gt_path, v, truthfile), "r") as gfo:
    lines = gfo.readlines()
    for line in lines[2:]:
        # Each line should contains more than 3 elements splitted by \t
        # which are: index, number of values, value1, value2, etc.
        item = line.strip().split("\t")
        index = item[0]