文章目录
前言
原文地址:https://blog.csdn.net/m0_48742971/article/details/123261951
这篇文章主要分析load_html_and_groundtruth
方法,这个方法负责加载SWDE数据集中的html文件和标注数据。
1. 读取标注数据
下载SWDE
数据集并解压后,可以查看目录结构,数据在sourceCode
目录下,标注数据在groudtruth
目录下,目录结构如下。
[root@localhost sourceCode]# tree groundtruth
groundtruth
├── auto
│ ├── auto-aol-engine.txt
│ ├── auto-aol-fuel_economy.txt
│ ├── auto-aol-model.txt
│ ├── auto-aol-price.txt
│ ├── auto-autobytel-engine.txt
│ ├── auto-autobytel-fuel_economy.txt
│ ├── auto-autobytel-model.txt
│ ├── auto-autobytel-price.txt
...略...
├── book
│ ├── book-abebooks-author.txt
│ ├── book-abebooks-isbn_13.txt
│ ├── book-abebooks-publication_date.txt
│ ├── book-abebooks-publisher.txt
│ ├── book-abebooks-title.txt
...略...
可以看到每个标注文件的命名方式:
vertical-website-field.txt
垂直领域-网站名称-字段.txt
源代码中从文件名直接提取出vertical, website, field
。
vertical, website, field = truthfile.replace(".txt", "").split("-")
下面开始读取groudtruth文件的内容。
with open(os.path.join(gt_path, v, truthfile), "r") as gfo:
lines = gfo.readlines()
for line in lines[2:]:
# Each line should contains more than 3 elements splitted by \t
# which are: index, number of values, value1, value2, etc.
item = line.strip().split("\t"