(1)短文本
1)哈工大LCSTS
加载方式:
import pandas as pd
from datasets import load_dataset, Dataset
lcsts_part_1 = pd.read_table(r'D:\softwares\zwj\nlp\project\long-document\datasets\PART_III.txt', header=None,
warn_bad_lines=True, error_bad_lines=False, sep='<[/d|/s|do|su|sh][^a].*>',
encoding='utf-8')
lcsts_part_1 = lcsts_part_1[0].dropna()
lcsts_part_1 = lcsts_part_1.reset_index(drop=True)
lcsts_part_1 = pd.concat([lcsts_part_1[1::2].reset_index(drop=True), lcsts_part_1[::2].reset_index(drop=True)], axis=1)
lcsts_part_1.columns = ['document', 'summary']
lcsts_part_2 = pd.read_table(r'D:\softwares\zwj\nlp\project\long-document\datasets\PART_III.txt', header=None,
warn_bad_lines=True, error_bad_lines=False, sep='<[/d|/s|do|su|sh][^a].*>',
encoding='utf-8')
lcsts_part_2 = lcsts_part_2[0].dropna()
lcsts_part_2 = lcsts_part_2.reset_index(drop=True)
x = lcsts_part_2[1::2].reset_index(drop=True)
xx = lcsts_part_2[::2].reset_index(drop=True)
lcsts_part_2 = pd.concat([lcsts_part_2[1::2].reset_index(drop=True), lcsts_part_2[::2].reset_index(drop=True)], axis=1)
lcsts_part_2.columns = ['document', 'summary']
dataset_train = Dataset.from_dict(lcsts_part_1).shuffle(seed=42)
dataset_valid = Dataset.from_dict(lcsts_part_2).shuffle(seed=42)
(2)中等长度
1)NLPCC2017的单文档新闻测试集合TTNews
2)NLPCC2021的字节跳动CNewSum
转换脚本:
# coding=utf-8
import json
from datasets import load_dataset
import jsonlines
data_type = 'jsonl'
data_field = 'data'
json_data_path = r'./test.simple.anno.label.jsonl'
article = ''
summary = ''
data = []
dict = {}
index=0
with open("./CNewSum_test_original.json","w",encoding='UTF-8') as f:
with jsonlines.open(json_data_path) as reader:
for idx,obj in enumerate(reader):
tmp = 10
for _ in obj['article']:
article+=str(_)
for _ in obj['summary']:
summary+=str(_)
dict['content'] = article
dict['title'] = summary
data.append(dict)
article=''
dict={}
summary=''
d = json.dumps(data, indent=4, sort_keys=False, ensure_ascii=False)
f.write(d)
dataset_train = load_dataset('json', data_files=[r'./CNewSum_test_original.json'])
(3)长文本
1)NLPCC2020的CLTS,但该数据集并不好很差,大量摘要为正文摘抄抽取。
2)SFZY2020,法研杯司法摘要数据集。该数据集公开可获取的有两部分,一部分约1w条,一部分约4k条。