CNN/Dailymail 数据集的三种处理方式

数据集简介: 文本摘要数据集

基础格式:article+summery

原始文件:两个**.tgz压缩包,解压后得到近32万个*.story

1、 开源项目PreSumm提供的数据集预处理

按照链接里的处理,最终可以将原始数据压缩包处理为json文件或者torch格式数据,json文件中的数据已经分词分句,key=['src','tgt']单条数据格式为:

{'src': [['by'], ['daily', 'mail', 'reporter'], ['an', 'ad', 'in', 'which', 'tnt', 'post', 'compared', 'itself', 'to', 'royal', 'mail', 'has', 'been', 'banned', 'for', 'implying', 'that', 'it', 'offered', 'the', 'same', 'service', 'and', 'similar', 'prices', 'as', 'its', 'competitor', '.'], ['the', 'leaflet', 'showed', 'a', 'photo', 'of', 'a', 'man', 'in', 'a', 'tnt', 'uniform', 'and', 'said', ':', '`', 'what', 'does', 'my', 'postie', 'look', 'like', '?'], ['like', 'this', '-', 'in', 'a', 'smart', 'uniform', 'which', 'our', 'posties', 'wear', 'at', 'all', 'times', 'on', 'duty', '.'], ['`', 'like', 'royal', 'mail', ',', 'all', 'are', 'crb-checked', 'and', 'fully', 'trained', 'in', 'how', 'to', 'keep', 'mail', 'safe', 'and', 'secure', '.', "'"], ['delivery', 'firm', 'tnt', 'post', 'has', 'had', 'one', 'of', 'its', 'ads', 'banned', 'after', 'implying', 'that', 'it', 'offered', 'the', 'same', 'service', 'and', 'similar', 'prices', 'as', 'the', 'royal', 'mail'], ['it', 'added', ':', '`', 'we', 'operate', 'under', 'exactly', 'the', 'same', 'rules', 'and', 'regs', 'as', 'royal', 'mail', '-', 'authorised', 'by', 'the', 'government', 'to', 'carry', 'mail', 'and', 'watched', 'over', 'by', 'ofcom', '.', "'"], ['a', 'regional', 'press', 'ad', 'said', 'every', 'item', 'was', 'delivered', '`', 'at', 'a', 'competitive', 'price', '-', 'not', 'just', 'within', 'manchester', ',', 'but', 'all', 'over', 'the', 'uk', "'", '.'], ['royal', 'mail', 'complained', 'that', 'the', 'ad', 'was', 'misleading', 'because', 'tnt', 'was', 'not', 'required', 'to', 'deliver', 'to', 'every', 'address', 'in', 'the', 'uk', 'on', 'a', 'next-day', 'basis', 'in', 'the', 'way', 'royal', 'mail', 'was', '.'], ['royal', 'mail', 'also', 'complained', 'that', 'the', 'ad', 'suggested', 'tnt', 'delivered', 'mail', 'all', 'over', 'the', 'uk', 'itself', 'and', 'also', 'that', 'it', 'implied', 'the', 'service', 'provided', 'by', 'tnt', 'was', 'better', 'than', 'the', 'service', 'provided', 'by', 'royal', 'mail', '.'], ['tnt', 'said', 'the', 'ad', 'was', 'intended', 'to', 'reassure', 'consumers', 'about', 'the', 'security', 'of', 'mail', 'rather', 'than', 'frequency', 'of', 'delivery', 'and', 'where', 'it', 'did', 'not', 'have', 'its', 'own', 'network', ',', 'it', 'subcontracted', 'royal', 'mail', 'to', 'deliver', ',', 'enabling', 'it', 'to', 'provide', 'a', 'service', 'across', 'the', 'uk', '.'], ['the', 'advertising', 'standards', 'authority', '(', 'asa', ')', 'said', 'the', 'ad', 'suggested', 'that', 'tnt', 'was', 'subject', 'to', 'the', 'same', 'service', 'levels', 'as', 'royal', 'mail', 'in', 'all', 'aspects', 'of', 'its', 'service', ',', 'including', 'the', 'obligation', 'to', 'deliver', 'to', 'every', 'uk', 'address', 'every', 'day', ',', 'which', 'was', 'not', 'the', 'case', '.'], ['royal', 'mail', 'complained', 'that', 'the', 'ad', 'suggested', 'tnt', 'delivered', 'mail', 'all', 'over', 'the', 'uk', 'itself'], ['it', 'also', 'said', 'the', 'reference', 'to', 'mail', 'being', '`', 'delivered', '...', 'all', 'over', 'the', 'uk', "'", 'was', 'misleading', 'in', 'the', 'absence', 'of', 'qualifying', 'information', 'that', 'explained', 'that', 'delivery', 'to', 'most', 'areas', 'in', 'the', 'uk', 'could', 'not', 'be', 'carried', 'out', 'by', 'tnt', '.'], ['and', 'it', 'found', 'that', 'information', 'in', 'the', 'ad', 'did', 'not', 'allow', 'readers', 'to', 'identify', 'for', 'themselves', 'how', 'tnt', 'was', 'superior', 'to', 'royal', 'mail', 'in', 'respect', 'of', 'how', 'its', 'prices', 'compared', 'to', 'those', 'offered', 'by', 'royal', 'mail', '.'], ['it', 'ruled', 'that', 'the', 'ads', 'must', 'not', 'appear', 'again', 'in', 'their', 'current', 'form', ',', 'adding', ':', '`', 'we', 'told', 'tnt', 'post', 'uk', 'to', 'ensure', 'future', 'ads', 'did', 'not', 'suggest', 'that', 'they', 'were', 'subject', 'to', 'the', 'same', 'service', 'levels', 'as', 'royal', 'mail', ',', 'that', 'they', 'delivered', 'mail', 'themselves', 'to', 'all', 'parts', 'of', 'the', 'uk', 'or', 'that', 'their', 'prices', 'were', 'competitive', 'in', 'relation', 'to', 'those', 'offered', 'by', 'royal', 'mail', 'unless', 'they', 'could', 'substantiate', 'that', 'that', 'was', 'the', 'case', '.', "'"], ['a', 'tnt', 'post', 'spokeswoman', 'said', ':', '`', 'whilst', 'we', 'disagree', 'with', 'the', 'findings', ',', 'we', 'will', 'comply', 'with', 'the', 'request', 'for', 'greater', 'clarity', 'in', 'the', 'future', '.', "'"]], 'tgt': [['royal', 'mail', 'complained', 'that', 'the', 'tnt', 'post', 'ad', 'was', 'misleading'], ['it', 'said', 'it', 'implied', 'tnt', "'s", 'service', 'was', 'better', 'than', 'royal', 'mail', "'s"], ['tnt', 'said', 'it', 'was', 'intended', 'to', 'reassure', 'consumers', 'about', 'mail', 'security']]}

PreSumm版本下载以及预处理链接:

预处理CNN/DailyMail数据集_cnndailymail下载_想念@思恋的博客-CSDN博客

2、 abisee处理版本

原始数据及数据集链接:该链接已提供了处理好的二进制数据集文件(*.bin),在此不详述

入坑Abstractive Summarization:文本摘要CNN/DM数据集_cnndm_木尧大兄弟的博客-CSDN博客

3、Huggingface 版本

个人觉得最方便的版本,huggingface的绝大部分数据集都可以这样使用

(1)官网信息:

cnn_dailymail at main (huggingface.co)

 

(2)使用方法:


from datasets import list_datasets, load_dataset, list_metrics, load_metric 
data=load_dataset('cnn_dailymail.py','3.0.0')  # 如果能连上huggingface 这是最方便的处理方法了,不过我一般是连不上

(3)本地加载cnn_dailymail数据集

首先把截图里的文件都下载下来 还有"all_train.txt"等三个文件,可以从abisee版本的final文件夹里找到,也可以直接从网页复制

	# cnn_dailymail.py 第67行:
	
	DL_URLS = {….}
	# 把路径全部换成本地路径
	# 然后:
	data=load_dataset(' ~/cnn_dailymail.py', '3.0.0') # 3.0.0可以理解为数据集的版本号
	
	#可以搭配:
	
	data.save_to_disk() #函数将数据集存到你想放在的地方,方便使用

数据集从本地读取处理速度很快,加载后得到的数据如下,key=['article','highlights', 'id']:

{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how he\'ll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. "I\'ll definitely have some sort of party," he said in an interview. "Hopefully none of you will be reading about it." Radcliffe\'s earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. "People are always looking to say \'kid star goes off the rails,\'" he told reporters last month. "But I try very hard not to go that way because it would be too easy for them." His latest outing as the boy wizard in "Harry Potter and the Order of the Phoenix" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter\'s latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called "My Boy Jack," about author Rudyard Kipling and his son, due for release later this year. He will also appear in "December Boys," an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer\'s "Equus." Meanwhile, he is braced for even closer media scrutiny now that he\'s legally an adult: "I just think I\'m going to be more sort of fair game," he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed.', 'highlights': "Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund .", 'id': '42c027e4ff9730fbb3de84c1af0d2c506e41c3e4'}

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值