将存在txt中的新闻导入csv

之前为了进行模型训练,在各大新闻网站上爬取了大量新闻,爬取的数据存在了txt文档中。为了日后调用数据方便,打算将数据转存入csv文件。本来想修改一下爬虫的保存方法,但鉴于需重新爬取外网新闻且最近梯子不稳定,打算写个python程序直接导数据。

思路:

  1. 观察新闻数据的特点,均以新闻发布日期开头,随后是新闻发布事件,最后是新闻标题和正文,每行开头若是‘2020-’,则该行为一条新的新闻,若开头不是‘2020-’, 则该行为上一条新闻的后半部分。
  2. 创建一个新列表,提取txt中 的每行数据作为每一项存入其中。
  3. 遍历列表,若某项开头为‘2020-’,则切片提取[0:10]为该新闻发布日期,[11:19]为该新闻发布时间,[21:]为该新闻的正文。将提取出的新闻的每部分加入新列表,为之后存入csv的每一列做准备。
  4. 若某项开头不为‘2020-’, 则该行为上一条新闻的后半部分,直接将其添加到上一条新闻内容的后面。
  5. 创建一个新的csv文件,按照新闻编号,新闻发布日期,新闻发布时间,新闻正文为四列,从个刚刚创建的新闻各部分列表传入数据。

代码:

import csv


# add each line in the txt to a list as an item
def process_news(path):
    processed_news = []  # create a list to contain lines from the txt
    with open(path, "r", encoding='utf-8') as f:
        for line in f.rea
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
496,835 条来自 AG 新闻语料库 4 大类别超过 2000 个新闻源的新闻文章,数据集仅仅援用了标题和描述字段。每个类别分别拥有 30,000 个训练样本及 1900 个测试样本。 README: AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值