基于军事专区新闻的爬虫

本文详述了从人民网军事新闻区爬取数据的过程,包括数据获取、处理、统计分析和可视化。通过词云、柱状图和饼状图展示了高频词汇,揭示了军事新闻的热点话题。
摘要由CSDN通过智能技术生成

人民网军事新闻专区分析

(目标:基于人民网军事新闻的数据收集和整合,建立分类模型)

 

 

摘要

本文对人民网军事新闻专区数据集进行探索性数据分析,以可视化、特征相关程度分析等不同方式对其进行分析和整理,并介绍了整理过程中所以到的问题,解决措施等。并将结果分别以词云,柱形图,饼状图和文字的形式呈现出来。此外,完成此项目后的心得也以文字形式呈现。

 

 

关键词

python  爬虫  数据处理  分析

 

 

目录

代码实现... 3

1.1引入所需包... 3

1.2解析网页... 3

1.3 获得新闻标题和新闻内容... 3

1.4写入txt文件和读txt文件... 4

1.5 分词... 4

1.6计算词出现频率和出现最多的前n个... 5

1.7写入csv文件... 5

1.8画词云... 5

1.9画柱状图... 6

1.10绘制饼状图... 7

1.11主函数及调用... 7

1.12说明... 8

2. 背景... 9

3. 数据分析流程... 10

3.1 数据获取... 10

3.1.1.爬取网页... 10

3.2 数据提取... 12

3.2.1写入csv文件写入过程出现乱码... 12

3.2.2将字典形式写入csv文件... 13

3.3 统计分析... 13

3.3.1  画柱状图... 13

3.3.2 画饼状图... 14

3.4 可视化... 15

3.4.1 词云... 15

3.4.2 柱形图... 16

3.4.3 饼状图... 17

3.5 结果保存... 17

4. 实验结论... 17

5. 结语... 18

6. 课程心得... 18

 

 

 

代码实现

1.1引入所需包

# -*- coding:utf-8 -*-

import requests as re

from bs4 import BeautifulSoup as BS

import jieba

import imageio

import wordcloud

import matplotlib

import matplotlib.pylab as plt

import string

import csv

 

1.2解析网页

#解析网页

def getHtml(url):        #传入网页链接

    rs = re.get(url)

    rs.encoding='gbk'    #用'gbk'解析

    html = rs.text       #得到网页内容

    return html

 

1.3 获得新闻标题和新闻内容

#获得新闻标题

def getComments(html):               #传入网页内容

    soup = BS(html, 'html.parser')

    p = soup.find_all('h5')          #寻找标签'h5'

    comments = []

    for pi in p:

        pi = pi.string               #得到每一个标签里的字符内容

        comments.append(str(pi))     #将得到内容加入列表

    return comments                  #返回题目列表

 

#获得新闻内容

def getComment(html):                #传入网页内容

    soup = BS(html, 'html.parser')

    pp = soup.find_all('em')         #寻找标签'em'

    com = []

    for pi in pp:

        pi=pi.text                   #得到每一个标签里的内容的文本形式

        com.append(str(pi))          #将得到内容加入列表

    return com                       #返回内容列表

 

1.4写入txt文件和读txt文件

#写入txt文件

def wTxt2f(fileName, comments,com):   #文件名,题目列表,内容列表

    with open(fileName, 'a', encoding='utf-8') as f:

        for i in ra

496,835 条来自 AG 新闻语料库 4 大类别超过 2000 个新闻源的新闻文章,数据集仅仅援用了标题和描述字段。每个类别分别拥有 30,000 个训练样本及 1900 个测试样本。 README: AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值