python爬虫百度新闻标题，并且做简单的数据分析

最新推荐文章于 2023-02-03 19:22:57 发布

土、拨鼠

最新推荐文章于 2023-02-03 19:22:57 发布

阅读量1.2k

点赞数 2

分类专栏：笔记文章标签： python 数据分析

本文链接：https://blog.csdn.net/hell_orld/article/details/105984792

版权

python爬虫-百度新闻标题和编辑信息，并且做简单的数据分析

需要下载的库

我所用的python版本为： Python 3.7.4

获取新闻信息需要的库： beautifulsoup4，request，re；
信息存储需要的库(获取信息存在csv文件中): csv；
数据分析需要的库： numpy、matplotlib；
界面设计需要的库： tkinter；

需要对html一些标签有一定的了解

可以到w3cschool了解
打开百度新闻网站，按F12开发者工具，或者右键点击查看源，就可以看到网页的源代码。

代码设计思想

1.每个新闻网页通过request请求获得网页源代码，再通过bs4(beautifulsoup)来对源代码进行提取信息；
2.每个类的新闻获取标题、链接是相同的。通过观察源代码，可发现新闻的标题是存在li标签中a标签里面并且每个a标签都有<target=”_blank”>属性，通过bs4(beautifulsoup)的select()来获取其标题和链接；
如图：
新闻标题、链接存放的代码特征：
在这里插入图片描述
3.而首页新闻中有新闻热搜词，体育新闻中有体育热搜词，其标题和链接也是以上述的特征存储；
4. 除了首页的新闻，其它类的新闻里面新闻的每个网页的源代码几乎都是差不多的。而前面先获取了每类新闻里面每个新闻链接，重新以步骤1来提取相关信息（编辑作者、编辑日期、编辑时间）。通过查看网页源代码，可发现编辑信息是放在类名为author-txt的div块里面（div class=”author-txt”），编辑作者姓名放在类名为author-name的p标签，编辑日期、时间放在类名为date的span标签和类名为time的span标签里。而当中也有些新闻的网页源代码是不同的，只能以-1的形式存入信息列表中；小部分的编辑信息存放不符合上面的特征（就以-1代替）：
5.每个信息都以一个列表来进行存储；
6.将每个列表的信息通过pandas库来存入csv文件中；
7.进行每类数据分析可视化处理时，通过datetime来获取今天、昨天的时间，然后用dict字典和一定运算来统计每类新闻里面今天、昨天、其它时间三个编辑时间分布的百分比，然后用matplotlib库来画条形图。
8.通过tkinter库设计一个界面，把每类新闻设计成一个按钮，通过点击按钮中显示出每类里面的新闻信息；把首页、体育新闻热搜词放在左下、右下两个角。

运行结果

可以点击按钮获取各类信息：
在这里插入图片描述
发布日期分布的分析：

源代码

main.py(主程序)：

from tkinter import *
import datetime
import numpy as np
import matplotlib.pyplot as plt
from hp import news_title,news_url,hot_title,hot_url
from inte import news_title2,news_url2,news_date,news_time,news_author,li
from mil import news_title3,news_url3,news_date2,news_time2,news_author2,li2
from finance import news_title4,news_url4,news_date3,news_time3,news_author3,li3
from ent import news_title5,news_url5,news_date4,news_time4,news_author4,li4
from sports import news_title6,news_url6,news_date5,news_time5,news_author5,hot_title2,hot_url2,li5
from tech import news_title7,news_url7,news_date6,news_time6,news_author6,li6
from game import news_title8,news_url8,news_date7,news_time7,news_author7,li7
def hp_print():#首页新闻输出
    txt.delete('1.0','end')#清空Text框内容
    txt.insert(END,'首页新闻标题\t新闻链接\n')
    for x in range(len(news_title)):
        txt.insert(END,news_title[x])
        txt.insert(END,'\t')
        txt.insert(END,news_url[x])
        txt.insert(END,'\n')
def print(title,url,date,time,author):#除首页外其它类新闻输出
    txt.delete('1.0','end')#清空
    txt.insert(END,'新闻标题\t新闻链接\t编辑日期\t编辑时间\t编辑作者\n')
    for x in range(len(title)):
        txt.insert(END,title[x])
        txt.insert(END,'\t')
        txt.insert(END,url[x])
        txt.insert(END,'\t')
        txt.insert(END,date[x])
        txt.insert(END,'\t')
        txt.insert(END,time[x])
        txt.insert(END,'\t')
        txt.insert(END,author[x])
        txt.insert(END,'\n')        
root=Tk()
root.title('百度新闻-我知道！')#界面标题
root.geometry('1024x560')
lb=Label(root,text='点击按钮，获得各类新闻中新闻信息（-1表示不清楚）')
lb.place(relx=0.1,rely=0.01,relwidth=0.8,relheight=0.08)
txt = Text(root)#各类新闻信息输出框
btn1=Button(root,text='首页',command=hp_print)
btn1.place(relx=0.005, rely=0.1, relwidth=0.05, relheight=0.05)
btn2=Button(root,text='int',command=lambda:print(news_title2,news_url2,news_date,news_time,news_author))
btn2.place(relx=0.08, rely=0.1, relwidth=0.05, relheight=0.05)
btn3=Button(root,text='mil',command=lambda:print(news_title3,news_url3,news_date2,news_time2,news_author2))
btn3.place(relx=0.155, rely=0.1, relwidth=0.05, relheight=0.05)
btn4=Button(root,text='财经',command=lambda:print(news_title4,news_url4,news_date3,news_time3,news_author3))
btn4.place(relx=0.23, rely=0.1, relwidth=0.05, relheight=0.05)
btn5=Button(root,text='娱乐',command=lambda:print(news_title5,news_url5,news_date4,news_time4,news_author4))
btn5.place(relx=0.305, rely=0.1, relwidth=0.05, relheight=0.05)
btn6=Button(root,text='体育',command=lambda:print(news_title6,news_url6,news_date5,news_time5,news_author5))
btn6.place(relx=0.38, rely=0.1, relwidth=0.05, relheight=0.05)
btn7=Button(root,text='科技',command=lambda:print(news_title7,news_url7,news_date6,news_time6,news_author6))
btn7.place(relx=0.455, rely=0.1, relwidth=0.05, relheight=0.05)
btn8=Button(root,text='游戏',command=lambda:print(news_title8,news_url8,news_date7,news_time7,news_author7))
btn8.place(relx=0.53, rely=0.1, relwidth=0.05, relheight=0.05)
txt2=Text(root)#新闻热搜词框
txt2.insert(END,'新闻热搜词\t链接\n')
for x in range(len(hot_title)):
    txt2.insert(END,hot_title[x])
    txt2.insert(END,'\t')
    txt2.insert(END,hot_url[x])
    txt2.insert(END,'\n')
txt2.place(rely=0.8, relwidth=0.4, relheight=0.2)
txt3=Text(root)#体育热搜词框
txt3.insert(END,'体育热搜词\t链接\n')
for x in range(len(hot_title2)):
    txt3.insert(END,hot_title2[x])
    txt3.insert(END,'\t')
    txt3.insert(END,hot_url2[x])
    txt3.insert(END,'\n')   
txt3.place(relx=0.6,rely=0.8, relwidth=0.4, relheight=0.2)
txt.place(rely=0.2, relwidth=1, relheight=0.6)
today=datetime.date.today()
yesterday=today - datetime.timedelta(days=1)
s=str(today)[5:]#获得今天
s2=str(yesterday)[5:]#昨天
ind=np.arange(7)
l1=[li[0],li2[0],li3[0],li4

最低0.47元/天解锁文章

土、拨鼠

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
python爬虫百度新闻标题，并且做简单的数据分析

这里写自定义目录标题新的改变功能快捷键合理的创建标题，有助于目录的生成如何改变文本的样式插入链接与图片如何插入一段漂亮的代码片生成一个适合你的列表创建一个表格设定内容居中、居左、居右SmartyPants创建一个自定义列表如何创建一个注脚注释也是必不可少的KaTeX数学公式新的甘特图功能，丰富你的文章UML 图表FLowchart流程图导出与导入导出导入你好！这是你第一次使用 Markdown...
复制链接

扫一扫