爬取新浪新闻网

最新推荐文章于 2024-05-20 21:35:11 发布

flamingobaby

最新推荐文章于 2024-05-20 21:35:11 发布

阅读量986

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/Naux1/article/details/76646127

版权

Python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

这里写图片描述
爬取新闻的标题，日期和链接

import requests
from bs4 import BeautifulSoup

url='http://news.sina.com.cn/china/'
res=requests.get(url)
soup=BeautifulSoup(res.text,'lxml')
print(soup)

此时会发现爬取的数据中中文全部为乱码，因此应加入一句：

res.encoding='utf-8'

而新闻标题，时间和链接都放在.new-item下，我们查询new-item并输出
在写select()里面内容的时候标签名不加任何修饰，类名class前加点，id名前加 #

import requests
from bs4 import BeautifulSoup

url='http://news.sina.com.cn/china/'
res=requests.get(url)
res.encoding='utf-8'
soup=BeautifulSoup(res.text,'lxml')

for news in soup.select('.news-item'):
    print(news)

输出：
这里写图片描述
此时我们可以发现，标题在h2标签下，时间在time下，链接在a标签下

import requests
from bs4 import BeautifulSoup

url='http://news.sina.com.cn/china/'
res=requests.get(url)
res.encoding='utf-8'
soup=BeautifulSoup(res.text,'lxml')

for news in soup.select('.news-item'):
    if len(news.select('h2'))>0:
        title=news.select('h2')[0].text  #取得是文字
        a=news.select('a')[0]['href']   #说明取的是href
        time=news.select('.time')[0].text  #time在class下所以用.time
        print(time)
        print(title)
        print(a)
        print

输出结果：
这里写图片描述

随意点开其中一个标题，我们进行爬取标题，时间来源
1.爬取标题

import requests
from bs4 import BeautifulSoup

url='http://news.sina.com.cn/c/nd/2017-08-03/doc-ifyitapp0298122.shtml'   #在network中URL查找
res=requests.get(url)
res.encoding='utf-8' #设置编码
soup=BeautifulSoup(res.text,'html.parser')

title=soup.select('#artibodyTitle')[0].text 
print(title)

对于title=soup.select(‘#artibodyTitle’)[0].text title的类型是list，标题只有一个，因此list中只有一个元素，因此用[0]来取第一个元素
2.爬取时间和文章来源
这里写图片描述
时间和来源都在class=”time-source”下

time=soup.select('.time-source')[0].text   #属于class，前面加.
print(time)

这里写图片描述
此时我们取得了时间和文章来源，都在在列表下第一个元素中，但如果我们想要将时间和来源分别单独取出

<span class="time-source" id="navtimeSource">2017年08月03日17:40           <span>
<span data-sudaclick="media_name"><a href="http://app.peopleapp.com/Api/600/DetailApi/shareArticle?type=0&amp;article_id=669164" rel="nofollow" target="_blank">新浪综合</a></span></span>
</span>

可见时间在一个span下，而文章来源也包括在一个span下，因此
可在select后加.content就会将它们分为一个列表中的两个元素

time=soup.select('.time-source')[0].contents
print (time)

这里写图片描述
可见到图片中逗号，有两个元素
取第一个元素时间：

time=soup.select('.time-source')[0].contents[0]
print (time)

取第二个元素文章来源：

time=soup.select('.time-source')[0].contents[1].text
print (time)

3.爬取文章内容

article=soup.select('#artibody p')
print (article)

文章在id=artibody下，我们同时可以发现每一段文章都在一对p标签中间，因此加个p,得到
这里写图片描述
加入我们不想要最后一句话“责任编辑：张迪”这句话，只需：

article=soup.select('#artibody p')[:-1]
print (article)

输出所有文字：

article=soup.select('#artibody')[0].text
print(article)

这里写图片描述
4.爬取文章的评论数量

command=soup.select('#commentCount1')
print(command)

此时我们输出时发现并未显示评论数量
这里写图片描述
评论是透过JavaScript的方式加载到网页上的，因此我们要找到相应的JavaScript

我们在JS下面找到了204，展开后发现评论者的评论

此时我们点击headers,复制url下的链接来抓取评论

flamingobaby

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
爬取新浪新闻网

爬取新闻的标题，日期和链接import requestsfrom bs4 import BeautifulSoupurl='http://news.sina.com.cn/china/'res=requests.get(url)soup=BeautifulSoup(res.text,'lxml')print(soup)此时会发现爬取的数据中中文全部为乱码，因此应加入一句：res.encodi
复制链接

扫一扫

专栏目录