爬虫实战操作（2）—— 新浪新闻内容细节

最新推荐文章于 2024-06-28 18:30:47 发布

千里足行~始于足下

最新推荐文章于 2024-06-28 18:30:47 发布

阅读量650

点赞数 1

分类专栏：网络爬虫

本文链接：https://blog.csdn.net/weixin_43585712/article/details/108936833

版权

网络爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

本文实现获取新浪新闻内容的各种细节，标题、时间、来源、内文、编辑者、评论数。

import requests
from bs4 import BeautifulSoup
res=requests.get("https://news.sina.com.cn/s/2020-10-05/doc-iivhvpwz0482504.shtml")
res.encoding='utf-8'
#print(res.text)
soup=BeautifulSoup(res.text,'html.parser')
print(soup.text)

1.获取标题

soup.select(".main-title")[0].text#获取文章标题，里面以什么为参考找，不一定，看实际操作

2.时间和来源

2.1 整体获取两个

#通过开发工具找到了时间和来源为:date-source
source=soup.select(".date-source")#获得了新闻的时间和来源
print(source)
print('{:*^100}'.format('输出'))
#根据上面的输出来写代码如何获取时间,contents是从span获取内容
print(source[0].contents)
time0=source[0].contents[1]
time=source[0].contents[1].contents[0]
print(time0,"\n",time)
print('{:*^100}'.format('输出'))
#上面的时间是str类型，收集数据时，我们希望它是时间类型
from datetime import datetime
print(datetime.strptime(time, "%Y年%m月%d日 %H:%M"))#将字符串转化为shi
#获取标题
print(source[0].contents[3].text)

备注：画红框得输出是为了查看怎么获取时间和来源

2.2 分开获取

快速获取时间

date=soup.select(".date")
datetime.strptime(date[0].text, "%Y年%m月%d日 %H:%M")

soup.select(".source")[0].text

3.获取内文和编辑者

在这里插入图片描述

#下面是合并每段的内容，去掉分割符P,\u3000\u3000是空白控制码，用strip()移除它
#"".join([p.text.strip() for p in soup.select("#article p")[:-1]])
article=[]
for p in soup.select("#article p")[:-1]:
    article.append(p.text.strip())
print("  ".join(article))#段落之间用空格隔开，也可以用其他符号“\n”,@
print('{:*^100}'.format('编辑者'))
print(soup.select("#article p")[-1].text.strip('责任编辑：'))

4. 获取评论数

查找信息得步骤：
先在doc下查找，如果没有，说明不是同步载入得，接着在XHR和JS下查找想要得信息。

soup.select(".icon-comment")

在这里插入图片描述
备注：说明评论数是靠其他方式获得得。
接下来我们查看XHR和JS下得文件，地毯上查找评论数

获取上面评论数所在得URL，点击headers即可。

import requests
#网址太长，分行，并在结尾结反斜杠\表连接
URL="https://comment.sina.com.cn/page/info?version=1&format=json\
&channel=sh&newsid=comos-ivhvpwz0482504&group=0&compress=0&ie=utf-8&oe=utf-8&page=1\
&page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user&callback=jsonp_1601953986658&_=1601953986658"
comments=requests.get(URL)
print(comments.text)

使用js解析，但是要去除掉红线画得部分

import json 
jd=json.loads(comments.text.strip('jsonp_1601953986658').strip('()'))
jd
#回到Chrome开发工具中，这样浏览jd中的信息会比较快
jd["result"]["count"]["total"]#获取评论数

5. 获取新闻ID

#怎么获取新网id
#下面是新闻所在网页的地址
newsurl="https://news.sina.com.cn/c/2018-11-09/doc-ihnprhzw5251381.shtmll"
print(newsurl.split("/"))
print('{:*^100}'.format('输出'))
newsid=newsurl.split("/")[-1].rstrip(".shtml").lstrip("doc-i")
print("新闻id:",newsid)
print('{:*^100}'.format('输出'))
#用正则表达式求新闻id
import re
m=re.search("doc-i(.+).shtml",newsurl)#返回匹配到的部分
print(m.group(1))#group(0)是获得匹配的部门，group(1)是获得匹配小括号的内容

6.整理总结

import requests
from datetime import datetime
from bs4 import BeautifulSoup
#给一个新闻id,返回一个信息评论数，因为评论数的网址只差一个新闻id不一样
import re
import requests
import json
commentURL = "https://comment.sina.com.cn/page/info?version=1&format=json\
&channel=sh&newsid=comos-{}&group=0&compress=0&ie=utf-8&oe=utf-8&page=1\
&page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user&callback=jsonp_1601953986658&_=1601953986658"
def getCommentCounts(newsurl):  
    m = re.search('doc-i(.+).shtml', newsurl)
    newsid = m.group(1) #获取新闻编码id 
    comments=requests.get(commentURL.format(newsid))
    jd=json.loads(comments.text.strip('jsonp_1601953986658').strip('()'))
    return jd["result"]["count"]["total"]
 
#输入：网址；输出：新闻正文，标题，评论数，来源
def getNewsDetail(newsurl):
    result = {}
    res = requests.get(newsurl)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    result['title'] = soup.select(".main-title")[0].text
    result['newssource'] = soup.select(".source")[0].text
    timesource =soup.select(".date")[0].text
    result['dt'] = datetime.strptime(timesource, "%Y年%m月%d日 %H:%M")
    result['article'] = '\n'.join([p.text.strip() for p in soup.select("#article p")[:-1]])
    result['editor'] = soup.select("#article p")[-1].text.strip('责任编辑：')
    result['comments'] = getCommentCounts(newsurl)
    return result

import json 
news="https://news.sina.com.cn/s/2020-10-05/doc-iivhvpwz0482504.shtml"
print(getNewsDetail(news))

在这里插入图片描述

千里足行~始于足下

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫实战操作（2）—— 新浪新闻内容细节

import requestsfrom bs4 import BeautifulSoupres=requests.get("https://news.sina.com.cn/s/2020-10-05/doc-iivhvpwz0482504.shtml")res.encoding='utf-8'#print(res.text)soup=BeautifulSoup(res.text,'html.parser')print(soup.text)1.获取标题soup.select(".main-
复制链接

扫一扫

专栏目录