网络爬虫学习（三）

最新推荐文章于 2023-06-12 13:49:34 发布

阳光下的Smiles

最新推荐文章于 2023-06-12 13:49:34 发布

阅读量415

点赞数

分类专栏：网络爬虫

本文链接：https://blog.csdn.net/liyuqian199695/article/details/65935601

版权

网络爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

网络爬虫学习（三）

1、抓取内文资料

（1）打开每条链接，内文包括：标题、内文、时间来源、评论和编辑。

（2）取得内文页面

开发者工具--->检查---->Network--->重载---->Doc,找到对应的链接。

import requests
from bs4 import BeautifulSoup
res=requests.get('http://news.sina.com.cn/c/nd/2017-03-25/doc-ifycspxn9729572.shtml')
res.encoding='utf-8'

print (res.text)
soup=BeautifulSoup(res.text,'html.parser')

通过上面代码可以提取到内容

（3）抓取标题

通过检测工具----->观察元素---->选择标题---->看到标题标签为artibodyTitle。

抓取命令为：soup.select('#artibodyTitle')[0].text

标题抓取结果为：

（4）来源与时间

抓取命令：soup.select('.time-source')[0]

处理来源与时间

取得时间

	timesource=soup.select('.time-source')[0].contents[0].strip()
	timesource

取得来源

	medianame=soup.select('.time-source span a')[0].text
	medianame

时间字符串转换

	from datetime import datetime

字符串转时间--strptime

		dt=datetime.strptime(timesource,'%Y年%m月%d日%H：%M')
		dt

时间转字符串

		dt.strftime('%Y-%m-%d')

2、取得内文

（1）将每一个段落加到list中

	article=[]
	for p in soup.select('#artibody p')[:-1]:
			article.append(p.text.strip())
	' '.join(article)

简短写法

	' '.join([p.text.strip() for p in soup.select('artibody p')[:-1] ])

3、取得编辑名称

抓取命令为：

	editor=soup.select('.article-editor')[0].text.strip('责任编辑：')
	editor

4、取得评论数

抓取命令：soup.select('#commentCount1')

没有抓取到想要的结果

找寻评论出处

命令为：（找到Headers中的链接拷贝过来即可）

	import json
	comments=requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-fycspxn9729572&group=&

		compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20&jsvar=loader_1490417383395_65805650')

取得评论数与评论内容

	jd=json.loads(comments.text.strip('var data='))
	jd['result']['count']['total']

由于评论数是实时变化的，所以数字也是变化的

如何取得新闻编号

	newsurl='http://news.sina.com.cn/c/nd/2017-03-25/doc-ifycspxn9729572.shtml'
	newsid=newsurl.split('/')[-1].rstrip('.shtml').lstrip('doc-i')
	newsid

抓取中间的部分

抓取结果

如何取得新闻编号（使用正则表达式）

	import re
	m=re.search('doc-i(.*).shtml',newsurl)
	print (m.group(1))

5、将抓取评论数方法整理成一函式

commentURL=

函式定义：

def  getCommentCount(newsurl):
	m=re.search('doc-i(.*).shtml',newsurl)
	newsid=m.group(1)
	comments=requests.get(commentURL.format(newsid))
	jd=json.loads(comments.text.strip('var data='))
	return jd['result']['count']['total']

6、将抓取内文信息方法整理成一函式

import requests
from bs4 import BeautifulSoup
def getNewsDetail(newsurl):
	result={}
	res=requests.get(newsurl)
	res.encoding='utf-8'
	soup=BeautifulSoup(res.text,'html.parser')
	result['title']=soup.select('#artibodyTitle')[0].text
	result['newssource']=soup.select('time-source span a')[0].text
	timesource=soup.select('.time-source')[0].contents[0].strip()
	result['dt']=datetime.strptime(timesource,'%Y年%m月%d日%H:%M')
	result['article']=' '.join([p.text.strip() for p in soup.select('#artibody p')[:-1]])
	result['editor']=soup.select('.article-editor')[0].text.strip('责任编辑:')
	result['comments']=getCommentCount(newsurl)
	return result