python实战之网络爬虫（爬取新闻内文信息）

最新推荐文章于 2024-04-24 13:35:29 发布

醍醐三叶

最新推荐文章于 2024-04-24 13:35:29 发布

阅读量6.3k

点赞数 4

文章标签： python

本文链接：https://blog.csdn.net/zx870121209/article/details/81698917

版权

（1）前期准备：打开谷歌浏览器，进入新浪新闻网国内新闻页面，点击进入其中一条新闻，打开开发者工具界面。获取当前网页数据，然后使用BeautifulSoup进行剖析，代码：

import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/c/2018-08-15/doc-ihhtfwqr3419248.shtml')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)

（2）获取新闻标题：开发者工具界面进行查看，新闻标题有class='main-title'标记，故：

title = soup.select('.main-title')[0].text

（3）获取新闻来源：开发者工具界面进行查看，新闻来源在class='date-source'标记下的a标签中，故：

source = soup.select('.date-source a')[0].text

（4）获取新闻时间：开发者工具界面进行查看，新闻来源在class='date-source'标记下的class='date'标签中，故：

time = soup.select('.date')[0].text
dt = datetime.strptime(time, '%Y年%m月%d日 %H:%M')
date = dt.strftime('%Y-%m-%d %H:%M')

这里为了便于存储，将时间变成便于存储的格式字符串

（5）获取新闻内容：开发者工具界面进行查看，新闻内容在class='article'标签中，由于内容分段位于几个<p>标签中，所以这里使用for循环遍历，去掉每一段的<p>标签后对其进行连接，故：

list = []
for p in soup.select('.article p')[:-1]:
list.append(p.text.strip())
article = ' '.join(list)

使用列表的推导式可以使得代码更加简洁，即：article = ' '.join([p.text.strip() for p in soup.select('.article p')[:-1]])

（6）获取新闻评论数：获取评论数有些特殊，因为评论是动态更新的，所以应该到开发者工具界面的JS类型数据中查看，找到一项info?version=1&format=json&channel=gn&newsid=comos…ad=1&callback=jsonp_1534300770937&_=1534300770937，进入其中就能找到评论的所在，这其中也包含了评论数。由于这里的格式是format=json，所以需要引入json模块对数据进行处理，代码如下：

import requests
from bs4 import BeautifulSoup
import json
res = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-hhqtawy3741909&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)
jd = json.loads(soup.text)
print(jd['result']['count']['total'])

另外，仔细观察这里的网址中的newsid=comos-hhqtawy3741909，与当前页面的网址http://news.sina.com.cn/w/2018-08-14/doc-ihhqtawy3741909.shtml中有一部分数据是一样的，即hhqtawy3741909，这就是我们的id。根据这个发现，可以将获取评论数的代码写成函数，即

def commentsCount(newsurl):
m = re.search('doc-i(.*).shtml' ,newsurl)
newsid = m.group(1)
commentsURL = 'http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-{}&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1'
res = requests.get(commentsURL.format(newsid))
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)
jd = json.loads(soup.text)
return (jd['result']['count']['total'])

其中引入正则模块re，可以对需要的字段进行筛选；使用commentsURL.format(newsid)可以以newsid来填充commentsURL中的{}

（7）将获取各种新闻内容的代码做成函数：由于这是获取一条新闻的内容，为了便于获取其他新闻内容，将这些获取代码做成函数，供后续复用

def getNews(newsurl):
result = {}
res = requests.get(newsurl)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)
result['title'] = soup.select('.main-title')[0].text
result['source'] = soup.select('.date-source a')[0].text
time = soup.select('.date')[0].text
dt = datetime.strptime(time, '%Y年%m月%d日 %H:%M')
result['date'] = dt.strftime('%Y-%m-%d %H:%M')
result['article'] = ' '.join([p.text.strip() for p in soup.select('.article p')[:-1]])
result['author'] = soup.select('.show_author')[0].text.lstrip('责任编辑：')
result['commentsCount'] = commentsCount(newsurl)
return result

（8）获取多篇新闻内容--寻找每一篇新闻的网址链接（分步解决）

一.开发人员考虑到一个页面的大小限制和载入速度，会将很多内容以动态分页的形式进行载入，即用户在当前页面往下进行翻页时，下面的内容会动态的出现，所以我们应该在JS类型数据中寻找所有的新闻网址链接

二.这里找到JS中zt_list开头的项目，其中的能够get回应的网址是http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page=2&callback=newsloadercallback&_=1534300543901。可以看到其中有一个属性page=2，说明这是翻页时的第二页内容

三.从第二页内容中获取本页所有新闻的页面网址：

newsURL = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page=1&callback=newsloadercallback&_=1534300543901'
res = requests.get(newsURL)
jd = json.loads(res.text.lstrip(' newsloadercallback(').rstrip(');'))
for news in jd['result']['data']:
print(news['url'])

四.由于国内新闻首页含有很多页的新闻，故可以将上述获取每一页上所有新闻网址的代码写成函数：

def getURL(newsURL):
res = requests.get(newsURL)
jd = json.loads(res.text.lstrip(' newsloadercallback(').rstrip(');'))
for news in jd['result']['data']:
return (news['url'])

五.要通过上述得到的网址将所有新闻内容都爬取下来，则上述函数做一下修改：

def getURL(newsURL):
news = []
res = requests.get(newsURL)
jd = json.loads(res.text.lstrip(' newsloadercallback(').rstrip(');'))
for new in jd['result']['data']:
news.append(getNews(new['url']))
return news

使用列表的append方法将所有新闻内容放到一个列表中

六.使用for循环将一定范围页的所有新闻内容爬取下来：

newsURL = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}&callback=newsloadercallback&_=1534300543901'
allNews = []
for i in range(1, 3):
allNews.extend(getURL(newsURL.format(i)))
print(allNews)
print(len(allNews))

最终所有的新闻内容将以列表的形式展现

（9）获取多篇新闻内容--寻找每一篇新闻的网址链接（整体程序代码）

import requests
from bs4 import BeautifulSoup
import json
import re
from datetime import datetime
def commentsCount(newsurl):
m = re.search('doc-i(.*).shtml' ,newsurl)
newsid = m.group(1)
commentsURL = 'http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-{}&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1'
res = requests.get(commentsURL.format(newsid))
res.encoding = 'utf-8'
jd = json.loads(res.text)
return (jd['result']['count']['total'])
def getNews(newsurl):
result = {}
res = requests.get(newsurl)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)
result['title'] = soup.select('.main-title')[0].text
try:
result['source'] = soup.select('.date-source a')[0].text
except Exception:
result['source'] = soup.select('.date-source span')[1].text
time = soup.select('.date')[0].text
dt = datetime.strptime(time, '%Y年%m月%d日 %H:%M')
result['date'] = dt.strftime('%Y-%m-%d %H:%M')
result['article'] = ' '.join([p.text.strip() for p in soup.select('.article p')[:-1]])
result['author'] = soup.select('.show_author')[0].text.lstrip('责任编辑：')
result['commentsCount'] = commentsCount(newsurl)
return result
def getURL(newsURL):
news = []
res = requests.get(newsURL)
jd = json.loads(res.text.lstrip(' newsloadercallback(').rstrip(');'))
for new in jd['result']['data']:
news.append(getNews(new['url']))
return news

newsURL = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}&callback=newsloadercallback&_=1534300543901'
allNews = []
for i in range(1, 3):
allNews.extend(getURL(newsURL.format(i)))
print(allNews)
print(len(allNews))

（10）将爬取的数据结构化存储到excel表格中

import pandas
df = pandas.DataFrame(allNews)
df.head(20)

这里需要引入pandas模块，是python data analise的缩写，表示python数据分析模块，使用DataFrame方法将爬取的新闻内容进行整理，并显示前20条新闻

df.to_excel

将整理后的数据存入到excel表格中

醍醐三叶

关注

4
点赞
踩
29

收藏

觉得还不错? 一键收藏
0
评论
python实战之网络爬虫（爬取新闻内文信息）

（1）前期准备：打开谷歌浏览器，进入新浪新闻网国内新闻页面，点击进入其中一条新闻，打开开发者工具界面。获取当前网页数据，然后使用BeautifulSoup进行剖析，代码：import requestsfrom bs4 import BeautifulSoupres = requests.get('http://news.sina.com.cn/c/2018-08-15/doc-ihhtfw...
复制链接

扫一扫