python网络爬虫方向的第三方库_[notes] Python网络爬虫1（基于第三方库requests）

最新推荐文章于 2023-06-23 09:30:00 发布

weixin_39882870

最新推荐文章于 2023-06-23 09:30:00 发布

阅读量153

点赞数

文章标签： python网络爬虫方向的第三方库

这段代码是一个Python爬虫，用于抓取人民网2020年1月1日至2月29日的所有新闻。它首先设定初始URL，然后通过循环逐页抓取新闻日期、文章标题、作者和正文，将信息写入文件crawlednews.txt。爬虫使用requests库进行HTTP请求，并用正则表达式解析HTML内容。

摘要由CSDN通过智能技术生成

代码

import requests

import re

# 爬取人民数据网2020-01-01至2020-02-29的全部新闻

# 初始url指向2020-01-01的网页

url_cur = 'http://data.people.com.cn/rmrb/20200101/1'

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36"}

while 1:

# 打开文件

f = open('crawled news.txt', 'a')

# 不加header伪装成客户端，会报403 forbidden

response = requests.get(url_cur, headers=headers, timeout=30)

response.encoding = 'utf-8'

html = response.text

# 时间

date = url_cur.split('/')[-2]

weekday = re.findall(r'(.*?)
', html)[0]

datetime = date[:4] + '年' + date[4:6] + '月' + date[6:] + '日' + '，' + weekday

f.write(datetime)

f.write('\n')

# 获取当天文章信息集合

article_info = re.findall(r'.*?', html)

num_article = len(article_info)

try:

url_next = re.findall(r'

下一期', html)[0]

url_next = ''.join(['http://data.people.com.cn', url_next])

except IndexError:

break

# 循环进入每篇新闻爬取内容

for id, (_, href) in enumerate(article_info):

article_url = ''.join([url_cur, '/', href.split('/')[-1]])

# 发出http请求

article_response = requests.get(article_url, headers=headers, timeout=30)

article_response.encoding = 'utf-8'

article_content = article_response.text

# 匹配信息：标题，作者，正文

title = re.findall(r'

(.*?)

', article_content)[0]

try:

author = re.findall(r'

(.*?)

', article_content)[0]

except IndexError:

author = 'UNKNOWN'

body_list = re.findall(r'

(.*?)

', article_content, re.S) # 不加re.S，匹配不到内容

# 数据清洗

body = '

' + '

'.join(body_list[1:]) + '

'

body = body.replace(' ', '')

body = body.replace('\u3000', '')

body = body.replace('\r', '')

body = body.replace('\n', '')

body = body.replace('\t', '')

body = body.replace(' ', '')

body = body.replace('”', '’')

body = body.replace('“', '‘')

body = body.replace('…', '…')

f.write('[{}/{}]'.format(id+1, num_article))

f.write(title)

f.write(author)

f.write(body)

f.write('\n')

# 更新文件，跳转到下一天的页面

f.close()

url_cur = url_next

print('done!')

weixin_39882870

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python网络爬虫方向的第三方库_[notes] Python网络爬虫1（基于第三方库requests）

代码import requestsimport re# 爬取人民数据网2020-01-01至2020-02-29的全部新闻# 初始url指向2020-01-01的网页url_cur = 'http://data.people.com.cn/rmrb/20200101/1'headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) Ap...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。