大家好,作为一名互联网行业的小白,写博客只是为了巩固自己学习的知识,但由于水平有限,博客中难免会有一些错误出现,有不妥之处恳请各位大佬指点一二!
博客主页:链接: https://blog.csdn.net/weixin_52720197?spm=1018.2118.3001.5343
1.导入包
# 用requests库来发送请求
import requests
from fake_useragent import UserAgent
# 使用正则
import re
2,分析页面,写url
# 要发送的地址
url = 'https://www.qiushibaike.com/text/'
headers = {"User-Agent": UserAgent().chrome}
# 发送请求,将url,headers扔进去,resp作出相应
resp = requests.get(url, headers=headers)
print(resp.text)
3,用正则提取信息
右键-检查
发现有两个span 标签,所以要取第一个span标签
contents = re.findall(r'<div class="content"><span>(.+)</span>', resp.text)
for info in contents:
print(info)
结果:
发现数据
# 正则
contents = re.findall(r'<div class="content">\s*<span>\s*(.+)', resp.text)
with open('duanzi.txt','a',encoding='utf-8') as f:
for info in contents:
f.write(info+"\n\n")
4,代码
# 用requests库来发送请求
import requests
from fake_useragent import UserAgent
import re
# 要发送的地址
url = 'https://www.qiushibaike.com/text/'
headers = {"User-Agent": UserAgent().chrome}
# 发送请求,将url,headers扔进去,resp作出相应
resp = requests.get(url, headers=headers)
print(resp.text)
# 正则
contents = re.findall(r'<div class="content">\s*<span>\s*(.+)', resp.text)
with open('duanzi.txt','a',encoding='utf-8') as f:
for info in contents:
f.write(info+"\n\n")
结果: