爬虫学习第七天
糗事百科案例
用到的模块:
re requests fake_UserAgent
思路:先找到网页然后进行匹配需要的东西,
关键点在于:找到所需要的东西的位置代码如下:
infos = re.findall(r'<div class="content">\s*<span>\s*(.+)\s*</span>',info)
注:
\s、\n\n\n*:是换行
最后储存代码有点搞不懂,反正直接用就行了吧,
完整代码如下:
import re
from fake_useragent import UserAgent
import requests
url = "https://www.qiushibaike.com/text/"
headers = {
"User-Agent":UserAgent().random
}
#构造请求
response = requests.get(url,headers=headers)
info = response.text
infos = re.findall(r'<div class="content">\s*<span>\s*(.+)\s*</span>',info)
with open('duanzi.txt','w',encoding="utf-8") as f:
for info in infos:
f.write(info + "\n\n\n")