爬虫学习
今天把电脑重装了,主要还没带u盘,我天真的以为把学习记录通过qq发给小号能解决问题,然而我错了,全没了,开始尝试写写博客
Python Re模块学习,爬取句子
简单记录,非贪婪模式 -> “.*?”
re 安装
pip install requests
基础知识
list = re.findall(pattern,string,flags) #返回对象是list
it = re.finditer(pattern,string,flags) #返回对象是迭代器
s = re.search(pattern,string,flags) #返回对象是match对象,且只匹配一个content
m = re.match(pattern,string,flags) # 默认从头开始匹配,相当于在正则表达式前加了^
从迭代器中读取content
for i in it: # i为match对象
print(i.group())
预加载正则(提高效率)
obj = re.compile(pattern,flags) #pattern为正则
ret = obj.findall(string) # findall可被替换为以上的一些匹配函数
代码:
import requests
import re
for i in range(1,5):
url = f"https://www.mingyantong.com/article/%E9%BE%99%E6%97%8F?page={i}"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'cookie':'SESSfe7dc110f2d82eeddc336c4b4a78ec53=u9gvh7ccpl0le17uvuo7r05kl3; __cfduid=d509e8975c2019c8310fe8dc2f71a96d11619961437; xqrclbr=70572; visited=1; _ga=GA1.2.1453129243.1619961439; _gid=GA1.2.1485613751.1619961439; has_js=1; xqrclm=; xqrcli=MTYxOTk2NTM5MiwxNDQwKjk2MCxXaW4zMixOZXRzY2FwZSw3MDU3MiwqNTI3Kg%3D%3D; Hm_lvt_99062fd40c87113b8be61ebc8113e7c2=1619961438,1619965394; Hm_lpvt_99062fd40c87113b8be61ebc8113e7c2=1619965394; Hm_cv_99062fd40c87113b8be61ebc8113e7c2=1*login*PC-0!1*version*PC',
'Connection':'close'}
resp = requests.get(url,headers=headers)
sen = resp.text
print(resp.status_code)
obj = re.compile(r'class="xlistju">(?P<s>.*?)</a>.*?class="views-field-field-oriwriter-value">(?P<writer>.*?)</a>.*?title="(?P<article>.*?)"')
ret = obj.finditer(sen)
for se in ret:
with open('rsentence.txt','a',encoding='utf-8') as f:
f.write(se.group('s').replace("<br/>",""))
f.write(" ")
f.write(se.group('writer'))
f.write(" ")
f.write(se.group('article'))
f.write("\n\n")
f.close()
代码很烂,请见谅,纯属自学,b站资源为 https://www.bilibili.com/video/BV1i54y1h75W