最终输出效果:
那么这种效果如何实现呢?
1.先爬取到网页
url = 'https://movie.douban.com/top250' #要爬取的url
import requests
import re
url = 'https://movie.douban.com/top250' #要爬取的url
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.95 Safari/537.36'
}
res = requests.get(url, headers=headers) #请求url
print(res.text)
headers是怎么来的呢
先右键点审查,然后网络networks,在下方随便找一个点进去,在标头中的最下方,就能找到自己的user agent
获取到一大堆源代码后,用re和正则表达式进行处理
2.处理源代码
先搞定名字,找到名字之后,发现电影名字被<span class="title">和</span>包着,于是就可以写出以下正则:
obj = re.complie(r'<span class="title">(?P<name>.*?)</span>', re.S)
其他的按照正则同样的原理写,最终得到如下:
import requests
import re
url = 'https://movie.douban.com/top250' #要爬取的url
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.95 Safari/537.36'
}
res = requests.get(url, headers=headers) #请求url
obj = re.compile(r'<span class="title">(?P<name>[\u4e00-\u9fa5].*?)</span>.*?<div class="bd">.*?<p class="">(?P<dy>.*?)</p>.*?</div>.*?<span class="rating_num" property="v:average">(?P<percent>.*?)</span>', re.S)
ret = obj.finditer(res.text)
for iter in ret:
info = iter.group('name'), iter.group('dy'), iter.group('percent')
print(info[0], info[1], f'电影评分:{info[2]}')
print('--------------------------------------------------------------------------------------------------------')