爬虫学习笔记-猫眼电影排行爬取
1 分析页面
https://maoyan.com/board/4
点击页码发现页面的URL变成:
初步推断出offset是一个偏移量的参数,当页面为第一页时offset=0,第二页时offset=10.。。
2 抓取完整页面
代码:
import requests
def get_one_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
}
response = requests.get(url, headers=headers)
if response.status_code!=200:
return None
return response.text
print(get_one_page("https://maoyan.com/board/4"));
3 正则提取
点击F12,打开调试页面,在开发者模式的Network监听组件中查看源代码
注意:不要在Elements选项卡中查看源码,因为源码可能经过JS操作与院士请求不同。
查看其中一项源码:
可以看到一部电影对应的源代码是一个dd节点
使用正则表达式提取内容,正则表达式如下:
<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>
代码:
import requests
import re
def get_one_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
}
response = requests.get(url, headers=headers)
if response.status_code!=200:
return None
return response.text
def parse_one_page(html):
pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S)
items = re.findall(pattern, html)
print(items)
输出结果:
('1', 'https://p0.meituan.net/movie/4c41068ef7608c1d4fbfbe6016e589f7204391.jpg@160w_220h_1e_1c', '活着', '\n 主演:葛优,巩俐,牛犇\n ', '上映时间:1994-05-17(法国)', '9.', '0'),
数据整理:
import requests
import re
import json
import time
def get_one_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
}
response = requests.get(url, headers=headers)
if response.status_code!=200:
return None
return response.text
def parse_one_page(html):
pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S)
items = re.findall(pattern, html)
print(items)
# 整理数据
for item in items:
yield {
'index': item[0],
'image': item[1],
'title': item[2].strip(),
'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
'score': item[5].strip() + item[6].strip()
}
输出结果:
{"index": "1", "image": "https://p0.meituan.net/movie/4c41068ef7608c1d4fbfbe6016e589f7204391.jpg@160w_220h_1e_1c", "title": "活着", "actor": "葛优,巩俐,牛犇", "time": "1994-05-17(法国)", "score": "9.0"}
{"index": "2", "image": "https://p0.meituan.net/movie/bcbe59fc51580317adf94537a61a1a26142090.jpg@160w_220h_1e_1c", "title": "钢琴家", "actor": "艾德里安·布洛迪,艾米莉娅·福克斯,米哈乌·热布罗夫斯基", "time": "2002-05-24(法国)", "score": "8.8"}
{"index": "3", "image": "https://p1.meituan.net/movie/f8e9d5a90224746d15dfdbd53d4fae3d209420.jpg@160w_220h_1e_1c", "title": "勇敢的心", "actor": "梅尔·吉布森,苏菲·玛索,帕特里克·麦高汉", "time": "1995-05-18(美国)", "score": "8.8"}
{"index": "4", "image": "https://p0.meituan.net/movie/85215b28d568ea8e2c97766edd95f890210522.jpg@160w_220h_1e_1c", "title": "阿飞正传", "actor": "张国荣,张曼玉,刘德华", "time": "2018-06-25", "score": "8.8"}
{"index": "5", "image": "https://p0.meituan.net/movie/86c5190ba1d1236093c13f2fe9ed8dd4150050.jpg@160w_220h_1e_1c", "title": "射雕英雄传之东成西就", "actor": "张国荣,梁朝伟,张学友", "time": "1993-02-05(中国香港)", "score": "8.8"}
{"index": "6", "image": "https://p0.meituan.net/movie/de1142a5dceb901eb939eb0bcfc2f88470909.jpg@160w_220h_1e_1c", "title": "爱·回家", "actor": "俞承豪,金艺芬,童孝熙", "time": "2002-04-05(韩国)", "score": "9.0"}
{"index": "7", "image": "https://p1.meituan.net/movie/05bc2f0ccf97aacfa64fcac4f237cf8082385.jpg@160w_220h_1e_1c", "title": "初恋这件小事", "actor": "马里奥·毛瑞尔,平采娜·乐维瑟派布恩,阿查拉那·阿瑞亚卫考", "time": "2012-06-05", "score": "8.8"}
{"index": "8", "image": "https://p1.meituan.net/movie/b607fba7513e7f15eab170aac1e1400d878112.jpg@160w_220h_1e_1c", "title": "泰坦尼克号", "actor": "莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩", "time": "1998-04-03", "score": "9.4"}
{"index": "9", "image": "https://p1.meituan.net/movie/a1634f4e49c8517ae0a3e4adcac6b0dc43994.jpg@160w_220h_1e_1c", "title": "迁徙的鸟", "actor": "雅克·贝汉,Philippe Labro", "time": "2001-12-12(法国)", "score": "9.0"}
{"index": "10", "image": "https://p0.meituan.net/movie/09658109acfea0e248a63932337d8e6a4268980.jpg@160w_220h_1e_1c", "title": "蝙蝠侠:黑暗骑士", "actor": "克里斯蒂安·贝尔,希斯·莱杰,阿伦·伊克哈特", "time": "2008-07-14(阿根廷)", "score": "9.3"}
5 分页爬取并写入文件
完整代码:
import requests
import re
import json
import time
def get_one_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
}
response = requests.get(url, headers=headers)
if response.status_code!=200:
return None
return response.text
def parse_one_page(html):
pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S)
items = re.findall(pattern, html)
# 整理数据
for item in items:
yield {
'index': item[0],
'image': item[1],
'title': item[2].strip(),
'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
'score': item[5].strip() + item[6].strip()
}
# 写入文件
def write_to_file(content):
with open('result.txt', 'a', encoding='utf-8') as f:
f.write(json.dumps(content, ensure_ascii=False) + '\n')
def main(offset):
url = 'https://maoyan.com/board/4?offset='+str(offset)
html = get_one_page(url)
for item in parse_one_page(html):
write_to_file(item)
if __name__ == '__main__':
# print(get_one_page("https://maoyan.com/board/4"));
for i in range(10):
main(offset=i*10)
time.sleep(1)