爬虫学习笔记-猫眼电影排行爬取

最新推荐文章于 2024-01-21 22:00:37 发布

黑面|书生

最新推荐文章于 2024-01-21 22:00:37 发布

阅读量1k

点赞数

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/weixin_44533129/article/details/106105506

版权

爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

爬虫学习笔记-猫眼电影排行爬取

1 分析页面

https://maoyan.com/board/4
点击页码发现页面的URL变成：
在这里插入图片描述

在这里插入图片描述
初步推断出offset是一个偏移量的参数，当页面为第一页时offset=0,第二页时offset=10.。。

2 抓取完整页面

代码：

import requests

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
    }

    response = requests.get(url, headers=headers)
    if response.status_code!=200:
        return None
    return response.text
    
print(get_one_page("https://maoyan.com/board/4"));

3 正则提取

点击F12,打开调试页面，在开发者模式的Network监听组件中查看源代码
在这里插入图片描述注意：不要在Elements选项卡中查看源码，因为源码可能经过JS操作与院士请求不同。
查看其中一项源码：
可以看到一部电影对应的源代码是一个dd节点
使用正则表达式提取内容，正则表达式如下：

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>

代码：

import requests
import re

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
    }

    response = requests.get(url, headers=headers)
    if response.status_code!=200:
        return None
    return response.text

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    print(items)

输出结果：

('1', 'https://p0.meituan.net/movie/4c41068ef7608c1d4fbfbe6016e589f7204391.jpg@160w_220h_1e_1c', '活着', '\n                主演：葛优,巩俐,牛犇\n        ', '上映时间：1994-05-17(法国)', '9.', '0'),

数据整理：

import requests
import re
import json
import time

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
    }

    response = requests.get(url, headers=headers)
    if response.status_code!=200:
        return None
    return response.text

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    print(items)
    # 整理数据
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2].strip(),
            'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
            'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
            'score': item[5].strip() + item[6].strip()
        }

输出结果：

{"index": "1", "image": "https://p0.meituan.net/movie/4c41068ef7608c1d4fbfbe6016e589f7204391.jpg@160w_220h_1e_1c", "title": "活着", "actor": "葛优,巩俐,牛犇", "time": "1994-05-17(法国)", "score": "9.0"}
{"index": "2", "image": "https://p0.meituan.net/movie/bcbe59fc51580317adf94537a61a1a26142090.jpg@160w_220h_1e_1c", "title": "钢琴家", "actor": "艾德里安·布洛迪,艾米莉娅·福克斯,米哈乌·热布罗夫斯基", "time": "2002-05-24(法国)", "score": "8.8"}
{"index": "3", "image": "https://p1.meituan.net/movie/f8e9d5a90224746d15dfdbd53d4fae3d209420.jpg@160w_220h_1e_1c", "title": "勇敢的心", "actor": "梅尔·吉布森,苏菲·玛索,帕特里克·麦高汉", "time": "1995-05-18(美国)", "score": "8.8"}
{"index": "4", "image": "https://p0.meituan.net/movie/85215b28d568ea8e2c97766edd95f890210522.jpg@160w_220h_1e_1c", "title": "阿飞正传", "actor": "张国荣,张曼玉,刘德华", "time": "2018-06-25", "score": "8.8"}
{"index": "5", "image": "https://p0.meituan.net/movie/86c5190ba1d1236093c13f2fe9ed8dd4150050.jpg@160w_220h_1e_1c", "title": "射雕英雄传之东成西就", "actor": "张国荣,梁朝伟,张学友", "time": "1993-02-05(中国香港)", "score": "8.8"}
{"index": "6", "image": "https://p0.meituan.net/movie/de1142a5dceb901eb939eb0bcfc2f88470909.jpg@160w_220h_1e_1c", "title": "爱·回家", "actor": "俞承豪,金艺芬,童孝熙", "time": "2002-04-05(韩国)", "score": "9.0"}
{"index": "7", "image": "https://p1.meituan.net/movie/05bc2f0ccf97aacfa64fcac4f237cf8082385.jpg@160w_220h_1e_1c", "title": "初恋这件小事", "actor": "马里奥·毛瑞尔,平采娜·乐维瑟派布恩,阿查拉那·阿瑞亚卫考", "time": "2012-06-05", "score": "8.8"}
{"index": "8", "image": "https://p1.meituan.net/movie/b607fba7513e7f15eab170aac1e1400d878112.jpg@160w_220h_1e_1c", "title": "泰坦尼克号", "actor": "莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩", "time": "1998-04-03", "score": "9.4"}
{"index": "9", "image": "https://p1.meituan.net/movie/a1634f4e49c8517ae0a3e4adcac6b0dc43994.jpg@160w_220h_1e_1c", "title": "迁徙的鸟", "actor": "雅克·贝汉,Philippe Labro", "time": "2001-12-12(法国)", "score": "9.0"}
{"index": "10", "image": "https://p0.meituan.net/movie/09658109acfea0e248a63932337d8e6a4268980.jpg@160w_220h_1e_1c", "title": "蝙蝠侠：黑暗骑士", "actor": "克里斯蒂安·贝尔,希斯·莱杰,阿伦·伊克哈特", "time": "2008-07-14(阿根廷)", "score": "9.3"}

5 分页爬取并写入文件

完整代码：

import requests
import re
import json
import time

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
    }

    response = requests.get(url, headers=headers)
    if response.status_code!=200:
        return None
    return response.text

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    # 整理数据
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2].strip(),
            'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
            'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
            'score': item[5].strip() + item[6].strip()
        }

# 写入文件
def write_to_file(content):
    with open('result.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')

def main(offset):
    url = 'https://maoyan.com/board/4?offset='+str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        write_to_file(item)

if __name__ == '__main__':
    # print(get_one_page("https://maoyan.com/board/4"));
    for i in range(10):
        main(offset=i*10)
        time.sleep(1)

黑面|书生

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习笔记-猫眼电影排行爬取

爬虫学习笔记-猫眼电影排行爬取1 分析页面https://maoyan.com/board/4点击页码发现页面的URL变成：初步推断出offset是一个偏移量的参数，当页面为第一页时offset=0,第二页时offset=10.。。2 抓取完整页面代码：import requestsdef get_one_page(url): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64;
复制链接

扫一扫

专栏目录