爬虫学习笔记-猫眼电影排行爬取

爬虫学习笔记-猫眼电影排行爬取

1 分析页面

https://maoyan.com/board/4
主页点击页码发现页面的URL变成:
在这里插入图片描述

在这里插入图片描述
初步推断出offset是一个偏移量的参数,当页面为第一页时offset=0,第二页时offset=10.。。

2 抓取完整页面

代码:

import requests

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
    }

    response = requests.get(url, headers=headers)
    if response.status_code!=200:
        return None
    return response.text
    
print(get_one_page("https://maoyan.com/board/4"));

3 正则提取

点击F12,打开调试页面,在开发者模式的Network监听组件中查看源代码
在这里插入图片描述注意:不要在Elements选项卡中查看源码,因为源码可能经过JS操作与院士请求不同。
查看其中一项源码:
在这里插入图片描述可以看到一部电影对应的源代码是一个dd节点
使用正则表达式提取内容,正则表达式如下:

<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>

代码:

import requests
import re

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
    }

    response = requests.get(url, headers=headers)
    if response.status_code!=200:
        return None
    return response.text

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    print(items)

输出结果:

('1', 'https://p0.meituan.net/movie/4c41068ef7608c1d4fbfbe6016e589f7204391.jpg@160w_220h_1e_1c', '活着', '\n                主演:葛优,巩俐,牛犇\n        ', '上映时间:1994-05-17(法国)', '9.', '0'),

数据整理:

import requests
import re
import json
import time

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
    }

    response = requests.get(url, headers=headers)
    if response.status_code!=200:
        return None
    return response.text

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    print(items)
    # 整理数据
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2].strip(),
            'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
            'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
            'score': item[5].strip() + item[6].strip()
        }

输出结果:

{"index": "1", "image": "https://p0.meituan.net/movie/4c41068ef7608c1d4fbfbe6016e589f7204391.jpg@160w_220h_1e_1c", "title": "活着", "actor": "葛优,巩俐,牛犇", "time": "1994-05-17(法国)", "score": "9.0"}
{"index": "2", "image": "https://p0.meituan.net/movie/bcbe59fc51580317adf94537a61a1a26142090.jpg@160w_220h_1e_1c", "title": "钢琴家", "actor": "艾德里安·布洛迪,艾米莉娅·福克斯,米哈乌·热布罗夫斯基", "time": "2002-05-24(法国)", "score": "8.8"}
{"index": "3", "image": "https://p1.meituan.net/movie/f8e9d5a90224746d15dfdbd53d4fae3d209420.jpg@160w_220h_1e_1c", "title": "勇敢的心", "actor": "梅尔·吉布森,苏菲·玛索,帕特里克·麦高汉", "time": "1995-05-18(美国)", "score": "8.8"}
{"index": "4", "image": "https://p0.meituan.net/movie/85215b28d568ea8e2c97766edd95f890210522.jpg@160w_220h_1e_1c", "title": "阿飞正传", "actor": "张国荣,张曼玉,刘德华", "time": "2018-06-25", "score": "8.8"}
{"index": "5", "image": "https://p0.meituan.net/movie/86c5190ba1d1236093c13f2fe9ed8dd4150050.jpg@160w_220h_1e_1c", "title": "射雕英雄传之东成西就", "actor": "张国荣,梁朝伟,张学友", "time": "1993-02-05(中国香港)", "score": "8.8"}
{"index": "6", "image": "https://p0.meituan.net/movie/de1142a5dceb901eb939eb0bcfc2f88470909.jpg@160w_220h_1e_1c", "title": "爱·回家", "actor": "俞承豪,金艺芬,童孝熙", "time": "2002-04-05(韩国)", "score": "9.0"}
{"index": "7", "image": "https://p1.meituan.net/movie/05bc2f0ccf97aacfa64fcac4f237cf8082385.jpg@160w_220h_1e_1c", "title": "初恋这件小事", "actor": "马里奥·毛瑞尔,平采娜·乐维瑟派布恩,阿查拉那·阿瑞亚卫考", "time": "2012-06-05", "score": "8.8"}
{"index": "8", "image": "https://p1.meituan.net/movie/b607fba7513e7f15eab170aac1e1400d878112.jpg@160w_220h_1e_1c", "title": "泰坦尼克号", "actor": "莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩", "time": "1998-04-03", "score": "9.4"}
{"index": "9", "image": "https://p1.meituan.net/movie/a1634f4e49c8517ae0a3e4adcac6b0dc43994.jpg@160w_220h_1e_1c", "title": "迁徙的鸟", "actor": "雅克·贝汉,Philippe Labro", "time": "2001-12-12(法国)", "score": "9.0"}
{"index": "10", "image": "https://p0.meituan.net/movie/09658109acfea0e248a63932337d8e6a4268980.jpg@160w_220h_1e_1c", "title": "蝙蝠侠:黑暗骑士", "actor": "克里斯蒂安·贝尔,希斯·莱杰,阿伦·伊克哈特", "time": "2008-07-14(阿根廷)", "score": "9.3"}

5 分页爬取并写入文件

完整代码:

import requests
import re
import json
import time

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
    }

    response = requests.get(url, headers=headers)
    if response.status_code!=200:
        return None
    return response.text

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    # 整理数据
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2].strip(),
            'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
            'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
            'score': item[5].strip() + item[6].strip()
        }

# 写入文件
def write_to_file(content):
    with open('result.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')

def main(offset):
    url = 'https://maoyan.com/board/4?offset='+str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        write_to_file(item)

if __name__ == '__main__':
    # print(get_one_page("https://maoyan.com/board/4"));
    for i in range(10):
        main(offset=i*10)
        time.sleep(1)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值