猫眼电影各年份数据采集

安于长情_

已于 2024-04-10 15:01:50 修改

阅读量864

点赞数 9

分类专栏：网络爬虫学习文章标签：爬虫 python

于 2024-04-09 22:20:12 首次发布

本文链接：https://blog.csdn.net/weixin_64138524/article/details/137568858

版权

网络爬虫学习专栏收录该内容

2 篇文章

订阅专栏

文章讲述了作者如何使用Python的selenium库从马蜂窝电影票务网站抓取2024年的电影列表和详细数据，包括电影名称、上映时间、票房等，并将数据保存为CSV文件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文只用作于学习。。。

上面是我采集的数据2019——2024的数据，经过我合并处理后的。（需要可自取）

采集时间为：2024/4/9 下午

一、准备工作

本次数据抓取使用Python+selenium实现。所用浏览器为chorm。

数据源为：https://piaofang.maoyan.com/rankings/year?year=2024

注：本文以2024年数据为例，如需其他年份，修改数据源链接即可。代码于2024/4/9 下午编写。后期使用以下代码，网站可能会有所变动，注意修改。

所用到的库如下：

selenium	3.141.0
urllib3	1.26.2

二、步骤及思路

电影数据采集分为两部分

第一部分：要采集的电影列表。字段如下：

电影名称

上映时间

票房

平均票价

场均人次

详情url

第二部分：采集电影的详细数据。字段如下：

累计票房

首日票房

首周票房

票房预测

类型

国家/时长

猫眼评分

第一部分代码：

# -*- coding: utf-8 -*-
# @Author  : 归燕
# @FileName: selenium_all.py
# @Time    : 2024/4/9 09:30

import csv
import random
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options

# 创建 Chrome WebDriver 选项
chrome_options = Options()
# 使用代理ip（自己本机ip被封禁了）
# proxy_ip=""   #换成自己的代理ip
# proxy_port=""           #换成自己的代理端口号
# chrome_options.add_argument('--proxy-server={}:{}'.format(proxy_ip, proxy_port))
# 使用chorm浏览器，注意chorm浏览器驱动路径换成自己的
# 使用Selenium库中的webdriver模块来实例化一个Chrome浏览器的驱动程序
driver = webdriver.Chrome(r'C:/Users/17332/AppData/Local/Google'
                          r'/Chrome/Application/chromedriver-win64/chromedriver.exe')
# 定义年份，访问目标网页（年份根据需要更换）
year=2024
driver.get('https://piaofang.maoyan.com/rankings/year?year={}'.format(year))
# 等待页面加载完成
time.sleep(random.randint(1, 10))
# 电影名称（使用find_elements_by_class_name方法进行定位）
title_elements = driver.find_elements_by_class_name('first-line')
# 将定位元素的值存到列表中
title_list = [element.text for element in title_elements]
# print(title_list)
# 上映时间
time_elements = driver.find_elements_by_class_name('second-line')
time_list = [element.text for element in time_elements]
print(time_list)
# 票房
piaofang_list = []
elements = driver.find_elements_by_class_name('col2')
for element in elements:
    if 'tr' in element.get_attribute('class').split(" "):
        piaofang_list.append(element.text)
print(piaofang_list)
# 平均票价
piaojia_list = []
elements = driver.find_elements_by_class_name('col3')
for element in elements:
    if 'tr' in element.get_attribute('class').split(" "):
        piaojia_list.append(element.text)
print(piaojia_list)

# 场均人次
changjun_list = []
elements = driver.find_elements_by_class_name('col4')
for element in elements:
    if 'tr' in element.get_attribute('class').split(" "):
        changjun_list.append(element.text)
print(changjun_list)

# 详细链接
detail_urls = []
elements = driver.find_elements_by_class_name('row')
# 对详细链接的定位进行循环
for element in elements[1:]:
    data_com = element.get_attribute('data-com')
    url = str(data_com)
    num = url.split(":")[1]
    num = num.replace("'", "")
    # 拼接详情url
    detail_url = "https://piaofang.maoyan.com" + num
    # 将url保存到文件中
    with open('detail_urls.txt', 'a+') as f:
        f.write(detail_url + "\n")
# 将电影名称、上映时间、票房、平均票房、场均人次保存到csv文件中
with open('data1.csv', "a+", encoding='utf-8', newline='') as f:
    csv_write = csv.writer(f)
    for i in range(len(title_list)):
        title_ = title_list[i]
        uptime = time_list[i]
        piaofang_ = piaofang_list[i + 1]
        average_money_ = piaojia_list[i + 1]
        average_people_ = changjun_list[i + 1]
        data_row = [title_, uptime, piaofang_, average_money_, average_people_]
        csv_write.writerow(data_row)

注：上述代码中data1.csv文件中，保存的数据列字段为（电影名称、上映时间、票房、平均票价、场均人次），即下图所示数据。

detail_urls.txt文件中，保存的是2024年电影列表的详情URL。

通过第二步，可以得到每个电影的详细数据。

第二部分代码：

# -*- coding: utf-8 -*-
# @Author  : 归燕
# @FileName: maoyan.py
# @Time    : 2024/4/2 16:09
import csv
import random
from selenium import webdriver
import time

# chrome_options = Options()
# 使用代理ip（自己本机ip被封禁了）
# proxy_ip=""   #换成自己的代理ip
# proxy_port=""           #换成自己的代理端口号
# chrome_options.add_argument('--proxy-server={}:{}'.format(proxy_ip, proxy_port))
# 使用chorm浏览器，注意chorm浏览器驱动路径换成自己的

#使用Selenium库中的webdriver模块来实例化一个Chrome浏览器的驱动程序
driver = webdriver.Chrome(r'C:/Users/17332/AppData/Local/Google/Chrome/Application/chromedriver-win64/chromedriver.exe')
# 打开detail_urls.txt文件
with open('detail_urls.txt', 'r') as file:
    # 按行读取
    for url in file:
        # 请求url
        driver.get(url)
        # 等待页面加载完成
        time.sleep(random.randint(1, 10))
        # 使用try except 避免报错使程序停止运行
        try :
            # 定位评分
            score = driver.find_element_by_xpath('//span[@class="rating-num"]')
            score_=score.text
        except Exception as e:
            score_="暂无"
        print(score_)
        # 定位票房数据
        elements = driver.find_elements_by_class_name('detail-num')
        neidi_piaof = [element.text for element in elements]
        try:
            # 定位类型
            leixing = driver.find_element_by_xpath('//p[@class="info-category"]')
            juqing=leixing.text
        except Exception as e:
            juqing="暂无"
        try:
            # 定位国家
            countary_ = driver.find_element_by_xpath('//p[@class=".ellipsis-1"]')
            countary=countary_.text
        except Exception as e:
            countary="暂无"
        rowdata = []
        rowdata.append(score_)
        rowdata.append(juqing)
        rowdata.append(countary)
        for n in neidi_piaof:
            rowdata.append(n)
        # 将数据写入csv文件中
        with open('data2.csv', 'a+', encoding='utf-8', newline='') as f:
            csv_write = csv.writer(f)
            csv_write.writerow(rowdata)

上述代码中的data2.csv中保存的是每个电影的详情数据。

保存的数据前七列字段为（猫眼评分、电影类型、国家/时长、累计票房、首日票房、首周票房、票房预测）

注：只标明前七列是因为，后面几列数据缺失比较严重，采集的时候顺手保存了，

想要了解的可以根据类名（detail-num）自行查看。

本次的数据采集，得到两个csv文件。即data1.csv和data2.csv,可根据自身需要将两文件进行合并。当然，所有的字段的属性值需要自己加上哦。