Scrapy 实战:爬取猫眼电影数据(含票房预测模型基础)

爬虫目标与工具选择

本次实战目标为爬取猫眼电影Top100榜单数据(含电影名称、主演、上映时间、评分等字段),并基于历史数据构建简单的线性回归票房预测模型。选用Scrapy框架因其具备成熟的管道机制和异步处理能力,适合结构化数据抓取。

# 创建Scrapy项目(命令行执行)
scrapy startproject maoyan
cd maoyan
scrapy genspider movie maoyan.com
项目结构配置

修改settings.py开启下载中间件并设置爬取间隔,避免触发反爬:

# settings.py关键配置
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
   'maoyan.pipelines.MaoyanPipeline': 300,
}
FEED_EXPORT_ENCODING = 'utf-8'
页面解析逻辑

猫眼电影采用动态渲染,需分析XHR接口或使用Splash中间件。此处采用静态页面分析方案:

# movie.py爬虫核心代码
import scrapy
from maoyan.items import MaoyanItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['maoyan.com']
    start_urls = ['https://www.maoyan.com/board/4']

    def parse(self, response):
        dl = response.css('.board-wrapper dd')
        for dd in dl:
            item = MaoyanItem()
            item['name'] = dd.css('.name a::text').get()
            item['stars'] = dd.css('.star::text').re_first(r'主演:(.*)')
            item['release_time'] = dd.css('.releasetime::text').re_first(r'上映时间:(.*)')
            item['score'] = dd.css('.score::text').get()
            yield item
数据存储管道

定义Item类并实现MySQL存储管道:

# items.py定义数据结构
import scrapy
class MaoyanItem(scrapy.Item):
    name = scrapy.Field()
    stars = scrapy.Field()
    release_time = scrapy.Field()
    score = scrapy.Field()
# pipelines.py数据库存储
import pymysql
class MaoyanPipeline:
    def __init__(self):
        self.conn = pymysql.connect(
            host='localhost',
            user='root',
            password='123456',
            db='scrapy_data',
            charset='utf8mb4'
        )
    
    def process_item(self, item, spider):
        sql = """INSERT INTO movies 
                 (name, stars, release_time, score) 
                 VALUES (%s, %s, %s, %s)"""
        self.conn.cursor().execute(sql, (
            item['name'],
            item['stars'],
            item['release_time'],
            item['score']
        ))
        self.conn.commit()
        return item
票房预测模型

基于历史数据构建线性回归模型需先完成数据预处理:

# 数据预处理示例
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_sql('SELECT * FROM movies', con=engine)
df['year'] = df['release_time'].str.extract(r'(\d{4})')
df['month'] = df['release_time'].str.extract(r'-(\d{2})')

# 编码分类变量
le = LabelEncoder()
df['stars_encoded'] = le.fit_transform(df['stars'].str.split(',').str[0])
# 构建预测模型
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = df[['year', 'month', 'stars_encoded']]
y = df['score'].astype(float)

X_train, X_test, y_train, y_test = train_test_split(X, y)
model = LinearRegression()
model.fit(X_train, y_train)
print(f'模型R2得分:{model.score(X_test, y_test):.2f}')
反爬策略应对

针对动态加载和IP限制问题,可通过以下方式增强爬虫:

# middlewares.py设置代理中间件
class ProxyMiddleware:
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://proxy_ip:port'

# 或使用Splash渲染
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashMiddleware': 725,
}
可视化分析

使用Matplotlib展示票房与评分关系:

import matplotlib.pyplot as plt
plt.scatter(df['score'], df['year'])
plt.xlabel('评分')
plt.ylabel('上映年份')
plt.title('电影评分年代分布')
plt.show()
部署与调度

建议使用Scrapyd服务进行生产环境部署:

# 安装并启动服务
pip install scrapyd
scrapyd &
scrapy deploy default -p maoyan
完整项目结构
maoyan/
├── scrapy.cfg
├── maoyan/
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders/
│       └── movie.py
└── data_analysis.ipynb

该实战项目涵盖从数据采集到预测建模的全流程,涉及Scrapy核心组件、数据处理技巧及简单的机器学习应用。注意实际运行时需根据网站结构调整选择器,并遵守robots.txt协议。
python爬虫图片

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值