爬虫目标与工具选择
本次实战目标为爬取猫眼电影Top100榜单数据(含电影名称、主演、上映时间、评分等字段),并基于历史数据构建简单的线性回归票房预测模型。选用Scrapy框架因其具备成熟的管道机制和异步处理能力,适合结构化数据抓取。
# 创建Scrapy项目(命令行执行)
scrapy startproject maoyan
cd maoyan
scrapy genspider movie maoyan.com
项目结构配置
修改settings.py
开启下载中间件并设置爬取间隔,避免触发反爬:
# settings.py关键配置
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
'maoyan.pipelines.MaoyanPipeline': 300,
}
FEED_EXPORT_ENCODING = 'utf-8'
页面解析逻辑
猫眼电影采用动态渲染,需分析XHR接口或使用Splash中间件。此处采用静态页面分析方案:
# movie.py爬虫核心代码
import scrapy
from maoyan.items import MaoyanItem
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['maoyan.com']
start_urls = ['https://www.maoyan.com/board/4']
def parse(self, response):
dl = response.css('.board-wrapper dd')
for dd in dl:
item = MaoyanItem()
item['name'] = dd.css('.name a::text').get()
item['stars'] = dd.css('.star::text').re_first(r'主演:(.*)')
item['release_time'] = dd.css('.releasetime::text').re_first(r'上映时间:(.*)')
item['score'] = dd.css('.score::text').get()
yield item
数据存储管道
定义Item类并实现MySQL存储管道:
# items.py定义数据结构
import scrapy
class MaoyanItem(scrapy.Item):
name = scrapy.Field()
stars = scrapy.Field()
release_time = scrapy.Field()
score = scrapy.Field()
# pipelines.py数据库存储
import pymysql
class MaoyanPipeline:
def __init__(self):
self.conn = pymysql.connect(
host='localhost',
user='root',
password='123456',
db='scrapy_data',
charset='utf8mb4'
)
def process_item(self, item, spider):
sql = """INSERT INTO movies
(name, stars, release_time, score)
VALUES (%s, %s, %s, %s)"""
self.conn.cursor().execute(sql, (
item['name'],
item['stars'],
item['release_time'],
item['score']
))
self.conn.commit()
return item
票房预测模型
基于历史数据构建线性回归模型需先完成数据预处理:
# 数据预处理示例
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_sql('SELECT * FROM movies', con=engine)
df['year'] = df['release_time'].str.extract(r'(\d{4})')
df['month'] = df['release_time'].str.extract(r'-(\d{2})')
# 编码分类变量
le = LabelEncoder()
df['stars_encoded'] = le.fit_transform(df['stars'].str.split(',').str[0])
# 构建预测模型
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X = df[['year', 'month', 'stars_encoded']]
y = df['score'].astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = LinearRegression()
model.fit(X_train, y_train)
print(f'模型R2得分:{model.score(X_test, y_test):.2f}')
反爬策略应对
针对动态加载和IP限制问题,可通过以下方式增强爬虫:
# middlewares.py设置代理中间件
class ProxyMiddleware:
def process_request(self, request, spider):
request.meta['proxy'] = 'http://proxy_ip:port'
# 或使用Splash渲染
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashMiddleware': 725,
}
可视化分析
使用Matplotlib展示票房与评分关系:
import matplotlib.pyplot as plt
plt.scatter(df['score'], df['year'])
plt.xlabel('评分')
plt.ylabel('上映年份')
plt.title('电影评分年代分布')
plt.show()
部署与调度
建议使用Scrapyd服务进行生产环境部署:
# 安装并启动服务
pip install scrapyd
scrapyd &
scrapy deploy default -p maoyan
完整项目结构
maoyan/
├── scrapy.cfg
├── maoyan/
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders/
│ └── movie.py
└── data_analysis.ipynb
该实战项目涵盖从数据采集到预测建模的全流程,涉及Scrapy核心组件、数据处理技巧及简单的机器学习应用。注意实际运行时需根据网站结构调整选择器,并遵守robots.txt协议。