基于机器学习的电影票房分析与预测系统

最新推荐文章于 2025-02-18 12:28:11 发布

雅致教育

最新推荐文章于 2025-02-18 12:28:11 发布

阅读量728

点赞数 3

分类专栏： python 大数据文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/m0_73484725/article/details/133418330

版权

python 同时被 2 个专栏收录

103 篇文章

订阅专栏

大数据

7 篇文章

订阅专栏

欢迎大家点赞、收藏、关注、评论啦，由于篇幅有限，只展示了部分核心代码。

文章目录

一项目简介

二、数据读取
三、数据探索式分析
3.1 抓取的数据如下图所示：
3.2 电影票房收入的分布情况
3.3 电影发布时间分布情况
3.4 电影发布时间与电影时长和票房收入间的关系
3.5 电影拍摄制作的总预算分布及与票房的关系
3.6 电影时长分布情况
3.7 电影时长与总预算间和票房收入间的关系

四. 电影票房数据集
五. 总结

一项目简介

票房作为衡量电影能否盈利的重要指标受诸多因素共同作用影响且其影响机制较为复杂，电影票房的准确预测是比较有难度的。本项目利用某开源电影数据集构建票房预测模型，首先将影响电影票房的因素如电影类型、上映档期、导演、演员等量化处理并进行可视化分析。采用多元线性回归模型、决策树回归模型、Ridge regression 岭回归模型、Lasso regression 岭回归模型和随机森林回归模型实现票房的预测，并进行以上模型的 model stacking，实现预测误差的进一步降低。

二、数据读取

功能主要包括：

在这里插入图片描述

三、数据探索式分析

3.1 抓取的数据如下图所示：

在这里插入图片描述

3.2 电影票房收入的分布情况

plt.figure(figsize=(16, 8))
plt.subplot(211)
sns.kdeplot(movie_df['Movie_Income'])
plt.title('电影票房收入(美元)的分布情况', fontsize=16, weight='bold', color='black')
 
plt.subplot(212)
sns.kdeplot(np.log1p(movie_df['Movie_Income']))
plt.title('电影票房收入(美元)的分布情况（lop1p转换）', fontsize=16, weight='bold', color='black')
plt.show()

在这里插入图片描述

3.3 电影发布时间分布情况

在这里插入图片描述

3.4 电影发布时间与电影时长和票房收入间的关系

plt.figure(figsize=(20, 8))
sns.boxplot(x="Movie_Year", y="Running_Time", data=movie_df, linewidth=1.5)
plt.title('MPAA 与电影时长间的分布情况', fontsize=16, weight='bold')
plt.show()
 
plt.figure(figsize=(20, 8))
sns.boxplot(x="Movie_Year", y="Movie_Income", data=movie_df, linewidth=1.5)
plt.title('MPAA 与电影票房收入间的分布情况', fontsize=16, weight='bold')
plt.show()

在这里插入图片描述

3.5 电影拍摄制作的总预算分布及与票房的关系

在这里插入图片描述

3.6 电影时长分布情况

在这里插入图片描述

3.7 电影时长与总预算间和票房收入间的关系

plt.figure(figsize=(16, 6))
plt.subplot(121)
plt.scatter(movie_df['Running_Time'], movie_df['Budget'], s=40, c='red')
plt.title('电影时长与电影制作总预算间的关系', fontsize=16, weight='bold')
 
plt.subplot(122)
plt.scatter(movie_df['Running_Time'], movie_df['Movie_Income'], s=40, c='blue')
plt.title('电影时长与电影票房收入间的关系', fontsize=16, weight='bold')
plt.show()

在这里插入图片描述

四. 电影票房数据集

电影票房数据来自于某公司旗下一个系统性计算电影票房的网站，旨在通过分析、评论、采访和最全面的在线票房追踪这种艺术与商业结合的方式来介绍电影的情况。

# 首页
url = 'https://www.xxxxxx.com/chart/top_lifetime_gross/?area=XWW'
# 保存所有的电影信息
all_movie_infos = []
need_break = False
 
while True:
    if need_break:
        break
        
    print('》》》爬取', url)
    headers = {
        'user-agent': util.get_random_user_agent(),
        'accept-language': 'zh-CN,zh;q=0.9',
        'cache-control': 'max-age=0',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
    }
    response = requests.get(url, headers=headers)
    response.encoding = 'utf8'
    soup = BeautifulSoup(response.text, 'lxml')
 
    rank_tds = soup.select('td.mojo-field-type-rank')
    movie_tds = soup.select('td.mojo-field-type-title')
    money_tds = soup.select('td.mojo-field-type-money')
    year_tds = soup.select('td.mojo-field-type-year')
 
    # 下一页
    next_page = soup.find('li', class_='a-last')
    if next_page is None:  # 所有页面爬取完成
        break
    
    try:
        url = 'https://www.xxxxxx.com/' + next_page.a['href']
    except:
        need_break = True
    
    for i in tqdm(range(len(rank_tds))):
        try:
            rank_td, movie_td, money_td, year_td = rank_tds[i], movie_tds[i], money_tds[i], year_tds[i]
            movie_info = {}
            movie_rank = int(rank_td.text.strip())
 
            movie_name = movie_td.a.text.strip()
            movie_link = 'https://www.boxofficemojo.com/' + movie_td.a['href']
            movie_income = money_td.text.strip()
            movie_income = float(movie_income.replace(',', '')[1:])
            movie_year = int(year_td.text.strip())
 
            movie_info['movie_name'] = movie_name
            movie_info['movie_link'] = movie_link
            movie_info['movie_income'] = movie_income
            movie_info['movie_year'] = movie_year
 
            # 电影发行的详细信息
            movie_detail = get_movie_detail(movie_link)
            movie_info.update(movie_detail)
            all_movie_infos.append(movie_info)
        except:
            continue
        
    print('总计爬取 {} 条电影数据'.format(len(all_movie_infos)))

————————————————

五. 总结

本项目利用某开源电影数据集构建票房预测模型，首先将影响电影票房的因素如电影类型、上映档期、导演、演员等量化处理并进行可视化分析。采用多元线性回归模型、决策树回归模型、Ridge regression 岭回归模型、Lasso regression 岭回归模型和随机森林回归模型实现票房的预测，并进行以上模型的 model stacking，实现预测误差的进一步降低。