kaggle TMDB 票房预测 -step by step解读代码_kaggle电影票房预测-CSDN博客

本文链接：https://blog.csdn.net/qq_14820609/article/details/88676504

本文详细探讨了如何预测TMDB电影票房，包括数据预处理、特征工程、模型训练和结果分析。通过提取电影类型、制作公司、语言、关键词等信息，构建新特征。使用LightGBM进行建模，并进行交叉验证。最后，分析了关键特征对票房的影响，揭示了预算、上映日期等因素的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

点这传送kaggle原作者
 点这传送数据源&比赛

变量构造

belongs_to_collection

for i, e in enumerate(train['belongs_to_collection'][:5]):
    print(i, e)

0 [{'id': 313576, 'name': 'Hot Tub Time Machine Collection', 'poster_path': '/iEhb00TGPucF0b4joM1ieyY026U.jpg', 'backdrop_path': '/noeTVcgpBiD48fDjFVic1Vz7ope.jpg'}]
1 [{'id': 107674, 'name': 'The Princess Diaries Collection', 'poster_path': '/wt5AMbxPTS4Kfjx7Fgm149qPfZl.jpg', 'backdrop_path': '/zSEtYD77pKRJlUPx34BJgUG9v1c.jpg'}]
2 {}
3 {}
4 {}

enumerate:在可循环对象的每个元素前面加个序号

train['belongs_to_collection'].apply(lambda x: len(x) if x != {
   } else 0).value_counts()

0    2396
1     604
Name: belongs_to_collection, dtype: int64

DataFrame.apply(func, axis=0)对列（对行则axis=1）使用func
lambda x:x+1 if 2==1 else 0 #if 条件为真的时候返回if前面内容，否则返回0

此列中的2396个值为空，604包含有关系列的信息。假设只有集合名称才有用。另一个可能有用的特征是是否属于系列。

train['collection_name'] = train['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {
   } else 0)
train['has_collection'] = train['belongs_to_collection'].apply(lambda x: len(x) if x != {
   } else 0)

test['collection_name'] = test['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {
   } else 0)
test['has_collection'] = test['belongs_to_collection'].apply(lambda x: len(x) if x != {
   } else 0)

train = train.drop(['belongs_to_collection'], axis=1)
test = test.drop(['belongs_to_collection'], axis=1)

genres

for i, e in enumerate(train['genres'][:5]):
    print(i, e)
0 [{
   'id': 35, 'name': 'Comedy'}]
1 [{
   'id': 35, 'name': 'Comedy'}, {
   'id': 18, 'name': 'Drama'}, {
   'id': 10751, 'name': 'Family'}, {
   'id': 10749, 'name': 'Romance'}]
2 [{
   'id': 18, 'name': 'Drama'}]
3 [{
   'id': 53, 'name': 'Thriller'}, {
   'id': 18, 'name': 'Drama'}]
4 [{
   'id': 28, 'name': 'Action'}, {
   'id': 53, 'name': 'Thriller'}

print('Number of genres in films')
train['genres'].apply(lambda x: len(x) if x != {
   } else 0).value_counts()
Number of genres in films
2    972
3    900
1    593
4    393
5    111
6     21
0      7
7      3
Name: genres, dtype: int64

类型列包含电影所属类型的命名和ID。大多数电影有2-3种类型，可能有5-6种类型。我认为0和7是异常值。让我们提取类型！我将在电影中创建一个包含所有类型的列，并为每个类型创建单独的列。

但首先让我们来看看这些类型。

list_of_genres = list(train['genres'].apply(lambda x: [i['name'] for i in x] if x != {
   } else []).values)

[[‘Comedy’], [‘Comedy’, ‘Drama’, ‘Family’, ‘Romance’], [‘Drama’],…]

plt.figure(figsize = (12, 8))
text = ' '.join([i for j in list_of_genres for i in j])#把list_of_genres里面的每一个元素拿出来形成一个长list再用空格分隔
wordcloud = WordCloud(max_font_size=None, background_color='white', collocations=False,
                      width=1200, height=1000).generate(text)#collocations=False, #避免重复单词 max_font_size=100,  # 字体最大值
plt.imshow(wordcloud)
plt.title('Top genres')
plt.axis("off")
plt.show()

在这里插入图片描述
剧情，喜剧和惊悚片比较流行。

Counter([i for j in list_of_genres for i in j]).most_common()

Counter是一个无序的容器类型，以字典的键值对形式存储，其中元素作为key，其计数作为value。
Counter.most_common([n])返回一个TopN列表。如果n没有被指定，则返回所有元素。当多个元素计数值相同时，排列是无确定顺序的。

[('Drama', 1531), ('Comedy', 1028), ('Thriller', 789), ('Action', 741), ('Romance', 571), ('Crime', 469), ('Adventure', 439), ('Horror', 301), ('Science Fiction', 290), ('Family', 260), ('Fantasy', 232), ('Mystery', 225), ('Animation', 141), ('History', 132), ('Music', 100), ('War', 100), ('Documentary', 87), ('Western', 43), ('Foreign', 31), ('TV Movie', 1)]

为前15种类型创建单独的列。

train['num_genres'] = train['genres'].apply(lambda x: len(x) if x != {
   } else 0)
train['all_genres'] = train['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {
   } else '')
top_genres = [m[0] for m in Counter([i for j in list_of_genres for i in j]).most_common(15)]
for g in top_genres:
    train['genre_' + g] = train['all_genres'].apply(lambda x: 1 if g in x else 0)
    
test['num_genres'] = test['genres'].apply(lambda x: len(x) if x != {
   } else 0)
test['all_genres'] = test['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {
   } else '')
for g in top_genres:
    test['genre_' + g] = test['all_genres'].apply(lambda x: 1 if g in x else 0)

train = train.drop(['genres'], axis=1)
test = test.drop(['genres'], axis=1)

production_companies&production_countries&Spoken languages&Keywords

制片公司、制片国、语言、关键词的处理类似类型的处理，保留参与公司（国家、语言种类、关键词）数目，为前30（或其他数字）单独生成变量。

for i, e in enumerate(train['production_companies'][:5]):
    print(i, e)
0 [{
   'name': 'Paramount Pictures', 'id': 4}, {
   'name': 'United Artists', 'id': 60}, {
   'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]
1 [{
   'name': 'Walt Disney Pictures', 'id': 2}]
2 [{
   'name': 'Bold Films', 'id': 2266}, {
   'name': 'Blumhouse Productions', 'id': 3172}, {
   'name': 'Right of Way Films', 'id': 32157}]
3 {
   }
4 {
   }
print('Number of production companies in films')
train['production_companies'].apply(lambda x: len(x) if x != {
   } else 0).value_counts()

Number of production companies in films
1     775
2     734
3     582
4     312
5     166
0     156
6     118
7      62
8      42
9      29
11      7
10      7
12      3
16      2
15      2
14      1
13      1
17      1
Name: production_companies, dtype: int64

Most of films have 1-2 production companies, cometimes 3-4. But there are films with 10+ companies! Let’s have a look at some of them.

train[train['production_companies'].apply(lambda x: len(x) if x != {
   } else 0) > 11]

list_of_companies = list(train['production_companies'].apply(lambda x: [i['name'] for i in x] if x != {
   } else []).values)
Counter([i for j in list_of_companies for i in j]).most_common(30)
[('Warner Bros.', 202), ('Universal Pictures', 188), ('Paramount Pictures', 161), ('Twentieth Century Fox Film Corporation', 138), ('Columbia Pictures', 91), ('Metro-Goldwyn-Mayer (MGM)', 84), ('New Line Cinema', 75), ('Touchstone Pictures', 63), ('Walt Disney Pictures', 62), ('Columbia Pictures Corporation', 61), ('TriStar Pictures', 53), ('Relativity Media', 48), ('Canal+', 46), ('United Artists', 44), ('Miramax Films', 40), ('Village Roadshow Pictures', 36), ('Regency Enterprises', 31), ('BBC Films', 30), ('Dune Entertainment', 30), ('Working Title Films', 30), ('Fox Searchlight Pictures', 29), ('StudioCanal', 28), ('Lionsgate', 28), ('DreamWorks SKG', 27), ('Fox 2000 Pictures', 25), ('Summit Entertainment', 24), ('Hollywood Pictures', 24), ('Orion Pictures', 24), ('Amblin Entertainment', 23), ('Dimension Films', 23)]

For now I’m not sure what to do with this data. I’ll simply create binary columns for top-30 films. Maybe later I’ll have a better idea.

train['num_companies'] = train['production_companies'].apply(lambda x: len(x) if x != {
   } else 0)
train['all_production_companies'] = train['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {
   } else '')
top_companies = [m[0] for m in Counter([i for j in list_of_companies for i in j]).most_common(30)]
for g in top_companies:
    train['production_company_' + g] = train['all_production_companies'].apply(lambda x: 1 if g in x else 0)
    
test['num_companies'] = test['production_companies'].apply(lambda x: len(x) if x != {
   } else 0)
test['all_production_companies'] = test['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {
   } else '')
for g in top_companies:
    test['production_company_' + g] = test['all_production_companies'].apply(lambda x: 1 if g in x else 0)

train = train.drop(['production_companies', 'all_production_companies'], axis=1)
test = test.drop(['production_companies', 'all_production_companies'], axis=1)

for i, e in enumerate(train['production_countries'][:5]):
    print(i, e)
0 [{
   'iso_3166_1': 'US', 'name': 'United States of America'}]
1 [{
   'iso_3166_1': 'US', 'name': 'United States of America'}]
2 [{
   'iso_3166_1': 'US', 'name': 'United States of America'}]
3 [{
   'iso_3166_1': 'IN', 'name': 'India'}]
4 [{
   'iso_3166_1': 'KR', 'name': 'South Korea'}]
print('Number of production countries in films')
train['production_countries'].apply(lambda x: len(x) if x != {
   } else 0).value_counts()

Number of production countries in films
1    2222
2     525
3     116
4      57
0      55
5      21
6       3
8       1
Name: production_countries, dtype: int64

Normally films are produced by a single country, but there are cases when companies from several countries worked together.

list_of_countries = list(train['production_countries'].apply(lambda x: [i['name'] for i in x] if x != {
   } else []).values)
Counter([i for j in list_of_countries for i in j]).most_common(25)
[('United States of America', 2282), ('United Kingdom', 380), ('France', 222), ('Germany', 167), ('Canada', 120), ('India', 81), ('Italy', 64), ('Japan', 61), ('Australia', 61), ('Russia', 58), ('Spain', 54), ('China', 42), ('Hong Kong', 42), ('Ireland', 23), ('Belgium', 23), ('South Korea', 22), ('Mexico', 19), ('Sweden', 18), ('New Zealand', 17), ('Netherlands', 15), ('Czech Republic', 14), ('Denmark', 13), ('Brazil', 12), ('Luxembourg', 10), ('South Africa', 10)]
train['num_countries'] = train['production_countries'].apply(lambda x: len(x) if x != {
   } else 0)
train['all_countries'] = train['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {
   } else '')
top_countries = [m[0] for m in Counter([i for j in list_of_countries for i in j]).most_common(25)]
for g in top_countries:
    train['production_country_' + g] = train['all_countries'].apply(lambda x: 1 if g in x else 0)
    
test['num_countries'] = test['production_countries'].apply(lambda x: len(x) if x != {
   } else 0)
test['all_countries'] = test['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {
   } else '')
for g in top_countries:
    test['production_country_' + g] = test['all_countries'].apply(lambda x: 1 if g in x else 0)

train = train.drop(['production_countries', 'all_countries'], axis=1)
test = test.drop(['production_countries', 'all_countries'], axis=1)

Spoken languages

for i, e in enumerate(train['spoken_languages'][:5]):
    print(i, e)
0