数据可视化分析票房数据报告
Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.
欢迎回到我的100天数据科学挑战之旅。 在第4天和第5天,我将研究Kaggle上提供的TMDB票房预测数据集。
I’ll start by importing some useful libraries that we need in this task.
我将从导入此任务中需要的一些有用的库开始。
import pandas as pd# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('dark_background')
数据加载与探索 (Data Loading and Exploration)
Once you downloaded data from the Kaggle, you will have 3 files. As this is a prediction competition, you have train, test, and sample_submission file. For this project, my motive is only to perform data analysis and visuals. I am going to ignore test.csv and sample_submission.csv files.
从Kaggle下载数据后,您将拥有3个文件。 由于这是一场预测比赛,因此您具有训练,测试和sample_submission文件。 对于这个项目,我的动机只是执行数据分析和视觉效果。 我将忽略test.csv和sample_submission.csv文件。
Let’s load train.csv in data frame using pandas.
让我们使用熊猫在数据框中加载train.csv。
%time train = pd.read_csv('./data/tmdb-box-office-prediction/train.csv')# output
CPU times: user 258 ms, sys: 132 ms, total: 389 ms
Wall time: 403 ms
关于数据集: (About the dataset:)
id: Integer unique id of each moviebelongs_to_collection: Contains the TMDB Id, Name, Movie Poster, and Backdrop URL of a movie in JSON format.budget: Budget of a movie in dollars. Some row contains 0 values, which mean unknown.genres: Contains all the Genres Name & TMDB Id in JSON Format.homepage: Contains the official URL of a movie.imdb_id: IMDB id of a movie (string).original_language: Two-digit code of the original language, in which the movie was made.original_title: The original title of a movie in original_language.overview: Brief description of the movie.popularity: Popularity of the movie.poster_path: Poster path of a movie. You can see full poster image by adding URL after this link → https://image.tmdb.org/t/p/original/production_companies: All production company name and TMDB id in JSON format of a movie.production_countries: Two-digit code and the full name of the production company in JSON format.release_date: The release date of a movie in mm/dd/yy format.runtime: Total runtime of a movie in minutes (Integer).spoken_languages: Two-digit code and the full name of the spoken language.status: Is the movie released or rumored?tagline: Tagline of a movietitle: English title of a movieKeywords: TMDB Id and name of all the keywords in JSON format.cast: All cast TMDB id, name, character name, gender (1 = Female, 2 = Male) in JSON formatcrew: Name, TMDB id, profile path of various kind of crew members job like Director, Writer, Art, Sound, etc.revenue: Total revenue earned by a movie in dollars.
Let’s have a look at the sample data.
让我们看一下样本数据。
train.head()
As we can see that some features have dictionaries, hence I am dropping all such columns for now.
如我们所见,某些功能具有字典,因此我暂时删除所有此类列。
train = train.drop(['belongs_to_collection', 'genres', 'crew',
'cast', 'Keywords', 'spoken_languages', 'production_companies', 'production_countries', 'tagline','overview','homepage'], axis=1)
Now it time to have a look at statistics of the data.
现在该看一下数据统计了。
print("Shape of data is ")
train.shape# OutputShape of data is
(3000, 12)
Dataframe information.
数据框信息。