python数据分析_kaggle电影数据分析案例

最新推荐文章于 2024-08-22 00:03:42 发布

周红艳的博客

最新推荐文章于 2024-08-22 00:03:42 发布

阅读量1.3w

点赞数 14

分类专栏： python学习笔记

本文链接：https://blog.csdn.net/qq_24330285/article/details/80453264

版权

本文通过对Kaggle电影数据集的分析，揭示了电影的发行年份、类型分布、收入与各项指标的相关性。数据显示，'Drama'、'Comedy'、'Action'是最常见的类型，而'Revenue'与'vote_count'和'budget'等因素高度相关。

摘要由CSDN通过智能技术生成

#导入需要的包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json

#导入数据
movies=pd.read_csv(r'E:\python\data\tmdb_5000_movies.csv',sep=',')
credit=pd.read_csv(r'e:\python\data\tmdb_5000_credits.csv',sep=',')

#检查两个id列和title列是否真的相同
(movies['id']==credit['movie_id']).describe()

结果为：
count 4803
unique 1
top True
freq 4803
dtype: object

(movies['title']==credit['title']).describe()

结果为：
count 4803
unique 1
top True
freq 4803
Name: title, dtype: object

#删除多余列
del credit['movie_id']
del credit['title']
del movies['homepage']
del movies['spoken_languages']
del movies['original_language']
del movies['original_title']
del movies['overview']
del movies['tagline']
del movies['status']

#合并两个数据集
full_df=pd.concat([credit,movies],axis=1)#横向连接

#缺失值处理，首先找到缺失值,然后对其进行处理
nan_x=full_df['runtime'].isnull()
full_df.loc[nan_x,:]

这里写图片描述

#在网上查询对应的信息，并填进去
full_df.loc[2656,'runtime']=98
full_df.loc[4140,'runtime']=82

#release_date字段的缺失值同样处理
nan_y=full_df['release_date'].isnull()
full_df.loc[nan_y,:]

这里写图片描述

#同样在网上找到对应的信息，并填进去
full_df.loc[4553,'release_date']='2014-06-01'

#将release_date的类型转换成日期类型
full_df['release_date']=pd.to_datetime(full_df['release_date'],errors='coerce',format='%Y-%m-%d')
full_df.info()

结果为：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 15 columns):
cast                    4803 non-null object
crew                    4803 non-null object
budget                  4803 non-null int64
genres                  4803 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4803 non-null datetime64[ns]
revenue                 4803 non-null int64
runtime                 4803 non-null float64
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(4), object(7)
memory usage: 562.9+ KB

#转换成日期格式后，提取对应的年份
full_df['release_year']=full_df['release_date'].map(lambda x : x.year)
full_df.loc[:,'release_year'].head()

结果为：
0 2009
1 2007
2 2015
3 2012
4 2012
Name: release_year, dtype: int64

#提取json格式
#使用json.loads将json格式转化成字符串

最低0.47元/天解锁文章

周红艳的博客

关注

14
点赞
踩
141

收藏

觉得还不错? 一键收藏
13
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录