【案例】电影数据分析

数据集

数据集包含3个csv文件,文件中包含电影名字,发行时间,评分用户信息,评分等信息

http://grouplens.org/datasets/movielens/ (使用的dataset是older datasets)

  1. 评分表 (u.data)
    196 242 3 881250949
    186 302 3 891717742
    22 377 1 878887116
    244 51 2 880606923
    166 346 1 886397596
    298 474 4 884182806
    115 265 2 881171488
    253 465 5 891628467
    305 451 3 886324817

  2. 用户信息表(u.user)
    1|24|M|technician|85711
    2|53|F|other|94043
    3|23|M|writer|32067
    4|24|M|technician|43537
    5|33|F|other|15213

  3. 电影表
    1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
    2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
    3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
    4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
    5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
#支持中文显示
mpl.rcParams['font.family']='Kaiti'
# 使用非unicode的负号,当使用中文时候要设置
mpl.rcParams['axes.unicode_minus']=False
%matplotlib inline

数据加载

数据加载时要考虑三件事:

  1. 考虑DataFrame的标题(也就是列索引)
  2. 数据是怎么分割的
  3. 合并数据的时候知道合并的条件
# 加载用户信息 user id | age | gender | occupation | zip code
user_cols=['user_id','age','gender','occupation','zip_cod'] # 设置列索引
users = pd.read_csv('data/ml-100k/u.user',sep='|',names=user_cols,encoding='latin-1')

# 加载电影信息 movie id | movie title | release date | video release date | IMDb URL
movie_cols=['movie_id','movie_title','release_date','video_release_date','imdb_url'] # 设置列索引
movies = pd.read_csv('data/ml-100k/u.item',sep='|',names=movie_cols,usecols=range(5),encoding='latin-1') # usercols=range(5):只切割前五列数据

# 加载评分信息 user id | item id | rating | timestamp
rating_cols=['user_id','movie_id','rating','unix_timestamp'] # 设置列索引
ratings = pd.read_csv('data/ml-100k/u.data',sep='\t',names=rating_cols,encoding='latin-1') 


# 为了后续分组统计数据方便,将3个DataFrame进行合并
# 首先合并用户表和评分表
user_ratings = pd.merge(users,ratings)
# 将上两个合并完成的表在和movies表合并
data = pd.merge(user_ratings,movies)
data
user_idagegenderoccupationzip_codmovie_idratingunix_timestampmovie_titlerelease_datevideo_release_dateimdb_url
0124Mtechnician85711614878542420Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
11347Meducator29206614882140552Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
21835Fother37212614880130803Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
35827Mprogrammer52246615884305271Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
45949Meducator08403614888204597Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
56050Mhealthcare06472614883326652Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
67620Mstudent02215614875028123Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
79426Mstudent71457615891720761Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
814453Mprogrammer20910613888106182Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
915425Mstudent53703614879138657Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
1016027Mprogrammer66215614876861799Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
1118932Martist95014613893265826Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
1219542Mscientist93555613888737277Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
1320127MwriterE2A4H612884111986Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
1425717Mstudent77005615879547534Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
1526824Mengineer19422614875309282Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
1627933Mprogrammer85251614875306552Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
1729643Fadministrator16803613884197287Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
1829929Mdoctor63108614877880648Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
1930523Mprogrammer94086614886323378Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
2030860Mretired95076613887739336Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
2132149Feducator55413615879441128Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
2233432Mlibrarian30002613891550409Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
2335429Flibrarian48197615891218091Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
2438032Mengineer55117614885478193Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
2538536Mwriter10003612879441572Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
2638733Mentertainment37412613886483565Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
2739123Mstudent84604615877399746Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
2840522Fhealthcare10019611885549589Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
2940948Madministrator98225614881109420Three Colors: White (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?Trzy%20kolory...
.......................................
9997089447Meducator7407516584882404137Substance of Fire, The (1996)06-Dec-1996NaNhttp://us.imdb.com/M/title-exact?Substance%20o...
9997174719Mother9361216602888640731Small Faces (1995)09-Aug-1996NaNhttp://us.imdb.com/M/title-exact?Small%20Faces...
9997274719Mother9361216591888733313Getting Away With Murder (1996)12-Apr-1996NaNhttp://us.imdb.com/Title?Getting+Away+With+Mur...
9997375124Fother9003416611889299429New Age, The (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?New%20Age,%20...
9997476232Madministrator9505016621878719324Rough Magic (1995)30-May-1997NaNhttp://us.imdb.com/M/title-exact?Rough%20Magic...
9997578221Fartist3320516624891500110Rough Magic (1995)30-May-1997NaNhttp://us.imdb.com/M/title-exact?Rough%20Magic...
9997678221Fartist3320516692891500150MURDER and murder (1996)20-Jun-1997NaNhttp://us.imdb.com/M/title-exact?MURDER+and+mu...
9997778221Fartist3320516632891499700Nothing Personal (1995)30-Apr-1997NaNhttp://us.imdb.com/M/title-exact?Nothing%20Per...
9997878221Fartist3320516662891500194Ripe (1996)02-May-1997NaNhttp://us.imdb.com/M/title-exact?Ripe%20%28199...
9997978221Fartist3320516683891500067Wedding Bell Blues (1996)13-Jun-1997NaNhttp://us.imdb.com/M/title-exact?Wedding%20Bel...
9998078221Fartist33205166448914996998 Heads in a Duffel Bag (1997)18-Apr-1997NaNhttp://us.imdb.com/Title?8+Heads+in+a+Duffel+B...
9998183938Fentertainment90814166418757529028 Heads in a Duffel Bag (1997)18-Apr-1997NaNhttp://us.imdb.com/Title?8+Heads+in+a+Duffel+B...
9998287022Mstudent65203166448900573228 Heads in a Duffel Bag (1997)18-Apr-1997NaNhttp://us.imdb.com/Title?8+Heads+in+a+Duffel+B...
9998388013Mstudent83702166448929587998 Heads in a Duffel Bag (1997)18-Apr-1997NaNhttp://us.imdb.com/Title?8+Heads+in+a+Duffel+B...
9998478221Fartist3320516652891500194Brother's Kiss, A (1997)25-Apr-1997NaNhttp://us.imdb.com/M/title-exact?Brother%27s%2...
9998578221Fartist3320516703891497793Tainted (1998)01-Feb-1998NaNhttp://us.imdb.com/M/title-exact?Tainted+(1998)
9998678221Fartist3320516673891500110Next Step, The (1995)13-Jun-1997NaNhttp://us.imdb.com/M/title-exact?Next%20Step%2...
9998778718Fstudent9862016711888980193Further Gesture, A (1996)20-Feb-1998NaNhttp://us.imdb.com/M/title-exact?Further+Gestu...
9998882828Mlibrarian8528216722891037722Kika (1993)01-Jan-1993NaNhttp://us.imdb.com/M/title-exact?Kika%20(1993)
9998989628Mwriter9150516722887159554Kika (1993)01-Jan-1993NaNhttp://us.imdb.com/M/title-exact?Kika%20(1993)
9999083544Fexecutive1157716733891034023Mirage (1995)01-Jan-1995NaNhttp://us.imdb.com/M/title-exact?Mirage%20(1995)
9999184039Martist5540616744891211682Mamma Roma (1962)01-Jan-1962NaNhttp://us.imdb.com/M/title-exact?Mamma%20Roma%...
9999285118Mother2964616762875731674War at Home, The (1996)01-Jan-1996NaNhttp://us.imdb.com/M/title-exact?War%20at%20Ho...
9999385118Mother2964616753884222085Sunchaser, The (1996)25-Oct-1996NaNhttp://us.imdb.com/M/title-exact?Sunchaser,%20...
9999485429Fstudent5540816773882814368Sweet Nothing (1995)20-Sep-1996NaNhttp://us.imdb.com/M/title-exact?Sweet%20Nothi...
9999586317Mstudent6008916793889289491B. Monkey (1998)06-Feb-1998NaNhttp://us.imdb.com/M/title-exact?B%2E+Monkey+(...
9999686317Mstudent6008916781889289570Mat' i syn (1997)06-Feb-1998NaNhttp://us.imdb.com/M/title-exact?Mat%27+i+syn+...
9999786317Mstudent6008916802889289570Sliding Doors (1998)01-Jan-1998NaNhttp://us.imdb.com/Title?Sliding+Doors+(1998)
9999889628Mwriter9150516813887160722You So Crazy (1994)01-Jan-1994NaNhttp://us.imdb.com/M/title-exact?You%20So%20Cr...
9999991627MengineerN2L5N16823880845755Scream of Stone (Schrei aus Stein) (1991)08-Mar-1996NaNhttp://us.imdb.com/M/title-exact?Schrei%20aus%...

100000 rows × 12 columns

数据探索和清洗

# data.describe()
# data['gender'].value_counts()

# 缺失值处理
# data.shape  (100000, 12)
# data.isnull().sum()

# 删除video_release_date
data.dropna(axis=1,how='all',inplace=True)
data.isnull().sum()

# 重复值查看
# data.duplicated().any()
user_id            0
age                0
gender             0
occupation         0
zip_cod            0
movie_id           0
rating             0
unix_timestamp     0
movie_title        0
release_date       9
imdb_url          13
dtype: int64

评分最多电影

				### 方式01 ###
# 按照电影标题分组,统计分组中数据个数,即得到评分次数
# g = data.groupby('movie_title')
# count = 0
# for k,v in g:
#     if count==3:
#         break
#     display(k,v)
#     count+=1

# data.groupby('movie_title').size().sort_values(ascending=False).head(20)

				### 方式02 ###
data['movie_title'].value_counts().head(20).plot(kind='bar')

在这里插入图片描述

评分最高

# 根据电影标题分组,对分组中评分求均值,均值越高,评分也就越高
movie_scores = data.groupby('movie_title').agg({'rating':['size','mean']})
# movie_scores
# 过滤,评分人数必须大于等于100
size_more_100 = movie_scores['rating']['size']>=100
d=movie_scores[size_more_100].sort_values([('rating','mean')],ascending=False).head(10)

d['rating']['mean'].plot(kind='bar',title='评分最高的电影')
<matplotlib.axes._subplots.AxesSubplot at 0x89e1b38>

在这里插入图片描述

分析评分人数中最多的百部电影

data['movie_title'].value_counts().head(100)
Star Wars (1977)                                583
Contact (1997)                                  509
Fargo (1996)                                    508
Return of the Jedi (1983)                       507
Liar Liar (1997)                                485
English Patient, The (1996)                     481
Scream (1996)                                   478
Toy Story (1995)                                452
Air Force One (1997)                            431
Independence Day (ID4) (1996)                   429
Raiders of the Lost Ark (1981)                  420
Godfather, The (1972)                           413
Pulp Fiction (1994)                             394
Twelve Monkeys (1995)                           392
Silence of the Lambs, The (1991)                390
Jerry Maguire (1996)                            384
Chasing Amy (1997)                              379
Rock, The (1996)                                378
Empire Strikes Back, The (1980)                 367
Star Trek: First Contact (1996)                 365
Back to the Future (1985)                       350
Titanic (1997)                                  350
Mission: Impossible (1996)                      344
Fugitive, The (1993)                            336
Indiana Jones and the Last Crusade (1989)       331
Willy Wonka and the Chocolate Factory (1971)    326
Princess Bride, The (1987)                      324
Forrest Gump (1994)                             321
Monty Python and the Holy Grail (1974)          316
Saint, The (1997)                               316
                                               ... 
Wizard of Oz, The (1939)                        246
Phenomenon (1996)                               244
Star Trek: The Wrath of Khan (1982)             244
Casablanca (1942)                               243
Die Hard (1988)                                 243
Sting, The (1973)                               241
Devil's Own, The (1997)                         240
Dante's Peak (1997)                             240
Psycho (1960)                                   239
Graduate, The (1967)                            239
Seven (Se7en) (1995)                            236
Time to Kill, A (1996)                          232
It's a Wonderful Life (1946)                    231
Speed (1994)                                    230
In & Out (1997)                                 230
Stand by Me (1986)                              227
Hunt for Red October, The (1990)                227
GoodFellas (1990)                               226
Heat (1995)                                     223
Sound of Music, The (1965)                      222
Apocalypse Now (1979)                           221
Clockwork Orange, A (1971)                      221
Courage Under Fire (1996)                       221
Top Gun (1986)                                  220
Lion King, The (1994)                           220
To Kill a Mockingbird (1962)                    219
Volcano (1997)                                  219
Babe (1995)                                     219
Aladdin (1992)                                  219
Murder at 1600 (1997)                           218
Name: movie_title, Length: 100, dtype: int64

评分与年龄关系

# 简单看下年龄分布
# data['age'].describe()
# data['age'].plot(kind='hist',bins=30,figsize=(12,6))

# 自定义年龄区间,分布不同年龄组对电影评分总体状况
# 0-9  10-19 20-29 ...70-79
# np.arange(0,81,10)
labels = ['0-9','10-19','20-29','30-39','40-49','50-59','60-69','70-79']
data['age_group'] = pd.cut(data['age'],np.arange(0,81,10),right=False,labels=labels)
data.groupby('age_group').agg({'rating':['size','mean']})
#年龄越大的区间,对电影打分越高 

在这里插入图片描述

# 不同年龄段对某个电影的评分
# 只分析评分次数排在前100的电影--》如何得到评分次数在前100的电影
# 得到评分次排在前100的电影的id信息,同时将数据movie_id列设置为行索引,利用索引数组来获得评分次数排在前100的电影的数据
# 如何得到一部电影,不同年龄段评分
#           0-9  10-19  ....
# 空军一号  4.1   4.2   ....
# 将最内层行索引untack为列索引
# 外层行索引 内层行索引
# 空军一号 0-9        4.1
#          10-19      4.2 
# 根据movie_title和age_group进行分组,对分组对象中rating进行mean

# 得到评分次数在前100的电影
# 得到评分次排在前100的电影的id信息
most_100 = data['movie_id'].value_counts().head(100)
# most_100
# 数据movie_id列设置为行索引
data.set_index('movie_id',inplace=True)
by_age_group = data.loc[most_100.index].groupby(['movie_title','age_group'])
by_age_group['rating'].mean().unstack(1).fillna(0).head(10)

在这里插入图片描述

  • 1
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值