数据集
数据集包含3个csv文件,文件中包含电影名字,发行时间,评分用户信息,评分等信息
http://grouplens.org/datasets/movielens/ (使用的dataset是older datasets)
-
评分表 (u.data)
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817 -
用户信息表(u.user)
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213 -
电影表
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
#支持中文显示
mpl.rcParams['font.family']='Kaiti'
# 使用非unicode的负号,当使用中文时候要设置
mpl.rcParams['axes.unicode_minus']=False
%matplotlib inline
数据加载
数据加载时要考虑三件事:
- 考虑DataFrame的标题(也就是列索引)
- 数据是怎么分割的
- 合并数据的时候知道合并的条件
# 加载用户信息 user id | age | gender | occupation | zip code
user_cols=['user_id','age','gender','occupation','zip_cod'] # 设置列索引
users = pd.read_csv('data/ml-100k/u.user',sep='|',names=user_cols,encoding='latin-1')
# 加载电影信息 movie id | movie title | release date | video release date | IMDb URL
movie_cols=['movie_id','movie_title','release_date','video_release_date','imdb_url'] # 设置列索引
movies = pd.read_csv('data/ml-100k/u.item',sep='|',names=movie_cols,usecols=range(5),encoding='latin-1') # usercols=range(5):只切割前五列数据
# 加载评分信息 user id | item id | rating | timestamp
rating_cols=['user_id','movie_id','rating','unix_timestamp'] # 设置列索引
ratings = pd.read_csv('data/ml-100k/u.data',sep='\t',names=rating_cols,encoding='latin-1')
# 为了后续分组统计数据方便,将3个DataFrame进行合并
# 首先合并用户表和评分表
user_ratings = pd.merge(users,ratings)
# 将上两个合并完成的表在和movies表合并
data = pd.merge(user_ratings,movies)
data
user_id | age | gender | occupation | zip_cod | movie_id | rating | unix_timestamp | movie_title | release_date | video_release_date | imdb_url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 24 | M | technician | 85711 | 61 | 4 | 878542420 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
1 | 13 | 47 | M | educator | 29206 | 61 | 4 | 882140552 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
2 | 18 | 35 | F | other | 37212 | 61 | 4 | 880130803 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
3 | 58 | 27 | M | programmer | 52246 | 61 | 5 | 884305271 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
4 | 59 | 49 | M | educator | 08403 | 61 | 4 | 888204597 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
5 | 60 | 50 | M | healthcare | 06472 | 61 | 4 | 883326652 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
6 | 76 | 20 | M | student | 02215 | 61 | 4 | 875028123 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
7 | 94 | 26 | M | student | 71457 | 61 | 5 | 891720761 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
8 | 144 | 53 | M | programmer | 20910 | 61 | 3 | 888106182 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
9 | 154 | 25 | M | student | 53703 | 61 | 4 | 879138657 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
10 | 160 | 27 | M | programmer | 66215 | 61 | 4 | 876861799 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
11 | 189 | 32 | M | artist | 95014 | 61 | 3 | 893265826 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
12 | 195 | 42 | M | scientist | 93555 | 61 | 3 | 888737277 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
13 | 201 | 27 | M | writer | E2A4H | 61 | 2 | 884111986 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
14 | 257 | 17 | M | student | 77005 | 61 | 5 | 879547534 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
15 | 268 | 24 | M | engineer | 19422 | 61 | 4 | 875309282 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
16 | 279 | 33 | M | programmer | 85251 | 61 | 4 | 875306552 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
17 | 296 | 43 | F | administrator | 16803 | 61 | 3 | 884197287 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
18 | 299 | 29 | M | doctor | 63108 | 61 | 4 | 877880648 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
19 | 305 | 23 | M | programmer | 94086 | 61 | 4 | 886323378 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
20 | 308 | 60 | M | retired | 95076 | 61 | 3 | 887739336 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
21 | 321 | 49 | F | educator | 55413 | 61 | 5 | 879441128 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
22 | 334 | 32 | M | librarian | 30002 | 61 | 3 | 891550409 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
23 | 354 | 29 | F | librarian | 48197 | 61 | 5 | 891218091 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
24 | 380 | 32 | M | engineer | 55117 | 61 | 4 | 885478193 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
25 | 385 | 36 | M | writer | 10003 | 61 | 2 | 879441572 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
26 | 387 | 33 | M | entertainment | 37412 | 61 | 3 | 886483565 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
27 | 391 | 23 | M | student | 84604 | 61 | 5 | 877399746 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
28 | 405 | 22 | F | healthcare | 10019 | 61 | 1 | 885549589 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
29 | 409 | 48 | M | administrator | 98225 | 61 | 4 | 881109420 | Three Colors: White (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?Trzy%20kolory... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99970 | 894 | 47 | M | educator | 74075 | 1658 | 4 | 882404137 | Substance of Fire, The (1996) | 06-Dec-1996 | NaN | http://us.imdb.com/M/title-exact?Substance%20o... |
99971 | 747 | 19 | M | other | 93612 | 1660 | 2 | 888640731 | Small Faces (1995) | 09-Aug-1996 | NaN | http://us.imdb.com/M/title-exact?Small%20Faces... |
99972 | 747 | 19 | M | other | 93612 | 1659 | 1 | 888733313 | Getting Away With Murder (1996) | 12-Apr-1996 | NaN | http://us.imdb.com/Title?Getting+Away+With+Mur... |
99973 | 751 | 24 | F | other | 90034 | 1661 | 1 | 889299429 | New Age, The (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?New%20Age,%20... |
99974 | 762 | 32 | M | administrator | 95050 | 1662 | 1 | 878719324 | Rough Magic (1995) | 30-May-1997 | NaN | http://us.imdb.com/M/title-exact?Rough%20Magic... |
99975 | 782 | 21 | F | artist | 33205 | 1662 | 4 | 891500110 | Rough Magic (1995) | 30-May-1997 | NaN | http://us.imdb.com/M/title-exact?Rough%20Magic... |
99976 | 782 | 21 | F | artist | 33205 | 1669 | 2 | 891500150 | MURDER and murder (1996) | 20-Jun-1997 | NaN | http://us.imdb.com/M/title-exact?MURDER+and+mu... |
99977 | 782 | 21 | F | artist | 33205 | 1663 | 2 | 891499700 | Nothing Personal (1995) | 30-Apr-1997 | NaN | http://us.imdb.com/M/title-exact?Nothing%20Per... |
99978 | 782 | 21 | F | artist | 33205 | 1666 | 2 | 891500194 | Ripe (1996) | 02-May-1997 | NaN | http://us.imdb.com/M/title-exact?Ripe%20%28199... |
99979 | 782 | 21 | F | artist | 33205 | 1668 | 3 | 891500067 | Wedding Bell Blues (1996) | 13-Jun-1997 | NaN | http://us.imdb.com/M/title-exact?Wedding%20Bel... |
99980 | 782 | 21 | F | artist | 33205 | 1664 | 4 | 891499699 | 8 Heads in a Duffel Bag (1997) | 18-Apr-1997 | NaN | http://us.imdb.com/Title?8+Heads+in+a+Duffel+B... |
99981 | 839 | 38 | F | entertainment | 90814 | 1664 | 1 | 875752902 | 8 Heads in a Duffel Bag (1997) | 18-Apr-1997 | NaN | http://us.imdb.com/Title?8+Heads+in+a+Duffel+B... |
99982 | 870 | 22 | M | student | 65203 | 1664 | 4 | 890057322 | 8 Heads in a Duffel Bag (1997) | 18-Apr-1997 | NaN | http://us.imdb.com/Title?8+Heads+in+a+Duffel+B... |
99983 | 880 | 13 | M | student | 83702 | 1664 | 4 | 892958799 | 8 Heads in a Duffel Bag (1997) | 18-Apr-1997 | NaN | http://us.imdb.com/Title?8+Heads+in+a+Duffel+B... |
99984 | 782 | 21 | F | artist | 33205 | 1665 | 2 | 891500194 | Brother's Kiss, A (1997) | 25-Apr-1997 | NaN | http://us.imdb.com/M/title-exact?Brother%27s%2... |
99985 | 782 | 21 | F | artist | 33205 | 1670 | 3 | 891497793 | Tainted (1998) | 01-Feb-1998 | NaN | http://us.imdb.com/M/title-exact?Tainted+(1998) |
99986 | 782 | 21 | F | artist | 33205 | 1667 | 3 | 891500110 | Next Step, The (1995) | 13-Jun-1997 | NaN | http://us.imdb.com/M/title-exact?Next%20Step%2... |
99987 | 787 | 18 | F | student | 98620 | 1671 | 1 | 888980193 | Further Gesture, A (1996) | 20-Feb-1998 | NaN | http://us.imdb.com/M/title-exact?Further+Gestu... |
99988 | 828 | 28 | M | librarian | 85282 | 1672 | 2 | 891037722 | Kika (1993) | 01-Jan-1993 | NaN | http://us.imdb.com/M/title-exact?Kika%20(1993) |
99989 | 896 | 28 | M | writer | 91505 | 1672 | 2 | 887159554 | Kika (1993) | 01-Jan-1993 | NaN | http://us.imdb.com/M/title-exact?Kika%20(1993) |
99990 | 835 | 44 | F | executive | 11577 | 1673 | 3 | 891034023 | Mirage (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Mirage%20(1995) |
99991 | 840 | 39 | M | artist | 55406 | 1674 | 4 | 891211682 | Mamma Roma (1962) | 01-Jan-1962 | NaN | http://us.imdb.com/M/title-exact?Mamma%20Roma%... |
99992 | 851 | 18 | M | other | 29646 | 1676 | 2 | 875731674 | War at Home, The (1996) | 01-Jan-1996 | NaN | http://us.imdb.com/M/title-exact?War%20at%20Ho... |
99993 | 851 | 18 | M | other | 29646 | 1675 | 3 | 884222085 | Sunchaser, The (1996) | 25-Oct-1996 | NaN | http://us.imdb.com/M/title-exact?Sunchaser,%20... |
99994 | 854 | 29 | F | student | 55408 | 1677 | 3 | 882814368 | Sweet Nothing (1995) | 20-Sep-1996 | NaN | http://us.imdb.com/M/title-exact?Sweet%20Nothi... |
99995 | 863 | 17 | M | student | 60089 | 1679 | 3 | 889289491 | B. Monkey (1998) | 06-Feb-1998 | NaN | http://us.imdb.com/M/title-exact?B%2E+Monkey+(... |
99996 | 863 | 17 | M | student | 60089 | 1678 | 1 | 889289570 | Mat' i syn (1997) | 06-Feb-1998 | NaN | http://us.imdb.com/M/title-exact?Mat%27+i+syn+... |
99997 | 863 | 17 | M | student | 60089 | 1680 | 2 | 889289570 | Sliding Doors (1998) | 01-Jan-1998 | NaN | http://us.imdb.com/Title?Sliding+Doors+(1998) |
99998 | 896 | 28 | M | writer | 91505 | 1681 | 3 | 887160722 | You So Crazy (1994) | 01-Jan-1994 | NaN | http://us.imdb.com/M/title-exact?You%20So%20Cr... |
99999 | 916 | 27 | M | engineer | N2L5N | 1682 | 3 | 880845755 | Scream of Stone (Schrei aus Stein) (1991) | 08-Mar-1996 | NaN | http://us.imdb.com/M/title-exact?Schrei%20aus%... |
100000 rows × 12 columns
数据探索和清洗
# data.describe()
# data['gender'].value_counts()
# 缺失值处理
# data.shape (100000, 12)
# data.isnull().sum()
# 删除video_release_date
data.dropna(axis=1,how='all',inplace=True)
data.isnull().sum()
# 重复值查看
# data.duplicated().any()
user_id 0
age 0
gender 0
occupation 0
zip_cod 0
movie_id 0
rating 0
unix_timestamp 0
movie_title 0
release_date 9
imdb_url 13
dtype: int64
评分最多电影
### 方式01 ###
# 按照电影标题分组,统计分组中数据个数,即得到评分次数
# g = data.groupby('movie_title')
# count = 0
# for k,v in g:
# if count==3:
# break
# display(k,v)
# count+=1
# data.groupby('movie_title').size().sort_values(ascending=False).head(20)
### 方式02 ###
data['movie_title'].value_counts().head(20).plot(kind='bar')
评分最高
# 根据电影标题分组,对分组中评分求均值,均值越高,评分也就越高
movie_scores = data.groupby('movie_title').agg({'rating':['size','mean']})
# movie_scores
# 过滤,评分人数必须大于等于100
size_more_100 = movie_scores['rating']['size']>=100
d=movie_scores[size_more_100].sort_values([('rating','mean')],ascending=False).head(10)
d['rating']['mean'].plot(kind='bar',title='评分最高的电影')
<matplotlib.axes._subplots.AxesSubplot at 0x89e1b38>
分析评分人数中最多的百部电影
data['movie_title'].value_counts().head(100)
Star Wars (1977) 583
Contact (1997) 509
Fargo (1996) 508
Return of the Jedi (1983) 507
Liar Liar (1997) 485
English Patient, The (1996) 481
Scream (1996) 478
Toy Story (1995) 452
Air Force One (1997) 431
Independence Day (ID4) (1996) 429
Raiders of the Lost Ark (1981) 420
Godfather, The (1972) 413
Pulp Fiction (1994) 394
Twelve Monkeys (1995) 392
Silence of the Lambs, The (1991) 390
Jerry Maguire (1996) 384
Chasing Amy (1997) 379
Rock, The (1996) 378
Empire Strikes Back, The (1980) 367
Star Trek: First Contact (1996) 365
Back to the Future (1985) 350
Titanic (1997) 350
Mission: Impossible (1996) 344
Fugitive, The (1993) 336
Indiana Jones and the Last Crusade (1989) 331
Willy Wonka and the Chocolate Factory (1971) 326
Princess Bride, The (1987) 324
Forrest Gump (1994) 321
Monty Python and the Holy Grail (1974) 316
Saint, The (1997) 316
...
Wizard of Oz, The (1939) 246
Phenomenon (1996) 244
Star Trek: The Wrath of Khan (1982) 244
Casablanca (1942) 243
Die Hard (1988) 243
Sting, The (1973) 241
Devil's Own, The (1997) 240
Dante's Peak (1997) 240
Psycho (1960) 239
Graduate, The (1967) 239
Seven (Se7en) (1995) 236
Time to Kill, A (1996) 232
It's a Wonderful Life (1946) 231
Speed (1994) 230
In & Out (1997) 230
Stand by Me (1986) 227
Hunt for Red October, The (1990) 227
GoodFellas (1990) 226
Heat (1995) 223
Sound of Music, The (1965) 222
Apocalypse Now (1979) 221
Clockwork Orange, A (1971) 221
Courage Under Fire (1996) 221
Top Gun (1986) 220
Lion King, The (1994) 220
To Kill a Mockingbird (1962) 219
Volcano (1997) 219
Babe (1995) 219
Aladdin (1992) 219
Murder at 1600 (1997) 218
Name: movie_title, Length: 100, dtype: int64
评分与年龄关系
# 简单看下年龄分布
# data['age'].describe()
# data['age'].plot(kind='hist',bins=30,figsize=(12,6))
# 自定义年龄区间,分布不同年龄组对电影评分总体状况
# 0-9 10-19 20-29 ...70-79
# np.arange(0,81,10)
labels = ['0-9','10-19','20-29','30-39','40-49','50-59','60-69','70-79']
data['age_group'] = pd.cut(data['age'],np.arange(0,81,10),right=False,labels=labels)
data.groupby('age_group').agg({'rating':['size','mean']})
#年龄越大的区间,对电影打分越高
# 不同年龄段对某个电影的评分
# 只分析评分次数排在前100的电影--》如何得到评分次数在前100的电影
# 得到评分次排在前100的电影的id信息,同时将数据movie_id列设置为行索引,利用索引数组来获得评分次数排在前100的电影的数据
# 如何得到一部电影,不同年龄段评分
# 0-9 10-19 ....
# 空军一号 4.1 4.2 ....
# 将最内层行索引untack为列索引
# 外层行索引 内层行索引
# 空军一号 0-9 4.1
# 10-19 4.2
# 根据movie_title和age_group进行分组,对分组对象中rating进行mean
# 得到评分次数在前100的电影
# 得到评分次排在前100的电影的id信息
most_100 = data['movie_id'].value_counts().head(100)
# most_100
# 数据movie_id列设置为行索引
data.set_index('movie_id',inplace=True)
by_age_group = data.loc[most_100.index].groupby(['movie_title','age_group'])
by_age_group['rating'].mean().unstack(1).fillna(0).head(10)