Pandas的小练习,使用电影评分的公开数据完成一系列的操作。数据由三部分组成:
- 电影
- 用户
- 用户对电影的评分
下载数据:ml-1m.zip
数据介绍:README.txt
引入pandas包
import pandas as pd
构造列名、构造数据表
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('ml-1m/users.dat', sep = '::',header = None, names = unames )
查看一下user数据导入情况
print(len(users))
users.head(5)
6040
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
同样导入评分rating和电影movie的数据
rating_names = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep = '::',header = None, names = rating_names )
movie_names = ['movie_id','title','genres']
movies = pd.read_table('ml-1m/movies.dat', sep = '::',header = None, names = movie_names )
print(len(ratings))
ratings.head(5)
1000209
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
print(len(movies))
movies.head(5)
3883
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
合并三个表的数据,操作类似于SQL中inner join
data = pd.merge(pd.merge(users,ratings),movies)
print(len(data))
data.head(10)
1000209
user_id gender age occupation zip movie_id rating timestamp title genres
0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 M 56 16 70072 1193 5 978298413 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 M 25 12 32793 1193 4 978220179 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 M 25 7 22903 1193 4 978199279 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 M 50 1 95350 1193 5 978158471 One Flew Over the Cuckoo's Nest (1975) Drama
5 18 F 18 3 95825 1193 4 978156168 One Flew Over the Cuckoo's Nest (1975) Drama
6 19 M 1 10 48073 1193 5 982730936 One Flew Over the Cuckoo's Nest (1975) Drama
7 24 F 25 7 10023 1193 5 978136709 One Flew Over the Cuckoo's Nest (1975) Drama
8 28 F 25 1 14607 1193 3 978125194 One Flew Over the Cuckoo's Nest (1975) Drama
9 33 M 45 3 55421 1193 5 978557765 One Flew Over the Cuckoo's Nest (1975) Drama
看一下data数据类型
print(type(data))
<class 'pandas.core.frame.DataFrame'>
select user_id==1的全部数据
data[data.user_id == 1]
user_id gender age occupation zip movie_id rating timestamp title genres
0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) Drama
1725 1 F 1 10 48067 661 3 978302109 James and the Giant Peach (1996) Animation|Children's|Musical
2250 1 F 1 10 48067 914 3 978301968 My Fair Lady (1964) Musical|Romance
2886 1 F 1 10 48067 3408 4 978300275 Erin Brockovich (2000) Drama
4201 1 F 1 10 48067 2355 5 978824291 Bug's Life, A (1998) Animation|Children's|Comedy
5904 1 F 1 10 48067 1197 3 978302268 Princess Bride, The (1987) Action|Adventure|Comedy|Romance
8222 1 F 1 10 48067 1287 5 978302039 Ben-Hur (1959) Action|Adventure|Drama
8926 1 F 1 10 48067 2804 5 978300719 Christmas Story, A (1983) Comedy|Drama
10278 1 F 1 10 48067 594 4 978302268 Snow White and the Seven Dwarfs (1937) Animation|Children's|Musical
11041 1 F 1 10 48067 919 4 978301368 Wizard of Oz, The (1939) Adventure|Children's|Drama|Musical
12759 1 F 1 10 48067 595 5 978824268 Beauty and the Beast (1991) Animation|Children's|Musical
13819 1 F 1 10 48067 938 4 978301752 Gigi (1958) Musical
14006 1 F 1 10 48067 2398 4 978302281 Miracle on 34th Street (1947) Drama
14386 1 F 1 10 48067 2918 4 978302124 Ferris Bueller's Day Off (1986) Comedy
15859 1 F 1 10 48067 1035 5 978301753 Sound of Music, The (1965) Musical
16741 1 F 1 10 48067 2791 4 978302188 Airplane! (1980) Comedy
18472 1 F 1 10 48067 2687 3 978824268 Tarzan (1999) Animation|Children's
18914 1 F 1 10 48067 2018 4 978301777 Bambi (1942) Animation|Children's
19503 1 F 1 10 48067 3105 5 978301713 Awakenings (1990) Drama
20183 1 F 1 10 48067 2797 4 978302039 Big (1988) Comedy|Fantasy
21674 1 F 1 10 48067 2321 3 978302205 Pleasantville (1998) Comedy
22832 1 F 1 10 48067 720 3 978300760 Wallace & Gromit: The Best of Aardman Animatio... Animation
23270 1 F 1 10 48067 1270 5 978300055 Back to the Future (1985) Comedy|Sci-Fi
25853 1 F 1 10 48067 527 5 978824195 Schindler's List (1993) Drama|War
28157 1 F 1 10 48067 2340 3 978300103 Meet Joe Black (1998) Romance
28501 1 F 1 10 48067 48 5 978824351 Pocahontas (1995) Animation|Children's|Musical|Romance
28883 1 F 1 10 48067 1097 4 978301953 E.T. the Extra-Terrestrial (1982) Children's|Drama|Fantasy|Sci-Fi
31152 1 F 1 10 48067 1721 4 978300055 Titanic (1997) Drama|Romance
32698 1 F 1 10 48067 1545 4 978824139 Ponette (1996) Drama
32771 1 F 1 10 48067 745 3 978824268 Close Shave, A (1995) Animation|Comedy|Thriller
33428 1 F 1 10 48067 2294 4 978824291 Antz (1998) Animation|Children's
34073 1 F 1 10 48067 3186 4 978300019 Girl, Interrupted (1999) Drama
34504 1 F 1 10 48067 1566 4 978824330 Hercules (1997) Adventure|Animation|Children's|Comedy|Musical
34973 1 F 1 10 48067 588 4 978824268 Aladdin (1992) Animation|Children's|Comedy|Musical
36324 1 F 1 10 48067 1907 4 978824330 Mulan (1998) Animation|Children's
36814 1 F 1 10 48067 783 4 978824291 Hunchback of Notre Dame, The (1996) Animation|Children's|Musical
37204 1 F 1 10 48067 1836 5 978300172 Last Days of Disco, The (1998) Drama
37339 1 F 1 10 48067 1022 5 978300055 Cinderella (1950) Animation|Children's|Musical
37916 1 F 1 10 48067 2762 4 978302091 Sixth Sense, The (1999) Thriller
40375 1 F 1 10 48067 150 5 978301777 Apollo 13 (1995) Drama
41626 1 F 1 10 48067 1 5 978824268 Toy Story (1995) Animation|Children's|Comedy
43703 1 F 1 10 48067 1961 5 978301590 Rain Man (1988) Drama
45033 1 F 1 10 48067 1962 4 978301753 Driving Miss Daisy (1989) Drama
45685 1 F 1 10 48067 2692 4 978301570 Run Lola Run (Lola rennt) (1998) Action|Crime|Romance
46757 1 F 1 10 48067 260 4 978300760 Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Fantasy|Sci-Fi
49748 1 F 1 10 48067 1028 5 978301777 Mary Poppins (1964) Children's|Comedy|Musical
50759 1 F 1 10 48067 1029 5 978302205 Dumbo (1941) Animation|Children's|Musical
51327 1 F 1 10 48067 1207 4 978300719 To Kill a Mockingbird (1962) Drama
52255 1 F 1 10 48067 2028 5 978301619 Saving Private Ryan (1998) Action|Drama|War
54908 1 F 1 10 48067 531 4 978302149 Secret Garden, The (1993) Children's|Drama
55246 1 F 1 10 48067 3114 4 978302174 Toy Story 2 (1999) Animation|Children's|Comedy
56831 1 F 1 10 48067 608 4 978301398 Fargo (1996) Crime|Drama|Thriller
59344 1 F 1 10 48067 1246 4 978302091 Dead Poets Society (1989) Drama
获得所有电影的评价评分表
values = 'rating:表数据
index = ‘title’:行索引为电影
columns = ‘gender’:列索引为性别
aggfunc = ‘mean’:数据计算方式 平均值
ratings_by_gender = data.pivot_table(values = 'rating', index = 'title', columns = 'gender',aggfunc = 'mean' )
ratings_by_gender.head(10)
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
1-900 (1994) 2.000000 3.000000
10 Things I Hate About You (1999) 3.646552 3.311966
101 Dalmatians (1961) 3.791444 3.500000
101 Dalmatians (1996) 3.240000 2.911215
12 Angry Men (1957) 4.184397 4.328421
为ratings_by_gender添加一个新列diff,表示不同性别的评分差
ratings_by_gender['diff'] = ratings_by_gender['F'] - ratings_by_gender.M
ratings_by_gender.head(10)
gender F M diff
title
$1,000,000 Duck (1971) 3.375000 2.761905 0.613095
'Night Mother (1986) 3.388889 3.352941 0.035948
'Til There Was You (1997) 2.675676 2.733333 -0.057658
'burbs, The (1989) 2.793478 2.962085 -0.168607
...And Justice for All (1979) 3.828571 3.689024 0.139547
1-900 (1994) 2.000000 3.000000 -1.000000
10 Things I Hate About You (1999) 3.646552 3.311966 0.334586
101 Dalmatians (1961) 3.791444 3.500000 0.291444
101 Dalmatians (1996) 3.240000 2.911215 0.328785
12 Angry Men (1957) 4.184397 4.328421 -0.144024
对ratings_by_gender进行排序,根据diff降序
ratings_by_gender.sort_values(by='diff',ascending=True).head(10)
gender F M diff
title
Tigrero: A Film That Was Ne... 1.0 4.333333 -3.333333
Neon Bible, The (1995) 1.0 4.000000 -3.000000
Enfer, L' (1994) 1.0 3.750000 -2.750000
Stalingrad (1993) 1.0 3.593750 -2.593750
Killer: A Journal of Murder (1995) 1.0 3.428571 -2.428571
Dangerous Ground (1997) 1.0 3.333333 -2.333333
In God's Hands (1998) 1.0 3.333333 -2.333333
Rosie (1998) 1.0 3.333333 -2.333333
Flying Saucer, The (1950) 1.0 3.300000 -2.300000
Jamaica Inn (1939) 1.0 3.142857 -2.142857
为了找到热门电影(评分次数高),统计评分次数
ratings_by_title = data.groupby('title').size()
ratings_by_title.head(10)
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
1-900 (1994) 2
10 Things I Hate About You (1999) 700
101 Dalmatians (1961) 565
101 Dalmatians (1996) 364
12 Angry Men (1957) 616
dtype: int64
排序处理
ratings_by_title.sort_values(ascending=False).head(10)
title
American Beauty (1999) 3428
Star Wars: Episode IV - A New Hope (1977) 2991
Star Wars: Episode V - The Empire Strikes Back (1980) 2990
Star Wars: Episode VI - Return of the Jedi (1983) 2883
Jurassic Park (1993) 2672
Saving Private Ryan (1998) 2653
Terminator 2: Judgment Day (1991) 2649
Matrix, The (1999) 2590
Back to the Future (1985) 2583
Silence of the Lambs, The (1991) 2578
dtype: int64
每个电影的平均评分
mean_ratings = data.pivot_table(values='rating',index='title',aggfunc='mean')
mean_ratings.head(10)
rating
title
$1,000,000 Duck (1971) 3.027027
'Night Mother (1986) 3.371429
'Til There Was You (1997) 2.692308
'burbs, The (1989) 2.910891
...And Justice for All (1979) 3.713568
1-900 (1994) 2.500000
10 Things I Hate About You (1999) 3.422857
101 Dalmatians (1961) 3.596460
101 Dalmatians (1996) 3.046703
12 Angry Men (1957) 4.295455
找到评分前10的电影
top_10_hop = ratings_by_title.sort_values(ascending=False).head(10)
mean_ratings.rating[top_10_hop.index]
title
American Beauty (1999) 4.317386
Star Wars: Episode IV - A New Hope (1977) 4.453694
Star Wars: Episode V - The Empire Strikes Back (1980) 4.292977
Star Wars: Episode VI - Return of the Jedi (1983) 4.022893
Jurassic Park (1993) 3.763847
Saving Private Ryan (1998) 4.337354
Terminator 2: Judgment Day (1991) 4.058513
Matrix, The (1999) 4.315830
Back to the Future (1985) 3.990321
Silence of the Lambs, The (1991) 4.351823
Name: rating, dtype: float64
评分top20电影的被评次数(很多电影仅有一个评分,分数不可靠)
top_20_score = mean_ratings.sort_values(by='rating',ascending=False).head(20)
ratings_by_title[top_20_score.index]
title
Ulysses (Ulisse) (1954) 1
Lured (1947) 1
Follow the Bitch (1998) 1
Bittersweet Motel (2000) 1
Song of Freedom (1936) 1
One Little Indian (1973) 1
Smashing Time (1967) 2
Schlafes Bruder (Brother of Sleep) (1995) 1
Gate of Heavenly Peace, The (1995) 3
Baby, The (1973) 1
I Am Cuba (Soy Cuba/Ya Kuba) (1964) 5
Lamerica (1994) 8
Apple, The (Sib) (1998) 9
Sanjuro (1962) 69
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 628
Shawshank Redemption, The (1994) 2227
Godfather, The (1972) 2223
Close Shave, A (1995) 657
Usual Suspects, The (1995) 1783
Schindler's List (1993) 2304
dtype: int64
找到热门电影(被评次数大于1000)
hot_movies = ratings_by_title[ratings_by_title > 1000]
hot_movies.head(10)
title
2001: A Space Odyssey (1968) 1716
Abyss, The (1989) 1715
African Queen, The (1951) 1057
Air Force One (1997) 1076
Airplane! (1980) 1731
Aladdin (1992) 1351
Alien (1979) 2024
Aliens (1986) 1820
Amadeus (1984) 1382
American Beauty (1999) 3428
dtype: int64
在评分平均值表中过滤出热门电影
hot_moviews_rating = mean_ratings.rating[hot_movies.index]
热门电影中评分top10
hot_moviews_rating.sort_values(ascending=False).head(10)
title
Shawshank Redemption, The (1994) 4.554558
Godfather, The (1972) 4.524966
Usual Suspects, The (1995) 4.517106
Schindler's List (1993) 4.510417
Raiders of the Lost Ark (1981) 4.477725
Rear Window (1954) 4.476190
Star Wars: Episode IV - A New Hope (1977) 4.453694
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) 4.449890
Casablanca (1942) 4.412822
Sixth Sense, The (1999) 4.406263
Name: rating, dtype: float64
现在回忆最开始我们分析的不同性别的评分差异,也存在过低的评分次数的情况,所以同样先做一次热门过滤,在计算不同性别的评分差异。
rating_by_gender = data.pivot_table(index='title',columns='gender',values='rating',aggfunc='mean')
去掉NaN值(可能存在只有单一性别评分的情况)
rating_by_gender['diff'] = rating_by_gender.F - rating_by_gender.M
rating_by_gender = rating_by_gender.dropna().sort_values(by='diff')
rating_by_gender
gender | F | M | diff |
---|---|---|---|
title | |||
Tigrero: A Film That Was Never Made (1994) | 1.000000 | 4.333333 | -3.333333 |
Neon Bible, The (1995) | 1.000000 | 4.000000 | -3.000000 |
Enfer, L' (1994) | 1.000000 | 3.750000 | -2.750000 |
Stalingrad (1993) | 1.000000 | 3.593750 | -2.593750 |
Killer: A Journal of Murder (1995) | 1.000000 | 3.428571 | -2.428571 |
... | ... | ... | ... |
Woman of Paris, A (1923) | 5.000000 | 2.428571 | 2.571429 |
Babyfever (1994) | 3.666667 | 1.000000 | 2.666667 |
Country Life (1994) | 5.000000 | 2.000000 | 3.000000 |
James Dean Story, The (1957) | 4.000000 | 1.000000 | 3.000000 |
Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919) | 4.000000 | 1.000000 | 3.000000 |
先通过热门电影的index过滤再根据评分排序
rating_by_gender.get('diff')[hop_movies.index].sort_values(ascending=False)
title
Rocky Horror Picture Show, The (1975) 0.512885
Mary Poppins (1964) 0.467147
Gone with the Wind (1939) 0.440471
Full Monty, The (1997) 0.352481
Little Mermaid, The (1989) 0.343561
...
Predator (1987) -0.406793
Airplane! (1980) -0.407854
South Park: Bigger, Longer and Uncut (1999) -0.424206
Reservoir Dogs (1992) -0.444642
Animal House (1978) -0.538286
Name: diff, Length: 207, dtype: float64