《利用Python进行数据分析》学习笔记ch02-2(2)

前言:
这部分出现的一些东西(还分不清是内置模块、算法、类还是函数):
pandas.read_table
pandas DataFrame
Python的切片语法
pandas的merge函数
pivot_table
size()
平均得分
排序
方差
标准差


如何对数据进行切片切块以满足实际需求

Movielens 1M数据集

通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中

import pandas as pd
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('C:\\pytest\\ch02\\movielens\\users.dat',sep='::',header=None,names=unames)

C:\Anaconda3\lib\site-packages\ipykernel__main__.py:1: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘\s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’.
if name == ‘main‘: #警告,不知道为什么发生

rnames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table('C:\\pytest\\ch02\\movielens\\ratings.dat',sep='::',header=None,names=rnames)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  if __name__ == '__main__':  #警告,不知道为什么发生
mnames = ['movie_id','title','genres']
movies = pd.read_table('C:\\pytest\\ch02\\movielens\\movies.dat',sep='::',header=None,names=mnames)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  if __name__ == '__main__':  #警告,不知道为什么发生

利用Python的切片语法,通过查看每个DataFrame的前几行即可验证数据加载工作是否顺利

users[:5]
user_idgenderageoccupationzip
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455
ratings[:5]
user_idmovie_idratingtimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
movies[:5]
movie_idtitlegenres
01Toy Story (1995)Animation|Children’s|Comedy
12Jumanji (1995)Adventure|Children’s|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
ratings
user_idmovie_idratingtimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
5111973978302268
6112875978302039
7128045978300719
815944978302268
919194978301368
1015955978824268
1119384978301752
12123984978302281
13129184978302124
14110355978301753
15127914978302188
16126873978824268
17120184978301777
18131055978301713
19127974978302039
20123213978302205
2117203978300760
22112705978300055
2315275978824195
24123403978300103
251485978824351
26110974978301953
27117214978300055
28115454978824139
2917453978824268
1000179604027624956704584
1000180604010363956715455
100018160405084956704972
1000182604010414957717678
1000183604037354960971654
1000184604027914956715569
1000185604027941956716438
100018660405275956704219
1000187604020031956716294
100018860405354964828734
1000189604020105957716795
1000190604020114956716113
1000191604037514964828782
1000192604020195956703977
100019360405414956715288
1000194604010775964828799
1000195604010792956715648
100019660405494956704746
1000197604020203956715288
1000198604020213956716374
1000199604020225956716207
1000200604020285956704519
1000201604010804957717322
1000202604010894956704996
1000203604010903956715518
1000204604010911956716541
1000205604010945956704887
100020660405625956704746
1000207604010964956715648
1000208604010974956715569

1000209 rows × 4 columns

先用pandas的merge函数将ratings跟users合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键

data = pd.merge(pd.merge(ratings,users),movies)
data
user_idmovie_idratingtimestampgenderageoccupationziptitlegenres
0111935978300760F11048067One Flew Over the Cuckoo’s Nest (1975)Drama
1211935978298413M561670072One Flew Over the Cuckoo’s Nest (1975)Drama
21211934978220179M251232793One Flew Over the Cuckoo’s Nest (1975)Drama
31511934978199279M25722903One Flew Over the Cuckoo’s Nest (1975)Drama
41711935978158471M50195350One Flew Over the Cuckoo’s Nest (1975)Drama
51811934978156168F18395825One Flew Over the Cuckoo’s Nest (1975)Drama
61911935982730936M11048073One Flew Over the Cuckoo’s Nest (1975)Drama
72411935978136709F25710023One Flew Over the Cuckoo’s Nest (1975)Drama
82811933978125194F25114607One Flew Over the Cuckoo’s Nest (1975)Drama
93311935978557765M45355421One Flew Over the Cuckoo’s Nest (1975)Drama
103911935978043535M18461820One Flew Over the Cuckoo’s Nest (1975)Drama
114211933978038981M25824502One Flew Over the Cuckoo’s Nest (1975)Drama
124411934978018995M451798052One Flew Over the Cuckoo’s Nest (1975)Drama
134711934977978345M18494305One Flew Over the Cuckoo’s Nest (1975)Drama
144811934977975061M25492107One Flew Over the Cuckoo’s Nest (1975)Drama
154911934978813972M181277084One Flew Over the Cuckoo’s Nest (1975)Drama
165311935977946400M25096931One Flew Over the Cuckoo’s Nest (1975)Drama
175411935977944039M50156723One Flew Over the Cuckoo’s Nest (1975)Drama
185811935977933866M25230303One Flew Over the Cuckoo’s Nest (1975)Drama
195911934977934292F50155413One Flew Over the Cuckoo’s Nest (1975)Drama
206211934977968584F35398105One Flew Over the Cuckoo’s Nest (1975)Drama
218011934977786172M56149327One Flew Over the Cuckoo’s Nest (1975)Drama
228111935977785864F25060640One Flew Over the Cuckoo’s Nest (1975)Drama
238811935977694161F45102476One Flew Over the Cuckoo’s Nest (1975)Drama
248911935977683596F56985749One Flew Over the Cuckoo’s Nest (1975)Drama
259511935977626632M45098201One Flew Over the Cuckoo’s Nest (1975)Drama
269611933977621789F251678028One Flew Over the Cuckoo’s Nest (1975)Drama
279911932982791053F11019390One Flew Over the Cuckoo’s Nest (1975)Drama
28102119351040737607M351920871One Flew Over the Cuckoo’s Nest (1975)Drama
2910411932977546620M251200926One Flew Over the Cuckoo’s Nest (1975)Drama
1000179493330843962757020M251594040Home Page (1999)Documentary
10001804802221821014866656M56140601Juno and Paycock (1930)Drama
1000181481223082962932391M181425301Detroit 9000 (1973)Action|Crime
100018248746244962781918F25470808Condition Red (1995)Action|Drama|Thriller
1000183505914344962484364M451622652Stranger, The (1994)Action
1000184594714344957190428F451697215Stranger, The (1994)Action
1000185507718683962417299M25220037Truce, The (1996)Drama|War
1000186594418681957197520F181027606Truce, The (1996)Drama|War
100018751054043962337582M50718977Brother Minister: The Assassination of Malcolm…Documentary
100018851854044963402617F35444485Brother Minister: The Assassination of Malcolm…Documentary
100018955324045959619841M251727408Brother Minister: The Assassination of Malcolm…Documentary
100019055434043960127592M251797401Brother Minister: The Assassination of Malcolm…Documentary
1000191522025433961546137M25791436Six Ways to Sunday (1997)Comedy
1000192575425434958272316F18160640Six Ways to Sunday (1997)Comedy
100019352275913961475931M181064050Tough and Deadly (1995)Action|Drama|Thriller
100019457955911958145253M25192688Tough and Deadly (1995)Action|Drama|Thriller
1000195531336565960920392M56055406Lured (1947)Crime
1000196532824384960838075F25491740Outside Ozona (1998)Drama|Thriller
1000197533433233960796159F561346140Chain of Fools (2000)Comedy|Crime
100019853341271960795494F561346140Silence of the Palace, The (Saimt el Qusur) (1…Drama
1000199533433825960796159F561346140Song of Freedom (1936)Drama
1000200542018433960156505F11914850Slappy and the Stinkers (1998)Children’s|Comedy
100020154332863960240881F351745014Nemesis 2: Nebula (1995)Action|Sci-Fi|Thriller
1000202549435304959816296F351794306Smoking/No Smoking (1993)Comedy
1000203555621983959445515M45692103Modulations (1998)Documentary
1000204594921985958846401M181747901Modulations (1998)Documentary
1000205567527033976029116M351430030Broken Vessels (1998)Drama
1000206578028451958153068M181792886White Boys (1999)Drama
1000207585136075957756608F182055410One Little Indian (1973)Comedy|Drama|Western
1000208593829094957273353M25135401Five Wives, Three Secretaries and Me (1998)Documentary

1000209 rows × 10 columns

只要稍微熟悉一下pandas,就能轻松地根据任意用户或电影属性对评分数据进行聚合操作了。为了按性别计算每部电影的平均得分,我们可以使用pivot_table方法

mean_ratings = data.pivot_table('rating',index='title',columns='gender',aggfunc='mean')
mean_ratings[:5]
genderFM
title
$1,000,000 Duck (1971)3.3750002.761905
‘Night Mother (1986)3.3888893.352941
‘Til There Was You (1997)2.6756762.733333
‘burbs, The (1989)2.7934782.962085
…And Justice for All (1979)3.8285713.689024

该操作产生了另一个DataFrame,其内容为电影评分,行标为电影名称,列标为性别现在打算过滤掉评分数据不够250条的电影,为达到这个目的,先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象

ratings_by_title = data.groupby('title').size()
ratings_by_title[:10]
title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64
active_titles = ratings_by_title.index[ratings_by_title>=250]
active_titles
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

该索引中含有评分数据大于250条的电影名称,然后就可以据此从前面的mean_ratings中选取所需的行了

mean_ratings = mean_ratings.ix[active_titles]
mean_ratings 
genderFM
title
‘burbs, The (1989)2.7934782.962085
10 Things I Hate About You (1999)3.6465523.311966
101 Dalmatians (1961)3.7914443.500000
101 Dalmatians (1996)3.2400002.911215
12 Angry Men (1957)4.1843974.328421
13th Warrior, The (1999)3.1120003.168000
2 Days in the Valley (1996)3.4888893.244813
20,000 Leagues Under the Sea (1954)3.6701033.709205
2001: A Space Odyssey (1968)3.8255814.129738
2010 (1984)3.4468093.413712
28 Days (2000)3.2094242.977707
39 Steps, The (1935)3.9655174.107692
54 (1998)2.7017542.782178
7th Voyage of Sinbad, The (1958)3.4090913.658879
8MM (1999)2.9062502.850962
About Last Night… (1986)3.1886793.140909
Absent Minded Professor, The (1961)3.4693883.446809
Absolute Power (1997)3.4691363.327759
Abyss, The (1989)3.6592363.689507
Ace Ventura: Pet Detective (1994)3.0000003.197917
Ace Ventura: When Nature Calls (1995)2.2696632.543333
Addams Family Values (1993)3.0000002.878531
Addams Family, The (1991)3.1861703.163498
Adventures in Babysitting (1987)3.4557823.208122
Adventures of Buckaroo Bonzai Across the 8th Dimension, The (1984)3.3085113.402321
Adventures of Priscilla, Queen of the Desert, The (1994)3.9890713.688811
Adventures of Robin Hood, The (1938)4.1666673.918367
African Queen, The (1951)4.3242324.223822
Age of Innocence, The (1993)3.8270683.339506
Agnes of God (1985)3.5348843.244898
White Men Can’t Jump (1992)3.0287773.231061
Who Framed Roger Rabbit? (1988)3.5693783.713251
Who’s Afraid of Virginia Woolf? (1966)4.0297034.096939
Whole Nine Yards, The (2000)3.2965523.404814
Wild Bunch, The (1969)3.6363644.128099
Wild Things (1998)3.3920003.459082
Wild Wild West (1999)2.2754492.131973
William Shakespeare’s Romeo and Juliet (1996)3.5326093.318644
Willow (1988)3.6586833.453543
Willy Wonka and the Chocolate Factory (1971)4.0639533.789474
Witness (1985)4.1158543.941504
Wizard of Oz, The (1939)4.3550304.203138
Wolf (1994)3.0740742.899083
Women on the Verge of a Nervous Breakdown (1988)3.9343073.865741
Wonder Boys (2000)4.0437963.913649
Working Girl (1988)3.6067423.312500
World Is Not Enough, The (1999)3.3375003.388889
Wrong Trousers, The (1993)4.5882354.478261
Wyatt Earp (1994)3.1470593.283898
X-Files: Fight the Future, The (1998)3.4894743.493797
X-Men (2000)3.6823103.851702
Year of Living Dangerously (1982)3.9512203.869403
Yellow Submarine (1968)3.7142863.689286
You’ve Got Mail (1998)3.5424243.275591
Young Frankenstein (1974)4.2899634.239177
Young Guns (1988)3.3717953.425620
Young Guns II (1990)2.9347832.904025
Young Sherlock Holmes (1985)3.5147063.363344
Zero Effect (1998)3.8644073.723140
eXistenZ (1999)3.0985923.289086

1216 rows × 2 columns

为了解女性观众最喜欢的电影,我们可以对F列降序排列:

top_female_ratings = mean_ratings.sort_index(by='F',ascending=False)
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=…) if __name__ == ‘__main__’:
top_female_ratings[:10]
genderFM
title
Close Shave, A (1995)4.6444444.473795
Wrong Trousers, The (1993)4.5882354.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)4.5726504.464589
Wallace & Gromit: The Best of Aardman Animation (1996)4.5631074.385075
Schindler’s List (1993)4.5626024.491415
Shawshank Redemption, The (1994)4.5390754.560625
Grand Day Out, A (1992)4.5378794.293255
To Kill a Mockingbird (1962)4.5366674.372611
Creature Comforts (1990)4.5138894.272277
Usual Suspects, The (1995)4.5133174.518248

计算评分分歧

假设我们想要找出男性和女性观众分歧最大的电影。一个方法是给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序:

mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']
sorted_by_diff = mean_ratings.sort_index(by='diff')
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  if __name__ == '__main__':
sorted_by_diff[:15]
genderFMdiff
title
Dirty Dancing (1987)3.7903782.959596-0.830782
Jumpin’ Jack Flash (1986)3.2547172.578358-0.676359
Grease (1978)3.9752653.367041-0.608224
Little Women (1994)3.8705883.321739-0.548849
Steel Magnolias (1989)3.9017343.365957-0.535777
Anastasia (1997)3.8000003.281609-0.518391
Rocky Horror Picture Show, The (1975)3.6730163.160131-0.512885
Color Purple, The (1985)4.1581923.659341-0.498851
Age of Innocence, The (1993)3.8270683.339506-0.487561
Free Willy (1993)2.9213482.438776-0.482573
French Kiss (1995)3.5357143.056962-0.478752
Little Shop of Horrors, The (1960)3.6500003.179688-0.470312
Guys and Dolls (1955)4.0517243.583333-0.468391
Mary Poppins (1964)4.1977403.730594-0.467147
Patch Adams (1998)3.4732823.008746-0.464536

对排序结果反序并取出前15行,得到的则是男性观众更喜欢的电影

sorted_by_diff[::-1][:15]
genderFMdiff
title
Good, The Bad and The Ugly, The (1966)3.4949494.2213000.726351
Kentucky Fried Movie, The (1977)2.8787883.5551470.676359
Dumb & Dumber (1994)2.6979873.3365950.638608
Longest Day, The (1962)3.4117654.0314470.619682
Cable Guy, The (1996)2.2500002.8637870.613787
Evil Dead II (Dead By Dawn) (1987)3.2972973.9092830.611985
Hidden, The (1987)3.1379313.7450980.607167
Rocky III (1982)2.3617022.9435030.581801
Caddyshack (1980)3.3961353.9697370.573602
For a Few Dollars More (1965)3.4090913.9537950.544704
Porky’s (1981)2.2968752.8363640.539489
Animal House (1978)3.6289064.1671920.538286
Exorcist, The (1973)3.5376344.0672390.529605
Fright Night (1985)2.9736843.5000000.526316
Barb Wire (1996)1.5853662.1003860.515020

如果只是想要找出分歧最大的电影(不考虑性别因素),则可以计算得分数据的方差或标准差

rating_std_by_title = data.groupby('title')['rating'].std()
rating_std_by_title = rating_std_by_title.ix[active_titles]
rating_std_by_title.order(ascending=False)[:10]
C:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: order is deprecated, use sort_values(...)
  if __name__ == '__main__':


title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值