《利用python进行数据分析》ch02续

# MovieLens 1M数据集 稍微过了下数据分析这本书,最后再把前面第二章例子敲一遍,不然总是记不住
import pandas as pd
import numpy as np
from pandas import DataFrame
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('D:\\pytest\\pydata-book-master\\ch02\\movielens\\users.dat', sep = '::',header = None, names = unames )
users[:5]
C:\software\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘\s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’.
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
user_idgenderageoccupationzip
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('D:\\pytest\\pydata-book-master\\ch02\\movielens\\ratings.dat', sep = '::', header = None, names = rnames)
ratings[:5]
C:\software\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘\s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’.
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
user_idmovie_idratingtimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('D:\\pytest\\pydata-book-master\\ch02\\movielens\\movies.dat', sep = '::', header = None, names = mnames)
movies[:5]
C:\software\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘\s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’.
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
movie_idtitlegenres
01Toy Story (1995)Animation|Children’s|Comedy
12Jumanji (1995)Adventure|Children’s|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
ratings.info()
users.info()
movies.info()
data = pd.merge(pd.merge(ratings, movies), users)
data[:5]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
user_idmovie_idratingtimestamptitlegenresgenderageoccupationzip
0111935978300760One Flew Over the Cuckoo’s Nest (1975)DramaF11048067
116613978302109James and the Giant Peach (1996)Animation|Children’s|MusicalF11048067
219143978301968My Fair Lady (1964)Musical|RomanceF11048067
3134084978300275Erin Brockovich (2000)DramaF11048067
4123555978824291Bug’s Life, A (1998)Animation|Children’s|ComedyF11048067
data.info()
data.iloc[1]  #iloc和loc方法是取行,ix方法已经丢弃不用了
user_id 1 movie_id 661 rating 3 timestamp 978302109 title James and the Giant Peach (1996) genres Animation|Children’s|Musical gender F age 1 occupation 10 zip 48067 Name: 1, dtype: object 根据rating里男女评分,对一一部电影取平均分,用了pivot_table方法,在书上的p288页有详细介绍
mean_ratings = data.pivot_table('rating', index = 'title', columns = 'gender', aggfunc = 'mean')
mean_ratings[:5]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
genderFM
title
$1,000,000 Duck (1971)3.3750002.761905
‘Night Mother (1986)3.3888893.352941
‘Til There Was You (1997)2.6756762.733333
‘burbs, The (1989)2.7934782.962085
…And Justice for All (1979)3.8285713.689024

将整个data按照title排序,例如:含有‘$1,000,000 Duck (1971)’这部电影共有37条行数据,说明有37个人对其评价了,现在打算将评价数低于250的都删除掉,首先要找出评价数低于250的数据在rating_by_title中的行索引

rating_by_title = data.groupby('title').size()
rating_by_title[:5]
title $1,000,000 Duck (1971) 37 ‘Night Mother (1986) 70 ‘Til There Was You (1997) 52 ‘burbs, The (1989) 303 …And Justice for All (1979) 199 dtype: int64 取出评论超过250条的电影
active_titles = rating_by_title.index[rating_by_title >= 250]
active_titles
Index([”burbs, The (1989)’, ‘10 Things I Hate About You (1999)’, ‘101 Dalmatians (1961)’, ‘101 Dalmatians (1996)’, ‘12 Angry Men (1957)’, ‘13th Warrior, The (1999)’, ‘2 Days in the Valley (1996)’, ‘20,000 Leagues Under the Sea (1954)’, ‘2001: A Space Odyssey (1968)’, ‘2010 (1984)’, … ‘X-Men (2000)’, ‘Year of Living Dangerously (1982)’, ‘Yellow Submarine (1968)’, ‘You’ve Got Mail (1998)’, ‘Young Frankenstein (1974)’, ‘Young Guns (1988)’, ‘Young Guns II (1990)’, ‘Young Sherlock Holmes (1985)’, ‘Zero Effect (1998)’, ‘eXistenZ (1999)’], dtype=’object’, name=’title’, length=1216)
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings[:10]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
genderFM
title
‘burbs, The (1989)2.7934782.962085
10 Things I Hate About You (1999)3.6465523.311966
101 Dalmatians (1961)3.7914443.500000
101 Dalmatians (1996)3.2400002.911215
12 Angry Men (1957)4.1843974.328421
13th Warrior, The (1999)3.1120003.168000
2 Days in the Valley (1996)3.4888893.244813
20,000 Leagues Under the Sea (1954)3.6701033.709205
2001: A Space Odyssey (1968)3.8255814.129738
2010 (1984)3.4468093.413712
mean_ratings.info()
top_female_ratings = mean_ratings.sort_values(by = 'F', ascending = False)  #ascending 为升降序参数
top_female_ratings[:10]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
genderFM
title
Close Shave, A (1995)4.6444444.473795
Wrong Trousers, The (1993)4.5882354.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)4.5726504.464589
Wallace & Gromit: The Best of Aardman Animation (1996)4.5631074.385075
Schindler’s List (1993)4.5626024.491415
Shawshank Redemption, The (1994)4.5390754.560625
Grand Day Out, A (1992)4.5378794.293255
To Kill a Mockingbird (1962)4.5366674.372611
Creature Comforts (1990)4.5138894.272277
Usual Suspects, The (1995)4.5133174.518248
### 计算评分分歧
mean_ratings['diff'] = mean_ratings['F'] - mean_ratings['M']
mean_ratings[:15]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
genderFMdiff
title
‘burbs, The (1989)2.7934782.962085-0.168607
10 Things I Hate About You (1999)3.6465523.3119660.334586
101 Dalmatians (1961)3.7914443.5000000.291444
101 Dalmatians (1996)3.2400002.9112150.328785
12 Angry Men (1957)4.1843974.328421-0.144024
13th Warrior, The (1999)3.1120003.168000-0.056000
2 Days in the Valley (1996)3.4888893.2448130.244076
20,000 Leagues Under the Sea (1954)3.6701033.709205-0.039102
2001: A Space Odyssey (1968)3.8255814.129738-0.304156
2010 (1984)3.4468093.4137120.033097
28 Days (2000)3.2094242.9777070.231717
39 Steps, The (1935)3.9655174.107692-0.142175
54 (1998)2.7017542.782178-0.080424
7th Voyage of Sinbad, The (1958)3.4090913.658879-0.249788
8MM (1999)2.9062502.8509620.055288

前15个是女性比较喜爱但是男性不喜爱的电影,分歧最大的15部

diff_ratings = mean_ratings.sort_values(by = 'diff', ascending = False)
diff_ratings[:15]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
genderFMdiff
title
Dirty Dancing (1987)3.7903782.9595960.830782
Jumpin’ Jack Flash (1986)3.2547172.5783580.676359
Grease (1978)3.9752653.3670410.608224
Little Women (1994)3.8705883.3217390.548849
Steel Magnolias (1989)3.9017343.3659570.535777
Anastasia (1997)3.8000003.2816090.518391
Rocky Horror Picture Show, The (1975)3.6730163.1601310.512885
Color Purple, The (1985)4.1581923.6593410.498851
Age of Innocence, The (1993)3.8270683.3395060.487561
Free Willy (1993)2.9213482.4387760.482573
French Kiss (1995)3.5357143.0569620.478752
Little Shop of Horrors, The (1960)3.6500003.1796880.470312
Guys and Dolls (1955)4.0517243.5833330.468391
Mary Poppins (1964)4.1977403.7305940.467147
Patch Adams (1998)3.4732823.0087460.464536

倒序后取15个是男性喜爱而女性不喜爱的电影前15

diff_ratings[::-1][:15]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
genderFMdiff
title
Good, The Bad and The Ugly, The (1966)3.4949494.221300-0.726351
Kentucky Fried Movie, The (1977)2.8787883.555147-0.676359
Dumb & Dumber (1994)2.6979873.336595-0.638608
Longest Day, The (1962)3.4117654.031447-0.619682
Cable Guy, The (1996)2.2500002.863787-0.613787
Evil Dead II (Dead By Dawn) (1987)3.2972973.909283-0.611985
Hidden, The (1987)3.1379313.745098-0.607167
Rocky III (1982)2.3617022.943503-0.581801
Caddyshack (1980)3.3961353.969737-0.573602
For a Few Dollars More (1965)3.4090913.953795-0.544704
Porky’s (1981)2.2968752.836364-0.539489
Animal House (1978)3.6289064.167192-0.538286
Exorcist, The (1973)3.5376344.067239-0.529605
Fright Night (1985)2.9736843.500000-0.526316
Barb Wire (1996)1.5853662.100386-0.515020

只找出分歧最大的电影,用方差或标准差计算

根据电影名称的评分,对每个电影计算标准差

rating_std_by_title = data.groupby('title')['rating'].std()
rating_std_by_title[:15]
title
$1,000,000 Duck (1971)                 1.092563
'Night Mother (1986)                   1.118636
'Til There Was You (1997)              1.020159
'burbs, The (1989)                     1.107760
...And Justice for All (1979)          0.878110
1-900 (1994)                           0.707107
10 Things I Hate About You (1999)      0.989815
101 Dalmatians (1961)                  0.982103
101 Dalmatians (1996)                  1.098717
12 Angry Men (1957)                    0.812731
13th Warrior, The (1999)               1.140421
187 (1997)                             1.057919
2 Days in the Valley (1996)            0.921592
20 Dates (1998)                        1.151943
20,000 Leagues Under the Sea (1954)    0.869685
Name: rating, dtype: float64

然后行过滤掉评价人数小于250人的电影

rating_std_by_title = rating_std_by_title.loc[active_titles]

最后降序排列得出分歧最大的电影

rating_std_by_title.sort_values(ascending = False)[:15]
title
Dumb & Dumber (1994)                           1.321333
Blair Witch Project, The (1999)                1.316368
Natural Born Killers (1994)                    1.307198
Tank Girl (1995)                               1.277695
Rocky Horror Picture Show, The (1975)          1.260177
Eyes Wide Shut (1999)                          1.259624
Evita (1996)                                   1.253631
Billy Madison (1995)                           1.249970
Fear and Loathing in Las Vegas (1998)          1.246408
Bicentennial Man (1999)                        1.245533
Hellraiser (1987)                              1.243046
Babe: Pig in the City (1998)                   1.239379
Wes Craven's New Nightmare (1994)              1.237630
South Park: Bigger, Longer and Uncut (1999)    1.235380
Deuce Bigalow: Male Gigolo (1999)              1.226337
Name: rating, dtype: float64
rating_std_by_title.order(ascending = False)
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-89-e2202e0e8762> in <module>()
----> 1 rating_std_by_title.order(ascending = False)


C:\software\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   2968             if name in self._info_axis:
   2969                 return self[name]
-> 2970             return object.__getattribute__(self, name)
   2971 
   2972     def __setattr__(self, name, value):


AttributeError: 'Series' object has no attribute 'order'

关于排序,Series 有order方法,DataFrame有sort_value方法,为什么order不能用了?

还有,索引排序,还有sort_index方法

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值