利用python进行数据分析数据集_《利用Python进行数据分析》终章·数据分析案例·学习笔记(二)...

一、第14章 数据分析案例

本书正文的最后一章,我们来看一些真实世界的数据集。对于每个数据集,我们会用之前介绍的方法,从原始数据中提取有意义的内容。展示的方法适用于其它数据集,也包括你的。本章包含了一些各种各样的案例数据集,可以用来练习。

下载本书代码:https://github.com/wesm/pydata-book(建议把代码下载下来之后,安装好Anaconda 3.6,在目录文件夹中用Jupyter notebook打开)

二、实例

2.4 MovieLens 1M数据集

GroupLens Research(

MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据。它分为三个表:评分、用户信息和电影信息。将该数据从zip文件中解压出来之后,可以通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中:

import pandas as pd

pd.options.display.max_rows = 10 #限制显示数量

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']

users = pd.read_table('datasets/movielens/users.dat', sep='::',

header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings = pd.read_table('datasets/movielens/ratings.dat', sep='::',

header=None, names=rnames)

mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('datasets/movielens/movies.dat', sep='::',

header=None, names=mnames)

利用Python的切片语法,通过查看每个DataFrame的前几行即可验证数据加载工作是否一切顺利:

In [69]: users[:5]

Out[69]:

user_id gender age occupation zip

0 1 F 1 10 48067

1 2 M 56 16 70072

2 3 M 25 15 55117

3 4 M 45 7 02460

4 5 M 25 20 55455

In [70]: ratings[:5]

Out[70]:

user_id movie_id rating timestamp

0 1 1193 5 978300760

1 1 661 3 978302109

2 1 914 3 978301968

3 1 3408 4 978300275

4 1 2355 5 978824291

In [71]: movies[:5]

Out[71]:

movie_id title genres

0 1 Toy Story (1995) Animation|Children's|Comedy

1 2 Jumanji (1995) Adventure|Children's|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama

4 5 Father of the Bride Part II (1995) Comedy

In [72]: ratings

Out[72]:

user_id movie_id rating timestamp

0 1 1193 5 978300760

1 1 661 3 978302109

2 1 914 3 978301968

3 1 3408 4 978300275

4 1 2355 5 978824291

... ... ... ... ...

1000204 6040 1091 1 956716541

1000205 6040 1094 5 956704887

1000206 6040 562 5 956704746

1000207 6040 1096 4 956715648

1000208 6040 1097 4 956715569

注意,其中的年龄和职业是以编码形式给出的,它们的具体含义请参考该数据集的README文件。分析散布在三个表中的数据可不是一件轻松的事情。假设我们想要根据性别和年龄计算某部电影的平均得分,如果将所有数据都合并到一个表中的话问题就简单多了。我们先用pandas的merge函数将ratings跟users合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键:

In [73]: data = pd.merge(pd.merge(ratings, users), movies)

In [74]: data

Out[74]:

user_id ... genres

0 1 ... Drama

1 2 ... Drama

2 12 ... Drama

3 15 ... Drama

4 17 ... Drama

... ... ... ...

1000204 5949 ... Documentary

1000205 5675 ... Drama

1000206 5780 ... Drama

1000207 5851 ... Comedy|Drama|Western

1000208 5938 ... Documentary

[1000209 rows x 10 columns]

In [75]: data.iloc[0]

Out[75]:

user_id 1

movie_id 1193

rating 5

timestamp 978300760

gender F

age 1

occupation 10

zip 48067

title One Flew Over the Cuckoo's Nest (1975)

genres Drama

Name: 0, dtype: object

为了按性别计算每部电影的平均得分,我们可以使用pivot_table方法:

In [76]: mean_ratings = data.pivot_table('rating', index='title',

....: columns='gender', aggfunc='mean')

In [77]: mean_ratings[:5]

Out[77]:

gender F M

title

$1,000,000 Duck (1971) 3.375000 2.761905

'Night Mother (1986) 3.388889 3.352941

'Til There Was You (1997) 2.675676 2.733333

'burbs, The (1989) 2.793478 2.962085

...And Justice for All (1979) 3.828571 3.689024

该操作产生了另一个DataFrame,其内容为电影平均得分,行标为电影名称,列标为性别。现在,我打算过滤掉评分数据不够250条的电影。为了达到这个目的,我先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象:

In [78]: ratings_by_title = data.groupby('title').size()

In [79]: ratings_by_title[:10]

Out[79]:

title

$1,000,000 Duck (1971) 37

'Night Mother (1986) 70

'Til There Was You (1997) 52

'burbs, The (1989) 303

...And Justice for All (1979) 199

1-900 (1994) 2

10 Things I Hate About You (1999) 700

101 Dalmatians (1961) 565

101 Dalmatians (1996) 364

12 Angry Men (1957) 616

dtype: int64

In [80]: active_titles = ratings_by_title.index[ratings_by_title >= 250]

In [81]: active_titles

Out[81]:

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',

'101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',

'13th Warrior, The (1999)', '2 Days in the Valley (1996)',

'20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',

'2010 (1984)',

...

'X-Men (2000)', 'Year of Living Dangerously (1982)',

'Yellow Submarine (1968)', 'You've Got Mail (1998)',

'Young Frankenstein (1974)', 'Young Guns (1988)',

'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',

'Zero Effect (1998)', 'eXistenZ (1999)'],

dtype='object', name='title', length=1216)

标题索引中含有评分数据大于250条的电影名称,然后我们就可以据此从前面的mean_ratings中选取所需的行了:

In [82]: mean_ratings = mean_ratings.loc[active_titles]

In [83]: mean_ratings

Out[83]:

gender F M

title

'burbs, The (1989) 2.793478 2.962085

10 Things I Hate About You (1999) 3.646552 3.311966

101 Dalmatians (1961) 3.791444 3.500000

101 Dalmatians (1996) 3.240000 2.911215

12 Angry Men (1957) 4.184397 4.328421

... ... ...

Young Guns (1988) 3.371795 3.425620

Young Guns II (1990) 2.934783 2.904025

Young Sherlock Holmes (1985) 3.514706 3.363344

Zero Effect (1998) 3.864407 3.723140

eXistenZ (1999) 3.098592 3.289086

[1216 rows x 2 columns]

为了了解女性观众最喜欢的电影,我们可以对F列降序排列:

In [85]: top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)

In [86]: top_female_ratings[:10]

Out[86]:

gender F M

title

Close Shave, A (1995) 4.644444 4.473795

Wrong Trousers, The (1993) 4.588235 4.478261

Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589

Wallace & Gromit: The Best of Aardman Animation... 4.563107 4.385075

Schindler's List (1993) 4.562602 4.491415

Shawshank Redemption, The (1994) 4.539075 4.560625

Grand Day Out, A (1992) 4.537879 4.293255

To Kill a Mockingbird (1962) 4.536667 4.372611

Creature Comforts (1990) 4.513889 4.272277

Usual Suspects, The (1995) 4.513317 4.518248

2.5计算评分分歧

假设我们想要找出男性和女性观众分歧最大的电影。一个办法是给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序:

In [87]: mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

按"diff"排序即可得到分歧最大且女性观众更喜欢的电影:

In [88]: sorted_by_diff = mean_ratings.sort_values(by='diff')

In [89]: sorted_by_diff[:10]

Out[89]:

gender F M diff

title

Dirty Dancing (1987) 3.790378 2.959596 -0.830782

Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359

Grease (1978) 3.975265 3.367041 -0.608224

Little Women (1994) 3.870588 3.321739 -0.548849

Steel Magnolias (1989) 3.901734 3.365957 -0.535777

Anastasia (1997) 3.800000 3.281609 -0.518391

Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885

Color Purple, The (1985) 4.158192 3.659341 -0.498851

Age of Innocence, The (1993) 3.827068 3.339506 -0.487561

Free Willy (1993) 2.921348 2.438776 -0.482573

对排序结果反序并取出前10行,得到的则是男性观众更喜欢的电影:

In [90]: sorted_by_diff[::-1][:10]#倒序

Out[90]:

gender F M diff

title

Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351

Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359

Dumb & Dumber (1994) 2.697987 3.336595 0.638608

Longest Day, The (1962) 3.411765 4.031447 0.619682

Cable Guy, The (1996) 2.250000 2.863787 0.613787

Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985

Hidden, The (1987) 3.137931 3.745098 0.607167

Rocky III (1982) 2.361702 2.943503 0.581801

Caddyshack (1980) 3.396135 3.969737 0.573602

For a Few Dollars More (1965) 3.409091 3.953795 0.544704

如果只是想要找出分歧最大的电影(不考虑性别因素),则可以计算得分数据的方差或标准差:

In [91]: rating_std_by_title = data.groupby('title')['rating'].std()#标准差

In [92]: rating_std_by_title = rating_std_by_title.loc[active_titles]#选择列

In [93]: rating_std_by_title.sort_values(ascending=False)[:10]#排序

Out[93]:

title

Dumb & Dumber (1994) 1.321333

Blair Witch Project, The (1999) 1.316368

Natural Born Killers (1994) 1.307198

Tank Girl (1995) 1.277695

Rocky Horror Picture Show, The (1975) 1.260177

Eyes Wide Shut (1999) 1.259624

Evita (1996) 1.253631

Billy Madison (1995) 1.249970

Fear and Loathing in Las Vegas (1998) 1.246408

Bicentennial Man (1999) 1.245533

Name: rating, dtype: float64

可能你已经注意到了,电影分类是以竖线(|)分隔的字符串形式给出的。如果想对电影分类进行分析的话,就需要先将其转换成更有用的形式才行。

三、小结

pd.options.display.max_rows = 10 #限制显示数量

data = pd.merge(pd.merge(ratings, users), movies)

data.iloc[0]

mean_ratings = data.pivot_table('rating', index='title',

....: columns='gender', aggfunc='mean')

ratings_by_title = data.groupby('title').size()

active_titles = ratings_by_title.index[ratings_by_title >= 250]

mean_ratings = mean_ratings.loc[active_titles]

top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

sorted_by_diff = mean_ratings.sort_values(by='diff')

sorted_by_diff[::-1][:10] #倒序

rating_std_by_title = data.groupby('title')['rating'].std()#标准差

rating_std_by_title = rating_std_by_title.loc[active_titles]#选择列

rating_std_by_title.sort_values(ascending=False)[:10]#排序

·后记

最后一章,查缺补漏。本节涉及了表的读取、合并、选择、透视、分组、排序、筛选、统计,功能相对齐全。

欢迎点赞·评论·分享·收藏·浇筑树苗

(o・ェ・o)ノ r (苗·lv.1576)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值