python找出第二大的数据_python数据分析案例(二)

密码:j1gl

第一部分 求出最受女性欢迎的五部电影

思路:利用read_table载入数据,通过merge函数进行连接,过滤评价数≤250条的电影,按性别计算每部电影的平均得分

step1

import pandas as pd

movies=pd.read_table(r'C:\Users\Administrator\Downloads\pydata-book-2nd-edition\datasets\movielens\movies.dat',sep='::',

header=None,names=['movie_id','title','genres'])

users=pd.read_table(r'C:\Users\Administrator\Downloads\pydata-book-2nd-edition\datasets\movielens\movies.dat',sep='::',

header=None,names=['user_id','gender','age','occupation','zip'])

ratings=pd.read_table(r'C:\Users\Administrator\Downloads\pydata-book-2nd-edition\datasets\movielens\movies.dat',sep='::',

header=None,names=['user_id','movie_id','rating','timestamp'])

#原dat数据以'::'为分隔符,通过read_table()加载,没有列名行,传入相应的列名

step2

movies.info()

RangeIndex: 3883 entries, 0 to 3882

Data columns (total 3 columns):

movie_id 3883 non-null int64

title 3883 non-null object

genres 3883 non-null object

dtypes: int64(1), object(2)

memory usage: 91.1+ KB

users.info()

RangeIndex: 3883 entries, 0 to 3882

Data columns (total 5 columns):

user_id 3883 non-null int64

gender 3883 non-null object

age 3883 non-null object

occupation 0 non-null float64

zip 0 non-null float64

dtypes: float64(2), int64(1), object(2)

memory usage: 151.8+ KB

ratings.info()

RangeIndex: 3883 entries, 0 to 3882

Data columns (total 4 columns):

user_id 3883 non-null int64

movie_id 3883 non-null object

rating 3883 non-null object

timestamp 0 non-null float64

dtypes: float64(1), int64(1), object(2)

memory usage: 121.4+ KB

#获得每张表的基本信息

step3

data=pd.merge(pd.merge(users,ratings,how='inner'),movies,how='inner')

data[:5]

user_id gender age occupation zip movie_id rating timestamp title genres

0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) Drama

1 2 M 56 16 70072 1193 5 978298413 One Flew Over the Cuckoo's Nest (1975) Drama

2 12 M 25 12 32793 1193 4 978220179 One Flew Over the Cuckoo's Nest (1975) Drama

3 15 M 25 7 22903 1193 4 978199279 One Flew Over the Cuckoo's Nest (1975) Drama

4 17 M 50 1 95350 1193 5 978158471 One Flew Over the Cuckoo's Nest (1975) Drama

step4

rating_counts=data['title'].value_counts()

#统计每部电影对应的评分条数

index=rating_counts.index[rating_counts>250]

#筛选评分条数大于250的电影

index[:5]

Index(['American Beauty (1999)', 'Star Wars: Episode IV - A New Hope (1977)',

'Star Wars: Episode V - The Empire Strikes Back (1980)',

'Star Wars: Episode VI - Return of the Jedi (1983)',

'Jurassic Park (1993)'])

step5

table_rating=data.pivot_table('rating',index='title',columns='gender',aggfunc='mean').loc[index]

#以title为轴索引(过滤掉评分人数少于等于250),gender为列,计算rating的平均数据

rating_order=table_rating.sort_values(by='F',ascending=False)

#根据女性评分高低进行从高到低排序

rating_order.head()

gender F M

Close Shave, A (1995) 4.644444 4.473795

Wrong Trousers, The (1993) 4.588235 4.478261

Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589

Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075

Schindler's List (1993) 4.562602 4.491415

#得到最受女性欢迎的五部电影

第二部分 找出男女分歧最大,且更受男观众喜爱的电影

思路:新增一列‘diff’,为同一部电影的男女评分差,差值越大表明越受男观众喜欢

step1

rating_order['diff']=rating_order['M']-rating_order['F']

rating_order.head()

gender F M diff

Close Shave, A (1995) 4.644444 4.473795 0.170650

Wrong Trousers, The (1993) 4.588235 4.478261 0.109974

Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589 0.108060

Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075 0.178032

Schindler's List (1993) 4.562602 4.491415 0.071187

step2

data4.sort_values(by='diff',ascending=False)

gender F M diff

Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351

Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359

Dumb & Dumber (1994) 2.697987 3.336595 0.638608

Longest Day, The (1962) 3.411765 4.031447 0.619682

Cable Guy, The (1996) 2.250000 2.863787 0.613787

#差值越大表示分歧越大,且更为男性喜欢

第三部分 找出分歧最小的电影(不考虑性别因素)

思路:利用std()函数,求出每部电影的标准差,排序后,获得标准差最小的电影

step1

rating_std=data.groupby('title')['rating'].std().loc[index]

#对每部电影求评分的标准差(过滤掉评分人数少于等于250的部分)

step2

rating_std.sort_values()[:5]

title

Close Shave, A (1995) 0.667143

Rear Window (1954) 0.688946

Great Escape, The (1963) 0.692585

Shawshank Redemption, The (1994) 0.700443

Wrong Trousers, The (1993) 0.708666

#选出分歧最小的五部电影

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值