python数据分析chapter2-2

最新推荐文章于 2022-06-21 19:40:13 发布

墨岚❤️

最新推荐文章于 2022-06-21 19:40:13 发布

阅读量1.1k

点赞数 1

分类专栏： python 数据处理与分析文章标签： python 数据分析

本文链接：https://blog.csdn.net/LY_ysys629/article/details/73555315

版权

python 同时被 2 个专栏收录

29 篇文章 5 订阅

订阅专栏

数据处理与分析

18 篇文章 3 订阅

订阅专栏

1 利用pandas对电影评分数据进行分析

数据来源于20世纪90年代末到21世纪初由Movielens用户提供的电影评分数据。这些数据包括电影评分、电影原数据（风格类型和年代）以及关于用户的人口统计学数据（年龄、邮编、性别和职业等）。数据集含有来自6000名用户对4000部电影的100万条评分数据。他分为三个表：评分、用户信息和电影信息。

1.1 下载并展示原始数据

import pandas as pd

#读取用户数据表，并指定列名
userColumnsNames = ['user_id','gender','age','occupation','zip']
user = pd.read_table('E:\python\pythonDataAnalysis\pydata-book-master\ch02\movielens\users.dat',sep='::',header=None,names=userColumnsNames)

#读取评分数据表，并指定列名
rNames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table(r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\movielens\ratings.dat',sep='::',header=None,names=rNames)

#读取评分数据表，并指定列名
moviesNames = ['movie_id','title','genres']
movies = pd.read_table('E:\python\pythonDataAnalysis\pydata-book-master\ch02\movielens\movies.dat',sep='::',header=None,names=moviesNames)

 C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:5: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
    C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:9: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
    C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:13: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

user[:5]

	user_id	gender	age	occupation	zip
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

ratings[:5]

	user_id	movie_id	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

movies[:5]

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children’s\|Comedy
1	2	Jumanji (1995)	Adventure\|Children’s\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

1.2 根据性别计算某部电影的平均得分

user表中有性别和年龄，movies表中有电影标题，ratings表中有得分，因此达到题目要求需将三个表融合在一起

data = pd.merge(pd.merge(ratings,user),movies)
data.head(5)

	user_id	movie_id	rating	timestamp	gender	age	occupation	zip	title	genres
0	1	1193	5	978300760	F	1	10	48067	One Flew Over the Cuckoo’s Nest (1975)	Drama
1	2	1193	5	978298413	M	56	16	70072	One Flew Over the Cuckoo’s Nest (1975)	Drama
2	12	1193	4	978220179	M	25	12	32793	One Flew Over the Cuckoo’s Nest (1975)	Drama
3	15	1193	4	978199279	M	25	7	22903	One Flew Over the Cuckoo’s Nest (1975)	Drama
4	17	1193	5	978158471	M	50	1	95350	One Flew Over the Cuckoo’s Nest (1975)	Drama

#计算不同性别对每部电影的平均得分
mean_ratings = pd.pivot_table(data,values = 'rating',index = ['title'],columns = ['gender'],aggfunc = 'mean')
mean_ratings[:5]

gender	F	M
title
$1,000,000 Duck (1971)	3.375000	2.761905
‘Night Mother (1986)	3.388889	3.352941
‘Til There Was You (1997)	2.675676	2.733333
‘burbs, The (1989)	2.793478	2.962085
…And Justice for All (1979)	3.828571	3.689024

#查看每部电影在不同性别下的评分条数
data.groupby(['title','gender']).size().unstack()[:10]

gender	F	M
title
$1,000,000 Duck (1971)	16.0	21.0
‘Night Mother (1986)	36.0	34.0
‘Til There Was You (1997)	37.0	15.0
‘burbs, The (1989)	92.0	211.0
…And Justice for All (1979)	35.0	164.0
1-900 (1994)	1.0	1.0
10 Things I Hate About You (1999)	232.0	468.0
101 Dalmatians (1961)	187.0	378.0
101 Dalmatians (1996)	150.0	214.0
12 Angry Men (1957)	141.0	475.0

#选择评论条数大于250 的电影
numComm_by_title = data.groupby(['title']).size()
numComm_by_title[:5]#title为index列

    title
    $1,000,000 Duck (1971)            37
    'Night Mother (1986)              70
    'Til There Was You (1997)         52
    'burbs, The (1989)               303
    ...And Justice for All (1979)    199
    dtype: int64

active_titles = numComm_by_title.index[numComm_by_title >= 250]
print active_titles.dtype
active_titles.size

object   
1216

active_titles[:5]

    Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
           u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
           u'12 Angry Men (1957)'],
          dtype='object', name=u'title')

#ix既可以对行索引，也可以对列索引，可以使用数字序号，还可以使用index关键字
mean_ratings = mean_ratings.ix[active_titles]
#查看mean_ratings的详细信息可以通过mean_ratings？还可以通过help（mean_ratings）
mean_ratings[:5]

gender	F	M
title
‘burbs, The (1989)	2.793478	2.962085
10 Things I Hate About You (1999)	3.646552	3.311966
101 Dalmatians (1961)	3.791444	3.500000
101 Dalmatians (1996)	3.240000	2.911215
12 Angry Men (1957)	4.184397	4.328421

1.3 查看女性最喜欢那部电影？

top_female_ratings = mean_ratings.sort_index(by='F',ascending = False)
top_female_ratings[:10]

 C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
      if __name__ == '__main__':

gender	F	M
title
Close Shave, A (1995)	4.644444	4.473795
Wrong Trousers, The (1993)	4.588235	4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.572650	4.464589
Wallace & Gromit: The Best of Aardman Animation (1996)	4.563107	4.385075
Schindler’s List (1993)	4.562602	4.491415
Shawshank Redemption, The (1994)	4.539075	4.560625
Grand Day Out, A (1992)	4.537879	4.293255
To Kill a Mockingbird (1962)	4.536667	4.372611
Creature Comforts (1990)	4.513889	4.272277
Usual Suspects, The (1995)	4.513317	4.518248

1.3 计算男女之间同一个电影评分差距最大的电影

那些电影最能反映男女之间差别，不是评分最高的，也不是最低的，而是评分差距最大的，如何找出？请看下边代码~

#加上一列存放男女之间评分差的列在透视表中
mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']
mean_ratings['diff'][:10]

    title
    'burbs, The (1989)                     0.168607
    10 Things I Hate About You (1999)     -0.334586
    101 Dalmatians (1961)                 -0.291444
    101 Dalmatians (1996)                 -0.328785
    12 Angry Men (1957)                    0.144024
    13th Warrior, The (1999)               0.056000
    2 Days in the Valley (1996)           -0.244076
    20,000 Leagues Under the Sea (1954)    0.039102
    2001: A Space Odyssey (1968)           0.304156
    2010 (1984)                           -0.033097
    Name: diff, dtype: float64

mean_ratings_M =mean_ratings.sort_index(by='diff',ascending = False)

    C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
      if __name__ == '__main__':

mean_ratings_M[:5]#男性更喜欢的电影

gender	F	M	diff
title
Good, The Bad and The Ugly, The (1966)	3.494949	4.221300	0.726351
Kentucky Fried Movie, The (1977)	2.878788	3.555147	0.676359
Dumb & Dumber (1994)	2.697987	3.336595	0.638608
Longest Day, The (1962)	3.411765	4.031447	0.619682
Cable Guy, The (1996)	2.250000	2.863787	0.613787

#女性更喜欢的电影
mean_ratings_F = mean_ratings_M[::-1]
mean_ratings_F[:10]

gender	F	M	diff
title
Dirty Dancing (1987)	3.790378	2.959596	-0.830782
Jumpin’ Jack Flash (1986)	3.254717	2.578358	-0.676359
Grease (1978)	3.975265	3.367041	-0.608224
Little Women (1994)	3.870588	3.321739	-0.548849
Steel Magnolias (1989)	3.901734	3.365957	-0.535777
Anastasia (1997)	3.800000	3.281609	-0.518391
Rocky Horror Picture Show, The (1975)	3.673016	3.160131	-0.512885
Color Purple, The (1985)	4.158192	3.659341	-0.498851
Age of Innocence, The (1993)	3.827068	3.339506	-0.487561
Free Willy (1993)	2.921348	2.438776	-0.482573

1.4 计算分歧最大的电影

仅从电影评分本身找出分歧最大的电影可以计算每部电影的评分方差或者标准差

#求每部电影的评分标准差
ratings_title_std = data.groupby(['title'])['rating'].std()
print type(ratings_title_std)
ratings_title_std[:5]

<class 'pandas.core.series.Series'>

title
$1,000,000 Duck (1971)           1.092563
'Night Mother (1986)             1.118636
'Til There Was You (1997)        1.020159
'burbs, The (1989)               1.107760
...And Justice for All (1979)    0.878110
Name: rating, dtype: float64

#对Series用order对值排序，还可以用sort_index对列排序
ratings_title_std_sort  = ratings_title_std.order(ascending = False)
ratings_title_std_sort[:5]

C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: order is deprecated, use sort_values(...)
  if __name__ == '__main__':

title
Foreign Student (1994)                                             2.828427
Criminal Lovers (Les Amants Criminels) (1999)                      2.309401
Identification of a Woman (Identificazione di una donna) (1982)    2.121320
Sunset Park (1996)                                                 2.121320
Eaten Alive (1976)                                                 2.121320
Name: rating, dtype: float64