大家好,给大家分享一下python数据分析案例 药店销售数据分析,很多人还不知道这一点。下面详细解释一下。现在让我们来看看!
Source code download: 本文相关源码
大家好,小编为大家解答利用python进行数据分析案例的问题。很多人还不知道python能进行数据分析的案例,现在让我们一起来看看吧!
点此下载数据集
1.MoviesLens 1M数据集
GroupLens实验室提供了一些从MoviesLens用户那里收集的20世纪90年代末到21世纪初的电影评分数据的集合python自动化运维应用。浙西额数据提供了电影的评分、流派、年份和观众数据(年龄、邮编、性别、职业)毕业论文查重降重方法。 MovisLens1M数据集包含6000个用户对4000部电影的100万个评分。数据分布在三个表格之中:分别包含评分、用户信息和电影信息。
import pandas as pd
#文件皆位于根目录下
# 读取"users.dat"文件,它将列名设置为unames,并将分隔符设置为::。header=None参数表示文件中没有标题行,因此应使用unames中提供的列名。
unames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("datasets/movielens/users.dat", sep="::", header=None, names=unames, engine="python")
# 读取"ratings.dat"文件
rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("datasets/movielens/ratings.dat", sep="::", header=None, names=rnames, engine="python")
# 读取"movies.dat"文件
mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("datasets/movielens/movies.dat", sep="::", header=None, names=mnames, engine="python")
#这段代码片段读取了三个文件("users.dat"、"ratings.dat"和"movies.dat"),
#并将它们分别存储在不同的pandas DataFrame中(users、ratings和movies),以便进一步分析或处理。
users.head(5)#输出用户DataFrame的前5行数据
ratings.head(5)# 输出评分DataFrame的前5行数据
movies.head(5)# 输出电影DataFrame的前5行数据
ratings# 输出完整的ratings DataFrame
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
... | ... | ... | ... | ... |
1000204 | 6040 | 1091 | 1 | 956716541 |
1000205 | 6040 | 1094 | 5 | 956704887 |
1000206 | 6040 | 562 | 5 | 956704746 |
1000207 | 6040 | 1096 | 4 | 956715648 |
1000208 | 6040 | 1097 | 4 | 956715569 |
1000209 rows × 4 columns
#使用pd.merge()函数将ratings、users和movies三个DataFrame合并为一个名为data的新DataFrame。
#这个新的data DataFrame将包含了三个原始DataFrame中的所有列和行,基于它们之间的共同列进行合并。
data = pd.merge(pd.merge(ratings, users), movies)
data#输出data DataFrame合并后的所有数据
data.iloc[0]#输出data DataFrame中的第一行数据
user_id 1 movie_id 1193 rating 5 timestamp 978300760 gender F age 1 occupation 10 zip 48067 title One Flew Over the Cuckoo's Nest (1975) genres Drama Name: 0, dtype: object
#使用pivot_table()函数基于data DataFrame创建了一个名为mean_ratings的新DataFrame,
#该DataFrame计算了每个电影标题(title)在不同性别(gender)下的平均评分。
mean_ratings = data.pivot_table("rating", index="title",columns="gender", aggfunc="mean")
mean_ratings.head(5)#mean_ratings DataFrame 的前5行输出
gender | F | M |
---|---|---|
title | ||
$1,000,000 Duck (1971) | 3.375000 | 2.761905 |
'Night Mother (1986) | 3.388889 | 3.352941 |
'Til There Was You (1997) | 2.675676 | 2.733333 |
'burbs, The (1989) | 2.793478 | 2.962085 |
...And Justice for All (1979) | 3.828571 | 3.689024 |
#使用groupby()函数对data DataFrame按电影标题(title)进行分组,并计算每个电影标题对应的评分数量。
ratings_by_title = data.groupby("title").size()
ratings_by_title.head()#显示每个电影标题对应的评分数量。
active_titles = ratings_by_title.index[ratings_by_title >= 250]#从评分数量中筛选出了评分数大于等于250的活跃电影标题
active_titles
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)', '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)', '13th Warrior, The (1999)', '2 Days in the Valley (1996)', '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)', '2010 (1984)', ... 'X-Men (2000)', 'Year of Living Dangerously (1982)', 'Yellow Submarine (1968)', 'You've Got Mail (1998)', 'Young Frankenstein (1974)', 'Young Guns (1988)', 'Young Guns II (1990)', 'Young Sherlock Holmes (1985)', 'Zero Effect (1998)', 'eXistenZ (1999)'], dtype='object', name='title', length=1216)
#使用loc索引器从mean_ratings DataFrame 中选择评分数大于等于250的活跃电影标题
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings
gender | F | M |
---|---|---|
title | ||
'burbs, The (1989) | 2.793478 | 2.962085 |
10 Things I Hate About You (1999) | 3.646552 | 3.311966 |
101 Dalmatians (1961) | 3.791444 | 3.500000 |
101 Dalmatians (1996) | 3.240000 | 2.911215 |
12 Angry Men (1957) | 4.184397 | 4.328421 |
... | ... | ... |
Young Guns (1988) | 3.371795 | 3.425620 |
Young Guns II (1990) | 2.934783 | 2.904025 |
Young Sherlock Holmes (1985) | 3.514706 | 3.363344 |
Zero Effect (1998) | 3.864407 | 3.723140 |
eXistenZ (1999) | 3.098592 | 3.289086 |
1216 rows × 2 columns
#使用rename()函数将mean_ratings DataFrame 中的索引标签从"Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)"更改为
#"Seven Samurai (Shichinin no samurai) (1954)"。
mean_ratings = mean_ratings.rename(index={"Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)":
"Seven Samurai (Shichinin no samurai) (1954)"})
#使用sort_values()函数根据女性观众的平均评分("F"列)对mean_ratings DataFrame 进行降序排序
top_female_ratings = mean_ratings.sort_values("F", ascending=False)
top_female_ratings.head()
gender | F | M | diff |
---|---|---|---|
title | |||
Close Shave, A (1995) | 4.644444 | 4.473795 | -0.170650 |
Wrong Trousers, The (1993) | 4.588235 | 4.478261 | -0.109974 |
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) | 4.572650 | 4.464589 | -0.108060 |
Wallace & Gromit: The Best of Aardman Animation (1996) | 4.563107 | 4.385075 | -0.178032 |
Schindler's List (1993) | 4.562602 | 4.491415 | -0.071187 |
#通过计算mean_ratings DataFrame 中男性观众平均评分("M"列)与女性观众平均评分("F"列)的差异,将差异值添加为名为"diff"的新列。
mean_ratings["diff"] = mean_ratings["M"] - mean_ratings["F"]
mean_ratings
gender | F | M | diff |
---|---|---|---|
title | |||
'burbs, The (1989) | 2.793478 | 2.962085 | 0.168607 |
10 Things I Hate About You (1999) | 3.646552 | 3.311966 | -0.334586 |
101 Dalmatians (1961) | 3.791444 | 3.500000 | -0.291444 |
101 Dalmatians (1996) | 3.240000 | 2.911215 | -0.328785 |
12 Angry Men (1957) | 4.184397 | 4.328421 | 0.144024 |
... | ... | ... | ... |
Young Guns (1988) | 3.371795 | 3.425620 | 0.053825 |
Young Guns II (1990) | 2.934783 | 2.904025 | -0.030758 |
Young Sherlock Holmes (1985) | 3.514706 | 3.363344 | -0.151362 |
Zero Effect (1998) | 3.864407 | 3.723140 | -0.141266 |
eXistenZ (1999) | 3.098592 | 3.289086 | 0.190494 |
1216 rows × 3 columns
#使用sort_values()函数根据差异值("diff"列)对mean_ratings DataFrame 进行升序排序
sorted_by_diff = mean_ratings.sort_values("diff")
sorted_by_diff.head()
gender | F | M | diff |
---|---|---|---|
title | |||
Dirty Dancing (1987) | 3.790378 | 2.959596 | -0.830782 |
Jumpin' Jack Flash (1986) | 3.254717 | 2.578358 | -0.676359 |
Grease (1978) | 3.975265 | 3.367041 | -0.608224 |
Little Women (1994) | 3.870588 | 3.321739 | -0.548849 |
Steel Magnolias (1989) | 3.901734 | 3.365957 | -0.535777 |
#通过使用切片操作[::-1]对sorted_by_diff DataFrame 进行逆序排序,并返回前5行内容。
sorted_by_diff[::-1].head(5)
gender | F | M | diff |
---|---|---|---|
title | |||
Good, The Bad and The Ugly, The (1966) | 3.494949 | 4.221300 | 0.726351 |
Kentucky Fried Movie, The (1977) | 2.878788 | 3.555147 | 0.676359 |
Dumb & Dumber (1994) | 2.697987 | 3.336595 | 0.638608 |
Longest Day, The (1962) | 3.411765 | 4.031447 | 0.619682 |
Cable Guy, The (1996) | 2.250000 | 2.863787 | 0.613787 |
#使用groupby()函数按电影标题(title)分组,并计算每个电影标题对应的评分标准差。
rating_std_by_title = data.groupby("title")["rating"].std()
#从标准差结果中筛选出评分数大于等于250的活跃电影标题,
rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title.head()
title 'burbs, The (1989) 1.107760 10 Things I Hate About You (1999) 0.989815 101 Dalmatians (1961) 0.982103 101 Dalmatians (1996) 1.098717 12 Angry Men (1957) 0.812731 Name: rating, dtype: float64
#使用sort_values()函数将rating_std_by_title Series 对象按评分标准差进行降序排序,并返回前10个最大的标准差值。
rating_std_by_title.sort_values(ascending=False)[:10]
title Dumb & Dumber (1994) 1.321333 Blair Witch Project, The (1999) 1.316368 Natural Born Killers (1994) 1.307198 Tank Girl (1995) 1.277695 Rocky Horror Picture Show, The (1975) 1.260177 Eyes Wide Shut (1999) 1.259624 Evita (1996) 1.253631 Billy Madison (1995) 1.249970 Fear and Loathing in Las Vegas (1998) 1.246408 Bicentennial Man (1999) 1.245533 Name: rating, dtype: float64
#展示"genres"列的前5行数据
movies["genres"].head()
#将每个电影的"genres"列按竖线字符分割成一个列表
movies["genres"].head().str.split("|")
#将分割后的列表赋值给了一个名为"genre"的新列,并删除了原始的"genres"列。这样,"genre"列包含了每个电影的类型列表。
movies["genre"] = movies.pop("genres").str.split("|")
movies.head()
movie_id | title | genre | |
---|---|---|---|
0 | 1 | Toy Story (1995) | [Animation, Children's, Comedy] |
1 | 2 | Jumanji (1995) | [Adventure, |