1 简介
刚学完pandas基础后跟着老师写的一个热门电影数据分析,电影分析数据集地址
2 过程
1. 导入数据
unames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("ml-1m/users.dat", sep="::", header=None, names=unames)
rating_names = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("ml-1m/ratings.dat", names=rating_names, sep="::", header=None)
ratings.head(5)
movies_names = ["movie_id", "title", "genrez"]
movies = pd.read_table("ml-1m/movies.dat", names=movies_names, sep="::", header=None)
movies.head(5)
- 写出一个数据的columns 为names赋值
- sep 表示分隔符是哪个
- header=None 没有头
2.数据连接 merge
data = pd.merge(pd.merge(users, ratings), movies)
- 两次merge 这里直接通过下标merge了
3.数据透视表 筛选数据
ratings_by_gender = data.pivot_table(values="rating", index="title", columns="gender", aggfunc="mean")
- values 数值
- index 行索引
- columns 列索引
- aggfun 处理函数
4.寻找热门电影
1. 查出评论最多的几部电影 groupby
ratings_by_title = data.groupby("title").size()
hot_movies = ratings_by_title[ratings_by_title > 1000]
2. 查出平均得分
mean_ratings = data.pivot_table(index="title", values="rating",aggfunc="mean")
mean_ratings_series = pd.Series(mean_ratings["rating"]) #在这一步遇到麻烦,发现由数据透视表的得到的是DataFrame要选择的话需要转成Series
hot_movies_rating = mean_ratings_series[hot_movies.index]
top_10_movies = hot_movies_rating.sort_values(ascending=False).head(10)