之前看《利用python进行数据分析》的书,觉得里面电影评分的例子非常好。只是看别人的代码,觉得棒,实际动手自己做,还是眼高手低。今天鼓起勇气,自己做了一下分析。
可以看出我们凡人和高手的代码水平还是有很大差距。
尽管如此,还是实现了分析目的,就是比较男性和女性对不同电影的评分差异。
印象最深的是电影《阿呆和阿瓜》男性评分很高,女性评分反而比较低。两者差距很大。我看过这个电影,所以觉得很有趣,今天试着做了出来。
效果如下:
代码很烂,后期再继续学习,直接贴上了。反正自己做的,不要求多好,凑合着自己看吧。
其中特别要说名的是,电影的名字数据表,必须指定“
encoding="ISO-8859-1",
”
才不会报错,否则会报一个 编码的错误。
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3114: invalid continuation byte
解决了这个,后面就是数据合并,透视表,基本的描述统计了。
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
data_dir_of_movie = "2022——数据分析\movielens"
movies = os.path.join(data_dir_of_movie, "movies.dat")
df_movie = pd.read_table(movies, sep ="::", encoding="ISO-8859-1", header =None, names= ['movieId', 'title', 'genres'])
df_movie.head()
pd.read_table()
df_movie.shape
users = os.path.join(data_dir_of_movie, "users.dat")
df_users =pd.read_table(users, sep ="::", header = None, names = ['userId', 'gender', 'age', 'occupation', 'zip'])
df_users.shape
3883 * 6040
ratings = os.path.join(data_dir_of_movie, "ratings.dat")
df_ratings = pd.read_table(ratings, sep = "::", header = None, names = ['userId', 'movieId', 'rating', 'timestamp'])
df_ratings.shape
df_ratings.head()
df_users.head()
df_ratings = df_ratings[['userId', 'movieId', 'rating']]
df_users = df_users[['userId', 'gender']]
df_movie.head()
df_ratings.shape
df_ratings.size
df_ratings.head()
df_ratings['movieId'].describe()
df_ratings.groupby('movieId').count()
df_ratings
df_numberOfRating = df_ratings.groupby('movieId').count()
df_numberOfRating.sort_values()
df_numberOfRating.sort_values(by = 'rating')
df_numberOfRating.shape
df_numberOfRating2 = df_numberOfRating[df_numberOfRating['rating'] > 250]
df_numberOfRating2.shape
df_numberOfRating2['movieId']
df_ratingLargerThan250 = df_numberOfRating2.reset_index(inplace = True)
df_ratingLargerThan250.head()
df_ratingLargerThan250
df_numberOfRating2['movieId']
movieId2 = df_numberOfRating2['movieId']
df_users.head()
df_ratings.head()
df_movie.head()
movieId2
m_250 = pd.DataFrame(movieId2, columns = ['movieId'])
df_movie.shape
movie2 = pd.merge(left = m_250, right = df_movie, how='inner', left_on='movieId', right_on='movieId')
movie2
movie2.shape
df_users.head()
df_ratings.head()
user_rating = pd.merge(left = df_users, right = df_ratings, left_on="userId", right_on="userId", how='outer')
user_rating.shape
user_rating.head()
user_rating_movie = pd.merge(left = user_rating, right = movie2, left_on='movieId', right_on='movieId', how='outer')
user_rating_movie.shape
user_rating_movie.head()
user_rating_movie.sort_values(by = 'userId', inplace = True)
user_rating_movie.iloc[0, ]
sex_rating = pd.pivot_table(data = user_rating_movie, index = 'title', columns = 'gender', values = 'rating')
sex_rating.sort_values()
sex_rating['F'].nlargest(5)
sex_rating['M'].nlargest(5)
sex_rating['F'].nsmallest(5)
sex_rating['M'].nsmallest(5)
sex_rating['gap'] = sex_rating['F'] - sex_rating['M']
sex_rating['gap'].nlargest(10)
sex_rating['gap'].nsmallest(10)
然后,看一看
男女对电影类型的评分的差别。
可以看出来,男性对动作片评价高,女性对罗曼爱情片评价高。男性对horro,恐怖片评价高。对sciFi, thriller 科幻片, 惊悚片评价高。
而且总体来说,评分低的类型有 犯罪,film-noirr, 西部片。评分高的类型有,喜剧,儿童片,戏曲,纪录片。
我只是用了1w份电影统计的。100w份太大了。统计不过来,另外在提取电影类型时也没有用get_indexer,而是自己写了一个函数。稍后整理出来,看两种方法耗时比较。
df = pd.read_excel(os.path.join(data_dir, "movie_1w.xlsx"))
df.shape
df.keys()[1:]
df = df[df.keys()[1:]]
df['genres']
###
gen_set = []
genres = df['genres']
for ge in genres:
split_list = ge.split('|')
for spl in split_list:
if spl not in gen_set:
gen_set.append(spl)
else:
pass
len(gen_set)
gen_set
df_type = pd.DataFrame(np.zeros((df.shape[0], 18)))
df_type.columns = gen_set
df_type
df.join(df_type, axis = 1)
pd.concat([df, df_type], axis = 1)
df2 = pd.concat([df, df_type], axis = 1)
for i in range(df2.shape[0]):
gi = df2.loc[df2.index[i], 'genres']
gi_list = gi.split("|")
for g in gi_list:
df2.loc[df2.index[i], g] = 1
df2.iloc[3,]
df2.to_excel(os.path.join(data_dir, "movie_1w_18type.xlsx"), index = False)
df = df2
df.keys()
df.groupby('gender')[['War', 'Thiller']].mean()
gen_set
df.pivot_table(index = 'gender', columns = gen_set)
pd.pivot_table(data = df, index = 'gender', values = gen_set).unstack()
female_rating = pd.pivot_table(data = df, index = 'gender', values = gen_set).loc['F']
male_rating = pd.pivot_table(data = df, index = 'gender', values = gen_set).loc['M']
plt.plot(female_rating, label = "female")
plt.plot(male_rating , label = "male")
# plt.gca().set_xticklabel(rotation = 45)
plt.xticks(rotation = 45)
plt.legend()