数据分析电影

import pandas as pd
import threading
from pandas import Series
import time

start = time.perf_counter()
unames = [‘user_id’, ‘gender’, ‘age’, ‘occupation’, ‘zip’]
users = pd.read_table(‘users.dat’, sep=’::’, header=None, names=unames, engine=‘python’)
rnames = [‘user_id’, ‘movie_id’, ‘rating’, ‘timestamp’]
ratings = pd.read_table(‘ratings.dat’, sep=’::’, header=None, names=rnames, engine=‘python’)
mnames = [‘movie_id’, ‘title’, ‘geners’]
movies = pd.read_table(‘movies.dat’, sep="::", header=None, names=mnames, engine=‘python’)

通过切片查看是否正常工作

print(users[:5])

print(ratings[:5])

print(movies[:5])

将数据合并到一起

data = pd.merge(pd.merge(ratings, users, on=‘user_id’), movies, on=‘movie_id’)

print(data[:2])

计算每部电影评分得分

mean_ratings = data.pivot_table(‘rating’, index=‘title’, columns=‘gender’, aggfunc=‘mean’)

print(mean_ratings[:5])

对电影分组

ratings_by_tittle = data.groupby(‘title’).size()

print(ratings_by_tittle[:3])

过滤不够250条的电影数据

active_titles = ratings_by_tittle.index[ratings_by_tittle >= 250]
mean_ratings = mean_ratings.loc[active_titles]
top_female_ratings = mean_ratings.sort_values(by=‘F’, ascending=False)

计算评分分歧

mean_ratings[‘diff’] = mean_ratings[‘M’] - mean_ratings[‘F’]
sorted_by_diff = mean_ratings.sort_values(by=‘diff’)

根据电影名称分组的得分数据的标准差

ratings_by_tittle = data.groupby(‘title’)[‘rating’].std()

根据active_titles进行过滤

ratings_by_tittle = ratings_by_tittle.loc[active_titles]

根据值对series进行降序排列

ratings_by_tittle.sort_values(ascending=False)[:10]
elapsed = (time.perf_counter() - start)
print(elapsed)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值