基于pandas的统计分析实例
数据集下载地址: https://github.com/wesm/pydata-book/tree/2nd-edition/datasets/movielens
电影评分数据的统计分析及可视化
1. 数据准备(获取、读取)
从网上获取的电影评分数据集。包含6040个用户对3900部电影做出的1000209条评分。数据集包含3个文件,各文件说明如下: (1)"ratings.dat":UserID,MovieID,Rating,Timestamp - UserI:用户ID(1-6040) - MovieID:电影ID(1-3952) - Rating:评分(1-5) are made on a 5-star scale (whole-star ratings only) - Timestamp:时间戳(时间戳是指格林威治时间1970年01月01日00时00分00秒(北京时间1970年01月01日08时00分00秒)起至现在的总秒数。) 每个用户至少有 20 个评分 (2)"users.dat":UserID,Gender,Age,Occupation,Zip-code - Gender:性别"M"(男),"F"(女) - Age:年龄,根据如下范围标注为对应数字 * 1: "Under 18" * 18: "18-24" * 25: "25-34" * 35: "35-44" * 45: "45-49" * 50: "50-55" * 56: "56+" -Occupation:职业,根据如下列表标注为对应数字 * 0: "other" or not specified * 1: "academic/educator" * 2: "artist" * 3: "clerical/admin" * 4: "college/grad student" * 5: "customer service" * 6: "doctor/health care" * 7: "executive/managerial" * 8: "farmer" * 9: "homemaker" * 10: "K-12 student" * 11: "lawyer" * 12: "programmer" * 13: "retired" * 14: "sales/marketing" * 15: "scientist" * 16: "self-employed" * 17: "technician/engineer" * 18: "tradesman/craftsman" * 19: "unemployed" * 20: "writer" - Zip-code:邮编 (3)"movies.dat" :MovieID,Title,Genres -Title:电影名 - Genres :题材(包含于下面的列表中) * Action * Adventure * Animation * Children's * Comedy * Crime * Documentary * Drama * Fantasy * Film-Noir * Horror * Musical * Mystery * Romance * Sci-Fi * Thriller * War * Western
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#读取users数据集
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('users.dat', sep='::', header=None, names=unames, engine='python')
#读取ratings数据集
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ratings.dat', sep='::', header=None, names=rnames, engine='python')
#读取movies数据集
mnames = ['movie_id', 'title', 'genres']
movies