第一步:收集和清洗数据
数据链接:https://grouplens.org/datasets/movielens/
下载文件:ml-latest-smallimport pandas as pd
import numpy as np
import tensorflow as tf导入ratings.csv文件
ratings_df = pd.read_csv('./ml-latest-small/ratings.csv')
ratings_df.tail
#tail命令用于输入文件中的尾部内容。tail命令默认在屏幕上显示指定文件的末尾5行。
结果:userIdmovieIdratingtimestamp9999967162682.51065579370
10000067162694.01065149201
10000167163654.01070940363
10000267163852.51070979663
10000367165653.51074784724
导入movies.csv文件
movies_df = pd.read_csv('./ml-latest-small/movies.csv')
movies_df.tail
结果:
movieIdtitlegenres9120162672Mohenjo Daro (2016)Adventure|Drama|Romance
9121163056Shin Godzilla (2016)Action|Adventure|Fantasy|Sci-Fi
9122163949The Beatles: Eight Days a Week - The Touring Y...Documentary
9123164977The Gay Desperado (1936)Comedy
9124164979Women of '69, UnboxedDocumentary
将movies_df中的movieId替换为行号
movies_df['movieRow'] = movies_df.index
#生成一列‘movieRow’,等于索引值index
movies_df.tail
结果:
movieIdtitlegenresmovieRow9120162672Mohenjo Daro (2016)Adventure|Drama|Romance9120
9121163056Shin Godzilla (2016)Action|Adventure|Fantasy|Sci-Fi9121
9122163949The Beatles: Eight Days a Week - The Touring Y...Documentary9122
9123164977The Gay Desperado (1936)Comedy9123
9124164979Women of '69, UnboxedDocumentary9124
筛选movies_df中的特征
movies_df = movies_df[['movieRow''movieId''title']]
#筛选三列出来
movies_df.to_csv('./ml-latest-small/moviesProcessed.csv', index=False, header=True, encoding='utf-8')
#生成一个新的文件moviesProcessed.csv
movies_df.tail
结果:movieRowmovieIdtitle91209120162672Mohenjo Daro (2016)
91219121163056Shin Godzilla (2016)
91229122163949The Beatles: Eight Days a Week - The Touring Y...
91239123164977The Gay Desperado (1936)
91249124164979Women of '69, Unboxed
根据movieId,合并rating_df和movie