个性化推荐系统——2. 数据预处理

数据预处理

  • 各种ID不用变,UserID,OccupationID,MovieID
  • 类别数据用字典转化为数字类型,Gender,Age,Genres
  • Title进行word2vec的转换
import pandas as pd
import numpy as np
import pickle
import re
def load_data():
    #读取User数据
    users_title=['UserID','Gender','Age','JobID','Zip-code']
    users=pd.read_csv('./ml-1m/users.dat',sep='::',header=None,names=users_title,engine='python')
    users=users.filter(regex='UserID|Gender|Age|JobID')#去掉zip-code
    #filter过滤列表中的元素,并且返回⼀个由所有符合要求的元素所构成的列表,符合要求即函数映射到该元素时返回值为True.这个filter类似于⼀个for循环,但它是⼀个内置函数,并且更快。
    users_orig=users.values
    #改变类别数据——性别&年龄
    gender_map={'F':0,'M':1}
    users['Gender']=users['Gender'].map(gender_map)
    
    age_map={val:ii for ii,val in enumerate(set(users['Age']))}
    users['Age']=users['Age'].map(age_map)
    
    #读取Movie数据
    movies_title = ['MovieID', 'Title', 'Genres']
    movies = pd.read_csv('./ml-1m/movies.dat', sep='::', header=None, names=movies_title, engine = 'python')
    movies_orig = movies.values
    #将Title中的年份去掉
    pattern = re.compile(r'^(.*)\((\d+)\)$')

    title_map = {val:pattern.match(val).group(1) for ii,val in enumerate(set(movies['Title']))}
    movies['Title'] = movies['Title'].map(title_map)
    
    #将电影的分类转化为数字字典
    genres_set=set()
    for val in movies['Genres'].str.split('|'):
        genres_set.update(val)
        
    genres_set.add('<PAD>')#空白填充为'<PAD>',统一长度所需,为神经网络模型训练提供便利
    genres2int={val:ii for ii,val in enumerate(genres_set)}
    
    #将电影类型转成等长数字列表,长度是18
    genres_map = {val:[genres2int[row] for row in val.split('|')] for ii,val in enumerate(set(movies['Genres']))}

    for key in genres_map:
        for cnt in range(max(genres2int.values()) - len(genres_map[key])):
            genres_map[key].insert(len(genres_map[key]) + cnt,genres2int['<PAD>'])
    
    movies['Genres'] = movies['Genres'].map(genres_map)

    #电影Title转数字字典
    title_set = set()
    for val in movies['Title'].str.split():
        title_set.update(val)
    
    title_set.add('<PAD>')
    title2int = {val:ii for ii, val in enumerate(title_set)}

    #将电影Title转成等长数字列表,长度是15
    title_count = 15
    title_map = {val:[title2int[row] for row in val.split()] for ii,val in enumerate(set(movies['Title']))}
    
    for key in title_map:
        for cnt in range(title_count - len(title_map[key])):
            title_map[key].insert(len(title_map[key]) + cnt,title2int['<PAD>'])
    
    movies['Title'] = movies['Title'].map(title_map)

    #读取评分数据集
    ratings_title = ['UserID','MovieID', 'ratings', 'timestamps']
    ratings = pd.read_csv('./ml-1m/ratings.dat', sep='::', header=None, names=ratings_title, engine = 'python')
    ratings = ratings.filter(regex='UserID|MovieID|ratings')

    #合并三个表
    data = pd.merge(pd.merge(ratings, users), movies)
    
    #将数据分成X和y两张表
    target_fields = ['ratings']
    features_pd, targets_pd = data.drop(target_fields, axis=1), data[target_fields]
    
    features = features_pd.values
    targets_values = targets_pd.values
    
    return title_count, title_set, genres2int, features, targets_values, ratings, users, movies, data, movies_orig, users_orig

加载数据并保存到本地

  • title_count:Title字段的长度(15)
  • title_set:Title文本的集合
  • genres2int:电影类型转数字的字典
  • features:是输入X
  • targets_values:是学习目标y
  • ratings:评分数据集的Pandas对象
  • users:用户数据集的Pandas对象
  • movies:电影数据的Pandas对象
  • data:三个数据集组合在一起的Pandas对象
  • movies_orig:没有做数据处理的原始电影数据
  • users_orig:没有做数据处理的原始用户数据
title_count, title_set, genres2int, features, targets_values, ratings, users, movies, data, movies_orig, users_orig = load_data()
#pickle保存到本地后可以快速加载
pickle.dump((title_count, title_set, genres2int, features, targets_values, ratings, users, movies, data, movies_orig, users_orig), open('preprocess.p', 'wb'))

预处理后的数据

users.head()
UserIDGenderAgeJobID
010010
121516
231615
34127
451620
movies.head()
MovieIDTitleGenres
01[2194, 4563, 2402, 2402, 2402, 2402, 2402, 240...[16, 18, 5, 13, 13, 13, 13, 13, 13, 13, 13, 13...
12[2558, 2402, 2402, 2402, 2402, 2402, 2402, 240...[10, 18, 1, 13, 13, 13, 13, 13, 13, 13, 13, 13...
23[1335, 4290, 3288, 2402, 2402, 2402, 2402, 240...[5, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13...
34[2423, 5164, 3171, 2402, 2402, 2402, 2402, 240...[5, 17, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13...
45[4573, 2552, 1568, 2808, 2806, 1319, 2402, 240...[5, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13...
movies.values[0]
array([1,
       list([2194, 4563, 2402, 2402, 2402, 2402, 2402, 2402, 2402, 2402, 2402, 2402, 2402, 2402, 2402]),
       list([16, 18, 5, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13])],
      dtype=object)
  • 0
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值