对亚马逊数据中用户ID和物品ID重新编号

对亚马逊数据集中用户ID和物品ID重新编号

2022.12.12 更新一种最快的方法

import pandas as  pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
# 将数据读取为dataframe
data = pd.read_csv('Digital_Music.csv', names=['uid','iid','rating','time'])
encoder = LabelEncoder()
encoder.fit(["uid", "iid", "rating", "time"]) 
# 就可实现对列的重新编码

——————————————————————————————————
在学习中我们常常需要使用亚马逊数据集,其数据集结构如下所示:


  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "vote": 5,
  "style": {
    "Format:": "Hardcover"
  },
  "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}

但是其用户ID和物品ID的形式并不是我们想要的,因此需要对其ID进行重新编码

0001388703,A1ZCPG3D3HGRSS,5.0,1387670400
0001388703,AC2PL52NKPL29,5.0,1378857600
0001388703,A1SUZXBDZSDQ3A,5.0,1362182400
0001388703,A3A0W7FZXM0IZW,5.0,1354406400
0001388703,A12R54MKO17TW0,5.0,1325894400
0001388703,A25ZT87OMIPLNX,5.0,1247011200
0001388703,A3NVGWKHLULDHR,1.0,1242259200
0001388703,AT7OB43GHKIUA,5.0,1209859200
0001388703,A1H3X1TW6Y7HD8,5.0,1442534400
0001388703,AZ3T21W6CW0MW,1.0,1431648000
0001388703,A2W6V65OFOZ12M,5.0,1426204800
0001388703,A1DOF5GHOWGMW6,5.0,1415059200
0001388703,A4V08BR7LZ6D9,5.0,1413072000
0001388703,AJO3UG6FR5C7R,5.0,1411430400

用户ID和物品ID编码代码如下所示:

import pandas as  pd
import numpy as np

def data_recode():

 data = pd.read_csv('Digital_Music.csv', names=['iid','uid','rating','time'])
 #print(data.head(5))
 df = pd.DataFrame(data = data)
 df = df[['uid','iid','rating','time']]  #将数据读为dataframe

 customers = df['uid'].value_counts() #第一列为所有用户个数(840372) 第二列为统计每个用户交互的次数
 products  =  df['iid'].value_counts()
 #print(customers,products)
 #print(customers.index)
 customers = customers[customers >= 5]
 #products = products[products >= 10]

 reduced_df = df.merge(pd.DataFrame({'uid': customers.index})).\
     merge(pd.DataFrame({'iid': products.index}))


 customer_index = pd.DataFrame({'uid': customers.index, 'userID': np.arange(customers.shape[0])})
 product_index = pd.DataFrame({'iid': products.index,
                               'itemID': np.arange(products.shape[0])})

 reduced_df = reduced_df.merge(customer_index).merge(product_index)
 reduced_df = reduced_df[['userID','itemID','rating','time']]
 print(reduced_df)

 reduced_df.to_csv('Digital_Music.dat',header=0,index=0,sep='\t')
#对数据顺序进行

if __name__ == '__main__':
    data_recode()

处理后

        userID  itemID  rating        time
0          747    6742     5.0  1325894400
1        31942    6742     5.0  1413072000
2        45713    6742     5.0  1411430400
3        14752    6742     5.0  1397520000
4          747    3135     5.0  1326758400
...        ...     ...     ...         ...
511162   12284  171649     5.0  1495324800
511163   12284  317396     5.0  1495324800
511164   12284  369414     5.0  1495324800
511165   12284  126294     5.0  1495324800
511166   12284  121341     5.0  1495324800

参考连接
https://aws.amazon.com/cn/blogs/china/using-amazon-sagemaker-to-build-a-recommendation-system-based-on-gluon/

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值