对亚马逊数据中用户ID和物品ID重新编号

狗娃子和翔娃子

已于 2022-12-12 18:51:41 修改

阅读量1.1k

点赞数 2

分类专栏： python数据处理文章标签：数据挖掘 python

于 2022-02-06 17:51:52 首次发布

本文链接：https://blog.csdn.net/QAQ6QAQ/article/details/122799916

版权

python数据处理专栏收录该内容

1 篇文章 0 订阅

订阅专栏

对亚马逊数据集中用户ID和物品ID重新编号

2022.12.12 更新一种最快的方法

import pandas as  pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
# 将数据读取为dataframe
data = pd.read_csv('Digital_Music.csv', names=['uid','iid','rating','time'])
encoder = LabelEncoder()
encoder.fit(["uid", "iid", "rating", "time"]) 
# 就可实现对列的重新编码

——————————————————————————————————
在学习中我们常常需要使用亚马逊数据集，其数据集结构如下所示：


  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "vote": 5,
  "style": {
    "Format:": "Hardcover"
  },
  "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}

但是其用户ID和物品ID的形式并不是我们想要的，因此需要对其ID进行重新编码

0001388703,A1ZCPG3D3HGRSS,5.0,1387670400
0001388703,AC2PL52NKPL29,5.0,1378857600
0001388703,A1SUZXBDZSDQ3A,5.0,1362182400
0001388703,A3A0W7FZXM0IZW,5.0,1354406400
0001388703,A12R54MKO17TW0,5.0,1325894400
0001388703,A25ZT87OMIPLNX,5.0,1247011200
0001388703,A3NVGWKHLULDHR,1.0,1242259200
0001388703,AT7OB43GHKIUA,5.0,1209859200
0001388703,A1H3X1TW6Y7HD8,5.0,1442534400
0001388703,AZ3T21W6CW0MW,1.0,1431648000
0001388703,A2W6V65OFOZ12M,5.0,1426204800
0001388703,A1DOF5GHOWGMW6,5.0,1415059200
0001388703,A4V08BR7LZ6D9,5.0,1413072000
0001388703,AJO3UG6FR5C7R,5.0,1411430400

用户ID和物品ID编码代码如下所示：

import pandas as  pd
import numpy as np

def data_recode():

 data = pd.read_csv('Digital_Music.csv', names=['iid','uid','rating','time'])
 #print(data.head(5))
 df = pd.DataFrame(data = data)
 df = df[['uid','iid','rating','time']]  #将数据读为dataframe

 customers = df['uid'].value_counts() #第一列为所有用户个数（840372） 第二列为统计每个用户交互的次数
 products  =  df['iid'].value_counts()
 #print(customers,products)
 #print(customers.index)
 customers = customers[customers >= 5]
 #products = products[products >= 10]

 reduced_df = df.merge(pd.DataFrame({'uid': customers.index})).\
     merge(pd.DataFrame({'iid': products.index}))


 customer_index = pd.DataFrame({'uid': customers.index, 'userID': np.arange(customers.shape[0])})
 product_index = pd.DataFrame({'iid': products.index,
                               'itemID': np.arange(products.shape[0])})

 reduced_df = reduced_df.merge(customer_index).merge(product_index)
 reduced_df = reduced_df[['userID','itemID','rating','time']]
 print(reduced_df)

 reduced_df.to_csv('Digital_Music.dat',header=0,index=0,sep='\t')
#对数据顺序进行

if __name__ == '__main__':
    data_recode()

处理后

        userID  itemID  rating        time
0          747    6742     5.0  1325894400
1        31942    6742     5.0  1413072000
2        45713    6742     5.0  1411430400
3        14752    6742     5.0  1397520000
4          747    3135     5.0  1326758400
...        ...     ...     ...         ...
511162   12284  171649     5.0  1495324800
511163   12284  317396     5.0  1495324800
511164   12284  369414     5.0  1495324800
511165   12284  126294     5.0  1495324800
511166   12284  121341     5.0  1495324800

参考连接
https://aws.amazon.com/cn/blogs/china/using-amazon-sagemaker-to-build-a-recommendation-system-based-on-gluon/