对亚马逊数据集中用户ID和物品ID重新编号
2022.12.12 更新一种最快的方法
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
# 将数据读取为dataframe
data = pd.read_csv('Digital_Music.csv', names=['uid','iid','rating','time'])
encoder = LabelEncoder()
encoder.fit(["uid", "iid", "rating", "time"])
# 就可实现对列的重新编码
——————————————————————————————————
在学习中我们常常需要使用亚马逊数据集,其数据集结构如下所示:
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"vote": 5,
"style": {
"Format:": "Hardcover"
},
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
但是其用户ID和物品ID的形式并不是我们想要的,因此需要对其ID进行重新编码
0001388703,A1ZCPG3D3HGRSS,5.0,1387670400
0001388703,AC2PL52NKPL29,5.0,1378857600
0001388703,A1SUZXBDZSDQ3A,5.0,1362182400
0001388703,A3A0W7FZXM0IZW,5.0,1354406400
0001388703,A12R54MKO17TW0,5.0,1325894400
0001388703,A25ZT87OMIPLNX,5.0,1247011200
0001388703,A3NVGWKHLULDHR,1.0,1242259200
0001388703,AT7OB43GHKIUA,5.0,1209859200
0001388703,A1H3X1TW6Y7HD8,5.0,1442534400
0001388703,AZ3T21W6CW0MW,1.0,1431648000
0001388703,A2W6V65OFOZ12M,5.0,1426204800
0001388703,A1DOF5GHOWGMW6,5.0,1415059200
0001388703,A4V08BR7LZ6D9,5.0,1413072000
0001388703,AJO3UG6FR5C7R,5.0,1411430400
用户ID和物品ID编码代码如下所示:
import pandas as pd
import numpy as np
def data_recode():
data = pd.read_csv('Digital_Music.csv', names=['iid','uid','rating','time'])
#print(data.head(5))
df = pd.DataFrame(data = data)
df = df[['uid','iid','rating','time']] #将数据读为dataframe
customers = df['uid'].value_counts() #第一列为所有用户个数(840372) 第二列为统计每个用户交互的次数
products = df['iid'].value_counts()
#print(customers,products)
#print(customers.index)
customers = customers[customers >= 5]
#products = products[products >= 10]
reduced_df = df.merge(pd.DataFrame({'uid': customers.index})).\
merge(pd.DataFrame({'iid': products.index}))
customer_index = pd.DataFrame({'uid': customers.index, 'userID': np.arange(customers.shape[0])})
product_index = pd.DataFrame({'iid': products.index,
'itemID': np.arange(products.shape[0])})
reduced_df = reduced_df.merge(customer_index).merge(product_index)
reduced_df = reduced_df[['userID','itemID','rating','time']]
print(reduced_df)
reduced_df.to_csv('Digital_Music.dat',header=0,index=0,sep='\t')
#对数据顺序进行
if __name__ == '__main__':
data_recode()
处理后
userID itemID rating time
0 747 6742 5.0 1325894400
1 31942 6742 5.0 1413072000
2 45713 6742 5.0 1411430400
3 14752 6742 5.0 1397520000
4 747 3135 5.0 1326758400
... ... ... ... ...
511162 12284 171649 5.0 1495324800
511163 12284 317396 5.0 1495324800
511164 12284 369414 5.0 1495324800
511165 12284 126294 5.0 1495324800
511166 12284 121341 5.0 1495324800
参考连接
https://aws.amazon.com/cn/blogs/china/using-amazon-sagemaker-to-build-a-recommendation-system-based-on-gluon/