Amazon Review Dataset数据集介绍

Amazon Review Dataset数据集记录了用户对亚马逊网站商品的评价,是推荐系统的经典数据集,并且Amazon一直在更新这个数据集,根据时间顺序,Amazon数据集可以分成三类:

Amazon数据集可以根据商品类别分为 Books,Electronics,Movies and TV,CDs and Vinyl等子数据集,这些子数据集包含两类信息:

以2014版数据集为例:

  1. 商品信息描述

    asin商品id
    title商品名称
    price价格
    imUrl商品图片链接
    related相关商品
    salesRank折扣信息
    brand品牌
    categories目录类别

    官方例子:

    {
    "asin": "0000031852",
    "title": "Girls Ballet Tutu Zebra Hot Pink",
    "price": 3.17,
    "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg",
    "related":
    {
     "also_bought": ["B00JHONN1S", "B002BZX8Z6"],
     "also_viewed": ["B002BZX8Z6", "B00JHONN1S"],
     "bought_together": ["B002BZX8Z6"]
    },
    "salesRank": {"Toys & Games": 211836},
    "brand": "Coxlures",
    "categories": [["Sports & Outdoors", "Other Sports", "Dance"]]
    }
    
  2. 用户评分记录数据

    reviewerID用户id
    asin商品id
    reviewerName用户名
    helpful有效评价率(helpfulness rating of the review, e.g. 2/3)
    reviewText评价文本
    overall评分
    summary评价总结
    unixReviewTime评价时间戳
    reviewTime评价时间
    {
      "reviewerID": "A2SUAM1J3GNN3B",
      "asin": "0000013714",
      "reviewerName": "J. McDonald",
      "helpful": [2, 3],
      "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
      "overall": 5.0,
      "summary": "Heavenly Highway Hymns",
      "unixReviewTime": 1252800000,
      "reviewTime": "09 13, 2009"
    }
    

Amazon数据集读取:

因为下载的数据是json文件,不易操作,这里主要介绍如何将json文件转化为csv格式文件。以2014版Amazon Electronics数据集的转化为例:

商品信息读取

import pickle
import pandas as pd

file_path = 'meta_Electronics.json'
fin = open(file_path, 'r')

df = {}
useless_col = ['imUrl','salesRank','related','title','description']  # 不想要的字段
i = 0
for line in fin:
    d = eval(line)
    for s in useless_col:
        if s in d:
            d.pop(s)
    df[i] = d 
    i += 1
df = pd.DataFrame.from_dict(df, orient='index')
df.to_csv('meta_Electronics.csv',index=False)

用户评分记录数据读取

file_path = 'Electronics_10.json'
fin = open(file_path, 'r')

df = {}
useless_col = ['reviewerName','reviewText','unixReviewTime','summary'] # 不想要的字段
i = 0
for line in fin:
    d = eval(line)
    for s in useless_col:
        if s in d:
            d.pop(s)
    df[i] = d 
    i += 1
df = pd.DataFrame.from_dict(df, orient='index')
df.to_csv('Electronics_10.csv',index=False)
  • 10
    点赞
  • 92
    收藏
    觉得还不错? 一键收藏
  • 11
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 11
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值