项目kaggle地址:https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews
分析思路:
首先通过观察数据得知,这是女装销售情况及评价的数据,变量有商品ID,服装所属的三级类目,客户年龄,评分,评论标题,评论内容。
建立分析模型:
我们把这几个变量分为三大类:商品变量,客户变量,反馈变量
商品变量包括 clothing id, division name, class name 和department name
客户变量只有一个:age
反馈变量包括: ratings, title, review,positive feedback count
明确分析目的
我们分析的目的就是找出这些变量自身的分布规律和变量与变量之间的联系,以预测各类目未来的销量走势和评价情况。
分析工具
excel 的数据透视表(图)
python的pandas,jieba,nltk,matplotlib,wordcloud 包。
分析内容
包括以下几个方便:
各变量自身的分布分析
两个变量之间的关系分析
3个变量之间的关系分析
异常分析
review的文本分析(词云)
nicknames:
To make clear of the relationships among the variables and make them easy to write, I will take some nicknames for them:
修改原列名,方便书写和识别
Id>>>index
Clothing id>>>sku
Age>>>age
Title>>>title(评价标题)
Review text>>>review(评价)
Rating>>>rating(评分)
Recommended IND>>>recons(sku是否被推荐)
Positive feedback count>>>posfeeds (review获得的点赞数)
Division name>>>large division (一级类录)
Department name>>>bigcato(二级类目)
Class name>>>smallcato(细分类目)
Analysis Goal:
we will review the three big aspects:
How each variable is divided
How the three categories affect each other?
How will the combination of two categories affect the left one?
We will answer the following questions:
(single variables analysis)
Which skus are sold most, which categories they belong to, who bought them,
Which categories are sold most
Which age bought most
How is the ratings distributed
What about the recommendations
一维变量分析:
sku:各sku销量分布如何
各级类目销量分布如何
客户年龄分布如何
评分分布如何
推荐分布如何
(two-variable correlation analysis)
Who bought each category (smallcato VS age,bigcato VS age,division vs age)
Which categories are most recommended(smallcato VS recon, bigcato VS age, division vs age)
Who make the most recommendations(age recons)
Who made high or low ratings(age VS ratings)
Which category get the most high/low ratings(smallcato VS ratings, bigcato VS ratings,division vs ratings)
What kind of ratings receive most upvotes(ratings VS posfeeds)
二维变量关系分析:
各级类目VS年龄
各级类目VS推荐
年龄VS推荐
年龄VS评分
类目VS推荐
评分VS点赞
(three-variable correlation analysis)
Who bought what and how is there feedback (smallcato VS age VS ratings)
How do recommendations affect the sales of categories(recon V smallcato VS sku(qty))
How do ratings affect the sales of categories(rating V smallcato VS sku(qty))
How do positive feedback affect the sales of categories(posfeeds VS smallcat VS sku(qty))
类目VS年龄 VS 评分
推荐VS类目 VS sku销量
评分VS类目 VS sku销量
评分点赞数VS类目 VS sku销量
Whose affects is larger?(correlation analysis )
评分,点赞推荐哪一项对销量影响最大?
The above is about numeric variables, now we will answer the questions about text variables:
Word frequency analysis of title
Word frequency analysis of review
negative feedback review analysis (rating below 3)
Positive feedback review analysis (rating below 3)
title 词频分析
review总体词频分析
整体好中差评词频统计
各类目差评词频分析
Outlier Analysis
the low ratings get recommendation
The high ratings get no recommendation
异常分析:
评分高但未推荐,review词频分析
评分低但推荐了,review词频分析
First, we will have an overview on the dataset
首先查看数据整体情况
import pandas as pd
import matplotlib as plt
pd.set_option('display.max_rows',25000)
pd.set_option('display.max_columns',30)
data=pd.read_csv('wc.csv')
print('dataset overview':data.describe())
groupedsku=data.groupby('sku').sku.count().sort_values(ascending=False)
print('sku overview':groupedsku.describe())
**dataset overview**
index sku age posfeeds rating \
count 23486.000000 23486.000000 23486.000000 23486.000000 23486.000000
mean 11742.500000 918.118709 43.198544 2.535936 4.196032
std 6779.968547 203.298980 12.279544 5.702202 1.110031
min 0.000000 0.000000 18.000000 0.000000 1.000000
25% 5871.250000 861.000000 34.000000 0.000000 4.000000
50% 11742.500000 936.000000 41.000000 1.000000 5.000000
75% 17613.750