kaggle women clothing 项目

项目kaggle地址:https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews

分析思路:
首先通过观察数据得知,这是女装销售情况及评价的数据,变量有商品ID,服装所属的三级类目,客户年龄,评分,评论标题,评论内容。

建立分析模型:
我们把这几个变量分为三大类:商品变量,客户变量,反馈变量

商品变量包括 clothing id, division name, class name 和department name
客户变量只有一个:age
反馈变量包括: ratings, title, review,positive feedback count

明确分析目的
我们分析的目的就是找出这些变量自身的分布规律和变量与变量之间的联系,以预测各类目未来的销量走势和评价情况。

分析工具
excel 的数据透视表(图)
python的pandas,jieba,nltk,matplotlib,wordcloud 包。

分析内容
包括以下几个方便:
各变量自身的分布分析
两个变量之间的关系分析
3个变量之间的关系分析
异常分析
review的文本分析(词云)

nicknames:
To make clear of the relationships among the variables and make them easy to write, I will take some nicknames for them:
修改原列名,方便书写和识别
Id>>>index
Clothing id>>>sku
Age>>>age
Title>>>title(评价标题)
Review text>>>review(评价)
Rating>>>rating(评分)
Recommended IND>>>recons(sku是否被推荐)
Positive feedback count>>>posfeeds (review获得的点赞数)
Division name>>>large division (一级类录)
Department name>>>bigcato(二级类目)
Class name>>>smallcato(细分类目)

Analysis Goal:
we will review the three big aspects:
How each variable is divided
How the three categories affect each other?
How will the combination of two categories affect the left one?

We will answer the following questions:


(single variables analysis)

Which skus are sold most, which categories they belong to, who bought them,
Which categories are sold most
Which age bought most
How is the ratings distributed
What about the recommendations

一维变量分析:
sku:各sku销量分布如何
各级类目销量分布如何
客户年龄分布如何
评分分布如何
推荐分布如何

(two-variable correlation analysis)
Who bought each category (smallcato VS age,bigcato VS age,division vs age)
Which categories are most recommended(smallcato VS recon, bigcato VS age, division vs age)
Who make the most recommendations(age recons)
Who made high or low ratings(age VS ratings)
Which category get the most high/low ratings(smallcato VS ratings, bigcato VS ratings,division vs ratings)
What kind of ratings receive most upvotes(ratings VS posfeeds)

二维变量关系分析:
各级类目VS年龄
各级类目VS推荐
年龄VS推荐
年龄VS评分
类目VS推荐
评分VS点赞

(three-variable correlation analysis)

Who bought what and how is there feedback (smallcato VS age VS ratings)
How do recommendations affect the sales of categories(recon V smallcato VS sku(qty))
How do ratings affect the sales of categories(rating V smallcato VS sku(qty))
How do positive feedback affect the sales of categories(posfeeds VS smallcat VS sku(qty))

类目VS年龄 VS 评分
推荐VS类目 VS sku销量
评分VS类目 VS sku销量
评分点赞数VS类目 VS sku销量

Whose affects is larger?(correlation analysis )
评分,点赞推荐哪一项对销量影响最大?


The above is about numeric variables, now we will answer the questions about text variables:

Word frequency analysis of title
Word frequency analysis of review
negative feedback review analysis (rating below 3)
Positive feedback review analysis (rating below 3)

title 词频分析
review总体词频分析
整体好中差评词频统计
各类目差评词频分析


Outlier Analysis

the low ratings get recommendation
The high ratings get no recommendation

异常分析:
评分高但未推荐,review词频分析
评分低但推荐了,review词频分析

First, we will have an overview on the dataset
首先查看数据整体情况

import pandas as pd
import matplotlib as plt

pd.set_option('display.max_rows',25000)
pd.set_option('display.max_columns',30)

data=pd.read_csv('wc.csv')
print('dataset overview':data.describe())
groupedsku=data.groupby('sku').sku.count().sort_values(ascending=False)
print('sku overview':groupedsku.describe())



**dataset overview**

            index           sku           age      posfeeds        rating  \
count  23486.000000  23486.000000  23486.000000  23486.000000  23486.000000   
mean   11742.500000    918.118709     43.198544      2.535936      4.196032   
std     6779.968547    203.298980     12.279544      5.702202      1.110031   
min        0.000000      0.000000     18.000000      0.000000      1.000000   
25%     5871.250000    861.000000     34.000000      0.000000      4.000000   
50%    11742.500000    936.000000     41.000000      1.000000      5.000000   
75%    17613.750000   1078.000000     52.000000      3.000000      5.000000   
max    23485.000000   1205.000000     99.000000    122.000000      5.000000   

              recon  
count  23486.000000  
mean       0.822362  
std        0.382216  
min        0.000000  
25%        1.000000  
50%        1.000000  
75%        1.000000  
max        1.000000  

sku overview
count    1206.000000
mean       19.474295
std        69.009764
min         1.000000
25%         1.000000
50%         2.000000
75%         6.750000
max      1024.000000
Name: sku, dtype: float64

Dataset overview:
this is an overview on numeric variables on the dataset. There are 23486 data.
sku: there are 1206 unique skus,the max sku number is 1205 while the minimum is 0.
age: the mean of age is 43.19, the minus is 18 while the max is 99,75% of the customers are between 18 and 52 years old.
positive feedback count: the average positive feedback each review gets is about 2.5,among which, 50% get less than 1 upvote,25% reviews get 1-3 upvotes, and the left 25% get more more than 3. Apparently, the average is affected by large outliers.
ratings: the average rating is about 4.2. Among the total, 75% are above 4.
recommend: the mean is 0.82, meaning that about 82% of the sales recorded are recommended.

数据总体分析:
sku:共有1206个sku,其中销量最高的sku售出1205,销量最低的售出0个。
age:客户年龄平均值为43.19,中位数为41,最大值和最小值分别为99,18,其中有75%客户年龄在18~52之间
评论点赞数:平均每条评论获得2.5个点赞,其中50%的评论点赞数小于1,只有25%评论点赞超过3.
ratings:平均评分4.2,75%的评分高于4.
推荐:推荐指数82%

Single variable analysis:

Division of sku qty

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值