kaggle women clothing 项目

最新推荐文章于 2023-05-24 16:11:28 发布

tsing_9521

最新推荐文章于 2023-05-24 16:11:28 发布

阅读量1.6k

点赞数 1

分类专栏： python 入门数据分析文章标签：数据分析 python excel 可视化

本文链接：https://blog.csdn.net/weixin_44595372/article/details/88924957

版权

本文基于kaggle上的女装销售数据进行深入分析，包括商品销量、客户年龄、评分和反馈等多个维度。通过数据透视表和Python的pandas等工具，研究各变量间的关联，如商品类别、客户年龄与评分的关系，以及反馈对销量的影响。分析发现，30至54岁的女性是主要消费群体，高评分往往伴随更多推荐，而低评分主要集中于特定商品类别。此外，异常分析揭示了高评分未推荐和低评分被推荐的情况，以及负面评论的关键词分布。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

项目kaggle地址：https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews

分析思路：
首先通过观察数据得知，这是女装销售情况及评价的数据，变量有商品ID，服装所属的三级类目，客户年龄，评分，评论标题，评论内容。

建立分析模型：
我们把这几个变量分为三大类：商品变量，客户变量，反馈变量

商品变量包括 clothing id, division name, class name 和department name
客户变量只有一个：age
反馈变量包括: ratings, title, review,positive feedback count

明确分析目的
我们分析的目的就是找出这些变量自身的分布规律和变量与变量之间的联系，以预测各类目未来的销量走势和评价情况。

分析工具
excel 的数据透视表（图）
python的pandas,jieba,nltk,matplotlib,wordcloud 包。

分析内容
包括以下几个方便：
各变量自身的分布分析
两个变量之间的关系分析
3个变量之间的关系分析
异常分析
review的文本分析（词云）

nicknames:
To make clear of the relationships among the variables and make them easy to write, I will take some nicknames for them:
修改原列名，方便书写和识别
Id>>>index
Clothing id>>>sku
Age>>>age
Title>>>title（评价标题）
Review text>>>review（评价）
Rating>>>rating（评分）
Recommended IND>>>recons（sku是否被推荐）
Positive feedback count>>>posfeeds （review获得的点赞数）
Division name>>>large division （一级类录）
Department name>>>bigcato（二级类目）
Class name>>>smallcato（细分类目）

Analysis Goal:
we will review the three big aspects:
How each variable is divided
How the three categories affect each other?
How will the combination of two categories affect the left one?

We will answer the following questions:

(single variables analysis)

Which skus are sold most, which categories they belong to, who bought them,
Which categories are sold most
Which age bought most
How is the ratings distributed
What about the recommendations

一维变量分析：
sku:各sku销量分布如何
各级类目销量分布如何
客户年龄分布如何
评分分布如何
推荐分布如何

(two-variable correlation analysis)
Who bought each category (smallcato VS age,bigcato VS age,division vs age)
Which categories are most recommended(smallcato VS recon, bigcato VS age, division vs age)
Who make the most recommendations(age recons)
Who made high or low ratings(age VS ratings)
Which category get the most high/low ratings(smallcato VS ratings, bigcato VS ratings,division vs ratings)
What kind of ratings receive most upvotes(ratings VS posfeeds)

二维变量关系分析：
各级类目VS年龄
各级类目VS推荐
年龄VS推荐
年龄VS评分
类目VS推荐
评分VS点赞

(three-variable correlation analysis)

Who bought what and how is there feedback （smallcato VS age VS ratings）
How do recommendations affect the sales of categories(recon V smallcato VS sku(qty))
How do ratings affect the sales of categories(rating V smallcato VS sku(qty))
How do positive feedback affect the sales of categories(posfeeds VS smallcat VS sku(qty))

类目VS年龄 VS 评分
推荐VS类目 VS sku销量
评分VS类目 VS sku销量
评分点赞数VS类目 VS sku销量

Whose affects is larger?(correlation analysis )
评分，点赞推荐哪一项对销量影响最大？

The above is about numeric variables, now we will answer the questions about text variables:

Word frequency analysis of title
Word frequency analysis of review
negative feedback review analysis (rating below 3)
Positive feedback review analysis (rating below 3)

title 词频分析
review总体词频分析
整体好中差评词频统计
各类目差评词频分析

Outlier Analysis

the low ratings get recommendation
The high ratings get no recommendation

异常分析：
评分高但未推荐，review词频分析
评分低但推荐了，review词频分析

First, we will have an overview on the dataset
首先查看数据整体情况

import pandas as pd
import matplotlib as plt

pd.set_option('display.max_rows',25000)
pd.set_option('display.max_columns',30)

data=pd.read_csv('wc.csv')
print('dataset overview':data.describe())
groupedsku=data.groupby('sku').sku.count().sort_values(ascending=False)
print('sku overview':groupedsku.describe())



**dataset overview**

            index           sku           age      posfeeds        rating  \
count  23486.000000  23486.000000  23486.000000  23486.000000  23486.000000   
mean   11742.500000    918.118709     43.198544      2.535936      4.196032   
std     6779.968547    203.298980     12.279544      5.702202      1.110031   
min        0.000000      0.000000     18.000000      0.000000      1.000000   
25%     5871.250000    861.000000     34.000000      0.000000      4.000000   
50%    11742.500000    936.000000     41.000000      1.000000      5.000000   
75%    17613.750

最低0.47元/天解锁文章