电商女装评论数据集分析

探索女性服装电子商务数据集

背景描述
这是一个女性服装电子商务数据集,围绕客户的评论撰写。数据具有9个特征,可以从多个维度解析文本。
由于是真实的商业数据,所以做了匿名处理,评论文本和正文中对该公司的引用被替换为“零售商”。

一.探索前,问题准备

  • 购买服装的客户各年龄所占比例是多少?
  • 各年龄段的产品销量分布情况?
  • 产品评分和评论最好的产品是?
  • 产品的推荐率分布情况?

二.数据导入,观察

# Ignore  the warnings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import jieba
import PIL.Image as Image
from jieba import analyse
from wordcloud import WordCloud
from snownlp import SnowNLP

%matplotlib inline
df = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
df.head()
Unnamed: 0 Clothing ID Age Title Review Text Rating Recommended IND Positive Feedback Count Division Name Department Name Class Name
0 0 767 33 NaN Absolutely wonderful - silky and sexy and comf... 4 1 0 Initmates Intimate Intimates
1 1 1080 34 NaN Love this dress! it's sooo pretty. i happene... 5 1 4 General Dresses Dresses
2 2 1077 60 Some major design flaws I had such high hopes for this dress and reall... 3 0 0 General Dresses Dresses
3 3 1049 50 My favorite buy! I love, love, love this jumpsuit. it's fun, fl... 5 1 0 General Petite Bottoms Pants
4 4 847 47 Flattering shirt This shirt is very flattering to all due to th... 5 1 6 General Tops Blouses

观察数据的整体情况

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
Unnamed: 0                 23486 non-null int64
Clothing ID                23486 non-null int64
Age                        23486 non-null int64
Title                      19676 non-null object
Review Text                22641 non-null object
Rating                     23486 non-null int64
Recommended IND            23486 non-null int64
Positive Feedback Count    23486 non-null int64
Division Name              23472 non-null object
Department Name            23472 non-null object
Class Name                 23472 non-null object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB
time: 14 ms

观察数据是否缺失

df.isnull().sum()
Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64



time: 12.5 ms
  • 评论标题, 评论内容, 一级大类产品名称, 二级大类产品名称, 三级大类产品名称信息缺失

观察数据是否重复

df.duplicated().sum()
0



time: 25.2 ms

观察数据分布

df.describe()
Unnamed: 0 Clothing ID Age Rating Recommended IND Positive Feedback Count
count 23486.000000 23486.000000 23486.000000 23486.000000 23486.000000 23486.000000
mean 11742.500000 918.118709 43.198544 4.196032 0.822362 2.535936
std 6779.968547 203.298980 12.279544 1.110031 0.382216 5.702202
min 0.000000 0.000000 18.000000 1.000000 0.000000 0.000000
25% 5871.250000 861.000000 34.000000 4.000000 1.000000 0.000000
50% 11742.500000 936.000000 41.000000 5.000000 1.000000 1.000000
75% 17613.750000 1078.000000 52.000000 5.000000 1.000000 3.000000
max 23485.000000 1205.000000 99.000000 5.000000 1.000000 122.000000
time: 43.3 ms
pd.plotting.scatter_matrix(df, alpha = 0.3, figsize = (32,16), diagonal = 'kde');

在这里插入图片描述

time: 23 s

三.问题探索

购买服装的客户各年龄所占比例是多少?

## 购买服装的客户各年龄所占比例前十
plt.figure(figsize=(16, 6))
sns.set(style="whitegrid")
df["Age"].value_counts()[:10].plot(kind = "bar");

在这里插入图片描述

time: 290 ms

各年龄段的产品销量分布情况

## 各年龄段的购买人数
bins = np.arange(18, 104, 5)

Age_cut = pd.cut(df["Age"], bins)
age_num = Age_cut.value_counts()
time: 7.1 ms
other = pd.Series({
   "other": age_num[7:].sum()})
age_indexs = age_num.index
age_num.index = age_num.index.astype(str)
pie_data = pd.concat([age_num[:7], other])
time: 4.58 ms
plt.figure(figsize=(10, 10))
labels = pie_data.index
size = pie_data
colors = ['red','yellowgreen','lightskyblue']
explode = np.array([0.05]) * np.ones(pie_data.shape[0])

patches,l_text,p_text = plt.pie(size,explode=explode,labels=labels,
                                labeldistance = 1.1,autopct = '%3.1f%%',shadow = False,
                                startangle = 90,pctdistance = 0.6)
plt.axis('equal')
plt.legend(loc='upper left',facecolor='white')
plt.show();

在这里插入图片描述

time: 268 ms
  • 结论: 33-43年龄段的人购买最多
plt.figure(figsize=(16, 6))
sns.set(style="whitegrid&#
  • 5
    点赞
  • 44
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值