商品信息可视化与文本处理结果可视化展示python（2）

最新推荐文章于 2023-10-22 17:46:12 发布

qq_42007099

最新推荐文章于 2023-10-22 17:46:12 发布

阅读量309

点赞数

文章标签：信息可视化

本文链接：https://blog.csdn.net/qq_42007099/article/details/128205766

版权

进行商品的分类

在商品的category_name中一共有1287个类别。但是这些类别太宏观，进一步细致的分析。

train.category_name.nunique() 

train.category_name.value_counts()

把类别按照/进行划分，定义一个划分类别的函数

def split_cat(text):
    try:return text.split("/")    #把train['category_name']中按‘/’分类
    except: return ("No Label", "No Label", "No Label")

try和except函数，其实就相当于if和else。随后对数据的每一行进行操作，使用apply函数，对数据的每一行都执行split_cat()操作。lambda表示匿名函数不需要return来返回值，表达式本身结果就是返回值。

train['general_cat'],train['subcat_1'],train['subcat_2']=zip(*train['category_name'].apply(lambda x:split_cat(x)))
train.head()

查看分类号的类别数量

train['general_cat'].nunique()  #11
train['subcat_1'].nunique()     #114
train.subcat_2.nunique()         #871

总的来说，我们有11个主要类别（第一个子类别中的114个和第二个子类别中的871个)。查看类别的分布情况。

x=train['general_cat'].value_counts().index.values.astype('str')   #当前类别的名称
y=train['general_cat'].value_counts().values
pct = [("%.2f"%(v*100))+"%" for v in (y/len(train))]   #占总体的比列

对于X(key)：.value_counts()得到当前的类别以及类别出现的次数；.index表示类别的索引；.values取得所有并且是字符串类型.astype('str')

对于Y(valus)：直接value_counts();取所有加上.values

打印一下XY看看。

显示的是在general_cat中的类别以及对应的数量。接下来进行画图，使用plotly工具包go.Bar。text表示指定展示，下面指定为百分比

trace1 = go.Bar(x=x, y=y, text=pct)
layout = dict(title= 'Number of Items by Main Category',
              yaxis = dict(title='Count'),
              xaxis = dict(title='Category'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

同理展示subcat_1的条形图。但由于subcat_1有114个，因此我们选取前20个进行展示。

x=train['subcat_1'].value_counts().index.values.astype('str')[:20]        #取出种类
y=train['subcat_1'].value_counts().values                                  #取出种类对应的数量
pct = [("%.2f"%(v*100))+"%" for v in (y/len(train))]

trace1 = go.Bar(x=x, y=y, text=pct,
                marker=dict(
                color = y,colorscale='Portland',showscale=True,
                reversescale = False
                )
               )
layout = dict(title= 'Number of Items by Main Category',
              yaxis = dict(title='Count'),
              xaxis = dict(title='Category'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

这里与上面不同的是在tracel中增加了一个显示颜色区分的

marker=dict(
color = y,colorscale='Portland',showscale=True,
reversescale = False

接下来看看再general_cat的11个类别中每个类别的价格的变动情况。首先先把11个类别取出。

general_cats = train['general_cat'].unique()
print(general_cats)

再把每个类别对应的价格取出，使用.loc函数.箱型图使用go.Box函数，x为转为Log函数的价格，name为general_catsd 11个类别，使用for循环将x和name传入。

general_cats = train['general_cat'].unique()
print(general_cats)
x = [train.loc[train.general_cat==cat, 'price'] for cat in general_cats]
data = [go.Box(x=np.log(x[i]+1),name=general_cats[i]) for i in range (len(general_cats))]
layout = dict(title="Price Distribution by General Category",
              yaxis = dict(title='Frequency'),
              xaxis = dict(title='Category'))
fig = dict(data=data, layout=layout)
py.iplot(fig)

在商品描述一列，有些商品的描述较长或较短，为了探究商品的描述长短与价格的关系。通过统计每个商品的长度对应的价格。先re.compile设置匹配的模式，随后使用.sub函数将模式中的字符用空格替代。随后通过for循环找出不在停用词和句子长度大于3的进行分割。

def wordCount(text):
  try:
      text.lower()         #大写变小写
      
      #去除一些特殊符号
      regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
      txt = regex.sub(" ",text)
      
      words = [w for w in txt.split(" ")\
              if not w in stop_words.ENGLISH_STOP_WORDS and len(w)>3] 
     return len(words)
  except:
     return 0

train['dese_len'] = train['item_description'].apply(lambda x: wordCount(x))

可以看到去除特殊字符和停用词后主要的描述性文字的数量dese_len。在每个长度中的多个价格，对其取平均。代码表示取'dese_len'一列统计对应的价格['price']，求平均。最后画图。

df = train.groupby('dese_len')['price'].mean().reset_index()

tracel = go.Scatter(
          x=df['dese_len']
          y = np.log(df['price']+1,
           mode = 'lines+markers',
            name = 'lines+markers')
layout = dict(title= 'Average Log(Price) by Description Length',
              yaxis = dict(title='Average Log(Price)'),
              xaxis = dict(title='Description Length'))
fig=dict(data=[trace1], layout=layout)