Price suggestion(EDA)上--简单数据挖掘

https://www.kaggle.com/thykhuely/mercari-interactive-eda-topic-modelling

该题为的目的在于通过商家给予的商品的信息,建立一个商品的估价模型。

所给数据的大致分析:

列含义的分析:
name:商品名

item_condition_id:卖方提供的物品的状况(不是很懂这个数据,可能是商品好坏状况)

category_name:每个商品有相应的三个标签

brand_name:品牌名

price:价格(即我们要预测的值)

shipping:1为运费为卖方支付,0为运费为买方支付

item_description:对物品的描述

 

1.首先对price我们的目标进行一个分析

train.price.describe()

可以清楚的看出数据的一些特性

 

然后观察price的分布情况

plt.subplot(1, 2, 1)
(train['price']).plot.hist(bins=50, figsize=(20,10), edgecolor='white',range=[0,250])
plt.xlabel('price+', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Price Distribution - Training Set', fontsize=17)

plt.subplot(1, 2, 2)
np.log(train['price']+1).plot.hist(bins=50, figsize=(20,10), edgecolor='white')
plt.xlabel('log(price+1)', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Log(Price) Distribution - Training Set', fontsize=17)
plt.show()

此处使用log(price+1)的方法使数据更加接近正态分布。也可以尝试Box-Cox.

 

2.对shipping运费进行分析

统计两种运费方式的分布情况

train.shipping.value_counts()/len(train)

 

对于两种运费方式进行price的分析

prc_shipBySeller = train.loc[train.shipping==1, 'price']
prc_shipByBuyer = train.loc[train.shipping==0, 'price']

fig, ax = plt.subplots(figsize=(20,10))
ax.hist(np.log(prc_shipBySeller+1), color='#8CB4E1', alpha=1.0, bins=50,
       label='Price when Seller pays Shipping')
ax.hist(np.log(prc_shipByBuyer+1), color='#007D00', alpha=0.7, bins=50,
       label='Price when Buyer pays Shipping')
ax.set(title='Histogram Comparison', ylabel='% of Dataset in Bin')
plt.xlabel('log(price+1)', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.title('Price Distribution by Shipping Type', fontsize=17)
plt.tick_params(labelsize=15)
plt.show()

 

3.对item_categroy(标签)的分析

对所有标签种类的分析

print("There are %d unique values in the category column." % train['category_name'].nunique())

显示总数量前5的标签,以及标签的数量

# TOP 5 RAW CATEGORIES
train['category_name'].value_counts()[:5]

查看空标签数量

1 # missing categories
2 print("There are %d items that do not have a label." % train['category_name'].isnull().sum())

对标签的拆分

# reference: BuryBuryZymon at https://www.kaggle.com/maheshdadhich/i-will-sell-everything-for-free-0-55
def split_cat(text):
    try: return text.split("/")
    except: return ("No Label", "No Label", "No Label")

train['general_cat'], train['subcat_1'], train['subcat_2'] = \
zip(*train['category_name'].apply(lambda x: split_cat(x)))
train.head()

对每个小标签进行分析

print("There are %d unique first sub-categories." % train['subcat_1'].nunique())

print("There are %d unique second sub-categories." % train['subcat_2'].nunique())

 

对general_cat进行分析

x = train['general_cat'].value_counts().index.values.astype('str')
y = train['general_cat'].value_counts().values
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))]
trace1 = go.Bar(x=x, y=y, text=pct)
layout = dict(title= 'Number of Items by Main Category',
              yaxis = dict(title='Count'),
              xaxis = dict(title='Category'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

x = train['subcat_1'].value_counts().index.values.astype('str')[:15]
y = train['subcat_1'].value_counts().values[:15]
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))][:15]
trace1 = go.Bar(x=x, y=y, text=pct,
                marker=dict(
                color = y,colorscale='Portland',showscale=True,
                reversescale = False
                ))
layout = dict(title= 'Number of Items by Sub Category (Top 15)',
              yaxis = dict(title='Count'),
              xaxis = dict(title='SubCategory'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)

使用箱型图对标签数据进行分析

general_cats = train['general_cat'].unique()
x = [train.loc[train['general_cat']==cat, 'price'] for cat in general_cats]

data = [go.Box(x=np.log(x[i]+1), name=general_cats[i]) for i in range(len(general_cats))]

layout = dict(title="Price Distribution by General Category",
              yaxis = dict(title='Frequency'),
              xaxis = dict(title='Category'))
fig = dict(data=data, layout=layout)
py.iplot(fig)

对brand_name(品牌名)的分析

print("There are %d unique brand names in the training dataset." % train['brand_name'].nunique())

x = train['brand_name'].value_counts().index.values.astype('str')[:10]
y = train['brand_name'].value_counts().values[:10]

# trace1 = go.Bar(x=x, y=y, 
#                 marker=dict(
#                 color = y,colorscale='Portland',showscale=True,
#                 reversescale = False
#                 ))
# layout = dict(title= 'Top 10 Brand by Number of Items',
#               yaxis = dict(title='Brand Name'),
#               xaxis = dict(title='Count'))
# fig=dict(data=[trace1], layout=layout)
# py.iplot(fig)

 

转载于:https://www.cnblogs.com/zhengzhe/p/8983730.html

CSDN海神之光上传的代码均可运行,亲测可用,直接替换数据即可,适合小白; 1、代码压缩包内容 主函数:main.m; 调用函数:其他m文件;无需运行 运行结果效果图; 2、代码运行版本 Matlab 2019b或2023b;若运行有误,根据提示修改;若不会,私信博主; 3、运行操作步骤 步骤一:将所有文件放到Matlab的当前文件夹中; 步骤二:双击打开main.m文件; 步骤三:点击运行,等程序运行完得到结果; 4、仿真咨询 如需其他服务,可私信博主或扫描博客文章底部QQ名片; 4.1 博客或资源的完整代码提供 4.2 期刊或参考文献复现 4.3 Matlab程序定制 4.4 科研合作 功率谱估计: 故障诊断分析: 雷达通信:雷达LFM、MIMO、成像、定位、干扰、检测、信号分析、脉冲压缩 滤波估计:SOC估计 目标定位:WSN定位、滤波跟踪、目标定位 生物电信号:肌电信号EMG、脑电信号EEG、心电信号ECG 通信系统:DOA估计、编码译码、变分模态分解、管道泄漏、滤波器、数字信号处理+传输+分析+去噪(CEEMDAN)、数字信号调制、误码率、信号估计、DTMF、信号检测识别融合、LEACH协议、信号检测、水声通信 1. EMD(经验模态分解,Empirical Mode Decomposition) 2. TVF-EMD(时变滤波的经验模态分解,Time-Varying Filtered Empirical Mode Decomposition) 3. EEMD(集成经验模态分解,Ensemble Empirical Mode Decomposition) 4. VMD(变分模态分解,Variational Mode Decomposition) 5. CEEMDAN(完全自适应噪声集合经验模态分解,Complementary Ensemble Empirical Mode Decomposition with Adaptive Noise) 6. LMD(局部均值分解,Local Mean Decomposition) 7. RLMD(鲁棒局部均值分解, Robust Local Mean Decomposition) 8. ITD(固有时间尺度分解,Intrinsic Time Decomposition) 9. SVMD(逐次变分模态分解,Sequential Variational Mode Decomposition) 10. ICEEMDAN(改进的完全自适应噪声集合经验模态分解,Improved Complementary Ensemble Empirical Mode Decomposition with Adaptive Noise) 11. FMD(特征模式分解,Feature Mode Decomposition) 12. REMD(鲁棒经验模态分解,Robust Empirical Mode Decomposition) 13. SGMD(辛几何模态分解,Spectral-Grouping-based Mode Decomposition) 14. RLMD(鲁棒局部均值分解,Robust Intrinsic Time Decomposition) 15. ESMD(极点对称模态分解, extreme-point symmetric mode decomposition) 16. CEEMD(互补集合经验模态分解,Complementary Ensemble Empirical Mode Decomposition) 17. SSA(奇异谱分析,Singular Spectrum Analysis) 18. SWD(群分解,Swarm Decomposition) 19. RPSEMD(再生相移正弦辅助经验模态分解,Regenerated Phase-shifted Sinusoids assisted Empirical Mode Decomposition) 20. EWT(经验小波变换,Empirical Wavelet Transform) 21. DWT(离散小波变换,Discraete wavelet transform) 22. TDD(时域分解,Time Domain Decomposition) 23. MODWT(最大重叠离散小波变换,Maximal Overlap Discrete Wavelet Transform) 24. MEMD(多元经验模态分解,Multivariate Empirical Mode Decomposition) 25. MVMD(多元变分模态分解,Multivariate Variational Mode Decomposition)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值