Otto商品分类（一）----数据探索&特征工程

最新推荐文章于 2024-07-11 16:43:10 发布

fly_Xiaoma

最新推荐文章于 2024-07-11 16:43:10 发布

阅读量1.7k

点赞数

分类专栏： ML-Demo

本文链接：https://blog.csdn.net/weixin_38664232/article/details/86905669

版权

ML-Demo 专栏收录该内容

6 篇文章 2 订阅

订阅专栏

一行的最大值、和、非0元素数目将这些特征加到原始特征中

1、数据预处理

2、保存特征编码过程中用到的模型，用于后续对测试数据的特征编码

训练数据探索部分

1、导入工具包

#首先导入必要的模块
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

2、读取数据

#读取数据
#path to where the data lies
#相对路径
dpath='./data/'
train=pd.read_csv(dpath+'Otto_train.csv')
#输出数据的前5行
print(train.head())

#数据的整体信息查看
print(train.info())

#查看各属性的统计特性 mean std min 25% 50% 75% max
print(train.describe())

3、标签的分布

查看各类样本是否均衡。交叉验证对分类任务默认采用StratifiedKFold,在每折采样时根据各类样本按比例采样。

#Target 分布，看看各类样本分布是否均衡
sns.countplot(train.target)
plt.xlabel('target')
plt.ylabel('Number of occurrences')

3.1 各特征的分布

sn.countplot(train.feat_1)
plt.xlabel('feat_1')
plt.ylabel(Number of occurrences')
plt.show()

特征值大部分是0（稀疏），是长尾分布，可以考虑log(x+1)变换，减弱长尾中大特征值的影响

特征稀疏（90%的数据为0）

看起来这些特征和计数有关系-->特征工程也可考虑TF-IDF

4、特征之间的相关系数

#get the names of all the columns
cols=train.columns

#Calculates pearson co-efficient for all combinations,通常认为相关系数大于0.5的为强相关
feat_corr=train.corr().abs()

plt.subplot(figsize(13,9))
sns.heatmap(feat_corr,annot=True)

#Mask unimportant features
sns.heatmap(feat)corr,mask=feat_corr<1,cbar=False)
plt.show()

#set the threshold to select only highly correlated attributes
threshold=0.5
#List of pairs along with correlation above threshold
corr_list=[]
#size=data.shape[1]
size=feat_corr.shape[0]

#search for the highly correlated pairs
for i in range(0,size):#for 'size' features
    for j in range(i+1,size):#avoid repatition
        if(feat_corr.iloc[i,j] >= threshold and feat_corr.iloc[i,j]<1) or ( feat_corr.iloc[i,j] <0 and feat_corr.iloc[i,j] >1) 
    #store correlation and columns index
    corr_list.append([feat_corr.iloc[i,j],i,j])

#sort to show higher ones first
s_corr_list=sorted(corr_list,key=lambda x:-abs(x[0]))

#print correlations and column names
for v,i,j in s_corr_list:
    print('%s and %s =%.2f' %(cols[i],cols[j],v)

特征之间相关性还好，考虑加正则项。

训练数据特征工程部分

特征变换，这是个体力活

取对数log1p(对线性模型很重要，单调变换树模型影响不大)
tf-idf
原始特征组合（加减乘除。如果是计数特征，乘法表示'and' ,更有意义（FM）；或者可采用GBDT做特征编码，实现更高阶特征组合；原始特征维数太高也可以先用基础模型得到特征的重要性，对重要的特征再组合）
t-SNE及PCA降维后的特征（降维部分讲解）
统计特征，如sum of the row ,number of non-zero,max of the row,x-mean 个人感觉对这个数据集意义不大

1、分开特征和标签

#标签
y_train=train['target'] #形式为Class_x

#暂存id，其实id没什么用
train_id=train['id']
# drop ids and targets
X_train=train.drop(['id','target'],axis=1)

# 保存特征名称
column_org=X_train.columns

1.1 feat编码：log(x+1)

原始特征feat_x看起来像计数特征，取log运算更接近人对数字的敏感度，更适合线性模型。同时也可以降低长维分布中大数值的影响，减弱长维分布的长尾性。

X_train_log=np.log1p(X_train)

#重新组成DataFrame
feat_names=columns_org+"_log"
X_train_log=pd.DataFrame(colums=feat_names,data=X_train_log.values)

print(X_train_log.head())

1.2 feat编码：TF-IDF

原始特征feat_x看起来像计数特征，类似文本分析中词频特征的处理，TF-IDF可以突出对特别类别有贡献的低频词。

这里原始特征已经是计数特征了，直接调用TfidfTransformer，将计数特征变成TF-IDF

如果输入是原始文本，需要讲计数功能（TF）和IDF功能集中在一起，用TfidfVectorizer

#transform counts to TFIDF features
from sklearn.feature_extraction.text import TfidfTransformer
tfidf=TfidfTransformer()

#输出稀疏矩阵
X_train_tfidf=tfidf.fit_transform(X_train).toarray()

#重新组成DataFrame，为了可视化
feat_names=columns_org+'_tfidf'
X_train_tfidf=pd.DataFrame(columns=feat_names,data=X_train_tfidf)

print(X_train_tfidf.head())

其他特征工程

一行的最大值、和、非0元素数目将这些特征加到原始特征中

#X_train['feat_max']=X_train.max(axis=1)
#X_train['feat_sum']=X_train.sum(axis=1)
#X_train['feat_zero_count']=X_train.apply(lambda #x:x.value_counts().get(0,0),axis=1)
#print(X_train.head())

1、数据预处理

由于数据极度稀疏，数据缩放应采用MinMaxScaler，使得变换后的数据继续保持稀疏

如果将特征看似词频这种特征，也可以不用缩放

也可以对每个样本用模长归一

#对原始数据缩放
from sklearn.preprocessing import MinMaxScaler
#构造输入特征的标准化器
ms_org=MinMaxScaler()

#保存特征名字，用于结果保存为csv
feat_names_org=X_train.columns

#训练模型：fit
#并对数据进行特征缩放：transform
X_train=ms_org.fit_transform(X_train)



#对log数据缩放
X_train_log=ms_org.fit_transform(X_tran_log)



#对tf-idf数据缩放
X_train_tfidf=ms_org.fit_transform(X_train_tfidf)



#保存原始特征
y=pd.Series(data=y_train,name='target')
feat_names=columns_org
train_org=pd.concat([train_id,pd.DataFrame(columns=feat_names_org,data=X_train),y],axis=1)
train_org.to_csv('Otto_FE_train_org.csv',index=False,header=True)



#保存log特征变换结果
y=pd.Series(data=y_train,name='target')
train_log=pd.concat([train_id,pd.DataFrame(columns=feat_names_log,data=X_train_Log),y],axis=1)
train_log.to_csv('Otto_FE_train_log.csv',index=False,header=True)



#保存tf-idf特征变换结果
y=pd.Series(data=y_train,name='target')
train_tfidf=pd.concat([train_id,pd.DataFrame(columns=feat_names_tfidf,data=X_train_tfidf),y],axis=1)
train_tfidf.to_csv('Otto_FE_tran_tfidf.csv',index=False,header=True)