kaggle竞赛实战2

最新推荐文章于 2024-06-15 11:35:22 发布

爱学习的uu

最新推荐文章于 2024-06-15 11:35:22 发布

阅读量951

点赞数 27

文章标签：人工智能

本文链接：https://blog.csdn.net/m0_60792028/article/details/139203842

版权

接上一篇，本篇针对merchant以及transaction数据集进行预处理，包括缺失值、inf值处理以及object类型数据的独热编码转化，完成后详细代码如下：

# In[5]:

import os
import numpy as np
import pandas as pd

# In[6]:

pd.read_excel('d:/Data_Dictionary.xlsx',header=2,sheet_name='train')#读取数据，去掉头两行（空行），先看看大概数据情况

# In[7]:

import gc #进行内存管理的

# In[8]:

train=pd.read_csv('d:/train.csv')

# In[9]:

test=pd.read_csv('d:/test.csv')

# In[10]:

#数据质量分析，判断训练和验证集是否取自同一总体，从而决定是用特征工程还是trick，如果分布不一致，则在训练集上容易过拟合
#先看数据集是否cardid独一无二
train['card_id'].nunique()==train.shape[0]#nunique用于看不同id个数

# In[11]:

test['card_id'].nunique()==test.shape[0]

# In[12]:

train['card_id'].nunique()+test['card_id'].nunique()==len(set(train['card_id']).union(set(test['card_id'])))#判断

# In[13]:

train.isnull().sum()#看缺失值情况

# In[14]:

test.isnull().sum()

# In[15]:

statistics=train['target'].describe()#看统计情况，找异常值

# In[16]:

statistics

# In[17]:

#连续变量用概率直方图来观察
import seaborn as sns

# In[18]:

import matplotlib.pyplot as plt

# In[19]:

sns.set()

# In[20]:

sns.histplot(train['target'])#绘制密度曲线，找异常值

# In[21]:

#看下异常值数量,可能是特殊用户的标记，不能直接删掉
(train['target']<-30).sum()

# In[22]:

#关于如何确定异常值，也可以用3倍方差准则
statistics.loc['mean']-3*statistics.loc['std']

# In[23]:

#规律一致性分析：两个集合分布规律是否一致
#先单变量分析，看每个变量在每个区间内的样本数分布图是否一致
features=['first_active_month','feature_1','feature_2','feature_3']
train_count=train.shape[0]
test_count=test.shape[0]

# In[24]:

for feature in features:
(train[feature].value_counts().sort_index()/train_count).plot()
(test[feature].value_counts().sort_index()/test_count).plot()
plt.legend(['train','test'])#画标签
plt.xlabel(feature)
plt.ylabel('ratio')
plt.show()

# In[25]:

merchant=pd.read_csv('d:/merchants.csv',header=0)#开始看商户表

# In[26]:

print(merchant.shape,merchant['merchant_id'].nunique())#看是否有一个商户对多条记录的情况，发现有

# In[27]:

merchant.isnull().sum()#看缺失值，不多，可能存在13个商户没有这三列数

# In[28]:

#开始数据预处理，先标注离散和连续字段
category_cols=['merchant_id','merchant_group_id','merchant_category_id','subsector_id','category_1','most_recent_sales_range',
'most_recent_purchases_range','category_4','city_id','state_id','category_2']

# In[29]:

numeric_cols=['numerical_1','numerical_2','avg_sales_lag3','avg_purchases_lag3','active_months_lag3',
'avg_sales_lag6','avg_purchases_lag6','active_months_lag6','avg_sales_lag12','avg_purchases_lag12','active_months_lag12']

# In[30]:

assert len(category_cols)+len(numeric_cols)==merchant.shape[1]#判断字段是不是都写入了

# In[31]:

merchant[category_cols].dtypes#object类型后续要处理

# In[32]:

#将缺失值填补为-1
merchant['category_2']=merchant['category_2'].fillna(-1)

# In[33]:

#变量分三类，有连续型、名义型（即离散型里变量取值不存在大小关系的）以及有序变量，现在对后两类做独热编码
def change_object_cols(se):
value=se.unique().tolist()#把取值拿出来转化为一个list
value.sort()#排序
return se.map(pd.Series(range(len(value)),index=value)).values#内层创造一个序列，它的索引是value，然后取出索引对应的值

# In[34]:

change_object_cols(merchant['category_1'])#把category_1转为独热编码

# In[35]:

for col in ['category_1','most_recent_sales_range','most_recent_purchases_range','category_4']:#把所有object类的都转了
change_object_cols(merchant[col])

# In[36]:

merchant[numeric_cols].dtypes#开始搞连续变量

# In[37]:

merchant[numeric_cols].describe()#发现有无穷大值

# In[38]:

inf_cols=['avg_purchases_lag3','avg_purchases_lag6','avg_purchases_lag12']#把无穷值用最大值替换

# In[39]:

merchant[inf_cols]=merchant[inf_cols].replace(np.inf,merchant[inf_cols].max)

# In[40]:

merchant[numeric_cols].describe()

# In[41]:

#缺失值较少，直接用均值替换
for col in numeric_cols:
merchant[col]=merchant[col].fillna(merchant[col].mean)

# In[42]:

history_transaction=pd.read_csv('d:/historical_transactions.csv',nrows=1000000)#数据太大，读一些做样本

# In[43]:

history_transaction.head(5)

# In[44]:

new_transaction=pd.read_csv('d:/new_merchant_transactions.csv',nrows=1000000)#最新的数据

# In[45]:

#最后要合成一张大表再处理，因此先看有哪些列一致
duplicate_cols=[]
for col in merchant.columns:
if col in new_transaction.columns:
duplicate_cols.append(col)
print(duplicate_cols)

# In[48]:

new_transaction[duplicate_cols].drop_duplicates().shape#取出和商户id重复的字段,进行行去重，保证后面连表时不会连出来两条一样的

# In[49]:

new_transaction['merchant_id'].nunique()#把商户id去重

# In[52]:

#和之前一样，针对离散和连续字段标注，处理缺失值
new_transaction.head(5)

# In[55]:

numeric_cols=['installments','month_lag','purchase_amount']

# In[58]:

category_cols=['authorized_flag','card_id','city_id','category_1','category_3','merchant_category_id','merchant_id','category_2','state_id','subsector_id']

# In[61]:

time_cols=['purchase_date']#时间序列特征

# In[60]:

new_transaction[category_cols].isnull().sum()

# In[63]:

new_transaction[category_cols].dtypes

# In[85]:

for col in ['authorized_flag','category_1']:#转类型，这里不清楚为什么category_3转不了
new_transaction[col]=change_object_cols(new_transaction[col])

# In[75]:

for col in ['authorized_flag','category_1']:
new_transaction[col]=new_transaction[col].fillna(-1)

# In[76]:

new_transaction[category_cols]=new_transaction[category_cols].fillna(-1)

爱学习的uu

关注

27
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
kaggle竞赛实战2

train['card_id'].nunique()+test['card_id'].nunique()==len(set(train['card_id']).union(set(test['card_id'])))#判断。for col in ['category_1','most_recent_sales_range','most_recent_purchases_range','category_4']:#把所有object类的都转了。
复制链接

扫一扫