(
以kaggle上信用卡欺诈案例-不平衡数据处理为例
简单分类问题
一.数据预处理
1.数据缺失值处理(遇到补充)(可以直接画那个图看下所有数据的缺失值情况(我找到了再补充)
(1)均值or线性替代(2)删除
2.分类数据训练不平衡处理
这里给个画图模板,以防每次都重新找麻烦(图更直观罢了)
#查看一下2者占比,这里也可以可以用图表示
# The classes are heavily skewed we need to solve this issue later.
print('No Frauds', round(df['Class'].value_counts()[0]/len(df) * 100,2), '% of the dataset')
print('Frauds', round(df['Class'].value_counts()[1]/len(df) * 100,2), '% of the dataset')
#柱状图(这里2分类问题,多分类其他都一样,这是对不同类别数据统计,其他也一样,画图模板全可以照搬)
colors = ["#0101DF", "#DF0101"]
sns.countplot('Class', data=df, palette=colors)
plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)', fontsize=14)
#饼状图
df["Class"].value_counts().plot.pie(labels=df['Class'].unique()
,autopct='%.2f%%'
,fontsize=20
,figsize=(6, 6))
#df是自己设置的训练集的名称,不一样的就改个这个就得
下采样or过采样(后面单独补充(选训练数据时候)
3.各种数据的分布情况
又是模板,可以通过数据分布情况观察相关特征,本题中特征不属于探索性分析,难度简单很多
fig, ax = plt.subplots(1, 2, figsize=(18,4))
amount_val = df['Amount'].values
time_val = df['Time'].values
sns.distplot(amount_val, ax=ax[0], color='r')
ax[0].set_title('Distribution of Transaction Amount', fontsize=14)
ax[0].set_xlim([min(amount_val), max(amount_val)])
sns.distplot(time_val, ax=ax[1], color='b')
ax[1].set_title('Distribution of Transaction Time', fontsize=14)
ax[1].set_xlim([min(time_val), max(time_val)])
plt.show()
4.对数据进行归一化处理:
当我们发现所需要的数据中的大小差异较大的时候,需要进行归一化处理,一般最后范围在(-1,1)以防机器学习中出现的“偏爱”现象
#一种是标准化,一种是自设置归一化,任选,个数就这个类(自己经常会忘记全称怎么打)
from sklearn.preprocessing import StandardScaler, RobustScaler
# RobustScaler is less prone to outliers.
std_scaler = StandardScaler()
rob_scaler = RobustScaler()
df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))
df.drop(['Time','Amount'], axis=1, inplace=True)
5.特征相关度分析。
对于本次数据的多个特征我们可以进行先相关度分析,对于特殊的尤其不相关的我们可以进行drop或者分析原因进行处理,对于特征之间相关度很高的我们也可以只用一个代替,做特征融合,特征融合构造新特征以后单独总结
这里用到我们的热力图
f,ax = plt.subplots(figsize=(15,15))
#输入dataframe
coor=train.coor()
ax = sns.heatmap(coor, annot=True, cmap = 'viridis', linewidths = .1, linecolor = 'grey', fmt=".2f")
ax.set_title("Correlation")
plt.show()
6.离群点分析,本次的数据的话因为特征的重要i相关性我们不得而知,所以先通过特征相关性再做具体分析。但比如探索性数据(房价
(1)对数据进行分布画图,将超出正常范围的值删除:超出下四分位的1.5倍间距,以及超出上四分位数据的1.5倍间距的数
#这里是画三个,简单改改就可以了,遇到搬过来改就行了
from scipy.stats import norm
f, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(20, 6))
v14_fraud_dist = new_df['V14'].loc[new_df['Class'] == 1].values
sns.distplot(v14_fraud_dist,ax=ax1, fit=norm, color='#FB8861')
ax1.set_title('V14 Distribution \n (Fraud Transactions)', fontsize=14)
# # -----> V14 Removing Outliers (Highest Negative Correlated with Labels)
v14_fraud = new_df['V14'].loc[new_df['Class'] == 1].values
q25, q75 = np.percentile(v14_fraud, 25), np.percentile(v14_fraud, 75)
print('Quartile 25: {} | Quartile 75: {}'.format(q25, q75))
v14_iqr = q75 - q25
print('iqr: {}'.format(v14_iqr))
v14_cut_off = v14_iqr * 1.5
v14_lower, v14_upper = q25 - v14_cut_off, q75 + v14_cut_off
print('Cut Off: {}'.format(v14_cut_off))
print('V14 Lower: {}'.format(v14_lower))
print('V14 Upper: {}'.format(v14_upper))
outliers = [x for x in v14_fraud if x < v14_lower or x > v14_upper]
print('Feature V14 Outliers for Fraud Cases: {}'.format(len(outliers)))
print('V10 outliers:{}'.format(outliers))
new_df = new_df.drop(new_df[(new_df['V14'] > v14_upper) | (new_df['V14'] < v14_lower)].index)
print('----' * 44)
(2)画箱线图(有时候1和2一起用也挺好用的)
#箱线图,看离群点的,看多个的话自己定义列表写个for循环呗
f, axes = plt.subplots(ncols=4, figsize=(20,4))
colors = ['r','b']
# Negative Correlations with our Class (The lower our feature value the more likely it will be a fraud transaction)
sns.boxplot(x="Class", y="V17", data=new_df, palette=colors, ax=axes[0])
axes[0].set_title('V17 vs Class Negative Correlation')
和房龄的关系等,我们可以通过画图来分析离群点)
二.下采样(实际过程,为了避免过多的数据预处理,直接放到数据处理那)
这里有个点挺有意思的,就是作者是通过分析数据量对我们训练的影响,后面的话得到即使通过瞎下采样对我们的训练分数的影响并不是很大,这个还挺重要的,如果数据量过小得到和数据量大的训练效果相差较大的时候,我们需要进行考虑
三.特征融合
这里可以讲的变化太多,我后面遇到个特征处理的再补充。
这里提一个,对本信用卡数据处理的时候需要进行降维处理。
当数据的维度过多的时候,我们需要对数据进行降维处理,常用的一些降维算法如下:随机选择
1、t-SNE:构建低维数据的概率分布拟合高维数据的概率分布,通过学习使得两个分布接近
2、PCA算法:主要的思想是将高维数据映射到低维数据上(数据操作之前需要对数据进行归一化处理);
3、SVD:对矩阵进行分解(分解时不需要矩阵维方阵)
下面根据情况选择,最后训练的数据一定是降维后的数据
# New_df is from the random undersample data (fewer instances)
X = new_df.drop('Class', axis=1)
y = new_df['Class']
# T-SNE Implementation
t0 = time.time()
X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(X.values)
t1 = time.time()
print("T-SNE took {:.2} s".format(t1 - t0))
# PCA Implementation
t0 = time.time()
X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values)
t1 = time.time()
print("PCA took {:.2} s".format(t1 - t0))
# TruncatedSVD
t0 = time.time()
X_reduced_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=42).fit_transform(X.values)
t1 = time.time()
print("Truncated SVD took {:.2} s".format(t1 - t0))
降维处理后的数据分布情况模板(画图模板)
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(24,6))
# labels = ['No Fraud', 'Fraud']
f.suptitle('Clusters using Dimensionality Reduction', fontsize=14)
blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud')
red_patch = mpatches.Patch(color='#AF0000', label='Fraud')
# t-SNE scatter plot
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax1.scatter(X_reduced_tsne[:,0], X_reduced_tsne[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax1.set_title('t-SNE', fontsize=14)
ax1.grid(True)
ax1.legend(handles=[blue_patch, red_patch])
#所有的代码都需要根据实际情况进行改正
# PCA scatter plot
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax2.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax2.set_title('PCA', fontsize=14)
ax2.grid(True)
ax2.legend(handles=[blue_patch, red_patch])
# TruncatedSVD scatter plot
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax3.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax3.set_title('Truncated SVD', fontsize=14)
ax3.grid(True)
ax3.legend(handles=[blue_patch, red_patch])
plt.show()
四.模型选择和调参
1.网格搜索找最好参数,特别耗内存
# Use GridSearchCV to find the best parameters.
from sklearn.model_selection import GridSearchCV
# Logistic Regression
log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(X_train, y_train)
# We automatically get the logistic regression with the best parameters.
log_reg = grid_log_reg.best_estimator_
knears_params = {"n_neighbors": list(range(2,5,1)), 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}
grid_knears = GridSearchCV(KNeighborsClassifier(), knears_params)
grid_knears.fit(X_train, y_train)
# KNears best estimator
knears_neighbors = grid_knears.best_estimator_
# Support Vector Classifier
svc_params = {'C': [0.5, 0.7, 0.9, 1], 'kernel': ['rbf', 'poly', 'sigmoid', 'linear']}
grid_svc = GridSearchCV(SVC(), svc_params)
grid_svc.fit(X_train, y_train)
# SVC best estimator
svc = grid_svc.best_estimator_
# DecisionTree Classifier
tree_params = {"criterion": ["gini", "entropy"], "max_depth": list(range(2,4,1)),
"min_samples_leaf": list(range(5,7,1))}
grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params)
grid_tree.fit(X_train, y_train)
# tree best estimator
tree_clf = grid_tree.best_estimator_
#这里采用的模型都是简单的,添加其他分类器总体的模板差不多
roc曲线
import scikitplot as skplt
# roc曲线
vali_proba_df = pd.DataFrame(xgb_base_sk.predict_proba(X_test))
skplt.metrics.plot_roc(y_test, vali_proba_df,
plot_micro=False, figsize=(6,6),
plot_macro=False)
混淆矩阵热力图
y_predict_gbd = gbdt_base.predict(X_test)
labels = [0, 1]
sns.set()
cm = confusion_matrix(y_test, y_predict_gbd, labels=labels)
print("混淆矩阵:\n{0}".format(cm))
cm_normalized = cm/cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized,annot=True)
plt.xlabel('predict label')
plt.ylabel('true label')
未完待补充-------
(此项目是较为简单,再特征处理方面比较简单,后续接着添加)
仅仅是本人学习日记博客总结。