此次分类任务用到的数据集是上次用到的金融数据并预测贷款用户是否会逾期,主要用sklearn中的几个分类方法来对预处理之后的数据进行分类。
1.Logistic Regression
Logistic Regression名称上面有一个回归但其实是一个分类方法,主要用于二分类问题,是通过建立一个损失函数,然后通过优化方法迭代求解出最优的模型参数,最后测试验证我们这个求解的模型的好坏。
主要用到 logistic = linear_model.LogisticRegression()和pre_lr=logistic.fit(train_data, train_label).score(test_data, test_label)
import pandas as pd
import numpy as np
from sklearn import linear_model
def LR_classify(train_data, train_label, test_data, test_label):
logistic = linear_model.LogisticRegression()
pre_lr=logistic.fit(train_data, train_label).score(test_data, test_label)
print(pre_lr)
return pre_lr
2.Support Vector Machines
Support Vector Machines是一种十分常见的分类器,其主要思想是通过构造一个超平面来将不同类型的数据进行分离。
def SVM_classify(train_data, train_label, test_data, test_label):
clf = svm.SVC(C=0.6, kernel='rbf', gamma=20, decision_function_shape='ovr')
clf.fit(train_data, train_label)
ac_train = clf.score(test_data, test_label)
return ac_train
3.Decision Tree
Decision Tree是一种基本的分类与回归方法,在分类问题中,基于特征对样本进行分类,它可以认为是if-then规则的集合。目前主流的有3种即ID3(信息增益)、C4.5(信息增益比)、CART(gini指数)。
def DT_classify(train_data, train_label, test_data, test_label):
clf = DecisionTreeClassifier()
acc = clf.fit(train_data, train_label).score(test_data, test_label)
return acc
4.XGB属于集成学习,其学习的目标函数是通过建立K个回归树,使得树群的预测值尽量接近真实值(准确率)而且有尽量大的泛化能力(更为本质的东西)。
def XGB_classify(train_data, train_label, test_data, test_label):
clf = XGBClassifier(n_estimators=100,learning_rate= 0.3, max_depth=6,subsample=1,gamma=0, seed=1000,num_class=2)
acc_xgb = clf.fit(train_data, train_label).score(test_data, test_label)
return acc_xgb
5.Random Forest
Random Forest也是集成算法的一种,是属于bagging算法,即套袋法,顾名思义就是将数据采样得到几个不同的数据集,每个数据集用一个模型进行分类,最后对这几个模型进行投票。
def RF_classify(train_data, train_label, test_data, test_label):
forest = RandomForestClassifier(n_estimators=15, random_state=0, n_jobs=-1)
forest.fit(train_data, train_label) #training
# pre_rf = forest.predict(x_test)
ac_train = forest.score(test_data, test_label)
return ac_train
6.Lightgbm
除了上述的几个方法之外,还有lightgbm,此方法是发表在NIPS 2017(现此会议已更名)的一篇论文“LightGBM: A Highly Efficient Gradient Boosting Decision Tree”,相比于GBDT、XGB等方法提高了效率。
def lightgbm_classify(train_data, train_label, test_data, test_label):
params = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}
lgbm2 = lgb.sklearn.LGBMClassifier(num_class=1 , n_estimators=97 , seed=0 , **params)
lgbm2.fit(train_data, train_label)
train_predprob = lgbm2.predict_proba(test_data)
return train_predprob
以下是对此金融数据分类的代码,代码中读入的数据"data00.csv"是已经预处理过的完整数据。代码中省去了各分类函数的定义部分。
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import StratifiedKFold
data = pd.read_csv('data00.csv')
label = data['status']#get label
data.drop('status',axis = 1, inplace = True)
names = data.columns
"""定义一些分类函数"""
sky = StratifiedKFold(n_splits = 5)
LR = []
SVM = []
DT = []
XGB = []
RF = []
for train_index, test_index in sky.split(data, label):
x_train, x_test = data.iloc[train_index,:], data.iloc[test_index,:]
y_train, y_test = label.iloc[train_index], label.iloc[test_index]
acc_lr = LR_classify(x_train, y_train, x_test, y_test)
acc_svm = SVM_classify(x_train, y_train, x_test, y_test)
acc_dt = DT_classify(x_train, y_train, x_test, y_test)
acc_xgb = XGB_classify(x_train, y_train, x_test, y_test)
acc_rf = RF_classify(x_train, y_train, x_test, y_test)
LR.append(acc_lr)
SVM.append(acc_svm)
DT.append(acc_dt)
XGB.append(acc_xgb)
RF.append(acc_rf)
print('Logistic Regression classification accuracy : {:.4}'.format(np.mean(LR)))
print('SVM classification accuracy : {:.4}'.format(np.mean(SVM)))
print('Decision Tree classification accuracy : {:.4}'.format(np.mean(DT)))
print('XGB classification accuracy : {:.4}'.format(np.mean(XGB)))
print('Random Forest classification accuracy : {:.4}'.format(np.mean(RF)))