使用python+sklearn的决策树方法预测是否有信用风险

import numpy as np
import pandas as pd
names=("Balance,Duration,History,Purpose,Credit amount,Savings,Employment,instPercent,sexMarried,Guarantors,Residence duration,Assets,Age,concCredit,Apartment,Credits,Occupation,Dependents,hasPhone,Foreign,lable").split(',')
data=pd.read_csv("Desktop/sunshengyun/data/german/german.data",sep='\s+',names=names)
data.head()
Balance Duration History Purpose Credit amount Savings Employment instPercent sexMarried Guarantors Assets Age concCredit Apartment Credits Occupation Dependents hasPhone Foreign lable
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 21 columns

data.Balance.unique()
array([‘A11’, ‘A12’, ‘A14’, ‘A13’], dtype=object)
data.count()
Balance 1000 Duration 1000 History 1000 Purpose 1000 Credit amount 1000 Savings 1000 Employment 1000 instPercent 1000 sexMarried 1000 Guarantors 1000 Residence duration 1000 Assets 1000 Age 1000 concCredit 1000 Apartment 1000 Credits 1000 Occupation 1000 Dependents 1000 hasPhone 1000 Foreign 1000 lable 1000 dtype: int64
#部分变量描述性统计分析
data.describe()
Duration Credit amount instPercent Residence duration Age Credits Dependents lable
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 20.903000 3271.258000 2.973000 2.845000 35.546000 1.407000 1.155000 1.300000
std 12.058814 2822.736876 1.118715 1.103718 11.375469 0.577654 0.362086 0.458487
min 4.000000 250.000000 1.000000 1.000000 19.000000 1.000000 1.000000 1.000000
25% 12.000000 1365.500000 2.000000 2.000000 27.000000 1.000000 1.000000 1.000000
50% 18.000000 2319.500000 3.000000 3.000000 33.000000 1.000000 1.000000 1.000000
75% 24.000000 3972.250000 4.000000 4.000000 42.000000 2.000000 1.000000 2.000000
max 72.000000 18424.000000 4.000000 4.000000 75.000000 4.000000 2.000000 2.000000
data.Duration.unique()
array([ 6, 48, 12, 42, 24, 36, 30, 15, 9, 10, 7, 60, 18, 45, 11, 27, 8, 54, 20, 14, 33, 21, 16, 4, 47, 13, 22, 39, 28, 5, 26, 72, 40], dtype=int64)
data.History.unique()
array([‘A34’, ‘A32’, ‘A33’, ‘A30’, ‘A31’], dtype=object)
data.groupby('Balance').size().order(ascending=
  • 4
    点赞
  • 27
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Stacking是一种集成学习方法,可以将多个模型的预测结果结合起来,得到更好的预测效果。在使用Python和scikit-learn库实现Stacking方法时,需要进行以下步骤: 1. 导入必要的库和数据集。 ```python import numpy as np import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.metrics import accuracy_score from sklearn.model_selection import cross_val_score, KFold from sklearn.model_selection import GridSearchCV from mlxtend.classifier import StackingClassifier iris = load_iris() X, y = iris.data[:, 1:3], iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) ``` 2. 定义基本模型和元模型。 ```python clf1 = KNeighborsClassifier(n_neighbors=3) clf2 = DecisionTreeClassifier() clf3 = RandomForestClassifier(n_estimators=100) clf4 = SVC(kernel='linear', probability=True) lr = LogisticRegression() ``` 3. 定义Stacking模型,并进行交叉验证。 ```python sclf = StackingClassifier(classifiers=[clf1, clf2, clf3, clf4], meta_classifier=lr) kfold = KFold(n_splits=10, shuffle=True, random_state=42) for clf, label in zip([clf1, clf2, clf3, clf4, sclf], ['KNN', 'Decision Tree', 'Random Forest', 'SVM', 'StackingClassifier']): scores = cross_val_score(clf, X, y, cv=kfold, scoring='accuracy') print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label)) ``` 4. 对Stacking模型进行调参。 ```python params = {'kneighborsclassifier__n_neighbors': [1, 3, 5], 'decisiontreeclassifier__max_depth': [1, 2], 'randomforestclassifier__max_depth': [1, 2], 'meta-logisticregression__C': [0.1, 1.0, 10.0]} grid = GridSearchCV(estimator=sclf, param_grid=params, cv=kfold, refit=True) grid.fit(X_train, y_train) print("Best parameters set found on development set:") print(grid.best_params_) print("Grid scores on development set:") means = grid.cv_results_['mean_test_score'] stds = grid.cv_results_['std_test_score'] for mean, std, params in zip(means, stds, grid.cv_results_['params']): print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params)) ``` 5. 计算Stacking模型在测试集上的准确率。 ```python y_pred = grid.predict(X_test) print('Accuracy: %.2f' % accuracy_score(y_test, y_pred)) ``` 通过以上步骤,我们就可以使用Python和scikit-learn库实现Stacking方法来组合预测了。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值