流程

最新推荐文章于 2021-01-10 09:24:18 发布

断了线的风筝-呀比

最新推荐文章于 2021-01-10 09:24:18 发布

阅读量323

点赞数 1

分类专栏：模型分析

本文链接：https://blog.csdn.net/qq55220011/article/details/82788886

版权

模型分析专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1、数据提取

import pandas

import pymysql

db = pymydql.connet(host='192.168.10.136',port=3316,user='root',passwd='1234',db='库名'，charset='gbk')

mysql = 'select * from student'

data = pd.read_sql(sql,db,index_col='')

2、查看缺失值

check_null = data.isnull().sum(axis=0).sort_values(ascending=False)/float(len(data))

3、分类统计数据类型

type = data.dtypes.value_counts()

4、剔除分类变量只有1的变量

data = data.loc([:,data.apply(pd.Series.nunique)!=1]

5、缺失值处理

#插值法填取缺失值
from sklearn.preprocessing import Imputer
Imputerimr = Imputer(missing_values='NaN',strategy='mean',axis = 0)
loansdata[numcolumns]=Imputerimr.fit_transform(loansdata[numcolumns])

6、数据过滤

7、特征工程

a.特征衍生

b.特征抽象

from sklearn.processing import LableEnode

from skleaen.processing import OneHot

c.特征缩放

from sklearn.processing import StandardScaler

d.特征选择

过滤方法（filter approach）: 通过自变量之间或自变量与目标变量之间的关联关系选择特征。

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

# 建立逻辑回归分类器

model = LogisticRegression()

# 建立递归特征消除筛选器

rfe = RFE(model, 30) #通过递归选择特征，选择30个特征

rfe = rfe.fit(x_val, y_val)

# 打印筛选结果

print(rfe.support_)

print(rfe.ranking_) #ranking 为 1代表被选中，其他则未被代表未被选中

col_filter = x_val.columns[rfe.support_] #通过布尔值筛选首次降维后的变量

col_filter # 查看通过递归特征消除法筛选的变量

嵌入方法（embedded approach）: 通过学习器自身自动选择特征。(通过皮尔森相关性图谱找出冗余特征并将其剔除)

colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(loans_ml_df[col_filter].corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
col_new = col_filter.drop(drop_col) #剔除冗余特征

包装方法（wrapper approacch）: 通过目标函数（AUC/MSE）来决定是否加入一个变量。

names = loans_ml_df[col_new].columns
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=10,random_state=123)#构建分类随机森林分类器
clf.fit(x_val[col_new], y_val) #对自变量和因变量进行拟合
names, clf.feature_importances_
for feature in zip(names, clf.feature_importances_):
print(feature)

8、模型训练

a.处理不平衡数据

（1）欠采样：去除一些负样本使得正、负样本数目接近，然后再进行学习。

（2）过采样：增加正样本使得正、负样本数目接近，然后再进行学习。

SMOET的基本原理是：采样最邻近算法，计算出每个少数类样本的K个近邻，从K个近邻中随机挑选N个样本进行随机线性插值，构造新的少数样本，同时将新样本与原数据合成，产生新的训练集。

from imblearn.over_sampling import SMOTE # 导入SMOTE算法模块
# 处理不平衡数据
sm = SMOTE(random_state=42) # 处理过采样的方法
X, y = sm.fit_sample(X, y)
print('通过SMOTE方法平衡正负样本后')
n_sample = y.shape[0]
n_pos_sample = y[y == 0].shape[0]
n_neg_sample = y[y == 1].shape[0]
print('样本个数：{}; 正样本占{:.2%}; 负样本占{:.2%}'.format(n_sample,
n_pos_sample / n_sample,
n_neg_sample / n_sample))

b.构建分类器进行训练

9、模型评估

a.查看预测结果的准确率

from sklearn.metrics import accuracy_score
print("Test set accuracy score: {:.5f}".format(accuracy_score(predicted1, y,)))

b.计算precision、recall、f1-score的值

from sklearn.metrics import classification_report
print(classification_report(y, predicted1))

from sklearn.metrics import roc_auc_score
roc_auc1 = roc_auc_score(y, predicted1)
print("Area under the ROC curve : %f" % roc_auc1)

10、模型优化

将数据集划分为训练集和测试集有3种处理方法：
1、留出法（hold-out）
2、交叉验证法（cross-validation）
3、自助法（bootstrapping）

让模型在训练集进行学习，在验证集上进行参数调优，最后使用测试集数据评估模型的性能。

模型调优我们采用网格搜索调优参数（grid search），通过构建参数候选集合，然后网格搜索会穷举各种参数组合，根据设定评定的评分机制找到最好的那一组设置

from sklearn.model_selection import GridSearchCV
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0) # random_state = 0 每次切分的数据都一样
# 构建参数组合
param_grid = {'C': [0.01,0.1, 1, 10, 100, 1000,],
'penalty': [ 'l1', 'l2']}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=10) # 确定模型LogisticRegression，和参数组合param_grid ，cv指定5折
grid_search.fit(X_train, y_train) # 使用训练集学习算法

results = pd.DataFrame(grid_search.cv_results_)
print(results)
print(results.columns)
best = np.argmax(results.mean_test_score.values)
print(best)

模型性能评估

print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.5f}".format(grid_search.best_score_))

断了线的风筝-呀比

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录