机器学习笔记1

最新推荐文章于 2021-06-07 17:21:20 发布

Miraclemie

最新推荐文章于 2021-06-07 17:21:20 发布

阅读量287

点赞数

分类专栏：机器学习整理

本文链接：https://blog.csdn.net/nonem101/article/details/107811492

版权

机器学习整理专栏收录该内容

2 篇文章 0 订阅

订阅专栏

数据预处理：

1.缺失值的填充：pandas库中dropna以及fillna使用；

scikit-learn库的：

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean')
# xxx是数据副本，xx是列表
xxx = xx.drop (”ocean_proximity”, axis=l) 
# 用fit方法将其适配到训练集
imputer.fit(xxx)
# X是个numpy的数组
X = imputer.transform(xxx)

2.文本标签的转换为数值：你可以自己编写if-else语句自己进行转换；

scikit-learn库的：

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
# xxx是转换后numpy数组形式，xx是dataframe的某一列数据
xxx = encoder.fit_transform(xx)

再将数值标签的转换为二值化：

from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit([1, 2, 6, 4, 2, 4, 7,8,9])
lb.transform([1,2,4])

特殊的：scikit-learn有one-hot编码器，基本上会用到的话主要是超市里的多种文字类型的数据，需要注意的是每个onehot编码器中fit_transform的里面放的是二维数据。

from sklearn.preprocessing import OneHotEncoder 
encoder = OneHotEncoder() 
xxx_one_hot = encoder.transform(xxx.reshape(-1,1))
# 转换成二维数组
xxx_one_hot.toarray()

你也可以自定义转换器，添加 BaseEstimator 作为基类，特征缩放就是可以进行特征标准化以及归一化的操作。归一化可以使用 MinMaxScaler ，内置参数feature_range 调节范围，而标准化则通过StandadScaler。

更加便捷一体化的可以使用流水线，是按具体的先后顺序进行的：

import numpy as np
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import MinMaxScaler
num_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy="median")), 
                         ('std_scaler', MinMaxScaler(feature_range =(2,3)))])
num_pipeline.fit_transform([[12,np.NAN,32],[23,23.,33]])

# 两条流水线的组合
from sklearn_features.transformers import DataFrameSelector
from sklearn.pipeline import FeatureUnion
num_pipeline = Pipeline([ 
    ('selector',DataFrameSelector(num_attribs)),
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler',StandardScaler()),]) 
cat_pipeline = Pipeline([ 
('selector', DataFrameSelector(cat_attribs))]) 
full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline),])
full_pipeline.fit_transform(data)

3. 划分测试集和训练集

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.drop('Label',axis = 1),df['Label'],train_size = 0.6)

4. 特征工程：降维选取特征值

主成成分分析（PCA）：

from sklearn.decomposition import PCA
pca = PCA(n_components=8) #n_components表示降维后的特征值个数
train_x = pca.fit_transform(x_train)

线性判别分析（Linear Discriminant Analysis,LDA），LDA主要是用于词频统计的，这里就不涉及到不做介绍

5. 模型选择和训练数据

决策树（Decision Tree，DT），决策树对于离群点处理不错，但很容易产生过拟合：

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(train_x,y_train)

Logistic回归（Logistic Regression, LR）：

from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
LR.fit(x_train,y_train)

随机森林（Random Forest Classifier），基于决策树的基础上，多个决策树的组合：

# 随机森林
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)
rfc.score(x_test,y_test)

6. 模型的预测和评估模型

可以使用模型自带的评估函数score（）来进行预估这个模型的测试集的数据准确率，若是需要预测则使用predict（）函数；

也可以使用sklearn里Scoring parameter（评分参数）、Metric functions（指标函数）进行模型的评估

7. 模型微调以及再次选取合适的模型

由于本身模型的一些指标或评分不尽人意，就需要进行微调，主要是运用网格搜索进行超参数的调整，来达到一个较好的模型；

# 以随机森林的超参数调节为例
from sklearn.model_selection import GridSearchCV
parms = [{'bootstrap':[True], 'n_estimators':[100,200,300],'max_features':[4,8,12]},{'bootstrap':[False], 'n_estimators':[25,50,100],'max_features':[3,6,9]}]
grid_search = GridSearchCV(rfc,param_grid=parms,cv=5,scoring='accuracy')
# 查看一个最好的超参数的模型
grid_search.best_params_ 
# 查看每个组合的评估分数
cvres = grid_search.cv_results_ 
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]): 
    print(mean_score , params)

如果实在不行，就重新选择模型再次进行训练。

Miraclemie

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习笔记1

数据预处理：1.缺失值的填充：pandas库中dropna以及fillna使用；scikit-learn库的：from sklearn.impute import SimpleImputerimputer = SimpleImputer(strategy = 'mean')# xxx是数据副本，xx是列表xxx = xx.drop (”ocean_proximity”, axis=l) # 用fit方法将其适配到训练集imputer.fit(xxx)# X是个numpy的数组X
复制链接

扫一扫

专栏目录