第4章节机器学习之预处理

最新推荐文章于 2023-07-28 15:33:01 发布

热爱学习的小鲁同学

最新推荐文章于 2023-07-28 15:33:01 发布

阅读量292

点赞数

分类专栏： python机器学习笔记文章标签：机器学习 python

本文链接：https://blog.csdn.net/m0_45055763/article/details/124314800

版权

python机器学习笔记专栏收录该内容

9 篇文章 1 订阅

订阅专栏

4.1处理缺失数据

常见缺失：

数据表中的空白或占位符，如NaN
数据库中的未知指示符，如NULL

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import sys
from io import StringIO

csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

df=pd.read_csv(StringIO(csv_data))
df

	A	B	C	D
0	1.0	2.0	3.0	4.0
1	5.0	6.0	NaN	8.0
2	10.0	11.0	12.0	NaN

#查找数据缺失
df.isnull()

	A	B	C	D
0	False	False	False	False
1	False	False	True	False
2	False	False	False	True

df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

删除缺失的数据

函数形式：dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)

参数：

axis：轴。0或’index’，表示按行删除；1或’columns’，表示按列删除。

how：筛选方式。‘any’，表示该行/列只要有一个以上的空值，就删除该行/列；‘all’，表示该行/列全部都为空值，就删除该行/列。

thresh：非空元素最低数量。int型，默认为None。如果该行/列中，非空元素数量小于这个值，就删除该行/列。

subset：子集。列表，元素为行或者列的索引。如果axis=0或者‘index’，subset中元素为列的索引；如果axis=1或者‘column’，subset中元素为行的索引。由subset限制的子区域，是判断是否删除该行/列的条件判断区域。

inplace：是否原地替换。布尔值，默认为False。如果为True，则在原DataFrame上进行操作，返回值为None。

————————————————
版权声明：本文为CSDN博主「shangyj17」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/qq_17753903/article/details/89817371

#删除缺失数据行
df.dropna(axis=0)

	A	B	C	D
0	1.0	2.0	3.0	4.0

#删除缺失数据列
df.dropna(axis=1)

	A	B
0	1.0	2.0
1	5.0	6.0
2	10.0	11.0

df.dropna(how='all')#该行全部为nan，删除该行

	A	B	C	D
0	1.0	2.0	3.0	4.0
1	5.0	6.0	NaN	8.0
2	10.0	11.0	12.0	NaN

df.dropna(thresh=4)#每行的非空元素最低为4

	A	B	C	D
0	1.0	2.0	3.0	4.0

df.dropna(subset=['C'])#c有nan，则删除这一行

	A	B	C	D
0	1.0	2.0	3.0	4.0
2	10.0	11.0	12.0	NaN

#使用sklearn之前将dataframe值转为numpy
df.values

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

填补缺失的数据

#均值插补
from sklearn.impute import SimpleImputer
simpl=SimpleImputer(missing_values=np.nan,strategy='mean')
simpl=simpl.fit(df.values)
simpl_data=simpl.transform(df.values)
simpl_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

class sklearn.impute.SimpleImputer(*, missing_values=nan, strategy=‘mean’, fill_value=None, verbose=0, copy=True, add_indicator=False)

missing_values:缺失值的占位符，如nan

strategy:

mean,均值
median 中位数
most_frequent 众数
constnat 自定义的值

fill_value:当strategy=‘constnat’,fill_value用来指定填充的值

#pandas方式填充
df.fillna(df.mean())

	A	B	C	D
0	1.0	2.0	3.0	4.0
1	5.0	6.0	7.5	8.0
2	10.0	11.0	12.0	6.0

sklearn的评估器API

评估器凉两个基本方法：

fit:从训练数据学习参数
transform：利用参数对数据进行变换

preidct方法对数据进行预测
在这里插入图片描述

4.2处理分类数据

名词特征和序数特征

序数特征：比如T恤的尺码，XL>L>M
名词特征：如T恤颜色，不能比大小

#创建示例数据：包含名词特征color、序数特征size、数值特征price
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df

	color	size	price	classlabel
0	green	M	10.1	class1
1	red	L	13.5	class2
2	blue	XL	15.3	class1

映射序数特征

#实现序数特征映射
#关系：XL=L+1=M+2
size_mapping={'XL':3,
             'L':2,
             'M':1}

df['size']=df['size'].map(size_mapping)
df

	color	size	price	classlabel
0	green	1	10.1	class1
1	red	2	13.5	class2
2	blue	3	15.3	class1

#反向映射：由整数型——>字符串
inv_size_mapping={v:k for k,v in size_mapping.items()}
df['size'].map(inv_size_mapping)

0     M
1     L
2    XL
Name: size, dtype: object

分类标签编码

import numpy as np
class_mapping={label:idx for idx,label in
              enumerate(np.unique(df['classlabel']))}
class_mapping

{'class1': 0, 'class2': 1}

三元表达式

格式：true_return if condition else false_return
if 后条件成立返回，true_return，不成立返回false_return
当功能需求仅仅是二选一的情况下，推荐使用三元表达式

列表生成式

定义：结构是在一个中括号里包含一个表达式，然后是一个for语句，然后是 0 个或多个 for 或者 if 语句。表达式可以是任意的，意思是你可以在列表中放入任意类型的对象。返回结果将是一个新的列表。
格式：[表达式 for 变量 in 列表 if 条件]

字典生成式

格式：{字典内容+循环条件+ i f 判断条件（产生条件）}

#分类标签转化为整数
df['classlabel']=df['classlabel'].map(class_mapping)
df

	color	size	price	classlabel
0	green	1	10.1	0
1	red	2	13.5	1
2	blue	3	15.3	0

#反向映射，将整数转为分类标签
inv_class_mapping={v:k for k,v in class_mapping.items()}
df['classlabel']=df['classlabel'].map(inv_class_mapping)
df

	color	size	price	classlabel
0	green	1	10.1	class1
1	red	2	13.5	class2
2	blue	3	15.3	class1

#使用LavelEncoder实现标签转为整数
from sklearn.preprocessing import LabelEncoder
class_le=LabelEncoder()
y=class_le.fit_transform(df['classlabel'].values)#同时调用fit和transform
y

array([0, 1, 0])

#inverse_transform将整数转为字符型
class_le.inverse_transform(y)

array(['class1', 'class2', 'class1'], dtype=object)

为名词特征做热编码

X=df[['color','size','price']].values#将dataframe值转为numpy形式
X

array([['green', 1, 10.1],
       ['red', 2, 13.5],
       ['blue', 3, 15.3]], dtype=object)

color_le=LabelEncoder()
X[:,0]=color_le.fit_transform(X[:,0])
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

green=1,red=2,blue=0,这时候会产生大小和顺序，而颜色实际上无大小

采用one-hot解决：

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    [('one_hot_encoder', OneHotEncoder(categories='auto'), [0])],   # The column numbers to be transformed (here is [0] but can be [0, 1, 3])
    remainder='passthrough'                                         # Leave the rest of the columns untouched
)
ct.fit_transform(X)

array([[0.0, 1.0, 0.0, 1, 10.1],
       [0.0, 0.0, 1.0, 2, 13.5],
       [1.0, 0.0, 0.0, 3, 15.3]], dtype=object)

#pandas调用get_dummies方法：只转换字符串列，不转换其他列
pd.get_dummies(df[['price','color','size']])

	price	size	color_blue	color_green	color_red
0	10.1	1	0	1	0
1	13.5	2	0	0	1
2	15.3	3	1	0	0

特征很相似，产生多重共线性

#删除第一列
pd.get_dummies(df[['price','color','size']],drop_first=True)

	price	size	color_green	color_red
0	10.1	1	1	0
1	13.5	2	0	1
2	15.3	3	0	0

4.3分裂数据集为训练集和测试集

df=pd.read_csv('wine.data',names=['分类标签','酒精','苹果酸',
                                  '灰','灰的碱度','镁','总酚','黄酮类化合物',
                                  '非黄烷类酚类','原花青素','色彩强度',
                                  '色调','稀释酒','脯氨酸'])
df

	分类标签	酒精	苹果酸	灰	灰的碱度	镁	总酚	黄酮类化合物	非黄烷类酚类	原花青素	色彩强度	色调	稀释酒	脯氨酸
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
173	3	13.71	5.65	2.45	20.5	95	1.68	0.61	0.52	1.06	7.70	0.64	1.74	740
174	3	13.40	3.91	2.48	23.0	102	1.80	0.75	0.43	1.41	7.30	0.70	1.56	750
175	3	13.27	4.28	2.26	20.0	120	1.59	0.69	0.43	1.35	10.20	0.59	1.56	835
176	3	13.17	2.59	2.37	20.0	120	1.65	0.68	0.53	1.46	9.30	0.60	1.62	840
177	3	14.13	4.10	2.74	24.5	96	2.05	0.76	0.56	1.35	9.20	0.61	1.60	560

178 rows × 14 columns

#使用train_test_split函数划分训练集和测试集
from sklearn.model_selection import train_test_split
X,y=df.iloc[:,1:].values,df.iloc[:,0].values

X_train,X_test,y_train,y_test=\
train_test_split(X,y,test_size=0.3,random_state=0,stratify=y)#按照y比例划分

常见分裂比例：

60:40
70: 30
80:20

4.4特征保持在同一尺度上

决策树和随机森林森林可以不用担心特征尺度，其他算法会受特征尺度影响

归一化：把特征比例调整到[0,1]。

样本x(i)的值归一化后如下：

$ x_{norm}^{{(i)}=\frac{x}{(i)}-x_{min}}{x_{max}-x_{min}}$

from sklearn.preprocessing import MinMaxScaler
mms=MinMaxScaler()
X_train_norm=mms.fit_transform(X_train)
X_test_norm=mms.transform(X_test)

根据对之前部分trainData进行fit的整体指标，对剩余的数据（testData）使用同样的均值、方差、最大最小值等指标进行转换transform(testData)，从而保证train、test处理方式相同。

标准化方法
$x_{std}^{(i)}=\frac{x^{(i)-\mu_{x}}}{\sigma_{x}}$

$\mu表示样本均值$

$ \sigma 表示标准方差$

from sklearn.preprocessing import StandardScaler
stdsc=StandardScaler()
X_train_std=stdsc.fit_transform(X_train)
X_test_std=stdsc.transform(X_test)

在训练数据上用StandardScaler类拟合一次，然后再用这些参数来转换测试集或任何新的数据点。

4.5选择有意义的特征

如果训练集表现得比测试集好的太多了，那就发生了过拟合。

解决过拟合的方法：

收集更多的训练数据
通过正则化引入对复杂性的惩罚
选择参数较少的简单模型
减少数据的维度

#L1正则化
from sklearn.linear_model import LogisticRegression
LogisticRegression(penalty='l1')

LogisticRegression(penalty='l1')

#L1正则化的逻辑回归应用到葡萄酒数据
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(penalty='l1',C=1.0,solver='liblinear')
lr.fit(X_train_std,y_train)

LogisticRegression(penalty='l1', solver='liblinear')

print('Training accuracy:',lr.score(X_train_std,y_train))

Training accuracy: 1.0

print('Test accuracy:',lr.score(X_test_std,y_test))

Test accuracy: 1.0

#访问截距项
lr.intercept_

array([-1.26338692, -1.21578571, -2.37011064])

第一个是可以分出是1类而不是2，3类模型的截距

第二个是可以分出是2类而不是1，3类模型的截距

lr.coef_#得到三次分类中每个特征的权重

array([[ 1.2452797 ,  0.18116072,  0.74205733, -1.15963069,  0.        ,
         0.        ,  1.17618838,  0.        ,  0.        ,  0.        ,
         0.        ,  0.54171705,  2.51150349],
       [-1.53874544, -0.38585049, -0.995439  ,  0.36354152, -0.05881598,
         0.        ,  0.66716563,  0.        ,  0.        , -1.93190493,
         1.23790642,  0.        , -2.23370484],
       [ 0.13556432,  0.16853476,  0.3573147 ,  0.        ,  0.        ,
         0.        , -2.43745972,  0.        ,  0.        ,  1.56396548,
        -0.81863775, -0.49269584,  0.        ]])

$ Z=\omega_{0}x_{0}+…+\omega_{13}x_{13}=\omega^{T}x$

$w_{0}$ 为截距项

lr.coef_.shape

(3, 13)

(lr.coef_== 0).sum(axis=1)

array([6, 4, 6])

#coding:utf-8
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
#有中文出现的情况，需要u'内容'




#不同特征在不同正则化强度下的权重系数
import matplotlib.pyplot as plt

#创建图窗
fig=plt.figure()
ax=plt.subplot(1,1,1)

colors=['blue','green','red','cyan','magenta',
       'yellow','black','pink','lightgreen','lightblue',
      'gray','indigo','orange']

weights,params=[],[]

for c in np.arange(-4.0,6.0):
    lr=LogisticRegression(penalty='l1',C=10.0**c,
                         random_state=0,
                         solver='liblinear')
    lr.fit(X_train_std,y_train)
    weights.append(lr.coef_[1])
    params.append(10.0**c)
    
weights=np.array(weights)#得到第二次分类的各个特征权重


for column,color in zip(range(weights.shape[1]),colors):
    plt.plot(params,weights[:,column],
             label=df.columns[column+1],
            color=color)#绘出每个特征的权重随C变化
    
plt.axhline(0,color='black',linestyle='--',linewidth=3)
plt.xlim([10**(-5.0),10**5])
plt.ylabel('权重系数')
plt.xlabel('C')

plt.xscale('log')

plt.legend(loc='upper left')
ax.legend(loc='upper left',bbox_to_anchor=(1.38,1.03),
         ncol=1,fancybox=True)
plt.show()

Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.
Font 'default' does not have a glyph for '-' [U+2212], substituting with a dummy symbol.

在这里插入图片描述

为序数特征选择算法

降维：

特征提取：从特征集中获取信息构造新的特征子空间
特征选择：从原始特征中选择子集

##SBS（逆序数选择算法实现）
from sklearn.base import clone
from itertools import combinations
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

class SBS():
    def __init__(self,estimator,k_features,
                scoring=accuracy_score,
                test_size=0.25,random_state=1):
        self.scoring=scoring
        self.estimator=clone(estimator)
        self.k_features=k_features#要返回的理想特征数
        self.test_size=test_size
        self.random_state=random_state
        
    def fit(self,X,y):
        X_train,X_test,y_train,y_test=train_test_split(X,y,
                                                       test_size=self.test_size,
                                                      random_state=self.random_state)
        
        #划分数据集
        
        dim=X_train.shape[1]
        
        #放进去所有的特征
        self.indices_=tuple(range(dim))#创建特征元组
        self.subsets_=[self.indices_]#存储每次删除特征后的indices_的集合
        score=self._calc_score(X_train,y_train,
                              X_test,y_test,self.indices_)
        self.scores_=[score]
        
        #一个个选择特征
        while dim> self.k_features:
            scores=[]
            subsets=[]
            
            for p in combinations(self.indices_,r=dim-1):#combinations产生indices_的各种组合
                score=self._calc_score(X_train,y_train,X_test,y_test,p)
                
                scores.append(score)
                subsets.append(p)
                
            best=np.argmax(scores)#返回这次best scores的索引
            self.indices_=subsets[best]#返回这次最好的特征组合
            self.subsets_.append(self.indices_)#把每次最好的特征组合加到subsets集合
            dim=dim-1
            
            self.scores_.append(scores[best])#存储每次的best scores
        self.k_score_=self.scores_[-1]#得到k=1，2，3……的best score
        
        return self
        
        
   #返回选取特征列的新数据     
    def transform(self,X):
        return X[:,self.indices_]#self.indices是选取的特征子集的列号
    
    #计算准确率
    def _calc_score(self,X_train,y_train,X_test,y_test,indices):
        self.estimator.fit(X_train[:,indices],y_train)
        y_pred=self.estimator.predict(X_test[:,indices])
        score=self.scoring(y_test,y_pred)
        return score

Python itertools模块combinations(iterable, r)方法可以创建一个迭代器，返回iterable中所有长度为r的子序列，返回的子序列中的项按输入iterable中的顺序排序。

#使用KNN分类器实现SBS
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier(n_neighbors=5)
sbs=SBS(knn,k_features=1)
sbs.fit(X_train_std,y_train)

<__main__.SBS at 0x2018b29d0d0>

#绘制在测试数据上KNN分类准确度
k_feat=[len(k) for k in sbs.subsets_]

plt.plot(k_feat,sbs.scores_,marker='o',color='red')
plt.ylim([0.7,1.01])
plt.ylabel('accuracy')
plt.xlabel('number of features')
plt.grid()#显示网格线
plt.show()

在这里插入图片描述

k_feat

[13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

sbs.subsets_

[(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
 (0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12),
 (0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11),
 (0, 1, 2, 3, 4, 5, 6, 7, 9, 11),
 (0, 1, 2, 3, 4, 5, 7, 9, 11),
 (0, 1, 2, 3, 5, 7, 9, 11),
 (0, 1, 2, 3, 5, 7, 11),
 (0, 1, 2, 3, 5, 11),
 (0, 1, 2, 3, 11),
 (0, 1, 2, 11),
 (0, 1, 11),
 (0, 11),
 (0,)]

#k=3其特征都是什么
k3=list(sbs.subsets_[-3])
print(df.columns[1:][k3])

Index(['酒精', '苹果酸', '稀释酒'], dtype='object')

#使用完整特征数据集
knn.fit(X_train_std,y_train)
print('training accuracy:',knn.score(X_train_std,y_train))#返回决定系数

training accuracy: 0.967741935483871

print('training accuracy:',knn.score(X_test_std,y_test))

training accuracy: 0.9629629629629629

#使用3个特征的数据集
knn.fit(X_train_std[:,k3],y_train)
print('training accuracy:',knn.score(X_train_std[:,k3],y_train))
print('training accuracy:',knn.score(X_test_std[:,k3],y_test))

training accuracy: 0.9516129032258065
training accuracy: 0.9259259259259259

特征选择其他方法：
https://scikit-learn.org/stable/modules/feature_selection.html

4.6使用随机森林评估特征的重要性

from sklearn.ensemble import RandomForestClassifier

feat_labels=df.columns[1:]#选择特征
forest=RandomForestClassifier(n_estimators=500,
                             random_state=0)

#训练模型
forest.fit(X_train,y_train)#随机森林无需标准化或者归一化

importances=forest.feature_importances_

importances

array([0.11411629, 0.02696095, 0.01196206, 0.02005486, 0.03374741,
       0.05124025, 0.16276593, 0.01424631, 0.0253044 , 0.16765709,
       0.06108374, 0.13805384, 0.17280686])

indices=np.argsort(importances)#获得特征从小到大排序后原位置
indices=indices[::-1]#将特征从大到小排列

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))
    plt.title('Feature Importace')
    plt.bar(range(X_train.shape[1]),importances[indices],align='center',
            color='blue')
    plt.xticks(range(X_train.shape[1]),feat_labels,rotation=90) 
    plt.show()

 1) 脯氨酸                            0.172807
 2) 色彩强度                           0.167657
 3) 黄酮类化合物                         0.162766
 4) 稀释酒                            0.138054
 5) 酒精                             0.114116
 6) 色调                             0.061084
 7) 总酚                             0.051240
 8) 镁                              0.033747
 9) 苹果酸                            0.026961
10) 原花青素                           0.025304
11) 灰的碱度                           0.020055
12) 非黄烷类酚类                         0.014246
13) 灰                              0.011962

在这里插入图片描述

from sklearn.feature_selection import SelectFromModel

sfm=SelectFromModel(forest,threshold=0.1,prefit=True)#prefit无需做随机森林
X_selected=sfm.transform(X_train)
print('Number of samples that meet this criterion:', 
      X_selected.shape[0])#训练集124个样本

Number of samples that meet this criterion: 124

for f in range(X_selected.shape[1]):
    print('{0} {1} {2}'.format(f+1,feat_labels[indices[f]],importances[indices[f]]))

1 脯氨酸 0.1728068577804985
2 色彩强度 0.167657093481789
3 黄酮类化合物 0.16276593266606412
4 稀释酒 0.1380538429143578
5 酒精 0.11411629222540887

热爱学习的小鲁同学

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第4章节机器学习之预处理

4.1处理缺失数据常见缺失：数据表中的空白或占位符，如NaN数据库中的未知指示符，如NULLimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport sysfrom io import StringIOcsv_data = \'''A,B,C,D1.0,2.0,3.0,4.05.0,6.0,,8.010.0,11.0,12.0,'''df=pd.read_csv(StringIO
复制链接

扫一扫