自用,使用sklearn和numpy对机器学习所使用数据的预处理
尺度变换
x
i
=
x
i
−
m
i
n
(
x
)
m
a
x
(
x
)
−
m
i
n
(
x
)
x_i = \frac{x_i-min(x)}{max(x)-min(x)}
xi=max(x)−min(x)xi−min(x)
feature = np.array([[-500.5],
[-100.1],
[0],
[100.1],
[500.5]])
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
scaled_feature = minmax_scale.fit_transform(feature)
标准化
x
i
=
x
i
−
x
ˉ
σ
x_i=\frac{x_i-\bar{x}}{\sigma}
xi=σxi−xˉ
scaler = preprocessing.StandardScaler()
standerdized = scaler.fit_transform(feature)
归一化
等效特征使用较多,l2为默认值,l1时每个向量和为1
∥
x
∥
1
=
∑
i
=
1
n
∣
x
i
∣
\Vert x\Vert_1=\sum_{i=1}^n\lvert x_i\rvert
∥x∥1=i=1∑n∣xi∣
∥
x
∥
2
=
x
1
2
+
x
2
2
+
.
.
.
+
x
n
2
\Vert x\Vert_2=\sqrt{x_1^2+x_2^2+...+x_n^2}
∥x∥2=x12+x22+...+xn2
from sklearn.preprocessing import Normalizer
normalizer = Normalizer(norm='l2')
normalizer.transform(feature)
生成多项式
from sklearn.preprocessing import PolynomialFeatures
features = np.array([[2, 3],
[2, 3],
[2, 3]])
polynomial_interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
print(polynomial_interaction.fit_transform(features))
degree代表最高次项,interaction_only仅有交叉相乘,include_bias包含1。
特征转化
使用sklearn中的FunctionTransformer,首先建立函数最后调用。
def add_ten(x):
return x + 10
ten_transformer = FunctionTransformer(add_ten)
print(ten_transformer.transform(features))
同样也可以使用pandas
df = pd.DataFrame(features, columns=["feature_1", "feature_2"])
df.apply(add_ten)
检测异常值
SKlearn中covariance.EllipticEnvelope,可以做出一个椭圆使得离群值在椭圆外。
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs
# 聚类产生函数集
features, _ = make_blobs(n_samples=10,# 样本个数
n_features=2,# 样本特征数
centers=1,# 类别数
random_state=1)# 随机数种子,给定后每次生成的数据集相同
# 加入异常值
features[0, 0] = 10000
features[0, 1] = 10000
# 创建用于检测高斯分布数据集中的异常值的对象
outlier_detector = EllipticEnvelope(contamination=.1)# contamination表征污染程度
outlier_detector.fit(features)# 求训练集属性
outlier_detector.predict(features)# 训练后返回预测结果
处理异常值
考虑两个方面:首先为何是异常值,其次如何处理取决于机器学习的目的。
利用pandas库滤去
import pandas as pd
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathroom'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]
print(houses[houses['Bathroom'] < 20])
调用numpy标注
houses["Outlier"] = np.where(houses['Bathroom'] < 20, 0, 1)# 符合要求为0,否则为1
print(houses)
减轻影响,取对数
houses["Log_Of_Square_Feet"] = [np.log(x) for x in houses["Square_Feet"]]
离散化处理
如果数据中有缺失,首先用pandas去除
dataframe.dropna()
首先是二值化
binarizer = Binarizer(threshold=18)# 设定阈值
print(binarizer.fit_transform(age))
离散化
print(np.digitize(age, bins=[20, 30, 64]))
聚类
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# 创建数据集
features, _ = make_blobs(n_samples=50,
n_features=2,
centers=3,
random_state=1)
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])
clusterer = KMeans(3, random_state=0)# 三类
clusterer.fit(features)
dataframe["group"] = clusterer.predict(features)
print(dataframe.head(5))