机器学习
一.概述
1.什么是机器学习
- 人工智能:通过人工的方法,实现或者近似实现某些需要人类智能处理的问题,都可以称为人工智能
- 机器学习:一个计算机程序在完成任务T之后,获得经验E,而该经验的效果可以通过P得以表现,如果随着T的增加,借助P来表现的E也可以同步增进,则称这样的程序为机器学习系统.
- 自我完善,自我修正,自我增强
2.为什么需要机器学习
- 简化或者替代人工方式的模式识别,易于系统的开发维护和升级换代.
- 对于那些算法过于复杂,或者没有明确解法的问题,机器学习系统具有得天独厚的优势
- 借鉴机器学习的过程,反向推理出隐藏在业务数据背后的规则.----数据挖掘
3.机器学习的类型
- 有监督学习,无监督学习,半监督学习和强化学习
- 批量学习和增量学习
- 基于实例的学习和基于模型的学习
4.机器学习的流程
- 数据采集
- 数据清洗 数据
........................... - 数据预处理
- 选择模型
- 训练模型
- 验证模型 机器模型
.................................. - 使用模型 业务
- 维护和升级
二.数据预处理
import sklearn.preprocessing as sp
样本矩阵
输入数据 输出数据
_____特征_____
/ | | \
身高 体重 年龄 性别
样本1 1.7 60 25 男 -> 8000
样本2 1.5 50 20 女 -> 6000
...
- 均值移除(标准化)
特征A:10+-5
特征B:10000+-5000
特征淹没
通过算法调整令样本矩阵中每一列(特征)的平均值为0,标准差为1.这样一来,所有特征对最终模型的预测结果都有接近一致的贡献,模型对每个特征的倾向性更加均衡.
[a b c]
m=(a+b+c)/3, s=sqrt(((a-m)^2+(b-m)^2+(c-m)^2)/3)
[a' b' c']
a'=a-m
b'=b-m
c'=c-m
m'
=(a'+b'+c')/3
=(a-m+b-m+c-m)/3
=(a+b+c-3m)/3
=(a+b+c)/3-m
=m-m
=0
[a" b" c"]
a"=a'/s
b"=b'/s
c"=c'/s
m"=0
s"
=sqrt((a"^2+b"^2+c"^2)/3)
=sqrt((a'^2+b'^2+c'^2)/(3s^2))
=sqrt(((a-m)^2+(b-m)^2+(c-m)^2)/(3s^2))
=sqrt(3s^2/(3s^2))
=1
sp.scale(原始样本矩阵)->经过均值移除后的样本矩阵
代码:std.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.preprocessing as sp raw_samples = np.array([ [3, -1.5, 2, -5.4], [0, 4, -0.3, 2.1], [1, 3.3, -1.9, -4.3]]) print(raw_samples) print(raw_samples.mean(axis=0)) print(raw_samples.std(axis=0)) std_samples = raw_samples.copy() for col in std_samples.T: col_mean = col.mean() col_std = col.std() col -= col_mean col /= col_std print(std_samples) print(std_samples.mean(axis=0)) print(std_samples.std(axis=0)) std_samples = sp.scale(raw_samples) print(std_samples) print(std_samples.mean(axis=0)) print(std_samples.std(axis=0))
- 范围缩放
90/150 80/100 5/5
将样本矩阵每一列的元素经过某种线性变换,使得所有列的元素都处在同样的范围区间内.
kx + b = y
k col_min + b = min \ -> k b
k col_max + b = max /
/ col_min 1 \ x / k \ = / min \
\ col_max 1/ \ b / \ max /
--------------- ----- --------
a x b
= np.linalg.solve(a, b)
= np.linalg.lstsq(a, b)[0]
范围缩放器 = sp.MinMaxScaler(
feature_range=(min, max))
范围缩放器.fit_transform(原始样本矩阵)
->经过范围缩放后的样本矩阵
有时候也把以[0, 1]区间作为目标范围的范围缩放称为"归一化"
代码:mms.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.preprocessing as sp raw_samples = np.array([ [3, -1.5, 2, -5.4], [0, 4, -0.3, 2.1], [1, 3.3, -1.9, -4.3]]) print(raw_samples) mms_samples = raw_samples.copy() for col in mms_samples.T: col_min = col.min() col_max = col.max() a = np.array([ [col_min, 1], [col_max, 1]]) b = np.array([0, 1]) x = np.linalg.solve(a, b) col *= x[0] col += x[1] print(mms_samples) mms = sp.MinMaxScaler(feature_range=(0, 1)) mms_samples = mms.fit_transform(raw_samples) print(mms_samples)
- 归一化
Python C/C++ Java PHP
2016 20 30 40 10 /100
2017 30 20 30 10 /90
2018 10 5 1 0 /16
用每个样本各个特征值除以该样本所有特征值绝对值之和,以占比的形式来表现特征。
sp.normalize(原始样本矩阵, norm='l1')
->经过归一化后的样本矩阵
l1 - l1范数,矢量诸元素的绝对值之和
l2 - l2范数,矢量诸元素的(绝对值的)平方之和
...
ln - ln范数,矢量诸元素的绝对值的n次方之和
代码:nor.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.preprocessing as sp raw_samples = np.array([ [3, -1.5, 2, -5.4], [0, 4, -0.3, 2.1], [1, 3.3, -1.9, -4.3]]) print(raw_samples) nor_samples = raw_samples.copy() for row in nor_samples: row_absum = abs(row).sum() row /= row_absum print(nor_samples) print(abs(nor_samples).sum(axis=1)) nor_samples = sp.normalize(raw_samples, norm='l1') print(nor_samples) print(abs(nor_samples).sum(axis=1))
- 二值化
根据事先给定阈值,将样本矩阵中高于阈值的元素设置为1,否则设置为0,得到一个完全由1和0组成的二值矩阵。
二值化器 = sp.Binarizer(threshold=阈值)
二值化器.transform(原始样本矩阵)
->经过二值化后的样本矩阵
代码:bin.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.preprocessing as sp raw_samples = np.array([ [3, -1.5, 2, -5.4], [0, 4, -0.3, 2.1], [1, 3.3, -1.9, -4.3]]) print(raw_samples) bin_samples = raw_samples.copy() bin_samples[bin_samples <= 1.4] = 0 bin_samples[bin_samples > 1.4] = 1 print(bin_samples) bin = sp.Binarizer(threshold=1.4) bin_samples = bin.transform(raw_samples) print(bin_samples)
- 独热编码
用一个只包含一个1和若干个0的序列来表达每个特征值的编码方式,借此既保留了样本矩阵的所有细节,同时又得到一个只含有1和0的稀疏矩阵,既可以提高模型的容错性,同时还能节省内存空间。
1 3 2
7 5 4
1 8 6
7 3 9
----------------------
1:10 3:100 2:1000
7:01 5:010 4:0100
8:001 6:0010
9:0001
----------------------
101001000
010100100
100010010
011000001
独热编码器 = sp.OneHotEncoder(
sparse=是否紧缩(缺省True), dtype=类型)
独热编码器.fit_transform(原始样本矩阵)
->经过独热编码后的样本矩阵
代码:ohe.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.preprocessing as sp raw_samples = np.array([ [1, 3, 2], [7, 5, 4], [1, 8, 6], [7, 3, 9]]) print(raw_samples) # 建立编码字典列表 code_tables = [] for col in raw_samples.T: # 针对一列的编码字典 code_table = {} for val in col: code_table[val] = None code_tables.append(code_table) # 为编码字典列表中每个编码字典添加值 for code_table in code_tables: size = len(code_table) for one, key in enumerate(sorted( code_table.keys())): code_table[key] = np.zeros( shape=size, dtype=int) code_table[key][one] = 1 # 根据编码字典表对原始样本矩阵做独热编码 ohe_samples = [] for raw_sample in raw_samples: ohe_sample = np.array([], dtype=int) for i, key in enumerate(raw_sample): ohe_sample = np.hstack( (ohe_sample, code_tables[i][key])) ohe_samples.append(ohe_sample) ohe_samples = np.array(ohe_samples) print(ohe_samples) ohe = sp.OneHotEncoder(sparse=False, dtype=int) ohe_samples = ohe.fit_transform(raw_samples) print(ohe_samples)
- 标签编码
文本形式的特征值->数值形式的特征值
其编码数值源于标签字符串的字典排序,与标签本身的含义无关
职位 车
员工 toyota - 0
组长 ford - 1
经理 audi - 2
老板 bmw - 3
标签编码器 = sp.LabelEncoder()
标签编码器.fit_transform(原始样本矩阵)
->经过标签编码后的样本矩阵
标签编码器.inverse_transform(经过标签编码后的样本矩阵)
->原始样本矩阵
代码:lab.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.preprocessing as sp raw_samples = np.array([ 'audi', 'ford', 'audi', 'toyota', 'ford', 'bmw', 'toyota', 'bmw']) print(raw_samples) lbe = sp.LabelEncoder() lbe_samples = lbe.fit_transform(raw_samples) print(lbe_samples) raw_samples = lbe.inverse_transform(lbe_samples) print(raw_samples)
三.机器学习基本问题
- 回归问题:由已知飞分布于连续域中的输入和输出,通过不断地模型训练,找到输入和输出之间的联系,通常这种联系可以通过一个函数方程被形式化:如y = w0 + w1 +w2x^2 ...,当提供未知输出的输入时,就可以根据以上函数方程,预测出与之对应的连续域输出
- 分类问题:如果将回归问题中的输出从连续域变为离散域,那么该问题就是一个分类问题
- 聚类问题:从已知的输入中寻找某种模式,比如相似性,根据该模式将输入划分为不同的集群,并对新的输入应用同样的划分方式,以确定其归属的集群
- 降维问题:从大量的特征中选择那些对模型预测最关键的少量特征,以降低输入样本的维度,提高模型的性能
四.一元线性回归
- 预测函数
输入 输出
0 1
1 3
2 5
3 7
4 9
...
y = 1 + 2x
10 -> 21
y = w0 +w1x
任务就是寻找预测函数中的模型参数w0和w1,以满足输入和输出之间的联系 - 单样本误差
x - >[y=w0+w1x] ->y' y->e= 1/2(y-y')^2 - 总样本误差
E = ∑[1/2(y-y')^2] - 损失函数
Loss(w0,w1) =∑[1/2(y-(w0+w1x))^2]
任务就是寻找可以使损失函数取得最小值的模型参数w0和w1. - 梯度下降法寻优
随机选择一组模型参数w0和w1
计算损失函数在该模型参数处的梯度
[DLoss/Dwo,Dloss/Dw1] <--|
计算与该梯度反方向的修正步长 |
[-nDLoss/Dwo,-nDLoss/Dw1] |
计算下一组模型参数 |
w0=w1-nDLoss/Dwo |
w1=w1-nDLoss/Dw1------------------------+
直到满足迭代终止条件:
迭代足够多次,
损失值已经足够小,
损失值已经不再明显减少
Loss = SIGMA[1/2(y-y')^2], y'=w0+w1x
DLoss/Dw0
=SIGMA[D(1/2(y-y')^2)/Dw0]
=SIGMA[(y-y')D(y-y')/Dw0]
=SIGMA[(y-y')(Dy/Dw0-Dy'/Dw0)]
=-SIGMA[(y-y')(Dy'/Dw0)]
=-SIGMA[(y-y')]
DLoss/Dw1
=SIGMA[D(1/2(y-y')^2)/Dw1]
...
=-SIGMA[(y-y')(Dy'/Dw1)]
=-SIGMA[(y-y')x]
代码:gd.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import matplotlib.pyplot as mp from mpl_toolkits.mplot3d import axes3d train_x = np.array([0.5, 0.6, 0.8, 1.1, 1.4]) train_y = np.array([5.0, 5.5, 6.0, 6.8, 7.0]) n_epoches = 1000 lrate = 0.01 epoches, losses = [], [] w0, w1 = [1], [1] for epoch in range(1, n_epoches + 1): epoches.append(epoch) losses.append(((train_y - ( w0[-1] + w1[-1] * train_x)) ** 2 / 2).sum()) print('{:4}> w0={:.8f}, w1={:.8f}, loss={:.8f}'.format( epoches[-1], w0[-1], w1[-1], losses[-1])) d0 = -(train_y - ( w0[-1] + w1[-1] * train_x)).sum() d1 = -((train_y - ( w0[-1] + w1[-1] * train_x)) * train_x).sum() w0.append(w0[-1] - lrate * d0) w1.append(w1[-1] - lrate * d1) w0 = np.array(w0[:-1]) w1 = np.array(w1[:-1]) sorted_indices = train_x.argsort() test_x = train_x[sorted_indices] test_y = train_y[sorted_indices] pred_test_y = w0[-1] + w1[-1] * test_x grid_w0, grid_w1 = np.meshgrid( np.linspace(0, 9, 500), np.linspace(0, 3.5, 500)) flat_w0, flat_w1 = grid_w0.ravel(), grid_w1.ravel() flat_loss = (((flat_w0 + np.outer( train_x, flat_w1)) - train_y.reshape( -1, 1)) ** 2).sum(axis=0) / 2 grid_loss = flat_loss.reshape(grid_w0.shape) mp.figure('Linear Regression', facecolor='lightgray') mp.title('Linear Regression', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.scatter(train_x, train_y, marker='s', c='dodgerblue', alpha=0.5, s=80, label='Training') mp.scatter(test_x, test_y, marker='D', c='orangered', alpha=0.5, s=60, label='Testing') mp.scatter(test_x, pred_test_y, c='orangered', alpha=0.5, s=60, label='Predicted') for x, y, pred_y in zip( test_x, test_y, pred_test_y): mp.plot([x, x], [y, pred_y], c='orangered', alpha=0.5, linewidth=1) mp.plot(test_x, pred_test_y, '--', c='limegreen', label='Regression', linewidth=1) mp.legend() mp.figure('Training Progress', facecolor='lightgray') mp.subplot(311) mp.title('Training Progress', fontsize=20) mp.ylabel('w0', fontsize=14) mp.gca().xaxis.set_major_locator( mp.MultipleLocator(100)) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.plot(epoches, w0, c='dodgerblue', label='w0') mp.legend() mp.subplot(312) mp.ylabel('w1', fontsize=14) mp.gca().xaxis.set_major_locator( mp.MultipleLocator(100)) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.plot(epoches, w1, c='limegreen', label='w1') mp.legend() mp.subplot(313) mp.xlabel('epoch', fontsize=14) mp.ylabel('loss', fontsize=14) mp.gca().xaxis.set_major_locator( mp.MultipleLocator(100)) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.plot(epoches, losses, c='orangered', label='loss') mp.legend() mp.tight_layout() mp.figure('Loss Function') ax = mp.gca(projection='3d') mp.title('Loss Function', fontsize=20) ax.set_xlabel('w0', fontsize=14) ax.set_ylabel('w1', fontsize=14) ax.set_zlabel('loss', fontsize=14) mp.tick_params(labelsize=10) ax.plot_surface(grid_w0, grid_w1, grid_loss, rstride=10, cstride=10, cmap='jet') ax.plot(w0, w1, losses, 'o-', c='orangered', label='BGD') mp.legend() mp.figure('Batch Gradient Descent', facecolor='lightgray') mp.title('Batch Gradient Descent', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.contourf(grid_w0, grid_w1, grid_loss, 1000, cmap='jet') cntr = mp.contour(grid_w0, grid_w1, grid_loss, 10, colors='black', linewidths=0.5) mp.clabel(cntr, inline_spacing=0.1, fmt='%.2f', fontsize=8) mp.plot(w0, w1, 'o-', c='orangered', label='BGD') mp.legend() mp.show()
import sklearn.linear_model as lm
线性回归器 = lm.LinearRegression()
线性回归器.fit(已知输入, 已知输出) # 计算模型参数
线性回归器.predict(新的输入)->新的输出
代码:line.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.linear_model as lm import sklearn.metrics as sm import matplotlib.pyplot as mp x, y = [], [] with open('../../data/single.txt', 'r') as f: for line in f.readlines(): data = [float(substr) for substr in line.split(',')] x.append(data[:-1]) y.append(data[-1]) x = np.array(x) y = np.array(y) model = lm.LinearRegression() model.fit(x, y) pred_y = model.predict(x) # 1/(1+E) print(sm.r2_score(y, pred_y)) mp.figure('Linear Regression', facecolor='lightgray') mp.title('Linear Regression', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.scatter(x, y, c='dodgerblue', alpha=0.75, s=60, label='Sample') sorted_indices = x.ravel().argsort() mp.plot(x[sorted_indices], pred_y[sorted_indices], c='orangered', label='Regression') mp.legend() mp.show()
模型的转储与载入:pickle
代码:dump.py、load.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import pickle import numpy as np import sklearn.linear_model as lm import sklearn.metrics as sm import matplotlib.pyplot as mp x, y = [], [] with open('../../data/single.txt', 'r') as f: for line in f.readlines(): data = [float(substr) for substr in line.split(',')] x.append(data[:-1]) y.append(data[-1]) x = np.array(x) y = np.array(y) model = lm.LinearRegression() model.fit(x, y) pred_y = model.predict(x) # 1/(1+E) print(sm.r2_score(y, pred_y)) with open('../../data/linear.pkl', 'wb') as f: pickle.dump(model, f) mp.figure('Linear Regression', facecolor='lightgray') mp.title('Linear Regression', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.scatter(x, y, c='dodgerblue', alpha=0.75, s=60, label='Sample') sorted_indices = x.ravel().argsort() mp.plot(x[sorted_indices], pred_y[sorted_indices], c='orangered', label='Regression') mp.legend() mp.show()
# -*- coding: utf-8 -*- from __future__ import unicode_literals import pickle import numpy as np import sklearn.linear_model as lm import sklearn.metrics as sm import matplotlib.pyplot as mp x, y = [], [] with open('../../data/single.txt', 'r') as f: for line in f.readlines(): data = [float(substr) for substr in line.split(',')] x.append(data[:-1]) y.append(data[-1]) x = np.array(x) y = np.array(y) with open('../../data/linear.pkl', 'rb') as f: model = pickle.load(f) pred_y = model.predict(x) # 1/(1+E) print(sm.r2_score(y, pred_y)) with open('../../data/linear.pkl', 'wb') as f: pickle.dump(model, f) mp.figure('Linear Regression', facecolor='lightgray') mp.title('Linear Regression', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.scatter(x, y, c='dodgerblue', alpha=0.75, s=60, label='Sample') sorted_indices = x.ravel().argsort() mp.plot(x[sorted_indices], pred_y[sorted_indices], c='orangered', label='Regression') mp.legend() mp.show()
五.领回归
- Loss(w0,w1) = SIGMA[1/2(y-(w0+w1x))^2]
-正则强度 *f(w0,w1) - 通过正则的方法,即在损失函数中加入正则项,以减弱模型参数对熟练数据的匹配度,借以规避少数明显偏移正常范围的异常样本影响模型的回归效果
代码# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.linear_model as lm import matplotlib.pyplot as mp x, y = [], [] with open('../../data/abnormal.txt', 'r') as f: for line in f.readlines(): data = [float(substr) for substr in line.split(',')] x.append(data[:-1]) y.append(data[-1]) x = np.array(x) y = np.array(y) model1 = lm.LinearRegression() model1.fit(x, y) pred_y1 = model1.predict(x) model2 = lm.Ridge(300, fit_intercept=True) model2.fit(x, y) pred_y2 = model2.predict(x) mp.figure('Linear & Ridge Regression', facecolor='lightgray') mp.title('Linear & Ridge Regression', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.scatter(x, y, c='dodgerblue', alpha=0.75, s=60, label='Sample') sorted_indices = x.ravel().argsort() mp.plot(x[sorted_indices], pred_y1[sorted_indices], c='orangered', label='Linear') mp.plot(x[sorted_indices], pred_y2[sorted_indices], c='limegreen', label='Ridge') mp.legend() mp.show()
六.多项式回归
- 多元线性: y=w0+w1x1+w2x2+w3x3+...+wnxn
^ x1 = x^1
| x2 = x^2
| ...
| xn = x^n
一元多项式:y=w0+w1x+w2x^2+w3x^3+...+wnx^n
x->多项式特征扩展器 -x1...xn-> 线性回归器->w0...wn
\______________________________________/
管线
代码:poly.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.pipeline as pl import sklearn.preprocessing as sp import sklearn.linear_model as lm import sklearn.metrics as sm import matplotlib.pyplot as mp train_x, train_y = [], [] with open('../../data/single.txt', 'r') as f: for line in f.readlines(): data = [float(substr) for substr in line.split(',')] train_x.append(data[:-1]) train_y.append(data[-1]) train_x = np.array(train_x) train_y = np.array(train_y) model = pl.make_pipeline(sp.PolynomialFeatures(10), lm.LinearRegression()) model.fit(train_x, train_y) pred_train_y = model.predict(train_x) print(sm.r2_score(train_y, pred_train_y)) test_x = np.linspace(train_x.min(), train_x.max(), 1000).reshape(-1, 1) pred_test_y = model.predict(test_x) mp.figure('Polynomial Regression', facecolor='lightgray') mp.title('Polynomial Regression', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.scatter(train_x, train_y, c='dodgerblue', alpha=0.75, s=60, label='Sample') mp.plot(test_x, pred_test_y, c='orangered', label='Regression') mp.legend() mp.show()
文件:链接