机器学习学习笔记2

L3 Feature 特征

Encode data 

将特征转化为实数

什么是特征?任何可获取的数据(除了label标签)

old features: x   new features: \phi (x)

Encode categorical data

Idea:将每个类别转化为独一无二的二进制数(通常不用)

e.g. 

\phi _d\phi_{d+1}\phi_{d+2}
nurse000
admin001
pharmacisst010
doctor011
social worker101
Idea:将每个类别转化为唯一的 0-1 数字(“one-hot encoding”)(使用于各类别无明显相关性)

e.g.

\phi _d\phi_{d+1}\phi_{d+2}\phi_{d+3}\phi_{d+4}
nurse10000
admin01000
pharmacisst00100
doctor00010
social worker00001

Idea:因子编码(单个数据包含多个因素)

e.g.

\phi _d\phi_{d+1}
pain10
pain&blockers11
blockers01
no medications00

Encode ordinal(序数) data

Idea:Unary/thermometer code 温度计编码(各个值区别较大)

e.g.Likert scale 李克特量表

Strongly disagreeDisagreeNeutralAgreeStrongly agree
1,0,0,0,01,1,0,0,01,1,1,0,01,1,1,1,01,1,1,1,1

Encode numerical data

Idea:标准化

for d_{th} feature:   \phi_{d}^{(k)}=\frac{x_{d}^{(k)}-mean_d}{stddev_d}

Nonlinear boundaries 不可线性分割的数据

Idea:用k次泰勒多项式逼近到一个平滑函数
order(k)terms when d=1terms for general d
0[1][1]
1[1,x_1][1,x_1,...,x_d]
2[1,x_1,x_1^{2}][1,x_1,...,x_d\\ x_1^2,x_1x_2,...,x_{d-1}x_d,x_d^2]
3[1,x_1,x_1^2,x_1^3][1,x_1,...,x_d\\ x_1^2,x_1x_2,...,x_{d-1}x_d,x_d^2\\ x_1^3,x_1^2x_2,x_1x_2x_3,...,x_d^3]

Evaluation of a learning algorithm

Idea:使用全部数据训练然后求出训练误差
Idea:保留一些数据用来测试
  • More training data:closer to training on full data
  • More testing data:less noisy estimate of performance
  • Only one classifier might not be representative
  • Good idea to shuffle order of data

伪代码如下:

Cross-validate(D_n,k)   #交叉验证

        Divide D_n into k chunks D_{n,1},...,D_{n,k} (for roughly equal size)

        for i = 1 to k

                train h_i on D_n \setminus D_{n,i} (i.e. except chunk i)

                compute "test" error \varepsilon (h_i,D_{n,i})\ of \ hi \ on\ D_{n,i}

        Return   \frac{1}{k}\sum_{i=1}^k\varepsilon (h_i,D_{n,i})

代码如下:

import matplotlib.pyplot as plt
from sklearn import neighbors
import numpy as np
import pandas as pd
from sklearn import model_selection
df1 = pd.read_csv(r'iris.csv')
print(df1.head())#输出前五行
predictors = df1.columns[:-1]
x_train,x_test,y_train,y_test=model_selection.train_test_split(
    df1[predictors],df1.Species,
    test_size=0.5,
    random_state = 1234
)
print(np.ceil(np.log2(df1.shape[0])))
#设置待测试的不同K值
K = np.arange(1,np.ceil(np.log2(df1.shape[0])))
print(np.arange(1,np.ceil(np.log2(df1.shape[0]))))
#设置空列表,用于储存平均准确率
accuracy = []
# 使用五重交叉验证的方法
for k in K:
    cv_result = model_selection.cross_val_score \
        (neighbors.KNeighborsClassifier(n_neighbors=int(k),
                                        weights='distance'),
         x_train, y_train, cv=5, scoring='accuracy')
    accuracy.append(cv_result.mean())

# 从K个平均准确率中挑选出最大值做对应的目标
arg_max = np.array(accuracy).argmax()
#中文负号正常显示
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus'] = False
#绘制不同k值与准确率之间的折线图
plt.plot(K,accuracy)
plt.scatter(K,accuracy)
plt.text(K[arg_max],accuracy[arg_max],'最佳K值为%s'%int(K[arg_max]))
plt.show()

L4 Logistic regression 逻辑回归/对数几率回归

回顾:感知器难以处理不可线性分割的数据

sigmoid/logistic function

\sigma(z)=\frac{1}{1+exp(-z)}

g(x)=\frac{1}{1+exp\{-(\theta^{\top}x+\theta_0)\}}

Linear logistic classification

如何得到一个一个分类器 learn \ \theta,\theta_0

Probability(data)

=\prod _{i=1}^nProbability(data \ point\ i)\\ =\prod_{i=1}^n \begin{cases}g^{(i)}\ if\ y^{(i)}=+1\\(1-g^{(i)})else\end{cases}\\ =\prod_{i=1}^n(g^{(i)})^{1\ \{y_i=+1\}}(1-g^{(i)})^{1\{y_i\neq -1\}}

Loss(data) = -(1/n)*log probability(data)

=\frac{1}{n}\sum_{i=1}^n-(1\{y_i=+1\}logg^{(i)}+1\{y_i\neq+1\}log(1-g^{(i)}))

Negative log likelihood loss:

-L_{nll}(g,a) = (1\{a=+1\}logg+1\{a\neq+1\}log(1-g))

learn \theta,\theta_0 ( i.e.to minimize average loss):

J_{lr}(\Theta )=J_{lr}(\theta,\theta_0)=\frac{1}{n}\sum_{i=1}^nL_{nll}(\sigma(\theta^{\top}x^{(i)}+\theta_0),y^{(i)})

Gradient descent 梯度下降

Gradient :\nabla_{\Theta}f=\nabla_{\Theta}f=[\frac{\partial f}{\partial \Theta_1},...,\frac{\partial f}{\partial \Theta_1}]^{\top}

伪代码如下:

Gradinent-Descent(\Theta_{init},\eta,f,\nabla_\Theta f,\epsilon )

        Initialize \Theta^{(0)}=\Theta_{init}

        Initialize t = 0

        repeat

                t = t + 1

                \Theta^{(t)}=\Theta^{(t-1)}-\eta\nabla_\Theta f(\Theta^{(t-1)})

        until   |f(\Theta^{(t)})-f(\Theta^{(t-1)})|<\epsilon

        Return \Theta^{(t)}

其他可能的停止条件:

  1. 达到最大迭代次数T
  2. |\Theta^{(t)}-\Theta^{(t-1)}|<\epsilon
  3. ||\nabla_{\Theta}f(\Theta^{(t)})||<\epsilon

Logistic regression learning algorithm

LR-Gradient-Descent(\theta_{init},\theta_{0,init},\eta,\epsilon )

        Initialize \theta^{(0)}=\theta_{init}

        Initialize \theta_0^{(0)}=\theta_{0,init}

        Initialize t = 0

        repeat

                t = t + 1

                \theta^{(t)}=\theta^{(t-1)}-\eta\{ \frac{1}{n}\sum_{i=1}^n[\sigma(\theta^{(t-1)\top}x^{(i)}+\theta_0^{(t-1)})-y^{(i)}]x^{(i)}+\lambda\theta^{(t-1)} \}

                \theta_0^{(t)}=\theta_0^{(t-1)}-\eta\{ \frac{1}{n}\sum_{i=1}^n[\sigma(\theta^{(t-1)\top}x^{(i)}+\theta_0^{(t-1)})-y^{(i)}] \}

        until   |J_{lr}(\theta^{(t)},\theta_0^{(t)})-J_{lr}(\theta^{(t-1)},\theta_0^{(t-1)})|<\epsilon

        Return \theta^{(t)},\theta_0^{(t)}

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import  LogisticRegression
from sklearn.metrics import  accuracy_score
from matplotlib.colors import ListedColormap

iris = load_iris()
X = iris.data[:,:2] # 可视化前两个特征
y = iris.target

# 划分训练集和测试集
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

# 训练逻辑回归模型
lr = LogisticRegression()
lr.fit(X_train,y_train)

# 在测试及上进行预测
y_pred = lr.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print('Accuracy:',accuracy)

# 绘制决策边界
x_min,x_max = X[:,0].min() - 0.5,X[:,0].max() + 0.5
y_min,y_max = X[:,1].min() - 0.5,X[:,1].max() + 0.5
xx,yy = np.meshgrid(np.arange(x_min,x_max,0.01),np.arange(y_min,y_max,0.01))
Z = lr.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)

# 绘制结果
plt.figure(figsize=(10,6))
plt.contourf(xx,yy,Z,alpha=0.8,cmap=ListedColormap(('red','green','blue')))
plt.scatter(X[:,0],X[:,1],c=y,cmap=ListedColormap(('red','green','blue')))
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('LR decision boundary')
plt.show()

输出结果:

Accuracy: 0.9


 

  • 15
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值