3-线性模型部分题解

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time

# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
plt.rcParams["font.sans-serif"]=["SimHei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #该语句解决图像中的“-”负号的乱码问题



# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

3.1 偏置项

什么情况下式(3.2)中不需要考虑偏置项

  • 已经知道 y y y x x x 的正比例函数(经过原点

3.2 证明凸性

证明参数 w w w,对率回归的目标函数非凸,但对数似然函数是凸的

对率回归:
y = 1 1 + e − ( w T x + b ) (3.18) y = \frac{1}{1+e^{-(w^Tx+b)}} \tag{3.18} y=1+e(wTx+b)1(3.18)

对数似然函数
ℓ ( β ) = ∑ i = 1 m ( − y i β T x ^ i + ln ⁡ ( 1 + e β T x ^ i ) ) (3.27) \ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(-y_{i} \boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}+\ln \left(1+e^{\boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}}\right)\right) \tag{3.27} (β)=i=1m(yiβTx^i+ln(1+eβTx^i))(3.27)

对实数集上的函数,可通过求二阶导数来在判别,若二阶导数在区间中非负,则称为凸函数,若二阶导数在区间上恒大于 0,则称为严格凸函数.(p54)

现在对对率回归的 w w w 求二阶偏导:
∂ ∂ w ( 1 1 + e − w T x + b ) e − w T x + b x ( 1 + e − w T x + b ) 2 \frac{\partial}{\partial w}\left(\frac{1}{1+e^{-w^T x+b}}\right) \frac{e^{-w^Tx+b_{x}}}{\left(1+e^{-w^T x +b}\right)^{2}} w(1+ewTx+b1)(1+ewTx+b)2ewTx+bx
二阶偏导小于0,因此非凸
∂ ∂ w 2 = − x 2 e − w T x + b × [ 1 + e − ( w T + b ) ] 2 [ 1 − e − ( w T + b ) ] 4 < 0 \frac{\partial}{\partial w^2} = \frac{-x^2 e^{-w^T x+b}\times [1+e^{-(w^T+b)}]^2}{[1-e^{-(w^T+b)}]^4} < 0 w2=[1e(wT+b)]4x2ewTx+b×[1+e(wT+b)]2<0

对对数似然函数的 w w w 求二阶偏导

书上(3.30)和(3.31写了)
β = ( w ; b ) \beta = (w;b) β=(w;b)
image.png

  1. p 1 ( x ^ i ; β ) p_1(\hat{x}_i;\beta) p1(x^i;β)是标量
  2. 原式化成 ∑ i = 1 m x ^ i x ^ i T ∗ f ( x ^ i ) = ∑ i = 1 m x ^ i ∗ f ( x ^ i ) ∗ x ^ i T \sum_{i=1}^{m}{\hat{x}_i\hat{x}_i^T*f(\hat{x}_i)} = \sum_{i=1}^{m}{\hat{x}_i*f(\hat{x}_i)*\hat{x}_i^T} i=1mx^ix^iTf(x^i)=i=1mx^if(x^i)x^iT
  3. 换成矩阵形式: Q = X P X T Q = XPX^T Q=XPXT
  4. 概率都大于0小于1, P P P是一个半正定对角矩阵,Q和P合同,那么Q也半正定。得证
  5. 证明4: a T Q a = ( a T X T ) P ( X a ) = ( b ) T P ( b ) a^TQa = (a^TX^T)P(Xa) = (b)^TP(b) aTQa=(aTXT)P(Xa)=(b)TP(b)

3.3 对率回归

编程实现对率回归,并给出西瓜数据集 3.0 α 3.0 \alpha 3.0α 上的结果.

# 下载西瓜3.0数据集
!curl -o "../datasets/data3.0.csv" "https://gist.githubusercontent.com/zhenghaoz/abb86b21797b6274b0fcb1dfc96274b2/raw/552f4ec4c6269a8db34ec32486e355508247991b/data3.0.csv"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1098  100  1098    0     0   1839      0 --:--:-- --:--:-- --:--:--  1842
# 本地加载数据集
def load_data():
    # 加载数据
    # 数据来自:https://gist.githubusercontent.com/zhenghaoz/abb86b21797b6274b0fcb1dfc96274b2
    data = pd.read_csv("../datasets/data3.0.csv")
    # 清理缺失值
    data = data.dropna()
    data = data.drop(columns=['编号']) # 编号是不是好瓜没关系,但是好瓜都排在前面了
    data = data.drop(columns=['色泽', '根蒂', '敲声', '纹理', '脐部', '触感']) # 不是数值类型,比较难写
    # 将类别数据转成数值数据
    # data = pd.get_dummies(data, columns=[])
    # 随机打乱数据
    data = data.sample(frac=1, random_state=11408)
    X = data.loc[:, data.columns!= '好瓜']
    y = data.loc[:, ['好瓜']]
    y['好瓜'] = y['好瓜'].apply(lambda x: 1 if x == '是' else 0)
    X['偏置'] = 1
    return X, y

牛顿法

# 使用牛顿法实现逻辑回归
X, y = load_data()
X = X.values
y = y.values.reshape(-1, 1)
# print(X.shape, y.shape)


def sigmoid(x):return 1 / (1 + np.exp(-x))

num_iter = 100
history = []
theta = np.zeros(X.shape[1])
print(theta.shape)
# 一阶导
for i in range(num_iter):
    h = sigmoid(X @ theta) # 计算预测值
    # 一阶导
    d1 = -np.sum(X.T @ (y-h), axis=1)
    # 二阶导
    d2 = X.T @ np.diag(h * (1-h)) @ X
    theta = theta - np.linalg.inv(d2) @ d1
    history.append(theta.copy())
    pass
history

结果出现了奇异矩阵,只能用拟牛顿法了😿

纯numpy实现(效果很差)

X, y = load_data()
X.head(3)
密度含糖率偏置
60.4810.1491
10.7740.3761
90.2430.2671
y.head(3)
好瓜
61
11
90
# 可视化X,y
data = pd.read_csv('./data3.0.csv')

yes = data[data['好瓜'].isin(['是'])]
no = data[data['好瓜'].isin(['否'])]
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(yes['密度'], yes['含糖率'], marker='o', c='b', label='好')
ax.scatter(no['密度'], no['含糖率'], marker='x', c='r', label='坏')
ax.legend()
ax.set_xlabel('密度')
ax.set_ylabel('糖度')
plt.show()


png

X, y = load_data()
X = X.values
y = y.values

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def J_beta(beta, X, y):
    m = len(y)
    h = sigmoid(X.dot(beta))
    # J = -1/m * (y.T.dot(np.log(h)) + (1-y).T.dot(np.log(1-h)))
    J = -y.T.dot(h) + np.log(1+np.exp(h)).sum()
    J /= m
    return J

def gradient_descent(theta, X, y, alpha, num_iters):
    m = len(y)
    J_history = np.zeros(num_iters)
    for i in range(num_iters):
        h = sigmoid(X.dot(theta))
        grad = (1/m) * X.T.dot(h - y)
        theta -= alpha * grad
        J_history[i] = J_beta(theta, X, y)
    return theta, J_history

alpha = 0.01
num_iters = 1000

# initialize parameters
theta = np.ones(X.shape[1])
theta = theta[:, np.newaxis]
# print(theta.shape)
theta, loss_history = gradient_descent(theta, X, y, alpha=alpha, num_iters=num_iters)
theta
    array([[ 0.39798959],
           [ 0.97555173],
           [-0.42082941]])
for i in range(0, len(loss_history), 50): print(loss_history[i])
    0.7993369746992927
    0.7914154932178077
    0.7835123605293576
    0.7759488329836969
    0.7689676608126914
    0.7627121701621775
    0.7572335864662013
    0.7525143317634295
    0.7484945689951914
    0.7450942978572639
    0.742228531516716
    0.7398161877287649
    0.7377844231503892
    0.7360701109799301
    0.7346197353108638
    0.7333885260020683
    0.7323393143356833
    0.7314413639369537
    0.7306692965348054
    0.7300021574303889

# 进行预测
def predict(theta, X):
    prob = sigmoid(np.dot(X, theta))
    label = [1 if p >= 0.5 else 0 for p in prob]
    return label

y_pred = predict(theta, X)
print("预测结果:", y_pred)

# 计算准确率
accuracy = np.mean(y_pred == y)
print("准确率:", accuracy)
    预测结果: [0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0]
    准确率: 0.5051903114186851

不知道哪里有问题,先质疑一下数据集😠😠😠

sklearn实现

太强了sklearn 😭😭😭

# 使用sklearn求解

# 使用sklearn的LogisticRegression模型进行二分类

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

X, y = load_data()
model.fit(X.values, y.values)
# 模型参数
print(model.coef_, model.intercept_)

print("score:", model.score(X, y))
    [[2.89072129e-01 4.94602388e-01 2.63708520e-07]] [-0.37719608]
    score: 0.6470588235294118

3.4 交叉验证 & 留出法

选择两个 UCI 数据集,比较 10 折交叉验证法和留法所估计出的对率回归的错误率.

数据集地址:UCI Machine Learning Repository: Iris Data Set

下载数据集

# 下载鸢尾花数据集
!curl -o "../datasets/iris.csv" "https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"
!curl -o "../datasets/Dry_Bean_Dataset.csv" "https://raw.githubusercontent.com/jiisa-k/Dry-Bean-Dataset/main/Dry_Bean_Dataset.csv"
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  3975  100  3975    0     0   5850      0 --:--:-- --:--:-- --:--:--  5845
    [proxychains] config file found: /etc/proxychains.conf
    [proxychains] preloading /usr/lib/libproxychains4.so
    [proxychains] DLL init: proxychains-ng 4.16
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0[proxychains] Strict chain  ...  127.0.0.1:20170  ...  raw.githubusercontent.com:443  ...  OK
    100 2426k  100 2426k    0     0  1370k      0  0:00:01  0:00:01 --:--:-- 1370k

sklearn解法

iris = pd.read_csv("../datasets/iris.csv")
iris = iris.sample(frac=1, random_state=11408)
iris.head()
sepal.lengthsepal.widthpetal.lengthpetal.widthvariety
315.43.41.50.4Setosa
1346.12.65.61.4Virginica
55.43.91.70.4Setosa
344.93.11.50.2Setosa
856.03.44.51.6Versicolor
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

iris['variety'] = le.fit_transform(iris['variety'])
X = iris.iloc[:, :-1]
y = iris.iloc[:, -1]


model = LinearRegression()
model.fit(X.values, y.values)
print("coefficients:", model.coef_)
print("score:", model.score(X, y))
    coefficients: [-0.11190585 -0.04007949  0.22864503  0.60925205]
    score: 0.9303939218549564

# 10折交叉验证
num_folds = 10

X['偏置'] = 1
X_train_folds = np.array_split(X.values, num_folds)
y_train_folds = np.array_split(y.values, num_folds)

thetas = []
acc_history = []

for i in range(num_folds):
    X_train = np.concatenate(X_train_folds[:i] + X_train_folds[i+1:])
    y_train = np.concatenate(y_train_folds[:i] + y_train_folds[i+1:])
    X_val = X_train_folds[i]
    y_val = y_train_folds[i]
    
    # 训练模型
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # 预测验证集
    y_pred = model.predict(X_val)
    thetas.append(np.array(model.coef_))
    # 计算准确率
    acc = np.sum(np.round(y_pred) == y_val) / len(y_val)
    acc_history.append(acc)
    print(f"Fold {i} - Accuracy: {acc}")

# 计算平均参数
mean_theta = np.mean(thetas, axis=0)
print(f"Mean theta: {mean_theta}")
# k折的准确率平均值
mean_acc = np.mean(acc_history)
print(f"Mean Accuracy: {mean_acc}")
# 计算平均参数拟合的准确率
y_pred = np.dot(X.values, mean_theta)
acc = np.sum(np.round(y_pred) == y.values) / len(y)
print(f"Mean theta Accuracy: {acc}")
    Fold 0 - Accuracy: 0.9333333333333333
    Fold 1 - Accuracy: 1.0
    Fold 2 - Accuracy: 0.9333333333333333
    Fold 3 - Accuracy: 1.0
    Fold 4 - Accuracy: 0.9333333333333333
    Fold 5 - Accuracy: 1.0
    Fold 6 - Accuracy: 0.9333333333333333
    Fold 7 - Accuracy: 0.9333333333333333
    Fold 8 - Accuracy: 1.0
    Fold 9 - Accuracy: 0.9333333333333333
    Mean theta: [-0.11122036 -0.04091899  0.2275572   0.6106891   0.        ]
    Mean Accuracy: 0.96
    Mean theta Accuracy: 0.9266666666666666

纯numpy解法

# TODO

3.5 线性判别分析(LDA)

编辑实现线性判别分析,并给出西瓜数据集 3.0 α 3.0α 3.0α 上的结果.

data = pd.read_csv("../datasets/data3.0.csv")
# 清理缺失值
data = data.dropna()
data = data.drop(columns=['编号']) # 编号是不是好瓜没关系,但是好瓜都排在前面了
data = data.drop(columns=['色泽', '根蒂', '敲声', '纹理', '脐部', '触感']) # 不是数值类型,比较难写
data = data.sample(frac=1, random_state=11408)
data['好瓜'] = data['好瓜'].apply(lambda x: 1 if x == '是' else 0)
X0 = np.array(data[data['好瓜'].isin([0])].iloc[:, 0:2], dtype='float')
X1 = np.array(data[data['好瓜'].isin([1])].iloc[:, 0:2], dtype='float')
data.head()
密度含糖率好瓜
60.4810.1491
10.7740.3761
90.2430.2670
130.6570.1980
80.6660.0910
# 线性判别分析(Linear Discriminant Analysis,LDA)
def LDA(X0, X1):
    mu_0 = X0.mean(axis=0, keepdims=True).T
    mu_1 = X1.mean(axis=0, keepdims=True).T
    sigma_0 = (X0 - mu_0.T).T @ (X0 - mu_0.T)
    sigma_1 = (X1 - mu_1.T).T @ (X1 - mu_1.T)
    S_w = sigma_0 + sigma_1
    w = np.linalg.inv(S_w) @ (mu_0 - mu_1)
    return w

w = LDA(X0, X1)
w
    array([[-0.14650982],
           [-0.73871557]])
# 可视化

# 计算投影位置
map_pos = []
def get_map_pos(w, X):
    # X是数据
    # w是投影方向
    # 计算投影位置
    projx = np.dot(X, w) # x在w方向上的投影距离
    ans = w * projx[:, np.newaxis]  # 使用广播将 w 乘以每个投影距离
    return ans.squeeze() # 去掉维度为1的轴

# 绘制投影点
p0 = get_map_pos(w, X0)
plt.scatter(p0[:, 0], p0[:, 1],c='r')
p1 = get_map_pos(w, X1)
plt.scatter(p1[:, 0], p1[:, 1],c='b')

# 绘制原始数据点
plt.scatter(X0[:, 0], X0[:, 1],c='r')
plt.scatter(X1[:, 0], X1[:, 1],c='b')

# # ax.xlabel('X1')
# # ax.ylabel('X2')
# # ax.title('Projection of X onto w')
plt.show()


png

3.6 线性判别分析-非线性可分

线性判别分析仅在线性可分数据上能获得理想结果?试设计一个改进方法,使其能较好地周于非线性可分数据

# 这里想到 广义线性模型
from sklearn import datasets
import matplotlib.pyplot as plt

# 创建非线性数据集
X, y = datasets.make_circles(n_samples=100, factor=0.5, noise=0.1, random_state=42)

# plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdBu)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM with RBF Kernel')
plt.show()


png

# X是一个 (100, 2) 的矩阵,每一行代表一个点的坐标
r = np.sqrt(X[:,0]**2 + X[:,1]**2)
# 构造新特征矩阵 Z
Z = np.hstack((X[:, 0].reshape(-1,1), r.reshape(-1,1)))

# LDA判别
Z1 = Z[y==1]
Z0 = Z[y==0]

w = LDA(Z0, Z1)
w
    array([[-0.00970806],
           [ 0.42604301]])
# 画图

# 绘制投影点
p0 = get_map_pos(w, Z0)
plt.scatter(p0[:, 0], p0[:, 1],c='r')
p1 = get_map_pos(w, Z1)
plt.scatter(p1[:, 0], p1[:, 1],c='b')

# 绘制原始数据点
plt.scatter(Z0[:, 0], Z0[:, 1],c='r')
plt.scatter(Z1[:, 0], Z1[:, 1],c='b')

# # ax.xlabel('X1')
# # ax.ylabel('X2')
# # ax.title('Projection of X onto w')
plt.show()


png
​ 又不知道哪不对😵‍💫,但是还是能看出来分类好了

3.7 海明距离与 ECOC二元码井证明

令码长为 9,类别数为 4,试给出海明距离意义下理论最优的 ECOC二元码井证明之.

  1. 最大程度地提高类别之间的Hamming距离
  2. 同时具有良好的纠错性能。
# 算了很久还没算出来 TAT python单跑完2**36就要一小时了
""" 
构造了一个每个类之间hamming距离都是6的
1111.1111.1
1110.0000.0
0000.1010.1
0001.0101.0
# 可能存在距离都是7的情况
"""
def has_duplicate_samples(X):
    # 判断是否有重复样本
    _, unique_indices = np.unique(X, axis=0, return_index=True)
    return X.shape[0] != unique_indices.shape[0]

def hamming_distance(X):
    if has_duplicate_samples(X):
        return 0  # 存在重复样本,返回0
    
    # 计算两两样本之间的汉明距离矩阵
    diff_matrix = X[:, np.newaxis, :] != X[np.newaxis, :, :]
    
    # 计算非零元素的数量,即不同的位数
    dist = np.sum(diff_matrix, axis=2)
    
    # 返回总的汉明距离
    return np.sum(dist)


            
ans = 0
ans_x = None
# 穷举所有可能的x
for i in range(2**(4*9)):
        # 构造x
        bstr = format(i, f'0{4*9}b')
        x = np.array(list(map(int, bstr)))
        # 将x转成 9*4的矩阵
        x = x.reshape(4,9)
        # 计算此时x的hamming距离
        dist = hamming_distance(x) #hamming距离
        if dist > ans:
            # print(x, "\n", hamming_distance(x))
            print(dist, end=" ")
            ans = dist
            ans_x = x.copy()
print(ans, "\n", ans_x)

3.8*

ECOC 编码能起到理想纠错作用的重要条件是:在每一位编码上出错的概率相当且独立.试析多分类任务经 ECOC 编码后产生的二类分类器满足该条件的可能性及由此产生的影响.

3.9 OvR 和 MvM 无需处理类别不平衡性

使用 OvR 和 MvM 将多分类任务分解为二分类任务求解时,试述为何无需专门针对类别不平衡性进行处理.

  1. OvO不行:最终结果由投票产生,将预测最多的结果作为最终结果
  2. OvR:将每个类别的样本都作为一个类别,训练出一个分类器。但是会选择置信度高的。
    • 找到个反例:A占95个、B占3个、C占2个。
    • 如果训练出3个分类器,每个都会预测出现次数多的,那么这时OvR就不能消除不平衡问题了
  3. MvM:分解成若干个子问题,每个子问题都是一个二分类问题。最后有一步汇总,返回距离最小

--p66

3.10* 多分类代价敏感学习再缩放条件

试推导出多分类代价敏感学习(仅考虑基于类别的误分类代价)使用"再缩放"能获得理论最优解的条件.

  • 12
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值