3-线性模型部分题解

超级虚空

已于 2024-01-22 12:04:38 修改

阅读量376

点赞数 12

分类专栏：读西瓜书文章标签：机器学习线性代数逻辑回归

于 2024-01-19 22:57:02 首次发布

本文链接：https://blog.csdn.net/m0_49376775/article/details/135707811

版权

读西瓜书专栏收录该内容

7 篇文章 0 订阅

订阅专栏

文章目录

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time

# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
plt.rcParams["font.sans-serif"]=["SimHei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #该语句解决图像中的“-”负号的乱码问题



# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

3.1 偏置项

什么情况下式(3.2)中不需要考虑偏置项

已经知道 $y$ 是 $x$ 的正比例函数（经过原点

3.2 证明凸性

证明参数 $w$ ，对率回归的目标函数非凸，但对数似然函数是凸的

对率回归：
$\frac{1}{1+e^{-(w^Tx+b)}} \tag{3.18}$

对数似然函数
$\ell(\boldsymbol{\beta})=\sum_{i=1}^{m}\left(-y_{i} \boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}+\ln \left(1+e^{\boldsymbol{\beta}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}}\right)\right) \tag{3.27}$

对实数集上的函数，可通过求二阶导数来在判别，若二阶导数在区间中非负，则称为凸函数，若二阶导数在区间上恒大于 0，则称为严格凸函数.(p54)

现在对对率回归的 $w$ 求二阶偏导：
$\frac{\partial}{\partial w}\left(\frac{1}{1+e^{-w^T x+b}}\right) \frac{e^{-w^Tx+b_{x}}}{\left(1+e^{-w^T x +b}\right)^{2}}$
二阶偏导小于0,因此非凸
$\frac{\partial}{\partial w^2} = \frac{-x^2 e^{-w^T x+b}\times [1+e^{-(w^T+b)}]^2}{[1-e^{-(w^T+b)}]^4} < 0$

对对数似然函数的 $w$ 求二阶偏导

书上(3.30)和(3.31写了)
$\beta = (w;b)$

$p_1(\hat{x}_i;\beta)$ 是标量
原式化成 $\sum_{i=1}^{m}{\hat{x}_i\hat{x}_i^T*f(\hat{x}_i)} = \sum_{i=1}^{m}{\hat{x}_i*f(\hat{x}_i)*\hat{x}_i^T}$
换成矩阵形式： $Q = XPX^T$
概率都大于0小于1， $P$ 是一个半正定对角矩阵，Q和P合同，那么Q也半正定。得证
证明4： $a^TQa = (a^TX^T)P(Xa) = (b)^TP(b)$

3.3 对率回归

编程实现对率回归，并给出西瓜数据集 $\alpha$ 上的结果.

# 下载西瓜3.0数据集
!curl -o "../datasets/data3.0.csv" "https://gist.githubusercontent.com/zhenghaoz/abb86b21797b6274b0fcb1dfc96274b2/raw/552f4ec4c6269a8db34ec32486e355508247991b/data3.0.csv"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1098  100  1098    0     0   1839      0 --:--:-- --:--:-- --:--:--  1842

# 本地加载数据集
def load_data():
    # 加载数据
    # 数据来自：https://gist.githubusercontent.com/zhenghaoz/abb86b21797b6274b0fcb1dfc96274b2
    data = pd.read_csv("../datasets/data3.0.csv")
    # 清理缺失值
    data = data.dropna()
    data = data.drop(columns=['编号']) # 编号是不是好瓜没关系，但是好瓜都排在前面了
    data = data.drop(columns=['色泽', '根蒂', '敲声', '纹理', '脐部', '触感']) # 不是数值类型，比较难写
    # 将类别数据转成数值数据
    # data = pd.get_dummies(data, columns=[])
    # 随机打乱数据
    data = data.sample(frac=1, random_state=11408)
    X = data.loc[:, data.columns!= '好瓜']
    y = data.loc[:, ['好瓜']]
    y['好瓜'] = y['好瓜'].apply(lambda x: 1 if x == '是' else 0)
    X['偏置'] = 1
    return X, y

牛顿法

# 使用牛顿法实现逻辑回归
X, y = load_data()
X = X.values
y = y.values.reshape(-1, 1)
# print(X.shape, y.shape)


def sigmoid(x):return 1 / (1 + np.exp(-x))

num_iter = 100
history = []
theta = np.zeros(X.shape[1])
print(theta.shape)
# 一阶导
for i in range(num_iter):
    h = sigmoid(X @ theta) # 计算预测值
    # 一阶导
    d1 = -np.sum(X.T @ (y-h), axis=1)
    # 二阶导
    d2 = X.T @ np.diag(h * (1-h)) @ X
    theta = theta - np.linalg.inv(d2) @ d1
    history.append(theta.copy())
    pass
history

结果出现了奇异矩阵，只能用拟牛顿法了😿

纯numpy实现（效果很差）

X, y = load_data()

X.head(3)

	密度	含糖率	偏置
6	0.481	0.149	1
1	0.774	0.376	1
9	0.243	0.267	1

y.head(3)

	好瓜
6	1
1	1
9	0

# 可视化X,y
data = pd.read_csv('./data3.0.csv')

yes = data[data['好瓜'].isin(['是'])]
no = data[data['好瓜'].isin(['否'])]
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(yes['密度'], yes['含糖率'], marker='o', c='b', label='好')
ax.scatter(no['密度'], no['含糖率'], marker='x', c='r', label='坏')
ax.legend()
ax.set_xlabel('密度')
ax.set_ylabel('糖度')
plt.show()

png

X, y = load_data()
X = X.values
y = y.values

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def J_beta(beta, X, y):
    m = len(y)
    h = sigmoid(X.dot(beta))
    # J = -1/m * (y.T.dot(np.log(h)) + (1-y).T.dot(np.log(1-h)))
    J = -y.T.dot(h) + np.log(1+np.exp(h)).sum()
    J /= m
    return J

def gradient_descent(theta, X, y, alpha, num_iters):
    m = len(y)
    J_history = np.zeros(num_iters)
    for i in range(num_iters):
        h = sigmoid(X.dot(theta))
        grad = (1/m) * X.T.dot(h - y)
        theta -= alpha * grad
        J_history[i] = J_beta(theta, X, y)
    return theta, J_history

alpha = 0.01
num_iters = 1000

# initialize parameters
theta = np.ones(X.shape[1])
theta = theta[:, np.newaxis]
# print(theta.shape)
theta, loss_history = gradient_descent(theta, X, y, alpha=alpha, num_iters=num_iters)
theta

    array([[ 0.39798959],
           [ 0.97555173],
           [-0.42082941]])

for i in range(0, len(loss_history), 50): print(loss_history[i])

    0.7993369746992927
    0.7914154932178077
    0.7835123605293576
    0.7759488329836969
    0.7689676608126914
    0.7627121701621775
    0.7572335864662013
    0.7525143317634295
    0.7484945689951914
    0.7450942978572639
    0.742228531516716
    0.7398161877287649
    0.7377844231503892
    0.7360701109799301
    0.7346197353108638
    0.7333885260020683
    0.7323393143356833
    0.7314413639369537
    0.7306692965348054
    0.7300021574303889

# 进行预测
def predict(theta, X):
    prob = sigmoid(np.dot(X, theta))
    label = [1 if p >= 0.5 else 0 for p in prob]
    return label

y_pred = predict(theta, X)
print("预测结果:", y_pred)

# 计算准确率
accuracy = np.mean(y_pred == y)
print("准确率:", accuracy)

    预测结果: [0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0]
    准确率: 0.5051903114186851

不知道哪里有问题，先质疑一下数据集😠😠😠

sklearn实现

太强了sklearn 😭😭😭

# 使用sklearn求解

# 使用sklearn的LogisticRegression模型进行二分类

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

X, y = load_data()
model.fit(X.values, y.values)
# 模型参数
print(model.coef_, model.intercept_)

print("score:", model.score(X, y))

    [[2.89072129e-01 4.94602388e-01 2.63708520e-07]] [-0.37719608]
    score: 0.6470588235294118

3.4 交叉验证 & 留出法

选择两个 UCI 数据集，比较 10 折交叉验证法和留法所估计出的对率回归的错误率.

数据集地址：UCI Machine Learning Repository: Iris Data Set

下载数据集

# 下载鸢尾花数据集
!curl -o "../datasets/iris.csv" "https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"
!curl -o "../datasets/Dry_Bean_Dataset.csv" "https://raw.githubusercontent.com/jiisa-k/Dry-Bean-Dataset/main/Dry_Bean_Dataset.csv"

      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  3975  100  3975    0     0   5850      0 --:--:-- --:--:-- --:--:--  5845
    [proxychains] config file found: /etc/proxychains.conf
    [proxychains] preloading /usr/lib/libproxychains4.so
    [proxychains] DLL init: proxychains-ng 4.16
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0[proxychains] Strict chain  ...  127.0.0.1:20170  ...  raw.githubusercontent.com:443  ...  OK
    100 2426k  100 2426k    0     0  1370k      0  0:00:01  0:00:01 --:--:-- 1370k

sklearn解法

iris = pd.read_csv("../datasets/iris.csv")
iris = iris.sample(frac=1, random_state=11408)
iris.head()

	sepal.length	sepal.width	petal.length	petal.width	variety
31	5.4	3.4	1.5	0.4	Setosa
134	6.1	2.6	5.6	1.4	Virginica
5	5.4	3.9	1.7	0.4	Setosa
34	4.9	3.1	1.5	0.2	Setosa
85	6.0	3.4	4.5	1.6	Versicolor

from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

iris['variety'] = le.fit_transform(iris['variety'])
X = iris.iloc[:, :-1]
y = iris.iloc[:, -1]


model = LinearRegression()
model.fit(X.values, y.values)
print("coefficients:", model.coef_)
print("score:", model.score(X, y))

    coefficients: [-0.11190585 -0.04007949  0.22864503  0.60925205]
    score: 0.9303939218549564

# 10折交叉验证
num_folds = 10

X['偏置'] = 1
X_train_folds = np.array_split(X.values, num_folds)
y_train_folds = np.array_split(y.values, num_folds)

thetas = []
acc_history = []

for i in range(num_folds):
    X_train = np.concatenate(X_train_folds[:i] + X_train_folds[i+1:])
    y_train = np.concatenate(y_train_folds[:i] + y_train_folds[i+1:])
    X_val = X_train_folds[i]
    y_val = y_train_folds[i]
    
    # 训练模型
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # 预测验证集
    y_pred = model.predict(X_val)
    thetas.append(np.array(model.coef_))
    # 计算准确率
    acc = np.sum(np.round(y_pred) == y_val) / len(y_val)
    acc_history.append(acc)
    print(f"Fold {i} - Accuracy: {acc}")

# 计算平均参数
mean_theta = np.mean(thetas, axis=0)
print(f"Mean theta: {mean_theta}")
# k折的准确率平均值
mean_acc = np.mean(acc_history)
print(f"Mean Accuracy: {mean_acc}")
# 计算平均参数拟合的准确率
y_pred = np.dot(X.values, mean_theta)
acc = np.sum(np.round(y_pred) == y.values) / len(y)
print(f"Mean theta Accuracy: {acc}")

    Fold 0 - Accuracy: 0.9333333333333333
    Fold 1 - Accuracy: 1.0
    Fold 2 - Accuracy: 0.9333333333333333
    Fold 3 - Accuracy: 1.0
    Fold 4 - Accuracy: 0.9333333333333333
    Fold 5 - Accuracy: 1.0
    Fold 6 - Accuracy: 0.9333333333333333
    Fold 7 - Accuracy: 0.9333333333333333
    Fold 8 - Accuracy: 1.0
    Fold 9 - Accuracy: 0.9333333333333333
    Mean theta: [-0.11122036 -0.04091899  0.2275572   0.6106891   0.        ]
    Mean Accuracy: 0.96
    Mean theta Accuracy: 0.9266666666666666

纯numpy解法

# TODO

3.5 线性判别分析（LDA）

编辑实现线性判别分析，并给出西瓜数据集 $3.0 α$ 上的结果.

data = pd.read_csv("../datasets/data3.0.csv")
# 清理缺失值
data = data.dropna()
data = data.drop(columns=['编号']) # 编号是不是好瓜没关系，但是好瓜都排在前面了
data = data.drop(columns=['色泽', '根蒂', '敲声', '纹理', '脐部', '触感']) # 不是数值类型，比较难写
data = data.sample(frac=1, random_state=11408)
data['好瓜'] = data['好瓜'].apply(lambda x: 1 if x == '是' else 0)
X0 = np.array(data[data['好瓜'].isin([0])].iloc[:, 0:2], dtype='float')
X1 = np.array(data[data['好瓜'].isin([1])].iloc[:, 0:2], dtype='float')
data.head()

	密度	含糖率	好瓜
6	0.481	0.149	1
1	0.774	0.376	1
9	0.243	0.267	0
13	0.657	0.198	0
8	0.666	0.091	0

# 线性判别分析（Linear Discriminant Analysis，LDA）
def LDA(X0, X1):
    mu_0 = X0.mean(axis=0, keepdims=True).T
    mu_1 = X1.mean(axis=0, keepdims=True).T
    sigma_0 = (X0 - mu_0.T).T @ (X0 - mu_0.T)
    sigma_1 = (X1 - mu_1.T).T @ (X1 - mu_1.T)
    S_w = sigma_0 + sigma_1
    w = np.linalg.inv(S_w) @ (mu_0 - mu_1)
    return w

w = LDA(X0, X1)
w

    array([[-0.14650982],
           [-0.73871557]])

# 可视化

# 计算投影位置
map_pos = []
def get_map_pos(w, X):
    # X是数据
    # w是投影方向
    # 计算投影位置
    projx = np.dot(X, w) # x在w方向上的投影距离
    ans = w * projx[:, np.newaxis]  # 使用广播将 w 乘以每个投影距离
    return ans.squeeze() # 去掉维度为1的轴

# 绘制投影点
p0 = get_map_pos(w, X0)
plt.scatter(p0[:, 0], p0[:, 1],c='r')
p1 = get_map_pos(w, X1)
plt.scatter(p1[:, 0], p1[:, 1],c='b')

# 绘制原始数据点
plt.scatter(X0[:, 0], X0[:, 1],c='r')
plt.scatter(X1[:, 0], X1[:, 1],c='b')

# # ax.xlabel('X1')
# # ax.ylabel('X2')
# # ax.title('Projection of X onto w')
plt.show()

png

3.6 线性判别分析-非线性可分

线性判别分析仅在线性可分数据上能获得理想结果?试设计一个改进方法，使其能较好地周于非线性可分数据

# 这里想到 广义线性模型
from sklearn import datasets
import matplotlib.pyplot as plt

# 创建非线性数据集
X, y = datasets.make_circles(n_samples=100, factor=0.5, noise=0.1, random_state=42)

# plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdBu)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM with RBF Kernel')
plt.show()

png

# X是一个 (100, 2) 的矩阵，每一行代表一个点的坐标
r = np.sqrt(X[:,0]**2 + X[:,1]**2)
# 构造新特征矩阵 Z
Z = np.hstack((X[:, 0].reshape(-1,1), r.reshape(-1,1)))

# LDA判别
Z1 = Z[y==1]
Z0 = Z[y==0]

w = LDA(Z0, Z1)
w

    array([[-0.00970806],
           [ 0.42604301]])

# 画图

# 绘制投影点
p0 = get_map_pos(w, Z0)
plt.scatter(p0[:, 0], p0[:, 1],c='r')
p1 = get_map_pos(w, Z1)
plt.scatter(p1[:, 0], p1[:, 1],c='b')

# 绘制原始数据点
plt.scatter(Z0[:, 0], Z0[:, 1],c='r')
plt.scatter(Z1[:, 0], Z1[:, 1],c='b')

# # ax.xlabel('X1')
# # ax.ylabel('X2')
# # ax.title('Projection of X onto w')
plt.show()

png
又不知道哪不对😵‍💫，但是还是能看出来分类好了

3.7 海明距离与 ECOC二元码井证明

令码长为 9，类别数为 4，试给出海明距离意义下理论最优的 ECOC二元码井证明之.

最大程度地提高类别之间的Hamming距离
同时具有良好的纠错性能。

# 算了很久还没算出来 TAT python单跑完2**36就要一小时了
""" 
构造了一个每个类之间hamming距离都是6的
1111.1111.1
1110.0000.0
0000.1010.1
0001.0101.0
# 可能存在距离都是7的情况
"""
def has_duplicate_samples(X):
    # 判断是否有重复样本
    _, unique_indices = np.unique(X, axis=0, return_index=True)
    return X.shape[0] != unique_indices.shape[0]

def hamming_distance(X):
    if has_duplicate_samples(X):
        return 0  # 存在重复样本，返回0
    
    # 计算两两样本之间的汉明距离矩阵
    diff_matrix = X[:, np.newaxis, :] != X[np.newaxis, :, :]
    
    # 计算非零元素的数量，即不同的位数
    dist = np.sum(diff_matrix, axis=2)
    
    # 返回总的汉明距离
    return np.sum(dist)


            
ans = 0
ans_x = None
# 穷举所有可能的x
for i in range(2**(4*9)):
        # 构造x
        bstr = format(i, f'0{4*9}b')
        x = np.array(list(map(int, bstr)))
        # 将x转成 9*4的矩阵
        x = x.reshape(4,9)
        # 计算此时x的hamming距离
        dist = hamming_distance(x) #hamming距离
        if dist > ans:
            # print(x, "\n", hamming_distance(x))
            print(dist, end=" ")
            ans = dist
            ans_x = x.copy()
print(ans, "\n", ans_x)

3.8*

ECOC 编码能起到理想纠错作用的重要条件是:在每一位编码上出错的概率相当且独立.试析多分类任务经 ECOC 编码后产生的二类分类器满足该条件的可能性及由此产生的影响.

3.9 OvR 和 MvM 无需处理类别不平衡性

使用 OvR 和 MvM 将多分类任务分解为二分类任务求解时，试述为何无需专门针对类别不平衡性进行处理.

OvO不行：最终结果由投票产生，将预测最多的结果作为最终结果
OvR：将每个类别的样本都作为一个类别，训练出一个分类器。但是会选择置信度高的。
- 找到个反例：A占95个、B占3个、C占2个。
- 如果训练出3个分类器，每个都会预测出现次数多的，那么这时OvR就不能消除不平衡问题了
MvM：分解成若干个子问题，每个子问题都是一个二分类问题。最后有一步汇总，返回距离最小的

--p66

3.10* 多分类代价敏感学习再缩放条件

试推导出多分类代价敏感学习(仅考虑基于类别的误分类代价)使用"再缩放"能获得理论最优解的条件.

超级虚空

关注

12
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
3-线性模型部分题解

ECOC 编码能起到理想纠错作用的重要条件是:在每一位编码上出错的概率相当且独立.试析多分类任务经 ECOC 编码后产生的二类分类器满足该条件的可能性及由此产生的影响.对实数集上的函数，可通过求二阶导数来在判别，若二阶导数在区间中非负，则称为凸函数，若二阶导数在区间上恒大于 0，则称为严格凸函数.(p54)试推导出多分类代价敏感学习(仅考虑基于类别的误分类代价)使用"再缩放"能获得理论最优解的条件.，对率回归的目标函数非凸，但对数似然函数是凸的。编程实现对率回归，并给出西瓜数据集。，只能用拟牛顿法了😿。
复制链接

扫一扫