【机器学习】基于Logistic Regression的新冠肺炎CT影像识别

最新推荐文章于 2023-04-18 20:22:28 发布

JMU-HZH

最新推荐文章于 2023-04-18 20:22:28 发布

阅读量2.7k

点赞数 9

文章标签：算法 python opencv

本文链接：https://blog.csdn.net/qq_45603919/article/details/121534823

版权

文章目录

【机器学习】基于Logistic Regression的新冠肺炎CT影像识别

【机器学习】基于Logistic Regression的新冠肺炎CT影像识别

本篇博客通过Logistic Regression的方法实现新冠肺炎CT影像的识别。我们通过代码与概念深入浅出项目的实现过程。

1. 线性模型与回归

$f(x)=w_{1} x_{1}+w_{2} x_{2}+\ldots+w_{d} x_{d}+b\\ 其中x=(x_1,x_2,...,x_d)是由d维属性描述的样本\\ 向量化表示：f(x)=w^{T} x+b$
在这里插入图片描述

2. 通过最小二乘实现参数求解

线性回归目标：回归预测值与真实值的误差最小。
$\begin{aligned} \left(w^{*}, b^{*}\right) &=\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(f\left(x_{i}\right)-y_{i}\right)^{2} \\ &=\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2} \end{aligned}$
因此我们需要对参数w和b求偏导求解误差最小值。
$\left.\begin{array}{c} \frac{\partial E_{(w, b)}}{\partial w}=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)=0 \\ \frac{\partial E_{(w, b)}}{\partial b}=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w_{i} x_{i}\right)\right)=0 \end{array}\right\}$
求解得。
$\begin{gathered} w=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^2-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^2} \\ b=\frac{1}{m} \sum_{i=1}^{m}\left(y_{i}-w x_{i}\right) \end{gathered}\\ 其中：\bar{x}=\frac{1}{m} \sum_{i=1}^{m} x_{i}$

3. 对数线性回归

对数线性回归目的：通过线性模型预测非线性的复杂函数
$f(x)=wx+b\\ g(x)=e^x\\ g(f(x))=e^{wx+b}$
在这里插入图片描述

4. Logistic Regression

Logistic Regression的目的 ：虽然名字上称之回归，但其本质是一个分类算法。

Logistic Regression的本质：Logistic Regression属于判别式模型。它是在线性回归的基础上使用sigmoid函数将线性模型压缩到[0,1]之间，实其具备概率意义。

sigmoid函数：获得分类概率
$h_{\theta}(x)=g\left(\theta^{T} x\right)=\frac{1}{1+e^{-\theta^{T} x}}$
在这里插入图片描述

Logistic Regression的损失函数：当通过sigmoid函数获得分类预测值后，我们通过损失函数来参与logistic模型的优化。

$假设训练数据集为\left\{\left(\mathrm{x}^{1}, \mathrm{y}^{1}\right),\left(\mathrm{x}^{2}, \mathrm{y}^{2}\right), \ldots\left(\mathrm{x}^{\mathrm{m}}, \mathrm{y}^{\mathrm{m}}\right)\right\}\\$
$\text { 令 } \mathrm{x}=\left[\mathrm{x}_{0}, \mathrm{x}_{1}, \ldots, \mathrm{x}_{\mathrm{n}}\right]^{\mathrm{T}}, \mathrm{x}_{0}=1 \text { ，即每个样本有 } \mathrm{n} \text { 个特征， } \mathrm{y} \in\{0,1\}\\$
$损失函数定义：\\ \begin{aligned} &J(\theta)=\frac{1}{m} \sum_{i=1}^{m} \operatorname{cost}\left(h_{\theta}\left(x^{i}\right), y^{i}\right) \\ &\operatorname{cost}\left(h_{\theta}(x), y\right)= \begin{cases}-\log \left(h_{\theta}(x)\right) & \text { if } \mathrm{y}=1 \\ -\log \left(1-h_{\theta}(x)\right) & \text { if } \mathrm{y}=0\end{cases} \end{aligned}\\$
$整理得：\\ \operatorname{cost}\left(h_{\theta}(x), y\right)=-y \log \left(h_{\theta}(x)\right)-(1-y) \log \left(1-h_{\theta}(x)\right)\\$
$转化为成本函数：\\ J(\theta)=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{i} \log \left(h_{\theta}\left(x^{i}\right)\right)+\left(1-y^{i}\right) \log \left(1-h_{\theta}\left(x^{i}\right)\right)\right]$

Logistic Regression的梯度下降：用梯度下降法来求得使代价函数最小的参数。
$\begin{aligned} \theta_{j} &=\theta_{j}-\alpha \frac{\partial J(\theta)}{\partial \theta_{j}} \\ &=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{i}\right)-y^{i}\right) x_{j}^{i} \end{aligned}$
Logistic Regression的推广（Softmax Regression）：

Logistic Regression用来解决二分类问题，但若是遇到多分类问题我们常采取softmax regression，它是Logistic Regression在多分类问题上的推广。

5. 新冠肺炎CT影像识别

数据集与代码：

链接：https://pan.baidu.com/s/1Ay4Y3Cr-0i–dlDl0zt1xg
提取码：7irt

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.datasets import mnist
import os
import cv2
from sklearn.model_selection import train_test_split

# 识别类别2类：正常与新冠
num_classes = 2
# 128 * 128
num_features = 16384
# 学习率往小了调
learning_rate = 0.0001
# 基本超参数
training_steps = 1000
batch_size = 32
display_step = 200

# 制作数据集
root_path = "./CT/"
imgs = []
labels = []

for files in os.listdir(root_path):
    path = os.path.join(root_path,files)
    #print(path)
    for img_name in os.listdir(path):
        img_path = os.path.join(path,img_name)
        img = cv2.imread(img_path,0)
        img = cv2.resize(img,(128,128))
        imgs.append(img)
        if files == 'CT_COVID':
            labels.append(0)
        if files == 'CT_NonCOVID':
            labels.append(1)

# 划分数据集
x_train,x_test,y_train,y_test = train_test_split(imgs, labels, test_size = 0.2)
print(y_test)

# 转换为float32
x_train,x_test = np.array(x_train,np.float32),np.array(x_test,np.float32)

print(x_train.shape)
print(x_test.shape)

# 将图像平铺成784个特征的一维向量（128*128）
x_train,x_test = x_train.reshape([-1,num_features]),x_test.reshape([-1,num_features])

# 将像素值从[0,255]归一化为[0,1]
x_train,x_test = x_train/255, x_test/255

# 数据随机分布和批处理
train_data = tf.data.Dataset.from_tensor_slices((x_train,y_train))
train_data = train_data.repeat().shuffle(5000).batch(batch_size).prefetch(1)

# 权值矩阵形状[16384,2]，128 * 128图像特征数和类别数目
W = tf.Variable(tf.ones([num_features, num_classes]), name="weight")
# 偏置形状[2], 类别数目
b = tf.Variable(tf.zeros([num_classes]), name="bias")

# logistic回归，这里我使用
def logistic_regression(x):
    return tf.nn.softmax(tf.matmul(x,W) + b)
	# return tf.nn.sigmoid(tf.matmul(x,W) + b)

# 交叉熵损失函数
def cross_entropy(y_pred, y_true):
    # 将标签编码为一个独热编码向量
    y_true = tf.one_hot(y_true, depth=num_classes)
    # 压缩预测值以避免log（0）错误
    y_pred = tf.clip_by_value(y_pred, 1e-9, 1.)
    # 计算交叉熵
    return tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred)))

# 计算精度
def accuracy(y_pred, y_true):
    correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64))
    return tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# 优化器，这里我选Adam
optimizer = tf.optimizers.Adam(learning_rate)

# 优化过程
def run_optimization(x, y):
    # 将计算封装在GradientTape中以实现自动微分
    with tf.GradientTape() as g:
        #print("x:",x)
        pred = logistic_regression(x)
        #print("pred:",pred)
        loss = cross_entropy(pred, y)

    # 计算梯度
    gradients = g.gradient(loss, [W, b])

    # 根据gradients更新 W 和 b
    optimizer.apply_gradients(zip(gradients, [W, b]))

# 开始训练
for step, (batch_x, batch_y) in enumerate(train_data.take(training_steps), 1):
    # 更新W和b值
    run_optimization(batch_x, batch_y)
    if step % display_step == 0:
        pred = logistic_regression(batch_x)
        loss = cross_entropy(pred, batch_y)
        acc = accuracy(pred, batch_y)
        print("step: %i, loss: %f, accuracy: %f" % (step, loss, acc))
    

# 在验证集上测试模型
pred = logistic_regression(x_test)
print("Test Accuracy: %f" % accuracy(pred, y_test))

#可视化预测
n_images = 5

test_images = x_test[:n_images]

font={	'color': 'red',
		'size': 20,
		'family': 'Times New Roman',
    	'style':'italic'}

predictions = logistic_regression(test_images)

for i in range(n_images):
    print(np.argmax(predictions.numpy()[i]))
    if np.argmax(predictions.numpy()[i]) == 0:
        plt.imshow(np.reshape(test_images[i],[128,128]))
        plt.text(28, 0.1, "Prediction : COVID", fontdict=font)
        plt.show()
    else:
        plt.imshow(np.reshape(test_images[i],[128,128]))
        plt.text(28, 0.1, "Prediction : Normal", fontdict=font)
        plt.show()