文章目录
【机器学习】基于Logistic Regression的新冠肺炎CT影像识别
本篇博客通过Logistic Regression的方法实现新冠肺炎CT影像的识别。我们通过代码与概念深入浅出项目的实现过程。
1. 线性模型与回归
f
(
x
)
=
w
1
x
1
+
w
2
x
2
+
…
+
w
d
x
d
+
b
其
中
x
=
(
x
1
,
x
2
,
.
.
.
,
x
d
)
是
由
d
维
属
性
描
述
的
样
本
向
量
化
表
示
:
f
(
x
)
=
w
T
x
+
b
f(x)=w_{1} x_{1}+w_{2} x_{2}+\ldots+w_{d} x_{d}+b\\ 其中x=(x_1,x_2,...,x_d)是由d维属性描述的样本\\ 向量化表示:f(x)=w^{T} x+b
f(x)=w1x1+w2x2+…+wdxd+b其中x=(x1,x2,...,xd)是由d维属性描述的样本向量化表示:f(x)=wTx+b
2. 通过最小二乘实现参数求解
线性回归目标:回归预测值与真实值的误差最小。
(
w
∗
,
b
∗
)
=
arg
min
(
w
,
b
)
∑
i
=
1
m
(
f
(
x
i
)
−
y
i
)
2
=
arg
min
(
w
,
b
)
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
2
\begin{aligned} \left(w^{*}, b^{*}\right) &=\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(f\left(x_{i}\right)-y_{i}\right)^{2} \\ &=\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2} \end{aligned}
(w∗,b∗)=(w,b)argmini=1∑m(f(xi)−yi)2=(w,b)argmini=1∑m(yi−wxi−b)2
因此我们需要对参数w和b求偏导求解误差最小值。
∂
E
(
w
,
b
)
∂
w
=
2
(
w
∑
i
=
1
m
x
i
2
−
∑
i
=
1
m
(
y
i
−
b
)
x
i
)
=
0
∂
E
(
w
,
b
)
∂
b
=
2
(
m
b
−
∑
i
=
1
m
(
y
i
−
w
i
x
i
)
)
=
0
}
\left.\begin{array}{c} \frac{\partial E_{(w, b)}}{\partial w}=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)=0 \\ \frac{\partial E_{(w, b)}}{\partial b}=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w_{i} x_{i}\right)\right)=0 \end{array}\right\}
∂w∂E(w,b)=2(w∑i=1mxi2−∑i=1m(yi−b)xi)=0∂b∂E(w,b)=2(mb−∑i=1m(yi−wixi))=0}
求解得。
w
=
∑
i
=
1
m
y
i
(
x
i
−
x
ˉ
)
∑
i
=
1
m
x
i
2
−
1
m
(
∑
i
=
1
m
x
i
)
2
b
=
1
m
∑
i
=
1
m
(
y
i
−
w
x
i
)
其
中
:
x
ˉ
=
1
m
∑
i
=
1
m
x
i
\begin{gathered} w=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^2-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^2} \\ b=\frac{1}{m} \sum_{i=1}^{m}\left(y_{i}-w x_{i}\right) \end{gathered}\\ 其中:\bar{x}=\frac{1}{m} \sum_{i=1}^{m} x_{i}
w=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−xˉ)b=m1i=1∑m(yi−wxi)其中:xˉ=m1i=1∑mxi
3. 对数线性回归
对数线性回归目的:通过线性模型预测非线性的复杂函数
f
(
x
)
=
w
x
+
b
g
(
x
)
=
e
x
g
(
f
(
x
)
)
=
e
w
x
+
b
f(x)=wx+b\\ g(x)=e^x\\ g(f(x))=e^{wx+b}
f(x)=wx+bg(x)=exg(f(x))=ewx+b
4. Logistic Regression
Logistic Regression的目的 :虽然名字上称之回归,但其本质是一个分类算法。
Logistic Regression的本质:Logistic Regression属于判别式模型。它是在线性回归的基础上使用sigmoid函数将线性模型压缩到[0,1]之间,实其具备概率意义。
sigmoid函数:获得分类概率
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
h_{\theta}(x)=g\left(\theta^{T} x\right)=\frac{1}{1+e^{-\theta^{T} x}}
hθ(x)=g(θTx)=1+e−θTx1
Logistic Regression的损失函数:当通过sigmoid函数获得分类预测值后,我们通过损失函数来参与logistic模型的优化。
假
设
训
练
数
据
集
为
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
…
(
x
m
,
y
m
)
}
假设训练数据集为\left\{\left(\mathrm{x}^{1}, \mathrm{y}^{1}\right),\left(\mathrm{x}^{2}, \mathrm{y}^{2}\right), \ldots\left(\mathrm{x}^{\mathrm{m}}, \mathrm{y}^{\mathrm{m}}\right)\right\}\\
假设训练数据集为{(x1,y1),(x2,y2),…(xm,ym)}
令
x
=
[
x
0
,
x
1
,
…
,
x
n
]
T
,
x
0
=
1
,即每个样本有
n
个特征,
y
∈
{
0
,
1
}
\text { 令 } \mathrm{x}=\left[\mathrm{x}_{0}, \mathrm{x}_{1}, \ldots, \mathrm{x}_{\mathrm{n}}\right]^{\mathrm{T}}, \mathrm{x}_{0}=1 \text { ,即每个样本有 } \mathrm{n} \text { 个特征, } \mathrm{y} \in\{0,1\}\\
令 x=[x0,x1,…,xn]T,x0=1 ,即每个样本有 n 个特征, y∈{0,1}
损
失
函
数
定
义
:
J
(
θ
)
=
1
m
∑
i
=
1
m
cost
(
h
θ
(
x
i
)
,
y
i
)
cost
(
h
θ
(
x
)
,
y
)
=
{
−
log
(
h
θ
(
x
)
)
if
y
=
1
−
log
(
1
−
h
θ
(
x
)
)
if
y
=
0
损失函数定义:\\ \begin{aligned} &J(\theta)=\frac{1}{m} \sum_{i=1}^{m} \operatorname{cost}\left(h_{\theta}\left(x^{i}\right), y^{i}\right) \\ &\operatorname{cost}\left(h_{\theta}(x), y\right)= \begin{cases}-\log \left(h_{\theta}(x)\right) & \text { if } \mathrm{y}=1 \\ -\log \left(1-h_{\theta}(x)\right) & \text { if } \mathrm{y}=0\end{cases} \end{aligned}\\
损失函数定义:J(θ)=m1i=1∑mcost(hθ(xi),yi)cost(hθ(x),y)={−log(hθ(x))−log(1−hθ(x)) if y=1 if y=0
整
理
得
:
cost
(
h
θ
(
x
)
,
y
)
=
−
y
log
(
h
θ
(
x
)
)
−
(
1
−
y
)
log
(
1
−
h
θ
(
x
)
)
整理得:\\ \operatorname{cost}\left(h_{\theta}(x), y\right)=-y \log \left(h_{\theta}(x)\right)-(1-y) \log \left(1-h_{\theta}(x)\right)\\
整理得:cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))
转
化
为
成
本
函
数
:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
i
log
(
h
θ
(
x
i
)
)
+
(
1
−
y
i
)
log
(
1
−
h
θ
(
x
i
)
)
]
转化为成本函数:\\ J(\theta)=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{i} \log \left(h_{\theta}\left(x^{i}\right)\right)+\left(1-y^{i}\right) \log \left(1-h_{\theta}\left(x^{i}\right)\right)\right]
转化为成本函数:J(θ)=−m1i=1∑m[yilog(hθ(xi))+(1−yi)log(1−hθ(xi))]
Logistic Regression的梯度下降:用梯度下降法来求得使代价函数最小的参数。
θ
j
=
θ
j
−
α
∂
J
(
θ
)
∂
θ
j
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
i
)
x
j
i
\begin{aligned} \theta_{j} &=\theta_{j}-\alpha \frac{\partial J(\theta)}{\partial \theta_{j}} \\ &=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{i}\right)-y^{i}\right) x_{j}^{i} \end{aligned}
θj=θj−α∂θj∂J(θ)=θj−αm1i=1∑m(hθ(xi)−yi)xji
Logistic Regression的推广(Softmax Regression):
Logistic Regression用来解决二分类问题,但若是遇到多分类问题我们常采取softmax regression,它是Logistic Regression在多分类问题上的推广。
5. 新冠肺炎CT影像识别
数据集与代码:
链接:https://pan.baidu.com/s/1Ay4Y3Cr-0i–dlDl0zt1xg
提取码:7irt
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.datasets import mnist
import os
import cv2
from sklearn.model_selection import train_test_split
# 识别类别2类:正常与新冠
num_classes = 2
# 128 * 128
num_features = 16384
# 学习率往小了调
learning_rate = 0.0001
# 基本超参数
training_steps = 1000
batch_size = 32
display_step = 200
# 制作数据集
root_path = "./CT/"
imgs = []
labels = []
for files in os.listdir(root_path):
path = os.path.join(root_path,files)
#print(path)
for img_name in os.listdir(path):
img_path = os.path.join(path,img_name)
img = cv2.imread(img_path,0)
img = cv2.resize(img,(128,128))
imgs.append(img)
if files == 'CT_COVID':
labels.append(0)
if files == 'CT_NonCOVID':
labels.append(1)
# 划分数据集
x_train,x_test,y_train,y_test = train_test_split(imgs, labels, test_size = 0.2)
print(y_test)
# 转换为float32
x_train,x_test = np.array(x_train,np.float32),np.array(x_test,np.float32)
print(x_train.shape)
print(x_test.shape)
# 将图像平铺成784个特征的一维向量(128*128)
x_train,x_test = x_train.reshape([-1,num_features]),x_test.reshape([-1,num_features])
# 将像素值从[0,255]归一化为[0,1]
x_train,x_test = x_train/255, x_test/255
# 数据随机分布和批处理
train_data = tf.data.Dataset.from_tensor_slices((x_train,y_train))
train_data = train_data.repeat().shuffle(5000).batch(batch_size).prefetch(1)
# 权值矩阵形状[16384,2],128 * 128图像特征数和类别数目
W = tf.Variable(tf.ones([num_features, num_classes]), name="weight")
# 偏置形状[2], 类别数目
b = tf.Variable(tf.zeros([num_classes]), name="bias")
# logistic回归,这里我使用
def logistic_regression(x):
return tf.nn.softmax(tf.matmul(x,W) + b)
# return tf.nn.sigmoid(tf.matmul(x,W) + b)
# 交叉熵损失函数
def cross_entropy(y_pred, y_true):
# 将标签编码为一个独热编码向量
y_true = tf.one_hot(y_true, depth=num_classes)
# 压缩预测值以避免log(0)错误
y_pred = tf.clip_by_value(y_pred, 1e-9, 1.)
# 计算交叉熵
return tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred)))
# 计算精度
def accuracy(y_pred, y_true):
correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64))
return tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# 优化器,这里我选Adam
optimizer = tf.optimizers.Adam(learning_rate)
# 优化过程
def run_optimization(x, y):
# 将计算封装在GradientTape中以实现自动微分
with tf.GradientTape() as g:
#print("x:",x)
pred = logistic_regression(x)
#print("pred:",pred)
loss = cross_entropy(pred, y)
# 计算梯度
gradients = g.gradient(loss, [W, b])
# 根据gradients更新 W 和 b
optimizer.apply_gradients(zip(gradients, [W, b]))
# 开始训练
for step, (batch_x, batch_y) in enumerate(train_data.take(training_steps), 1):
# 更新W和b值
run_optimization(batch_x, batch_y)
if step % display_step == 0:
pred = logistic_regression(batch_x)
loss = cross_entropy(pred, batch_y)
acc = accuracy(pred, batch_y)
print("step: %i, loss: %f, accuracy: %f" % (step, loss, acc))
# 在验证集上测试模型
pred = logistic_regression(x_test)
print("Test Accuracy: %f" % accuracy(pred, y_test))
#可视化预测
n_images = 5
test_images = x_test[:n_images]
font={ 'color': 'red',
'size': 20,
'family': 'Times New Roman',
'style':'italic'}
predictions = logistic_regression(test_images)
for i in range(n_images):
print(np.argmax(predictions.numpy()[i]))
if np.argmax(predictions.numpy()[i]) == 0:
plt.imshow(np.reshape(test_images[i],[128,128]))
plt.text(28, 0.1, "Prediction : COVID", fontdict=font)
plt.show()
else:
plt.imshow(np.reshape(test_images[i],[128,128]))
plt.text(28, 0.1, "Prediction : Normal", fontdict=font)
plt.show()
新冠CT影像上的消融实验:
Method | Accuracy |
---|---|
Logistic Regression | 0.4800 |
Softmax Regression | 0.7133 |
定量分析: