肺结节检测（一）：数据集介绍及处理

Five-42

已于 2023-03-10 16:40:56 修改

阅读量4.2k

点赞数 5

文章标签： python

于 2022-10-15 22:09:10 首次发布

原文链接：https://blog.csdn.net/qq_22152499/article/details/89913821

版权

一、LUNA16数据集介绍

1.简介

LUNA16数据集包括888低剂量肺部CT影像（mhd格式）数据，每个影像包含一系列胸腔的多个轴向切片。原始图像为三维图像。每个图像包含一系列胸腔的多个轴向切片。这个三维图像由不同数量的二维图像组成。

2.组成

subset0-subset10:10个zip文件中包含所有的CT图像，每一套CT扫描都是由.mhd和.raw共同给出的。.mhd会给出CT图像的一些基本信息，.raw图像用来存储CT的具体数据。

annotation.csv:csv文件中包含用于肺结节检测比赛作为参考标准使用的注释。注释中一共包含1186个结节，该文件给出了不同CT中，结节的世界质心坐标和半径。

二、数据格式介绍及处理

1.mhd解析

ObjectType = Image
NDims = 3
BinaryData = True
BinaryDataByteOrderMSB = False
CompressedData = False
TransformMatrix = 1 0 0 0 1 0 0 0 1
Offset = -161.699997 -169.5 -316.46499599999999
CenterOfRotation = 0 0 0
AnatomicalOrientation = RAI
ElementSpacing = 0.6640620231628418 0.6640620231628418 0.625
DimSize = 512 512 471
ElementType = MET_SHORT
ElementDataFile = 1.3.6.1.4.1.14519.5.2.1.6279.6001.100530488926682752765845212286.raw

ObjectType是文件类型
NDims是raw数据的维度
TransformMatric:图像矩阵是否翻转的标志。(在现实中CT扫描的时候有的人是正卧，有的人是仰卧的，所以会导致图像会出现翻转的情况。)
offset是质心坐标。（第一次看到这个坐标的时候觉得很奇怪，还记得annotation中结节对应的坐标吗。两种坐标都是这种形式的。）其实这种表示也同样应用与dicom文件。
ElementSpacing:x,y,z方向上的步长
DimSize：三个数字分别表示x,y,z方向上的维度。
ElementDataFile:该mhd对应的raw文件，raw中存放了CT值矩阵，也就是所有CT扫描断层。
结节在矩阵中的位置的计算公式如下：

x= (x_ano-x_offset)/x_ElementSpacing 
y= (y_ano-y_offset)/y_ElementSpacing 
z= (z_ano-z_offset)/z_ElementSpacing

其中，x是实际对应三维矩阵中的坐标。
x_ano是肺结节在annotation.csv中的坐标.
x_offset是质心坐标.
x_ElementSpacing是在x轴方向上的步长。相当于每一个像素对应现实世界中的长度。

2.读取CT值矩阵

CT是按层扫描的，每一层CT图像罗列形成三维矩阵。读取数据的流程是先读取mhd，通过MatricTransform判别三维CT值矩阵在x,y方向上是否需要翻转。然后返回三维CT矩阵。

def load_itk_image(filename):
    with open(filename) as f:
        contents = f.readlines()
        line = [k for k in contents if k.startswith('TransformMatrix')][0]
        transform = np.array(line.split(' = ')[1].split(' ')).astype('float')
        transform = np.round(transform)
        if np.any(transform != np.array([1, 0, 0, 0, 1, 0, 0, 0, 1])):
            isflip = True
        else:
            isflip = False
    itkimage = sitk.ReadImage(filename)
    numpyimage = sitk.GetArrayFromImage(itkimage)
    if(isflip == True):
        numpyimage = numpyimage[:,::-1,::-1]
    return numpyimage

3.CT值矩阵的预处理与归一化

(1)CT影像原理

CT 采集后所得的数值为 X 射线衰减值,单位为亨氏单位(Hounsfield unit, HU). 水的 HU 值为 0,空气的 HU 值为-1000, 而其他物体的 HU 值计算公式为

其中 , $\mu$ 为线性衰减系数，和 X 射线强度有关。亨氏单位值经过线性变换成为图像中的像素值。不同设备由于变换标准不同, 所获得的图像像素值有一定差别; 但是在相同的 X 射线条件下, CT照射人体所获得的亨氏单位值却是相同的, 如表1所示。

在对肺部 CT 图像处理之中，由于肺的 HU 值为-500 左右, 一般做法是将 HU 值在[-1000,+ 400]内的区域保留( 从空气到骨骼 )，超出此范围的区域就可以认为无关而舍去。

(2)预处理：首先就是将CT值过大和过小的数据置为0。

def truncate_hu(image_array):
    image_array[image_array > 400] = 0
    image_array[image_array <-1000] = 0

(3)归一化

def normalazation(image_array):
    max = image_array.max()
    min = image_array.min()
    image_array = (image_array-min)/(max-min) 
    avg = image_array.mean()
    image_array = image_array-avg
    return image_array

(4)单层可视化：挑取一层进行可视化，并标注结节位置。

a = image_array.transpose(1,2,0)[:,:,0] #transpose是将(z,x,y)的三维矩阵转为(x,y,z)的矩阵
plt.gca().add_patch( plt.Rectangle((147,297), 24,24, fill=False,edgecolor='r', linewidth=3))
plt.imshow(a[:,:,1]*255)#在图中画框
plt.show()

三、LUNA16数据处理

出处：(21条消息) LUNA16数据集介绍_pursuit_zhangyu的博客-CSDN博客_luna16数据集

# -*- coding:utf-8 -*-
'''
this script is used for basic process of lung 2017 in Data Science Bowl
'''
import SimpleITK as sitk
from skimage.morphology import ball, disk, dilation, binary_erosion, remove_small_objects, erosion, closing, \
    reconstruction, binary_closing
from skimage.measure import label, regionprops
from skimage.filters import roberts
from skimage.segmentation import clear_border
from scipy import ndimage as ndi
import matplotlib.pyplot as plt
 
 
# numpyImage[numpyImage > -600] = 1
# numpyImage[numpyImage <= -600] = 0
 
def get_segmented_lungs(im, plot=False):
    '''
    This funtion segments the lungs from the given 2D slice.
    '''
    if plot == True:
        f, plots = plt.subplots(8, 1, figsize=(5, 40))
    '''
    Step 1: Convert into a binary image. 
    '''
    binary = im < -600
    if plot == True:
        plots[0].axis('off')
        plots[0].imshow(binary, cmap=plt.cm.bone)
    '''
    Step 2: Remove the blobs connected to the border of the image.
    '''
    cleared = clear_border(binary)
    if plot == True:
        plots[1].axis('off')
        plots[1].imshow(cleared, cmap=plt.cm.bone)
    '''
    Step 3: Label the image.
    '''
    label_image = label(cleared)
    if plot == True:
        plots[2].axis('off')
        plots[2].imshow(label_image, cmap=plt.cm.bone)
    '''
    Step 4: Keep the labels with 2 largest areas.
    '''
    areas = [r.area for r in regionprops(label_image)]
    areas.sort()
    if len(areas) > 2:
        for region in regionprops(label_image):
            if region.area < areas[-2]:
                for coordinates in region.coords:
                    label_image[coordinates[0], coordinates[1]] = 0
    binary = label_image > 0
    if plot == True:
        plots[3].axis('off')
        plots[3].imshow(binary, cmap=plt.cm.bone)
    '''
    Step 5: Erosion operation with a disk of radius 2. This operation is 
    seperate the lung nodules attached to the blood vessels.
    '''
    selem = disk(2)
    binary = binary_erosion(binary, selem)
    if plot == True:
        plots[4].axis('off')
        plots[4].imshow(binary, cmap=plt.cm.bone)
    '''
    Step 6: Closure operation with a disk of radius 10. This operation is 
    to keep nodules attached to the lung wall.
    '''
    selem = disk(10)
    binary = binary_closing(binary, selem)
    if plot == True:
        plots[5].axis('off')
        plots[5].imshow(binary, cmap=plt.cm.bone)
    '''
    Step 7: Fill in the small holes inside the binary mask of lungs.
    '''
    edges = roberts(binary)
    binary = ndi.binary_fill_holes(edges)
    if plot == True:
        plots[6].axis('off')
        plots[6].imshow(binary, cmap=plt.cm.bone)
    '''
    Step 8: Superimpose the binary mask on the input image.
    '''
    get_high_vals = binary == 0
    im[get_high_vals] = 0
    if plot == True:
        plots[7].axis('off')
        plots[7].imshow(im, cmap=plt.cm.bone)
 
    plt.show()
 
    return im
 
 
if __name__ == '__main__':
    filename = './raw_data/1.3.6.1.4.1.14519.5.2.1.6279.6001.108197895896446896160048741492.mhd'
    itkimage = sitk.ReadImage(filename)  # 读取.mhd文件
    numpyImage = sitk.GetArrayFromImage(itkimage)  # 获取数据，自动从同名的.raw文件读取
    data = numpyImage[50]
    plt.figure(50)
    plt.imshow(data, cmap='gray')
    im = get_segmented_lungs(data, plot=True)
    plt.figure(200)
    plt.imshow(im, cmap='gray')
    plt.show()