kaggle lung cancer detection--Full Preprocessing Tuturial(附翻译)

这篇博客介绍了如何对LUNA16竞赛中的肺部CT扫描数据进行完整的预处理,包括加载DICOM文件、转换为Hounsfield单位、重采样、3D可视化、肺部分割、归一化和零中心化。预处理旨在为CNN等机器学习方法提供合适的数据。
摘要由CSDN通过智能技术生成

原文链接:https://www.kaggle.com/gzuidhof/full-preprocessing-tutorial/notebook

Introduction

Working with these files can be a challenge, especially given their heterogeneous nature. Some preprocessing is required before they are ready for consumption by your CNN.

Fortunately, I participated in the LUNA16 competition as part of a university course on computer aided diagnosis, so I have some experience working with these files. At this moment we top the leaderboard there :)

This tutorial aims to provide a comprehensive overview of useful steps to take before the data hits your ConvNet/other ML method.

What we will cover:

  • Loading the DICOM files, and adding missing metadata
  • Converting the pixel values to Hounsfield Units (HU), and what tissue these unit values correspond to
  • Resampling to an isomorphic resolution to remove variance in scanner resolution.
  • 3D plotting, visualization is very useful to see what we are doing.
  • Lung segmentation
  • Normalization that makes sense.
  • Zero centering the scans.

Before we start, let's import some packages and determine the available patients.

英文不咋好,只能意译,没有办法一字不漏的翻译。见谅。如有错误,请大家指正。

介绍

处理这些文件是一种挑战,特别是考虑到他们成分的复杂性。在将他们送进CNN网络之前,一些预处理是必不可少的。

幸运的是,我曾经参加过LNUA16比赛,所以我对处理这些文件有点经验。

这个教程目的在于提供有用的步骤的概览,以便于将这些数据应用与你们的网络。

主要内容:

  • 加载DICOM文件,添加缺失的metadata
  • 将像素值转换为亨氏单位(肺部组织对应的单位)
  • 重采样到同构(同样)分辨率,以消除扫描分辨率的差异。
  • 3D绘点(绘图),可视化能让我们明白究竟在做什么。
  • 肺部(图片)切割。
  • 归一化。
  • 零中心化。

Before we start, let's import some packages and determine the available patients.

%matplotlib inline

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import dicom
import os
import scipy.ndimage
import matplotlib.pyplot as plt

from skimage import measure, morphology
from mpl_toolkits.mplot3d.art3d import Poly3DCollection

# Some constants 
INPUT_FOLDER = '../input/sample_images/'
patients = os.listdir(INPUT_FOLDER)
patients.sort()

Loading the files

Dicom is the de-facto file standard in medical imaging. This is my first time working with it, but it seems to be fairly straight-forward. These files contain a lot of metadata (such as the pixel size, so how long one pixel is in every dimension in the real world).

This pixel size/coarseness of the scan differs from scan to scan (e.g. the distance between slices may differ), which can hurt performance of CNN approaches. We can deal with this by isomorphic resampling, which we will do later.

Below is code to load a scan, which consists of multiple slices, which we simply save in a Python list. Every folder in the dataset is one scan (so one patient). One metadata field is missing, the pixel size in the Z direction, which is the slice thickness. Fortunately we can infer this, and we add this to the metadata.

加载文件

DICOM是医疗图像的文件标准。这是我第一次处理DICOM,但事实上处理起来还是相当顺利的。这些文件包含很多metadata(中文翻译:元数据)(例如像素尺寸,所以在现实世界中在每个维度中一个像素有多长呢)。

这些每个扫描文件的像素尺寸各不相同,会影响CNN的表现。后面我们会使用同构重采样解决这个问题。

下面的代码是加载一份扫描文件。扫描文件包含多个切片,我们将其简单地存储在python列表中。数据集中的每个文件夹

都是一份扫描文件(对应一名病人)。缺少一个元数据字段,即Z方向 的像素大小,即片厚。幸运的是我们能推断这一点,并将其添加到元数据中。

# Load the scans in given folder path
def load_scan(path):
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: float(x.ImagePositionPatient[2]))
    try:
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
    except:
        slice_thickness = np.abs(sli
  • 4
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值