人工智能学习笔记（1）-从nnu-net源码中学习数据预处理

之前做研究从来没有写笔记的习惯，现在也开始创作一些自己的东西了，就写一写笔记记录一下。

u-net是一种能很好的描述图之间的转换细节的网络结构，能很好的应用于医疗影像数据的分割任务里面，nnu-net在论文《nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation》中被提出，最大的特色就是提供了一整套数据预处理的流程。今天让我们来分析一下作者的github源代码，从中学习一些关于数据预处理的经验。

spacing and normalization

normalization（标准化）

参考

我们先看一下整个项目的目录树：

├─dataset_conversion
├─evaluation
│  └─model_selection
├─experiment_planning
│  ├─alternative_experiment_planning
│  │  ├─normalization
│  │  ├─patch_size
│  │  ├─pooling_and_convs
│  │  └─target_spacing
│  └─old
├─inference
│  └─pretrained_models
├─network_architecture
│  └─custom_modules
├─postprocessing
├─preprocessing
│  └─custom_preprocessors
├─run
├─training
└─utilities

接下来我们按照数据处理的流程来看每个文件都在干什么。

数据格式预转换

（注意：这一部分并不在整个nnu-net框架中，是需要用户自己来做的）

首先在nnunet目录下有一个叫做path.py的文件，它从环境变量中导入nnUNet_raw_data_base、nnUNet_preprocessed、RESULTS_FOLDER的路径，我特地指出这一点是因为我的环境变量设置出了一些问题，所以我直接就修改这里的源代码来设置路径。

对于训练的数据集，首先要将原始数据的目录重命名为如下格式：
├── Task005_Prostate
│   ├── dataset.json
│   ├── imagesTr
│   ├── imagesTs
│   └── labelsTr

# Data format for Inference 

The data format for inference must match the one used for the raw data (specifically, the images must be in exactly 
the same format as in the imagesTr folder). As before, the filenames must start with a
unique identifier, followed by a 4-digit modality identifier. Here is an example for two different datasets:

1) Task005_Prostate:

    This task has 2 modalities, so the files in the input folder must look like this:

        input_folder
        ├── prostate_03_0000.nii.gz
        ├── prostate_03_0001.nii.gz
        ├── prostate_05_0000.nii.gz
        ├── prostate_05_0001.nii.gz
        ├── prostate_08_0000.nii.gz
        ├── prostate_08_0001.nii.gz
        ├── ...

    _0000 is always the T2 image and _0001 is always the ADC image (as specified by 'modality' in the dataset.json)

2) Task002_Heart:

        imagesTs
        ├── la_001_0000.nii.gz
        ├── la_002_0000.nii.gz
        ├── la_006_0000.nii.gz
        ├── ...
    
    Task002 only has one modality, so each case only has one _0000.nii.gz file.
  

The segmentations in the output folder will be named INDENTIFIER.nii.gz (omitting the modality identifier).

（上面的document来自documentation\data_format_inference.md，大家可以去看看作者的document，说的已经很清楚了）

对于每一个病例的数据，nnUNet中的数据格式包括一个四维numpy数组以及与之对应的pickle文件。后缀分别为.npz与.pkl。在四维数组data=array(CXYZ)中第一个维度的最后一块array[-1,:,:,:]存放的是分割的label，前面存放不同模态的数据，可以理解为图片的rgb等多个通道，医疗图像数据据也有多个通道，比如MRI数据中有FLAIR, T1w, t1gd, T2w等四种模态。pickle文件里面则存放其它的一些信息，以字典的格式存储。

整个预处理的代码在nnunet\experiment_planning\nnUNet_plan_and_preprocess.py中：

crop（裁剪）

# we need raw data
    tasks = []
    for i in task_ids:
        i = int(i)

        task_name = convert_id_to_task_name(i)

        if args.verify_dataset_integrity:
            verify_dataset_integrity(join(nnUNet_raw_data, task_name))

        crop(task_name, False, tf)

        tasks.append(task_name)

    search_in = join(nnunet.__path__[0], "experiment_planning")

首先使用convert_id_to_task_name获得对应所要做的任务目录下的文件名。

verify_dataset_integrity做的是数据完整性检查，也就是检查有没有缺失的数据。

然后就是crop函数crop(task_name, False, tf)

crop主要处理了以下问题：在医疗影像数据中，有很大一部分都是空的，也就是全为0的值，这一部分的数据是不包含任何信息的，因此可以被压缩掉。

crop的核心代码在nnunet\preprocessing\cropping.py：

def crop_to_nonzero(data, seg=None, nonzero_label=-1):
    """

    :param data:
    :param seg:
    :param nonzero_label: this will be written into the segmentation map
    :return:
    """
    nonzero_mask = create_nonzero_mask(data)
    bbox = get_bbox_from_mask(nonzero_mask, 0)

    cropped_data = []
    for c in range(data.shape[0]):
        cropped = crop_to_bbox(data[c], bbox)
        cropped_data.append(cropped[None])
    data = np.vstack(cropped_data)

    if seg is not None:
        cropped_seg = []
        for c in range(seg.shape[0]):
            cropped = crop_to_bbox(seg[c], bbox)
            cropped_seg.append(cropped[None])
        seg = np.vstack(cropped_seg)

    nonzero_mask = crop_to_bbox(nonzero_mask, bbox)[None]
    if seg is not None:
        seg[(seg == 0) & (nonzero_mask == 0)] = nonzero_label
    else:
        nonzero_mask = nonzero_mask.astype(int)
        nonzero_mask[nonzero_mask == 0] = nonzero_label
        nonzero_mask[nonzero_mask > 0] = 0
        seg = nonzero_mask
    return data, seg, bbox

nonzero_mask = create_nonzero_mask(data)这行代码根据data(CXYZ)获得原始数据的非零区域的mask，如果有多个特征图则是各个mask的并集。

bbox = get_bbox_from_mask(nonzero_mask, 0)这行代码柑橘上一步获得的nonzero_mask计算其在x，y，z轴上的非零区域的最大最小值。我们可以从这一步看出，crop就是找到一个能装得下原始数据的最小长方体盒子。

后面五行：

cropped_data = []
    for c in range(data.shape[0]):
        cropped = crop_to_bbox(data[c], bbox)
        cropped_data.append(cropped[None])
    data = np.vstack(cropped_data)

就是根据上一步获得的包围盒对源数据的每一个通道做裁剪，然后再拼接。

同样的，如果seg标注不为空，这一部分也需要进行crop，将数据中被crop掉的区域标注为nonzero_label，默认为-1。这样就可以根据标签快速判断哪些区域被裁剪，哪些没有。

seg[(seg == 0) & (nonzero_mask == 0)] = nonzero_label

data analysis

我们看到这两行代码：

dataset_analyzer = DatasetAnalyzer(cropped_out_dir, overwrite=False, num_processes=tf)  # this class creates the fingerprint
        _ = dataset_analyzer.analyze_dataset(collect_intensityproperties)  # this will write output files that will be used by the ExperimentPlanner

这里构造了一个DatasetAnalyzer来对原有数据的特征进行分析，也就是作者所说的提取“data fingertips”，然后将提取出来的数据指纹保存在dataset_properties.pkl文件里。（实际上仔细研究后发现，所谓的数据指纹是在dataset.json文件里面保存的...所以本质上这些数据都是用户自定义的），我们可以看出数据的一些性质都在这里：all_sizes、all_spacings、all_classes...这些信息为后面的数据预处理做了准备。

def analyze_dataset(self, collect_intensityproperties=True):
        # get all spacings and sizes
        sizes, spacings = self.get_sizes_and_spacings_after_cropping()

        # get all classes and what classes are in what patients
        # class min size
        # region size per class
        classes = self.get_classes()
        all_classes = [int(i) for i in classes.keys() if int(i) > 0]

        # modalities
        modalities = self.get_modalities()

        # collect intensity information
        if collect_intensityproperties:
            intensityproperties = self.collect_intensity_properties(len(modalities))
        else:
            intensityproperties = None

        # size reduction by cropping
        size_reductions = self.get_size_reduction_by_cropping()

        dataset_properties = dict()
        dataset_properties['all_sizes'] = sizes
        dataset_properties['all_spacings'] = spacings
        dataset_properties['all_classes'] = all_classes
        dataset_properties['modalities'] = modalities  # {idx: modality name}
        dataset_properties['intensityproperties'] = intensityproperties
        dataset_properties['size_reductions'] = size_reductions  # {patient_id: size_reduction}

        save_pickle(dataset_properties, join(self.folder_with_cropped_data, "dataset_properties.pkl"))
        return dataset_properties

planner

然后就是预处理的最后一部分：

if planner_3d is not None:
            if args.overwrite_plans is not None:
                assert args.overwrite_plans_identifier is not None, "You need to specify -overwrite_plans_identifier"
                exp_planner = planner_3d(cropped_out_dir, preprocessing_output_dir_this_task, args.overwrite_plans,
                                         args.overwrite_plans_identifier)
            else:
                exp_planner = planner_3d(cropped_out_dir, preprocessing_output_dir_this_task)
            exp_planner.plan_experiment()
            if not dont_run_preprocessing:  # double negative, yooo
                exp_planner.run_preprocessing(threads)
        if planner_2d is not None:
            exp_planner = planner_2d(cropped_out_dir, preprocessing_output_dir_this_task)
            exp_planner.plan_experiment()
            if not dont_run_preprocessing:  # double negative, yooo
                exp_planner.run_preprocessing(threads)

这里针对3d数据和2d数据给出了两种不同的exp_planner，在前面一段的代码recursive_find_python_class中获得：

if planner_name3d is not None:
        planner_3d = recursive_find_python_class([search_in], planner_name3d, current_module="nnunet.experiment_planning")
        if planner_3d is None:
            raise RuntimeError("Could not find the Planner class %s. Make sure it is located somewhere in "
                               "nnunet.experiment_planning" % planner_name3d)
    else:
        planner_3d = None

    if planner_name2d is not None:
        planner_2d = recursive_find_python_class([search_in], planner_name2d, current_module="nnunet.experiment_planning")
        if planner_2d is None:
            raise RuntimeError("Could not find the Planner class %s. Make sure it is located somewhere in "
                               "nnunet.experiment_planning" % planner_name2d)
    else:
        planner_2d = None

所有的planner都在nnunet\experiment_planning中被定义，这部分决定了数据的预处理的方式。

我们就nnunet\experiment_planning\experiment_planner_baseline_3DUNet_v21.py来分析一下planner是怎么做preprocess的。

ExperimentPlanner3D_v21是ExperimentPlanner的子类，也就继承了ExperimentPlanner的大部分性质。

我们先看ExperimentPlanner的plan_experiment()的最后几行

def plan_experiment(self):

       ................

        # these are independent of the stage
        plans = {'num_stages': len(list(self.plans_per_stage.keys())), 'num_modalities': num_modalities,
                 'modalities': modalities, 'normalization_schemes': normalization_schemes,
                 'dataset_properties': self.dataset_properties, 'list_of_npz_files': self.list_of_cropped_npz_files,
                 'original_spacings': spacings, 'original_sizes': sizes,
                 'preprocessed_data_folder': self.preprocessed_output_folder, 'num_classes': len(all_classes),
                 'all_classes': all_classes, 'base_num_features': self.unet_base_num_features,
                 'use_mask_for_norm': use_nonzero_mask_for_normalization,
                 'keep_only_largest_region': only_keep_largest_connected_component,
                 'min_region_size_per_class': min_region_size_per_class, 'min_size_per_class': min_size_per_class,
                 'transpose_forward': self.transpose_forward, 'transpose_backward': self.transpose_backward,
                 'data_identifier': self.data_identifier, 'plans_per_stage': self.plans_per_stage,
                 'preprocessor_name': self.preprocessor_name,
                 'conv_per_stage': self.conv_per_stage,
                 }

        self.plans = plans
        self.save_my_plans()

上面这部分代码做了非常多的分析，获得了很多plans,也就是后面模型所需要使用的hyperparameter，比如模型的输出种类，正则化方法等。这部分代码量实在是太大了，我就只讲一讲比较重要的几个预处理的手段，说明一下默认的方法是怎么获得这些plans的。

def get_properties_for_stage(self, current_spacing, original_spacing, original_shape, num_cases,
                                 num_modalities, num_classes):
        """
        ExperimentPlanner configures pooling so that we pool late. Meaning that if the number of pooling per axis is
        (2, 3, 3), then the first pooling operation will always pool axes 1 and 2 and not 0, irrespective of spacing.
        This can cause a larger memory footprint, so it can be beneficial to revise this.

        Here we are pooling based on the spacing of the data.

        """
        ......................

        plan = {
            'batch_size': batch_size,
            'num_pool_per_axis': network_num_pool_per_axis,
            'patch_size': input_patch_size,
            'median_patient_size_in_voxels': new_median_shape,
            'current_spacing': current_spacing,
            'original_spacing': original_spacing,
            'do_dummy_2D_data_aug': do_dummy_2D_data_aug,
            'pool_op_kernel_sizes': pool_op_kernel_sizes,
            'conv_kernel_sizes': conv_kernel_sizes,
        }
        return plan

spacing and normalization

对应部分的代码：

 def resample_and_normalize(self, data, target_spacing, properties, seg=None, force_separate_z=None):
        """
        data and seg must already have been transposed by transpose_forward. properties are the un-transposed values
        (spacing etc)
        :param data:
        :param target_spacing:
        :param properties:
        :param seg:
        :param force_separate_z:
        :return:
        """

        # target_spacing is already transposed, properties["original_spacing"] is not so we need to transpose it!
        # data, seg are already transposed. Double check this using the properties
        original_spacing_transposed = np.array(properties["original_spacing"])[self.transpose_forward]
        before = {
            'spacing': properties["original_spacing"],
            'spacing_transposed': original_spacing_transposed,
            'data.shape (data is transposed)': data.shape
        }

        # remove nans
        data[np.isnan(data)] = 0

        data, seg = resample_patient(data, seg, np.array(original_spacing_transposed), target_spacing, 3, 1,
                                     force_separate_z=force_separate_z, order_z_data=0, order_z_seg=0,
                                     separate_z_anisotropy_threshold=self.resample_separate_z_anisotropy_threshold)
        after = {
            'spacing': target_spacing,
            'data.shape (data is resampled)': data.shape
        }
        print("before:", before, "\nafter: ", after, "\n")

        if seg is not None:  # hippocampus 243 has one voxel with -2 as label. wtf?
            seg[seg < -1] = 0

        properties["size_after_resampling"] = data[0].shape
        properties["spacing_after_resampling"] = target_spacing
        use_nonzero_mask = self.use_nonzero_mask

        assert len(self.normalization_scheme_per_modality) == len(data), "self.normalization_scheme_per_modality " \
                                                                         "must have as many entries as data has " \
                                                                         "modalities"
        assert len(self.use_nonzero_mask) == len(data), "self.use_nonzero_mask must have as many entries as data" \
                                                        " has modalities"

        for c in range(len(data)):
            scheme = self.normalization_scheme_per_modality[c]
            if scheme == "CT":
                # clip to lb and ub from train data foreground and use foreground mn and sd from training data
                assert self.intensityproperties is not None, "ERROR: if there is a CT then we need intensity properties"
                mean_intensity = self.intensityproperties[c]['mean']
                std_intensity = self.intensityproperties[c]['sd']
                lower_bound = self.intensityproperties[c]['percentile_00_5']
                upper_bound = self.intensityproperties[c]['percentile_99_5']
                data[c] = np.clip(data[c], lower_bound, upper_bound)
                data[c] = (data[c] - mean_intensity) / std_intensity
                if use_nonzero_mask[c]:
                    data[c][seg[-1] < 0] = 0
            elif scheme == "CT2":
                # clip to lb and ub from train data foreground, use mn and sd form each case for normalization
                assert self.intensityproperties is not None, "ERROR: if there is a CT then we need intensity properties"
                lower_bound = self.intensityproperties[c]['percentile_00_5']
                upper_bound = self.intensityproperties[c]['percentile_99_5']
                mask = (data[c] > lower_bound) & (data[c] < upper_bound)
                data[c] = np.clip(data[c], lower_bound, upper_bound)
                mn = data[c][mask].mean()
                sd = data[c][mask].std()
                data[c] = (data[c] - mn) / sd
                if use_nonzero_mask[c]:
                    data[c][seg[-1] < 0] = 0
            elif scheme == 'noNorm':
                pass
            else:
                if use_nonzero_mask[c]:
                    mask = seg[-1] >= 0
                    data[c][mask] = (data[c][mask] - data[c][mask].mean()) / (data[c][mask].std() + 1e-8)
                    data[c][mask == 0] = 0
                else:
                    mn = data[c].mean()
                    std = data[c].std()
                    # print(data[c].shape, data[c].dtype, mn, std)
                    data[c] = (data[c] - mn) / (std + 1e-8)
        return data, seg, properties

在很多医疗影像中，图片数据是各向异性的。各向异性也就是说单个voxel所代表的空间的spacing在各个方向不一样。在u-net里面会忽视掉这种voxel之间的差异性，因此我们需要做resample，让每一个voxel在各个轴上代表的实际物理距离相等。

 def get_target_spacing(self):
        """
        per default we use the 50th percentile=median for the target spacing. Higher spacing results in smaller data
        and thus faster and easier training. Smaller spacing results in larger data and thus longer and harder training

        For some datasets the median is not a good choice. Those are the datasets where the spacing is very anisotropic
        (for example ACDC with (10, 1.5, 1.5)). These datasets still have examples with a spacing of 5 or 6 mm in the low
        resolution axis. Choosing the median here will result in bad interpolation artifacts that can substantially
        impact performance (due to the low number of slices).
        """

作者在上面代码的注释里面也说明了，在默认情况下，target_spacing是数据集各个图像中spacing的中值：

self.target_spacing_percentile = 50
self.anisotropy_threshold = 3

anisotropy_threshold翻译为各向异性门槛，也就是说最大坐标上的spacing÷最小坐标上的spacing>3的数据集被认为是各向异性数据集：

worst_spacing_axis = np.argmax(target)
other_axes = [i for i in range(len(target)) if i != worst_spacing_axis]
other_spacings = [target[i] for i in other_axes]

has_aniso_spacing = target[worst_spacing_axis] > (anisotropy_threshold * min(other_spacings))
has_aniso_voxels = target_size[worst_spacing_axis] * self.anisotropy_threshold < min(other_sizes)

在这样的数据集上，取数据集10%分位点的spacing值作为spacing最大坐标的目标空间大小：

if has_aniso_spacing and has_aniso_voxels:
            spacings_of_that_axis = np.vstack(spacings)[:, worst_spacing_axis]
            target_spacing_of_that_axis = np.percentile(spacings_of_that_axis, 10)
            # don't let the spacing of that axis get higher than the other axes
            if target_spacing_of_that_axis < max(other_spacings):
                target_spacing_of_that_axis = max(max(other_spacings), target_spacing_of_that_axis) + 1e-5
            target[worst_spacing_axis] = target_spacing_of_that_axis

获得了target_spacing，我们也就获得了目标空间下的图片的尺寸shape，从而对原始图片进行resize，注意planner只负责计算超参数，这一部分的代码nnunet\preprocessing\preprocessing.py

计算方法如下：

new_shape = np.round(((np.array(original_spacing) / np.array(target_spacing)).astype(float) * shape)).astype(int)

reshape的代码在如下函数中实现：

def resample_patient(data, seg, original_spacing, target_spacing, order_data=3, order_seg=0, force_separate_z=False,
                     order_z_data=0, order_z_seg=0,
                     separate_z_anisotropy_threshold=RESAMPLING_SEPARATE_Z_ANISO_THRESHOLD):
    """
    :param data:
    :param seg:
    :param original_spacing:
    :param target_spacing:
    :param order_data:
    :param order_seg:
    :param force_separate_z: if None then we dynamically decide how to resample along z, if True/False then always
    /never resample along z separately
    :param order_z_seg: only applies if do_separate_z is True
    :param order_z_data: only applies if do_separate_z is True
    :param separate_z_anisotropy_threshold: if max_spacing > separate_z_anisotropy_threshold * min_spacing (per axis)
    then resample along lowres axis with order_z_data/order_z_seg instead of order_data/order_seg

    :return:
    """

normalization（标准化）

非CT图像的标准化很简单：

data[c][mask] = (data[c][mask] - data[c][mask].mean()) / (data[c][mask].std() + 1e-8)

每张图片减去均值除以方差就是标准化后的结果。

对于CT图像由于voxel的绝对值也包含了相应的信息，所以要采用一种更为复杂的策略：

class ExperimentPlannerCT2(ExperimentPlanner):
    """
    preprocesses CT data with the "CT2" normalization.

    (clip range comes from training set and is the 0.5 and 99.5 percentile of intensities in foreground)
    CT = clip to range, then normalize with global mn and sd (computed on foreground in training set)
    CT2 = clip to range, normalize each case separately with its own mn and std (computed within the area that was in clip_range)
    """
    def __init__(self, folder_with_cropped_data, preprocessed_output_folder):
        super(ExperimentPlannerCT2, self).__init__(folder_with_cropped_data, preprocessed_output_folder)
        self.data_identifier = "nnUNet_CT2"
        self.plans_fname = join(self.preprocessed_output_folder, "nnUNetPlans" + "CT2_plans_3D.pkl")

    def determine_normalization_scheme(self):
        schemes = OrderedDict()
        modalities = self.dataset_properties['modalities']
        num_modalities = len(list(modalities.keys()))

        for i in range(num_modalities):
            if modalities[i] == "CT":
                schemes[i] = "CT2"
            else:
                schemes[i] = "nonCT"
        return schemes

这部分的策略在这里：

if scheme == "CT":
                # clip to lb and ub from train data foreground and use foreground mn and sd from training data
                assert self.intensityproperties is not None, "ERROR: if there is a CT then we need intensity properties"
                mean_intensity = self.intensityproperties[c]['mean']
                std_intensity = self.intensityproperties[c]['sd']
                lower_bound = self.intensityproperties[c]['percentile_00_5']
                upper_bound = self.intensityproperties[c]['percentile_99_5']
                data[c] = np.clip(data[c], lower_bound, upper_bound)
                data[c] = (data[c] - mean_intensity) / std_intensity
                if use_nonzero_mask[c]:
                    data[c][seg[-1] < 0] = 0
            elif scheme == "CT2":
                # clip to lb and ub from train data foreground, use mn and sd form each case for normalization
                assert self.intensityproperties is not None, "ERROR: if there is a CT then we need intensity properties"
                lower_bound = self.intensityproperties[c]['percentile_00_5']
                upper_bound = self.intensityproperties[c]['percentile_99_5']
                mask = (data[c] > lower_bound) & (data[c] < upper_bound)
                data[c] = np.clip(data[c], lower_bound, upper_bound)
                mn = data[c][mask].mean()
                sd = data[c][mask].std()
                data[c] = (data[c] - mn) / sd
                if use_nonzero_mask[c]:
                    data[c][seg[-1] < 0] = 0

参考

[1]F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,” Nat Methods, Dec. 2020, doi: 10.1038/s41592-020-01008-z.

[2]如何针对三维医学图像分割任务进行通用数据预处理：nnUNet中预处理流程总结及代码分析

[3]nnUNet（代码）-预处理