【机器学习】医学图像处理：ABIDE数据集下载-CSDN博客

本文链接：https://blog.csdn.net/weixin_42163563/article/details/121384051

在AI与深度学习逐渐发展成熟的趋势下，人工智能和大数据等技术开始进入了医疗领域，它们把现有的一些传统流程进行优化，大幅度提高各种流程的效率、精度、用户体验，同时也缓解了医疗资源的压力和精确度不够的问题。

智能医疗有很多的发展方向，例如医学影像处理、诊断预测、疾病控制、健康管理、康复机器人、语音识别病历电子化等。当前人工智能技术新的发力点中的医学图像在疾病的预测和自动化诊断方面有非常大的意义，本篇主要介绍ABIDE数据集下载方法。

ABIDE数据集发布于2013年，这是一个对自闭症内在大脑结构的大规模评估数据集，包括539名患有ASD和573名正常个体的功能MRI图像。

下载的Python脚本download_abide_preproc.py：

# download_abide_preproc.py
#
# Updated to python 3 and to support downloading by DX, Cameron Craddock, 2019

"""
This script downloads data from the Preprocessed Connetomes Project's
ABIDE Preprocessed data release and stores the files in a local
directory; users specify derivative, pipeline, strategy, and optionally
age ranges, sex, site of interest
Usage:
    python download_abide_preproc.py -d <derivative> -p <pipeline>
                                     -s <strategy> -o <out_dir>
                                     [-lt <less_than>] [-gt <greater_than>]
                                     [-x <sex>] [-t <site>]
"""


# Main collect and download function
def collect_and_download(derivative, pipeline, strategy, out_dir, less_than, greater_than, site, sex, diagnosis):
    """
    Function to collect and download images from the ABIDE preprocessed
    directory on FCP-INDI's S3 bucket
    Parameters
    ----------
    derivative : string
        derivative or measure of interest
    pipeline : string
        pipeline used to process data of interest
    strategy : string
        noise removal strategy used to process data of interest
    out_dir : string
        filepath to a local directory to save files to
    less_than : float
        upper age (years) threshold for participants of interest
    greater_than : float
        lower age (years) threshold for participants of interest
    site : string
        acquisition site of interest
    sex : string
        'M' or 'F' to indicate whether to download male or female data
    diagnosis : string
        'asd', 'tdc', or 'both' corresponding to the diagnosis of the
        participants for whom data should be downloaded
    Returns
    -------
    None
        this function does not return a value; it downloads data from
        S3 to a local directory
    :param derivative:
    :param pipeline:
    :param strategy:
    :param out_dir:
    :param less_than:
    :param greater_than:
    :param site:
    :param sex:
    :param diagnosis:
    :return:
    """

    # Import packages
    import os
    import urllib.request as request

    # Init variables
    mean_fd_thresh = 0.2
    s3_prefix = 'https://s3.amazonaws.com/fcp-indi/data/Projects/'\
                'ABIDE_Initiative'
    s3_pheno_path = '/'.join([s3_prefix, 'Phenotypic_V1_0b_preprocessed1.csv'])

    # Format input arguments to be lower case, if not already
    derivative = derivative.lower()
    pipeline = pipeline.lower()
    strategy = strategy.lower()

    # Check derivative for extension
    if 'roi' in derivative:
        extension = '.1D'
    else:
        extension = '.nii.gz'

    # If output path doesn't exist, create it
    if not os.path.exists(out_dir):
        print('Could not find {0}, creating now...'.format(out_dir))
        os.makedirs(out_dir)

    # Load the phenotype file from S3
    s3_pheno_file = request.urlopen(s3_pheno_path)
    pheno_list = s3_pheno_file.readlines()
    print(pheno_list[0])

    # Get header indices
    header = pheno_list[0].decode().split(',')
    try:
        site_idx = header.index('SITE_ID')
        file_idx = header.index('FILE_ID')
        age_idx = header.index('AGE_AT_SCAN')
        sex_idx = header.index('SEX')
        dx_idx = header.index('DX_GROUP')
        mean_fd_idx = header.index('func_mean_fd')
    except Exception as exc:
        err_msg = 'Unable to extract header information from the pheno file: {0}\nHeader should have pheno info:' \
                  ' {1}\nError: {2}'.format(s3_pheno_path, str(header), exc)
        raise Exception(err_msg)

    # Go through pheno file and build download paths
    print('Collecting images of interest...')
    s3_paths = []
    for pheno_row in pheno_list[1:]:

        # Comma separate the row
        cs_row = pheno_row.decode().split(',')

        try:
            # See if it was preprocessed
            row_file_id = cs_row[file_idx]
            # Read in participant info
            row_site = cs_row[site_idx]
            row_age = float(cs_row[age_idx])
            row_sex = cs_row[sex_idx]
            row_dx = cs_row[dx_idx]
            row_mean_fd = float(cs_row[mean_fd_idx])
        except Exception as e:
            err_msg = 'Error extracting info from phenotypic file, skipping...'
            print(err_msg)
            continue

        # If the filename isn't specified, skip
        if row_file_id == 'no_filename':
            continue
        # If mean fd is too large, skip
        if row_mean_fd >= mean_fd_thresh:
            continue

        # Test phenotypic criteria (three if's looks cleaner than one long if)
        # Test sex
        if (sex == 'M' and row_sex != '1') or (sex == 'F' and row_sex != '2'):
            continue

        if (diagnosis == 'asd' and row_dx != '1') or (diagnosis == 'tdc' and row_dx != '2'):
            continue

        # Test site
        if site is not None and site.lower() != row_site.lower():
            continue
        # Test age range
        if greater_than < row_age < less_than:
            filename = row_file_id + '_' + derivative + extension
            s3_path = '/'.join([s3_prefix, 'Outputs', pipeline, strategy, derivative, filename])
            print('Adding {0} to download queue...'.format(s3_path))
            s3_paths.append(s3_path)
        else:
            continue

    # And download the items
    total_num_files = len(s3_paths)
    for path_idx, s3_path in enumerate(s3_paths):
        rel_path = s3_path.lstrip(s3_prefix)
        download_file = os.path.join(out_dir, rel_path)
        download_dir = os.path.dirname(download_file)
        if not os.path.exists(download_dir):
            os.makedirs(download_dir)
        try:
            if not os.path.exists(download_file):
                print('Retrieving: {0}'.format(download_file))
                request.urlretrieve(s3_path, download_file)
                print('{0:3f}% percent complete'.format(100*(float(path_idx+1)/total_num_files)))
            else:
                print('File {0} already exists, skipping...'.format(download_file))
        except Exception as exc:
            print('There was a problem downloading {0}.\n Check input arguments and try again.'.format(s3_path))

    # Print all done
    print('Done!')


# Make module executable
if __name__ == '__main__':

    # Import packages
    import argparse
    import os
    import sys

    # Init argument parser
    parser = argparse.ArgumentParser(description=__doc__)

    # Required arguments
    parser.add_argument('-a', '--asd', required=False, default=False, action='store_true',
                        help='Only download data for participants with ASD.'
                             ' Specifying neither or both -a and -c will download data from all participants.')
    parser.add_argument('-c', '--tdc', required=False, default=False, action='store_true',
                        help='Only download data for participants who are typically developing controls.'
                             ' Specifying neither or both -a and -c will download data from all participants.')
    parser.add_argument('-d', '--derivative', nargs=1, required=True, type=str,
                        help='Derivative of interest (e.g. \'reho\')')
    parser.add_argument('-p', '--pipeline', nargs=1, required=True, type=str,
                        help='Pipeline used to preprocess the data (e.g. \'cpac\')')
    parser.add_argument('-s', '--strategy', nargs=1, required=True, type=str,
                        help='Noise-removal strategy used during preprocessing (e.g. \'nofilt_noglobal\'')
    parser.add_argument('-o', '--out_dir', nargs=1, required=True, type=str,
                        help='Path to local folder to download files to')

    # Optional arguments
    parser.add_argument('-lt', '--less_than', nargs=1, required=False,
                        type=float, help='Upper age threshold (in years) of participants to download (e.g. for '
                                         'subjects 30 or younger, \'-lt 31\')')
    parser.add_argument('-gt', '--greater_than', nargs=1, required=False,
                        type=int, help='Lower age threshold (in years) of participants to download (e.g. for '
                                       'subjects 31 or older, \'-gt 30\')')
    parser.add_argument('-t', '--site', nargs=1, required=False, type=str,
                        help='Site of interest to download from (e.g. \'Caltech\'')
    parser.add_argument('-x', '--sex', nargs=1, required=False, type=str,
                        help='Participant sex of interest to download only (e.g. \'M\' or \'F\')')

    # Parse and gather arguments
    args = parser.parse_args()

    # Init variables
    desired_derivative = args.derivative[0].lower()
    desired_pipeline = args.pipeline[0].lower()
    desired_strategy = args.strategy[0].lower()
    download_data_dir = os.path.abspath(args.out_dir[0])

    # Try and init optional arguments

    # for diagnosis if both ASD and TDC flags are set to true or false, we download both
    desired_diagnosis = ''
    if args.tdc == args.asd:
        desired_diagnosis = 'both'
        print('Downloading data for ASD and TDC participants')
    elif args.tdc:
        desired_diagnosis = 'tdc'
        print('Downloading data for TDC participants')
    elif args.asd:
        desired_diagnosis = 'asd'
        print('Downloading data for ASD participants')

    try:
        desired_age_max = args.less_than[0]
        print('Using upper age threshold of {0:d}...'.format(desired_age_max))
    except TypeError:
        desired_age_max = 200.0
        print('No upper age threshold specified')

    try:
        desired_age_min = args.greater_than[0]
        print('Using lower age threshold of {0:d}...'.format(desired_age_min))
    except TypeError:
        desired_age_min = -1.0
        print('No lower age threshold specified')

    try:
        desired_site = args.site[0]
    except TypeError:
        desired_site = None
        print('No site specified, using all sites...')

    try:
        desired_sex = args.sex[0].upper()
        if desired_sex == 'M':
            print('Downloading only male subjects...')
        elif desired_sex == 'F':
            print('Downloading only female subjects...')
        else:
            print('Please specify \'M\' or \'F\' for sex and try again')
            sys.exit()
    except TypeError:
        desired_sex = None
        print('No sex specified, using all sexes...')

    # Call the collect and download routine
    collect_and_download(desired_derivative, desired_pipeline, desired_strategy, download_data_dir, desired_age_max,
                         desired_age_min, desired_site, desired_sex, desired_diagnosis)

脚本使用方法：

The download_abide_preproc.py script allows any user to download outputs from the ABIDE preprocessed data release. The user specifies the desired derivative, pipeline, and noise removal strategy of interest, and the script finds the data on FCP-INDI's S3 bucket, hosted by Amazon Web Services, and downloads the data to a local directory. The script also allows for phenotypic specifications for targeting only the particpants whose information meets the desired criteria; these specifications include: diagnosis (either ASD, TDC, or both), an age range (e.g. particpants between 2 and 30 years of age), sex (male or female), and site (location where the images where acquired from). * Note the script only downloads images where the functional image's mean framewise displacement is less than 0.2.

At a minimum, the script needs a specific derivative, pipeline, and strategy to search for.
Acceptable derivatives include:
- alff (Amplitude of low frequency fluctuations)
- degree_binarize (Degree centrality with binarized weighting)
- degree_weighted (Degree centrality with correlation weighting)
- eigenvector_binarize (Eigenvector centrality with binarized weighting)
- eigenvector_weighted (Eigenvector centrality with correlation weighting)
- falff (Fractional ALFF)
- func_mask (Functional data mask)
- func_mean (Mean preprocessed functional image)
- func_preproc (Preprocessed functional image)
- lfcd (Local functional connectivity density)
- reho (Regional homogeneity)
- rois_aal (Timeseries extracted from the Automated Anatomical Labeling atlas)
- rois_cc200 (" " from Cameron Craddock's 200 ROI parcellation atlas)
- rois_cc400 (" " " 400 ROI parcellation atlas)
- rois_dosenbach160 (" " from the Dosenbach160 atlas)
- rois_ez (" " from the Eickhoff-Zilles atlas)
- rois_ho (" " from the Harvard-Oxford atlas)
- rois_tt (" " from the Talaraich and Tournoux atlas)
- vmhc (Voxel-mirrored homotopic connectivity)

Acceptable pipelines include:
- ccs
- cpac
- dparsf
- niak

Acceptable strategies include:
- filt_global (band-pass filtering and global signal regression)
- filt_noglobal (band-pass filtering only)
- nofilt_global (global signal regression only)
- nofilt_noglobal (neither)

For example, to download all particpants across all sites' ReHo images processed using C-PAC, without any frequency filtering or global signal regression:
    python download_abide_preproc.py -d reho -p cpac -s nofilt_noglobal -o /path/to/local/download/dir

The script will then search for and download the data to the local directory specified with the -o flag.

Participants can also be selected based on phenotypic information. For example, to download the same outputs from the previous example, but using only male ASD participants scanned from Caltech between the ages of 2 and 30 years:
    python download_abide_preproc.py -a -d reho -p cpac -s nofilt_noglobal -o /path/to/local/download/dir -gt 2 -lt 30 -x M -t Caltech

For more information on the ABIDE preprocessed initiative, please check out http://preprocessed-connectomes-project.github.io/abide

说明：一般使用下面这个命令下载就可以。

python download_abide_preproc.py -d rois_aal -p cpac -s nofilt_noglobal -o /aaa/bbb
注意：-o后替换为自己的路径即可