kaggle比赛: RSNA Pneumonia Detection Challenge 【简单神经网络实现思路总结】
比赛信息详见此链接
本文展示的实现思路是分别使用三个神经网络模型。
- 输入数据,对模型进行训练,最后通过K折交叉验证对模型进行评估。
- 但是本方案并没有达到比赛的要求,本方案并没有给出比赛要求的预测方框,并且本方案使用的评估参数也于比赛要求的不同
- 本方案的优点:实现简单,易于理解。模型评估时,展现出很高预测准确率。
- 本方案的缺点:在kaggle上提供的资源运行该方案时,会报内存溢出的错。
构建神经网络预测肺炎
第0部分 前期准备工作
数据来源详见此链接
首先是引入相关的包
# Imports
import os
import cv2
import glob
import time
import pydicom
import skimage
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from skimage import feature, filters
%matplotlib inline
from functools import partial
from collections import defaultdict
from joblib import Parallel, delayed
from lightgbm import LGBMClassifier
from tqdm import tqdm
# Tensorflow / Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras import Model
from tensorflow.keras.applications.vgg16 import VGG16
from keras import models
from keras import layers
# sklearn
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import RandomizedSearchCV
sns.set_style('whitegrid')
np.warnings.filterwarnings('ignore') #忽略训练中的不影响运行的报错
接下来就是将数据路径关联起来
# 关联数据读取路径
trainImagesPath = "../input/rsna-pneumonia-detection-challenge/stage_2_train_images"
testImagesPath = "../input/rsna-pneumonia-detection-challenge/stage_2_test_images"
labelsPath = "../input/rsna-pneumonia-detection-challenge/stage_2_train_labels.csv"
classInfoPath = "../input/rsna-pneumonia-detection-challenge/stage_2_detailed_class_info.csv"
# 读取标签和类信息
labels = pd.read_csv(labelsPath)
details = pd.read_csv(classInfoPath)
第1部分:以适当的格式实现我们的培训和测试数据
"""
@Description: Reads an array of dicom image paths, and returns an array of the images after they have been read
读取一组 dicom 图像路径,并在读取后返回一组图像
@Inputs: An array of filepaths for the images
图像的文件路径数组
@Output: Returns an array of the images after they have been read
读取图像后返回图像数组
"""
def readDicomData(data):
res = []
for filePath in tqdm(data): # Loop over data
# We use stop_before_pixels to avoid reading the image (Saves on speed/memory)
f = pydicom.read_file(filePath, stop_before_pixels=True)
res.append(f)
return res
# 获取一系列测试和训练文件路径
trainFilepaths = glob.glob(f"{
trainImagesPath}/*.dcm")
testFilepaths = glob.glob(f"{
testImagesPath}/*.dcm")
# 将数据读入数组
trainImages = readDicomData(trainFilepaths[:5000])
testImages = readDicomData(testFilepaths)
|100%|██████████| 5000/5000 [00:46<00:00, 107.50it/s]
100% |██████████| 3000/3000 [00:27<00:00, 110.21it/s]|
第2部分:平衡数据
COUNT_NORMAL = len(labels.loc[labels['Target'] == 0]) # 没有肺炎的患者数量
COUNT_PNE = len(labels.loc[labels['Target'] == 1]) # 肺炎患者数量
TRAIN_IMG_COUNT = len(trainFilepaths) # 总患者数
# 计算每一个的权重
weight_for_0 = (1 / COUNT_NORMAL)*(TRAIN_IMG_COUNT)/2.0
weight_for_1 = (1 / COUNT_PNE)*(TRAIN_IMG_COUNT)/2.0
classWeight = {
0: weight_for_0,
1: weight_for_1}
print(f"Weights: {
classWeight}")
Weights: {0: 0.6454140866873065, 1: 1.3963369963369963}
第3部分:获取train_y&test_y
"""
@Description: 此功能解析包含的Meta-Data包含的医学图像
@Inputs: 在读取后接受DICOM图像
@Output: 返回解压后的数据和组元素关键字
"""
def parseMetadata(dcm):
unpackedData = {
}
groupElemToKeywords = {
}
for d in dcm: # Iterate here to force conversion from lazy RawDataElement to DataElement
pass
# Un-pack Data
for tag, elem in dcm.items():
tagGroup = tag.group
tagElem = tag.elem
keyword = elem.keyword
groupElemToKeywords[(tagGroup, tagElem)] = keyword
value = elem.value
unpackedData[keyword] = value
return unpackedData, groupElemToKeywords
# 解析这些元数据到词典中
trainMetaDicts, trainKeyword = zip(*[parseMetadata(x) for x in tqdm(trainImages)])
testMetaDicts, testKeyword = zip(*[parseMetadata(x) for x in tqdm(testImages)])
100%|██████████| 5000/5000 [00:04<00:00, 1123.70it/s]
100%|██████████| 3000/3000 [00:02<00:00, 1279.92it/s]
"""
@Description: 此功能通过DICOM图像信息并返回1或0(取决于图像是否包含肺炎或不存在)
@Inputs: 包含元数据的数据帧
@Output: 返回结果Y(即:我们的训练和测试数据的结果Y)
"""
def createY(df):
y = (df['SeriesDescription'] == 'view: PA')
Y = np.zeros(len(y)