目录
数据集概述
本次实战练习的数据集来自Kaggle的Skin Cancer MNIST: HAM10000。官方的Description如下:
Description
Overview
Another more interesting than digit classification dataset to use to get biology and medicine students more excited about machine learning and image processing.
Original Data Source
- Original Challenge: https://challenge2018.isic-archive.com
- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T
[1] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, Harald Kittler, Allan Halpern: “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”, 2018; https://arxiv.org/abs/1902.03368
[2] Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi:10.1038/sdata.2018.161 (2018).From Authors
Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 (“Human Against Machine with 10000 training images”) dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen’s disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc).
More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (followup), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal). The dataset includes lesions with multiple images, which can be tracked by the lesionid-column within the HAM10000_metadata file.
The test set is not public, but the evaluation server remains running (see the challenge website). Any publications written using the HAM10000 data should be evaluated on the official test set hosted there, so that methods can be fairly compared.
本次训练我使用的数据集为hmnist_28_28_RGB.csv
,其中包含10016行2353列(含标题行),其中前2352列为每张图片各通道各像素点的灰度值 (28×28×3),最后1列为label
,即该图像的标签。在这份数据集中,label
的取值为[0, 1, 2, 3, 4, 5, 6]
。
初次尝试 (CNN)
在观察过数据集之后,我的第一想法是可以把之前训练过的MNIST的模型拿来训练这份数据集,因为MNIST数据集每张图片的大小也是28×28,两份数据集最大的差异在于MNIST的图片是灰度图,而HAM10000的图片是RGB三通道的彩色图(Kaggle也提供了灰度图版本的HAM10000数据集)。另外,MNIST本质上是一个十分类问题,即将图片分类为[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
这十个类别之一,而HAM10000本质上是一个七分类问题。
基于以上想法,我们可以将以前训练MNIST的模型修改一下:将第一个卷积层的in_channels
参数由1
修改为3
,将最后一个nn.Linear
的参数由(128, 10)
修改为(128, 7)
。
注:原先训练MNIST的CNN模型如下所示。此前该模型训练MNIST数据集可达到0.99875的泛化精度。
CNN( (conv_unit1): Sequential( (0): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1)) (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv_unit2): Sequential( (0): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1)) (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (conv_unit3): Sequential( (0): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1)) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv_unit4): Sequential( (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (fc_unit): Sequential( (0): Linear(in_features=2048, out_features=1024, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.5, inplace=False) (3): Linear(in_features=1024, out_features=128, bias=True) (4): ReLU(inplace=True) (5): Dropout(p=0.5, inplace=False) (6): Linear(in_features=128, out_features=10, bias=True) ) )
修改后的train.py
如下:
import torch.nn as nn
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.conv_unit1 = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3),
nn.BatchNorm2d(16),
nn.ReLU(inplace=True)
)
self.conv_unit2 = nn.Sequential(
nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv_unit3 = nn.Sequential(
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True)
)
self