皮肤癌是皮肤细胞的异常生长产生的癌症,它是最常见的癌症之一,而且可能致命。但是如果及早发现,您的皮肤科医生可以对其进行治疗并彻底消除。
使用深度学习和神经网络,我们将能够对良性和恶性皮肤疾病进行分类,这可能有助于医生在早期阶段诊断出癌症。在本教程中,我们将创建一个皮肤疾病分类器,尝试使用Python中的TensorFlow框架仅从图像中区分良性(痣和脂溢性角化病)和恶性(黑素瘤)皮肤病。
好了,我们来一步一步操作吧。
▊ 安装所需的库:
pip3 install tensorflow tensorflow_hub matplotlib seaborn numpy pandas sklearn imblearn
打开一个新的笔记本(或bfwstudio)并导入必要的模块:
import tensorflow as tfimport tensorflow_hub as hubimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport seaborn as snsfrom tensorflow.keras.utils import get_filefrom sklearn.metrics import roc_curve, auc, confusion_matrixfrom imblearn.metrics import sensitivity_score, specificity_scoreimport osimport globimport zipfileimport random# to get consistent results after multiple runstf.random.set_seed(7)np.random.seed(7)random.seed(7)# 0 for benign, 1 for malignantclass_names = ["benign", "malignant"]
▊ 准备数据集
在本教程中,我们将仅使用ISIC存档数据集的一小部分,以下函数下载并将数据集提取到新data文件夹中:
def download_and_extract_dataset(): # dataset from https://github.com/udacity/dermatologist-ai # 5.3GB train_url = "https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/skin-cancer/train.zip" # 824.5MB valid_url = "https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/skin-cancer/valid.zip" # 5.1GB test_url = "https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/skin-cancer/test.zip" for i, download_link in enumerate([valid_url, train_url, test_url]): temp_file = f"temp{i}.zip" data_dir = get_file(origin=download_link, fname=os.path.join(os.getcwd(), temp_file)) print("Extracting", download_link) with zipfile.ZipFile(data_dir, "r") as z: z.extractall("data") # remove the temp file os.remove(temp_file)# comment the below line if you already downloaded the datasetdownload_and_extract_dataset()
这将花费几分钟,具体取决于您的网速,之后,data将显示包含训练,验证和测试集的文件夹。每个集是一个文件夹,其中包含三类皮肤疾病图像(痣,脂溢性角化病和黑色素瘤)。
注意:如果网速较慢,则可能难以使用上述Python函数下载数据集,在这种情况下,应下载并手动将其提取data到当前目录的文件夹中。
现在,我们已经在机器中拥有了数据集,让我们找到一种方法来标记这些图像,请记住我们将仅对良性和恶性皮肤疾病进行分类,因此我们需要将痣和脂溢性角化病标记为0和黑色素瘤1。
下面的单元格为每个集合生成一个元数据CSV文件,该CSV文件中的每一行对应于图像的路径及其标签(0或1):
# preparing data# generate CSV metadata file to read img paths and labels from itdef generate_csv(folder, labels): folder_name = os.path.basename(folder) # convert comma separated labels into a list label2int = {} if labels: labels = labels.split(",") for label in labels: string_label, integer_label = label.split("=") label2int[string_label] = integer_label labels = list(label2int) # generate CSV file df = pd.DataFrame(columns=["filepath", "label"]) i = 0 for label in labels: print("Reading", os.path.join(folder, label, "*")) for filepath in glob.glob(os.path.join(folder, label, "*")): df.loc[i] = [filepath, label2int[label]] i += 1 output_file = f"{folder_name}.csv" print("Saving", output_file) df.to_csv(output_file)# generate CSV files for all data portions, labeling nevus and seborrheic keratosis# as 0 (benign), and melanoma as 1 (malignant)# you should replace "data" path to your extracted dataset path# don't replace if you used download_and_extract_dataset() functiongenerate_csv("data/train", {"nevus": 0, "seborrheic_keratosis": 0, "melanoma": 1})generate_csv("data/valid", {"nevus": 0, "seborrheic_keratosis": 0, "melanoma": 1})generate_csv("data/test", {"nevus": 0, "seborrheic_keratosis": 0, "melanoma": 1})
generate_csv()函数接受2个参数,第一个是集合的路径,例如,如果您已下载并提取了中的数据集