kaggle 比赛分类
Using deep learning to identify melanomas from skin images and patient meta-data
使用深度学习从皮肤图像和患者元数据中识别黑色素瘤
Kaggle, SIIM, and ISIC hosted the SIIM-ISIC Melanoma Classification competition on May 27, 2020, the goal was to use image data from skin lesions and the patients meta-data to predict if the skin image had a melanoma or not, here is a small introduction to the task from the hosts:
K aggle, SIIM和ISIC于2020年5月27日举办了SIIM-ISIC黑色素瘤分类比赛,目标是使用来自皮肤病变的图像数据和患者元数据来预测皮肤图像是否患有黑色素瘤。是主机对任务的简短介绍:
Skin cancer is the most prevalent type of cancer. Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer. The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020. It’s also expected that almost 7,000 people will die from the disease. As with other cancers, early and accurate detection — potentially aided by data science — can make treatment more effective.
皮肤癌是最普遍的癌症类型。 尽管是最不常见的皮肤癌,但黑色素瘤仍可导致75%的皮肤癌死亡。 美国癌症协会估计,到2020年将诊断出100,000多例新的黑色素瘤病例。还预计将有7,000人死于这种疾病。 与其他癌症一样,在数据科学的帮助下,早期而准确的检测可以使治疗更加有效。
Currently, dermatologists evaluate every one of a patient’s moles to identify outlier lesions or “ugly ducklings” that are most likely to be melanoma. Existing AI approaches have not adequately considered this clinical frame of reference. Dermatologists could enhance their diagnostic accuracy if detection algorithms take into account “contextual” images within the same patient to determine which images represent a melanoma. If successful, classifiers would be more accurate and could better support dermatological clinic work.
当前,皮肤科医生评估患者的每一颗痣,以找出最有可能是黑色素瘤的异常病灶或“丑小鸭”。 现有的AI方法尚未充分考虑此临床参考框架。 如果检测算法考虑到同一患者内的“上下文”图像来确定哪些图像代表黑色素瘤,则皮肤科医生可以提高其诊断准确性。 如果成功,分类器将更加准确,并且可以更好地支持皮肤科临床工作。
I took part in the competition and after about 2 months and about 200 experiments got a bronze medal finishing at 241st among 3314 teams (Top 8%), during the competition I also published two kernels one about visualizing data augmentations and another about using SHAP to explain models predictions.
我参加了比赛,经过大约2个月的时间,大约有200个实验在3314个团队中排名第241位(排名前8%)获得铜牌,在比赛中我还发布了两个内核,一个关于可视化数据增强,另一个关于使用SHAP解释模型的预测。
关于数据 (About the data)
Between images, TFRecords, and CSV files the complete data was about 108GB (33126 samples for the training set and 10982 for the test set), most of the images had high resolution, handling all this alone was a challenge.At the image side, we had 584 images that were melanomas and 32542 images that were not, here is an example:
在图像,TFRecords和CSV文件之间,完整数据约为108GB(训练集为33126个样本,测试集为10982个样本),大多数图像具有高分辨率,仅处理所有这些都是一个挑战。我们有584张黑色素瘤图像和32542张不是黑色素瘤的图像,这是一个示例: