- 研究背景:乳腺癌,作为女性面临的严重健康挑战之一,其发病率在女性群体中名列前茅,并且随着社会进步、人口老龄化以及生活方式的变迁,这一趋势愈发明显。因此,针对乳腺癌的发病风险和复发预测的研究变得至关重要。这不仅需要深入分析乳腺癌的发病原因,还需基于历史数据精准预测其复发可能性,并利用现有临床数据对风险进行全面评价。在研究影响癌症的特征因素时发现对乳腺癌预测的影响因素很多,大致分成三类:首先是患者自身的遗传学特征,比如发病的年龄多少、绝经情况等,其次是疾病的特征,比如肿瘤的位置和大小以及组织分级状况等等,最后,发展化疗、免疫等治疗方法。截止到目前,越来越多的模型算法对乳腺癌数据进行了细致研究,判断影响患者的更多因素。
图1.1 2020年全球女性癌症新发病例数前十的癌症类型
图1.2 2020年全球女性癌症死亡病例数前十的癌症类型
-
研究目的:针对乳腺癌数据分析和预测中的挑战,本文综合国内外研究,运用数据可视化和模型训练技术,构建了乳腺癌复发风险预测模型。首先,利用Python对乳腺癌数据进行预处理,以优化数据质量,降低误差。在预处理阶段,我们提取并清洗影像特征,去除无关特征和噪声,以提升模型识别效果。其次,通过生物信息学分析,本文筛选出与乳腺癌发病和复发紧密相关的风险因素,简化模型特征,既增强预测能力又降低学习难度。采用条形图和扇形图等可视化方式,本文进一步识别了与乳腺癌复发相关的关键因素。基于上述研究,本文利用机器学习算法,构建乳腺癌复发风险预测模型。这些算法不仅为乳腺癌问题提供了解决方案,还能从海量数据中挖掘潜在价值。通过决策树、KNN、随机森林、逻辑回归和支持向量机等方法,挖掘数据潜在价值,选择最优模型,为临床提供精准指导。
图2.1 研究内容流程图
-
数据来源与预处理:乳腺癌数据集源于南斯拉夫卢布尔雅那大学医疗中心肿瘤研究所的M·兹维特和M·索克拉奇的研究,后被纳入美国加州大学欧文分校的UCI数据库,方便全球研究者使用。UCI数据库作为机器学习领域的权威资源,提供了丰富的数据集和分类问题的测试案例。乳腺癌数据集在其中尤为关键,助力研究人员开发更精确的分类算法,为乳腺癌的早期诊断和治疗提供重要支持。https://archive.ics.uci.edu/ml/datasets/breast+cancer如图3.1所示。
图3.1 乳腺癌数据集
-
在286个数据标本中,没有复发的数据有201个,有复发的数据有85个。每个样本包含9个描述特性,1个结果特性,总共10个属性。属性分别为class、age、menopause、tumorsize、invnodes、nodecaps、degmalig、breast、breastquad、irradiat,属性的相关说明如图3.2所示。
- 图3.2 特征属性意义
- 导入必要的库
import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib as mpl import seaborn as sns import plotly_express as px import plotly.graph_objects as go %matplotlib inline mpl.rcParams['font.sans-serif'] = ['simhei'] mpl.rcParams['font.serif'] = ['simhei'] plt.rc("font",family="SimHei",size="14") sns.set_style("darkgrid")
- 数据预处理
cancer=pd.read_csv('breast-cancer.csv') cancer.columns=['class','age','menopause','tumorsize','invnodes','nodecaps','degmalig','breast','breastquad','irradiat'] cancer.head()
左右胸和肿瘤所在象限是重复的属性,所以把breast这个属性drop掉
cancer=cancer.drop(['breast'],axis=1) cancer.head()
cancer.isnull()
cancer.info()
字段中有?就是本数据中的缺失值,我们直接选择非缺失值值的数据
cancer1 = cancer[(cancer["nodecaps"] != "?") & (cancer["breastquad"] != "?") ] len(cancer1)
cancer1.isnull().any() cancer1.isnull().sum()
1.数据可视化
age = cancer1["age"].value_counts().reset_index()
age.columns = ["年龄段", "人数"]
age
fig=px.bar(age,x="年龄段",y="人数",text="人数")
fig.show()
由图可以看出40-59之间的乳腺癌患病人数,是比其他年龄段高的。
menopause = cancer1["menopause"].value_counts().reset_index()
menopause
fig = px.pie(menopause,names="index",values="menopause")
fig.update_traces(
textposition='inside',
textinfo='percent+label')
fig.show()
从图中可以看出未绝经和40岁之后绝经患乳腺癌的比率更高
tumor_size = cancer["tumorsize"].value_counts().reset_index()
tumor_size
fig = px.bar(tumor_size,
x="index",
y="tumorsize",
color="tumorsize",
text="tumorsize")
fig.show()
breastquad = cancer1["breastquad"].value_counts().reset_index()
breastquad
fig = px.pie(breastquad,names="index",values="breastquad")
fig.update_traces(
textposition='inside',
textinfo='percent+label')
fig.show()
cancer1["class"].value_counts()
sns.countplot(cancer1['class'],label="Count")
plt.show()
复发人数是81,没有复发的人数196
将字符型转换为数值型,举例:
dic = {"no-recurrence-events":1, "recurrence-events":0}
cancer1["class"] = cancer1["class"].map(dic)
cancer1
以此类推,其他特征转换。
co = cancer1.corr()
plt.subplots(figsize=(8, 5))
sns.heatmap(co.corr().round(2),annot=True)
plt.show()
从图可以看出,年龄与绝经状况的相关性非常高,为0.92,即存在较强的多重共线性。在本文中需要进行特征工程,并且需要考虑剔除二者中的一个变量,防止因为多重共线性所发生的过拟合,乳腺癌复发与肿瘤大小(tumorsize)、受侵受侵淋巴结个数(invnodes)、有无结节帽(nodecaps)、肿瘤恶性程度(degmalig)、是否放疗(irradiat)几个特征之间具有较强相关性,相关系数超过0.4,乳腺癌复发与年龄阶段(age)、肿块所在象限(breastquad)两个特征之间存在一定相关性,乳腺癌复发和绝经情况(menopause)之间的相关性较弱。
fig,axes = plt.subplots(2,2,figsize=(18,13))
sns.barplot(x='age',y='class',data=cancer1,ax=axes[0,0])
sns.barplot(x='menopause',y='class',data=cancer1,ax=axes[0,1])
sns.countplot(x='menopause',hue='class',data=cancer1,ax=axes[1,0])
sns.barplot(x='irradiat',y='class',hue='breastquad',data=cancer1,ax=axes[1,1])
左上图1可以看出,乳腺癌复发的概率和年龄关系不大,但在30-39这个年龄段,可能因为绝经的原因,而导致复发率相比较高。故做左下图3,发现30-39基本为未绝经患者。同时右上图2表明,绝经时期对复发率有一定影响,但影响不大。右下图4表明,经过放疗会较大程度地增加复发的概率。
fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(26,5))
sns.barplot(x='tumorsize',y='class',data=cancer1,ax=ax1)
sns.barplot(x='invnodes',y='class',data=cancer1,ax=ax2)
sns.barplot(x='nodecaps',y='class',data=cancer1,ax=ax3)
由上面图可知,肿块大小,受侵淋巴个数多少,结节冒有无,都影响着复发率。
sns.barplot(x='degmalig',y='class',data=cancer1)
恶化程度越深,复发率越高
2.机器学习
采用5种模型进行预测:决策树、knn、随机森林、逻辑回归和SVM
- 独热编码
tumorsize_dummies = pd.get_dummies(cancer1['tumorsize'],prefix='tumorsize') invnodes_dummies = pd.get_dummies(cancer1['invnodes'],prefix='invnodes') nodecaps_dummies = pd.get_dummies(cancer1['nodecaps'],prefix='nodecaps') degmalig_dummies= pd.get_dummies(cancer1['degmalig'],prefix = 'degmalig') irradiat_dummies= pd.get_dummies(cancer1['irradiat'],prefix = 'irradiat') breast_new = cancer1.drop(['tumorsize','invnodes','nodecaps','degmalig','irradiat'],axis=1) breast_new = pd.concat([breast_new,tumorsize_dummies,invnodes_dummies,nodecaps_dummies,degmalig_dummies,irradiat_dummies,],axis=1) breast_new.head()
- 训练集和测试集的划分
x = breast_new.drop(['class'], axis=1) y = breast_new['class'] from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=20) sc = StandardScaler() x_train = sc.fit_transform(x_train) x_test = sc.transform(x_test)
- 决策树
from sklearn.model_selection import GridSearchCV from sklearn.tree import DecisionTreeClassifier from sklearn import metrics from sklearn.metrics import classification_report max_depth=range(4,10,1) min_samples_split=range(2,12,1) min_samples_leaf=range(2,12,1) parameters_dtc={'max_depth':max_depth,'min_samples_split':min_samples_split,'min_samples_leaf':min_samples_leaf} grid_search=GridSearchCV(estimator=DecisionTreeClassifier(),param_grid=parameters_dtc,cv=10,n_jobs=-1) grid_search.fit(x_train,y_train) grid_search.best_params_
dtc=DecisionTreeClassifier(max_depth=4,min_samples_leaf=3,min_samples_split=4) dtc.fit(x_train,y_train) y_predict=dtc.predict(x_test) score1 = dtc.score(x_test, y_test.astype('int')) print("准确率为:", score1) report1 = classification_report(y_test.astype('int'), y_predict) print(report1)
- KNN
from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import GridSearchCV knn = KNeighborsClassifier() param_test = [ {'n_neighbors':[i for i in range(1,10)], 'weights':['uniform'], 'algorithm':['auto']}, {'n_neighbors':[i for i in range(1,10)], 'weights':['distance'], 'p':[i for i in range(1,6)]} ] knn_gv = GridSearchCV(estimator = knn,param_grid=param_test,cv=5) knn_gv.fit(x_train, y_train) print("最佳参数:", knn_gv.best_params_)
knn = KNeighborsClassifier(n_neighbors=6,algorithm='auto',weights='uniform') knn.fit(x_train, y_train.astype('int')) y_pred2 = knn.predict(x_test) score2 = knn.score(x_test, y_test.astype('int')) print("准确率为:", score2) report2 = classification_report(y_test.astype('int'), y_pred2) print(report2)
- 随机森林
from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(random_state=1) param_grid = { 'n_estimators':[100,300,500], 'max_leaf_nodes':[10,12,14,16], } grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5) grid_search.fit(x_train, y_train) print("最佳参数:", grid_search.best_params_)
from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(max_leaf_nodes=14,n_estimators=500) rfc.fit(x_train, y_train.astype('int')) y_pred3 = rfc.predict(x_test) score3 = rfc.score(x_test, y_test.astype('int')) print("准确率为:\n", score3) report3 = classification_report(y_test.astype('int'), y_pred3) print(report3)
- 逻辑回归
from sklearn.linear_model import LogisticRegression log= LogisticRegression() param_test = {'penalty':['l2'], 'C':[0.01,0.1,1.0,10,100], 'tol': [1e-4,1e-5,1e-6,1e-7,1e-8]} log_gv = GridSearchCV(estimator = log,param_grid=param_test,cv=5) log_gv.fit(x_train, y_train) print("最佳参数:", log_gv.best_params_)
from sklearn.linear_model import LogisticRegression log = LogisticRegression(C=1.0,penalty='l2',tol=0.0001) log.fit(x_train, y_train.astype('int')) y_pred4 = log.predict(x_test) score4 = log.score(x_test, y_test.astype('int')) print("准确率为:\n", score4) report4 = classification_report(y_test.astype('int'), y_pred4) print(report4)
- SVM
from sklearn.svm import SVC model = SVC(random_state=1) param_grid = { 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100], } grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5) grid_search.fit(x_train, y_train) print("最佳参数:", grid_search.best_params_)
from sklearn.svm import SVC svr=SVC(C=10,gamma=0.01,probability=True) svr.fit(x_train, y_train.astype('int')) y_pred5 = svr.predict(x_test) score5 = svr.score(x_test, y_test.astype('int')) print("准确率为:\n", score5) report5 = classification_report(y_test.astype('int'), y_pred5) print(report5)
import matplotlib.pyplot as plt plt.rcParams['font.sans-serif']=['SimHei'] import seaborn as sns model = ['决策树','knn','随机森林','逻辑回归','SVM'] score = [score1, score2, score3, score4, score5] plt.figure(figsize = (15, 10)) sns.barplot(x = score, y = model) plt.show()
- 绘制决策树 KNN 随机森林 逻辑回归和SVM的混淆矩阵
from sklearn.metrics import confusion_matrix i=1 fig1= plt.figure(figsize=(3*6,2*5)) estimator_dict={'Logistic Regression':log,'KNN':knn,'Decision Tree':dtc,'RandomForestClassifier':rfc,'SupportVectorMachines':svr} for key,estimator in estimator_dict.items(): # 绘制混淆矩阵 pred_y = estimator.predict(x_test) matrix = pd.DataFrame(confusion_matrix(y_test,pred_y)) ax1 = fig1.add_subplot(2,3,i) sns.heatmap(matrix,annot=True,cmap='OrRd') plt.title('Confusion Matrix -- %s ' % key) i+=1 plt.show()
- 绘制个模型的的ROC 曲线并计算AUC面积
from sklearn.metrics import classification_report, roc_curve, auc,roc_auc_score log_score = log.predict_proba(x_test)[:,1] knn_score = knn.predict_proba(x_test)[:,1] tree_score = dtc.predict_proba(x_test)[:,1] rfc_score = rfc.predict_proba(x_test)[:,1] svr_score = svr.predict_proba(x_test)[:,1] score = [log_score,knn_score,tree_score,rfc_score,svr_score] i=0 j=1 fig2 = plt.figure(figsize=(4*5,3*4)) for key,estimator in estimator_dict.items(): pred_y = estimator.predict(x_test) ax2 = fig2.add_subplot(2,3,j) fprs,tprs,thresholds = roc_curve(y_test,score[i]) plt.plot(fprs,tprs) plt.plot([0,1],linestyle='--') area = roc_auc_score(y_test,pred_y) plt.xlabel('FP rate\n %s_AUC:%f' % (key,area),fontsize=12) plt.ylabel('TP rate',fontsize=12) plt.title('ROC of %s 曲线 ' % key,fontsize=13) plt.grid() i +=1 j +=1 # 添加网格线 plt.grid() plt.legend() plt.show()
from sklearn.metrics import classification_report, roc_curve, auc,roc_auc_score
fpr1, tpr1, threshold1 = metrics.roc_curve(y_test, y_predict)
roc_auc1 = metrics.auc(fpr1, tpr1)
fpr2, tpr2, threshold2 = metrics.roc_curve(y_test, y_pred2)
roc_auc2 = metrics.auc(fpr2, tpr2)
fpr3, tpr3, threshold3 = metrics.roc_curve(y_test, y_pred3)
roc_auc3 = metrics.auc(fpr3, tpr3)
fpr4, tpr4, threshold4 = metrics.roc_curve(y_test, y_pred4)
roc_auc4 = metrics.auc(fpr4, tpr4)
fpr5, tpr5, threshold5 = metrics.roc_curve(y_test, y_pred5)
roc_auc5 = metrics.auc(fpr5, tpr5)
plt.figure(figsize=(4*5,3*4))
plt.title('Validation ROC')
plt.plot(fpr1, tpr1, 'b', label = 'Decision Tree AUC = %0.3f' % roc_auc1)
plt.plot(fpr2, tpr2, 'r', label = 'KNN AUC = %0.3f' % roc_auc2)
plt.plot(fpr3, tpr3, 'y', label = 'RandomForestClassifier AUC = %0.3f' % roc_auc3)
plt.plot(fpr4, tpr4, 'g', label = 'Logistic Regression AUC = %0.3f' % roc_auc4)
plt.plot(fpr5, tpr5, 'm', label = 'SupportVectorMachines AUC = %0.3f' % roc_auc5)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.grid()
plt.legend()
plt.show()
由上面模型评估指标对比图可得出,随机森林模型最佳。