一、数据清洗
写在前列
Preg Number of times pregnant 怀孕次数
Plas Plasma glucose concentration a 2 hours in an oral glucose tolerance test 口服葡萄糖耐量试验中 2 小时的血浆葡萄糖浓度
Pres Diastolic blood pressure (mm Hg) 舒张压 (mm Hg)
Skin Triceps skin fold thickness (mm) 三头肌皮褶厚度(mm)
Insu 2-Hour serum insulin (mu U/ml) 2 小时血清胰岛素 (mu U/ml)
Mass Body mass index (weight in kg/(height in m)^2) 体重指数(体重kg/(身高m)^2)
Pedi Diabetes pedigree function 糖尿病谱系功能
age Age (years) 年龄(岁)
class Class variable (0 or 1) 类变量(0或1)
1、数据导入
我们依旧采用下列函数对数据进行导入
diabetes_data = pd.read_csv(r"C:\Users\86137\PycharmProjects\pythonProject\venv\糖尿病检测\diabetes.csv")
2、数据情况查看
第一步
diabetes_data.head()#查看数据的前五行
第二步
diabetes_data#查看所有数据
根据图标我们知道,一共有768行数据。
第三步
diabetes_data.info()#查看数据的缺失值
我们将0设置为缺失值
#将0全部设为空值
diabetes_data_copy = diabetes_data.copy(deep = True)
diabetes_data_copy[['plas','pres','skin','insu','mass']] = diabetes_data_copy[['plas','pres','skin','insu','mass']].replace(0,np.NaN)
print(diabetes_data_copy.isnull().sum())
得到下列结果
由图可知,plas和mass缺少数据较少,pres相对较少,skin和insu缺少数据过多。
第四步
diabetes_data.describe()#查看数据表的情况
3、异常值检测
采用和泰坦尼克号相同的代码
#异常值检测函数
# Outlier detection
def detect_outliers(df, n, features):
"""
Takes a dataframe df of features and returns a list of the indices
corresponding to the observations containing more than n outliers according
to the Tukey method.
"""
outlier_indices = []
# iterate over features(columns)
for col in features:
# 1st quartile (25%)
Q1 = np.percentile(df[col], 25)
# 3rd quartile (75%)
Q3 = np.percentile(df[col], 75)
# Interquartile range (IQR)
IQR = Q3 - Q1
# outlier step
outlier_step = 1.5 * IQR
# Determine a list of indices of outliers for feature col
outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index
# append the found outlier indices for col to the list of outlier indices
outlier_indices.extend(outlier_list_col)
# select observations containing more than 2 outliers
outlier_indices = Counter(outlier_indices)
multiple_outliers = list(k for k, v in outlier_indices.items() if v > n)
return multiple_outliers
Outliers_to_drop = detect_outliers(diabetes_data, 4, ["preg", "plas", "pres", "skin","insu","mass","pedi","age"])
print(diabetes_data.loc[Outliers_to_drop]) # Show the outliers rows
#没有超过四个缺失值的,故不对数据进行删除
关于缺失值处理,可以查看3000字详解四种常用的缺失值处理方法_一行玩python的博客-CSDN博客
4、对非数值类型转换
diabetes_data['OutCome'] = diabetes_data['class'].map({"b'tested_positive'":1,"b'tested_negative'":0})
diabetes_data.drop(labels = ["class"], axis = 1, inplace = True)
#我们新设OutCome列,把b'tested_positive换成1,将b'tested_negative换成0,并原有列删除。
之后我们查看数据类型
print("展示样本类型数据")
diabetes_data.dtypes
结果:所有数据均换成数值型
我们重新查看数据前五行
diabetes_data.head()
5、 对特征列进行分析
(1)查看关联矩阵
print("关联矩阵")
# Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived
g = sns.heatmap(diabetes_data[["preg", "plas", "pres", "skin","insu","mass","pedi","age","OutCome"]].corr(),annot=True, fmt = ".2f", cmap = "coolwarm")
plt.show()
分析:由图表可知,与OutCome关联大的有plas,mass,age和preg。缺失值较多的相关系数较小。
(2)分析plas和OutCome的关系
由上面缺少值统计可知,plas只有5个缺失值,考虑用中位数代替。
diabetes_data['plas'].fillna(value = diabetes_data['plas'].median())
如果用条状图,图像如下:
#条状图
print("plas和OutCome的关系")
# Explore plas feature vs Outcome
g = sns.catplot(x="plas",y="OutCome",data=diabetes_data,kind="bar", height = 6 ,
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("OutCome")
plt.show()
故我们选择用柱状图
g = sns.FacetGrid(diabetes_data, col='OutCome')
g = g.map(sns.histplot, "plas")
plt.show()
分析:由图表可知糖尿病检测为阳性的plas在20-40附近,糖尿病检测为阴性的plas在0-60大区间,说明plas对结果有很大影响。
(3)分析mass和OutCome的关系
由上面缺少值统计可知,mass只有11个缺失值,考虑用中位数代替。
print("mass和OutCome的关系")
diabetes_data['mass'].fillna(value = diabetes_data['mass'].median())
g = sns.FacetGrid(diabetes_data, col='OutCome')
g = g.map(sns.histplot, "mass")
plt.show()
分析:糖尿病检测为阳性的mass的数值比检测为阴性的低,且不超过60。
(4)分析age和OutCome的关系
print("age和OutCome的关系")
# Explore age vs OutCome
g = sns.FacetGrid(diabetes_data, col='OutCome')
g = g.map(sns.histplot, "age")
plt.show()
分析:糖尿病检测为阳性的年龄分布均匀分布在20-60岁,60-80岁糖尿病检测为阴性。
(5)分析preg和OutCome的关系
print("preg和OutCome的关系")
# Explore preg feature vs Outcome
g = sns.catplot(x="preg",y="OutCome",data=diabetes_data,kind="bar", height = 6 ,
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("OutCome")
plt.show()
分析:preg越高,糖尿病检测为阳性的可能越高。
(6)分析age分布情况
print("age分布情况")
# Explore Age distibution
g = sns.kdeplot(diabetes_data["age"][(diabetes_data["OutCome"] == 0) & (diabetes_data["age"].notnull())], color="Red", shade = True)
g = sns.kdeplot(diabetes_data["age"][(diabetes_data["OutCome"] == 1) & (diabetes_data["age"].notnull())], ax =g, color="Blue", shade= True)
g.set_xlabel("age")
g.set_ylabel("Frequency")
g = g.legend(["0","1"])
分析:30-60岁糖尿病检测为阳性的可能更大。
(6)分析age和plas,mass,preg的关系
print("age和plas,mass,age和preg的关系")
# Explore age vs plas,mass和preg
g = sns.catplot(y="age",x="plas",hue="OutCome", data=diabetes_data,kind="box")
g = sns.catplot(y="age",x="mass", hue="OutCome",data=diabetes_data,kind="box")
g = sns.catplot(y="age",x="preg",hue="OutCome",data=diabetes_data,kind="box")
plt.show()
二、模型训练
Cross validate models
# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=10)
# Modeling step Test differents algorithms
random_state = 2
classifiers = []
classifiers.append(SVC(random_state=random_state))
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression(random_state = random_state))
classifiers.append(LinearDiscriminantAnalysis())
diabetes_data["OutCome"] = diabetes_data["OutCome"].astype(int)
Y_diabetes_data = diabetes_data["OutCome"]
X_diabetes_data = diabetes_data.drop(labels = ["OutCome"],axis = 1)
cv_results = []
for classifier in classifiers :
cv_results.append(cross_val_score(classifier, X_diabetes_data, y = Y_diabetes_data, scoring = "accuracy", cv = kfold, n_jobs=4))
cv_means = []
cv_std = []
for cv_result in cv_results:
cv_means.append(cv_result.mean())
cv_std.append(cv_result.std())
cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost",
"RandomForest","ExtraTrees","GradientBoosting","MultipleLayerPerceptron","KNeighboors","LogisticRegression","LinearDiscriminantAnalysis"]})
g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")
Hyperparameter tunning for best models
### META MODELING WITH ADABOOST, RF, EXTRATREES and GRADIENTBOOSTING
# Adaboost
DTC = DecisionTreeClassifier()
adaDTC = AdaBoostClassifier(DTC, random_state=7)
ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"algorithm" : ["SAMME","SAMME.R"],
"n_estimators" :[1,2],
"learning_rate": [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]}
gsadaDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsadaDTC.fit(X_diabetes_data,Y_diabetes_data)
ada_best = gsadaDTC.best_estimator_
gsadaDTC.best_score_
#ExtraTrees
ExtC = ExtraTreesClassifier()
## Search grid for optimal parameters
ex_param_grid = {"max_depth": [None],
"max_features": [1, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [False],
"n_estimators" :[100,300],
"criterion": ["gini"]}
gsExtC = GridSearchCV(ExtC,param_grid = ex_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsExtC.fit(X_diabetes_data,Y_diabetes_data)
ExtC_best = gsExtC.best_estimator_
# Best score
gsExtC.best_score_
# RFC Parameters tunning
RFC = RandomForestClassifier()
## Search grid for optimal parameters
rf_param_grid = {"max_depth": [None],
"max_features": [1, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [False],
"n_estimators" :[100,300],
"criterion": ["gini"]}
gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsRFC.fit(X_diabetes_data,Y_diabetes_data)
RFC_best = gsRFC.best_estimator_
# Best score
gsRFC.best_score_
# Gradient boosting tunning
GBC = GradientBoostingClassifier()
gb_param_grid = {'loss' : ["deviance"],
'n_estimators' : [100,200,300],
'learning_rate': [0.1, 0.05, 0.01],
'max_depth': [4, 8],
'min_samples_leaf': [100,150],
'max_features': [0.3, 0.1]
}
gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsGBC.fit(X_diabetes_data,Y_diabetes_data)
GBC_best = gsGBC.best_estimator_
# Best score
gsGBC.best_score_
### SVC classifier
SVMC = SVC(probability=True)
svc_param_grid = {'kernel': ['rbf'],
'gamma': [ 0.001, 0.01, 0.1, 1],
'C': [1, 10, 50, 100,200,300, 1000]}
gsSVMC = GridSearchCV(SVMC,param_grid = svc_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsSVMC.fit(X_diabetes_data,Y_diabetes_data)
SVMC_best = gsSVMC.best_estimator_
# Best score
gsSVMC.best_score_
Plot learning curves
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
"""Generate a simple plot of the test and training learning curve"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
g = plot_learning_curve(gsRFC.best_estimator_,"RF mearning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsExtC.best_estimator_,"ExtraTrees learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsSVMC.best_estimator_,"SVC learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsadaDTC.best_estimator_,"AdaBoost learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsGBC.best_estimator_,"GradientBoosting learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
Feature importance of tree based classifiers
nrows = ncols = 2
fig, axes = plt.subplots(nrows = nrows, ncols = ncols, sharex="all", figsize=(15,15))
names_classifiers = [("AdaBoosting", ada_best),("ExtraTrees",ExtC_best),("RandomForest",RFC_best),("GradientBoosting",GBC_best)]
nclassifier = 0
for row in range(nrows):
for col in range(ncols):
name = names_classifiers[nclassifier][0]
classifier = names_classifiers[nclassifier][1]
indices = np.argsort(classifier.feature_importances_)[::-1][:40]
g = sns.barplot(y=X_diabetes_data.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h',ax=axes[row][col])
g.set_xlabel("Relative importance",fontsize=12)
g.set_ylabel("Features",fontsize=12)
g.tick_params(labelsize=9)
g.set_title(name + " feature importance")
nclassifier += 1
后面的模型不知道怎么套用了。