俄亥俄州诊所供求分类问题

最新推荐文章于 2021-03-19 16:56:53 发布

我不拽世界怎麼精彩

最新推荐文章于 2021-03-19 16:56:53 发布

阅读量350

点赞数

本文链接：https://blog.csdn.net/weixin_44034053/article/details/95918794

版权

详细分析及分类过程

一、背景及目的

（1）背景

这是俄亥俄州的一个诊所面临的损失问题，尽管诊所拥有最好的医生，而且诊所的预约情况很好，但是诊所的收益还是亏损。对于这种情况，该诊所需要找出原因，现在将提取得到的数据属性如下表所示。

属性名称	属性说明
Age	病人的年龄
Gender	病人的性别
AppointmentRegistration	病人预约日期
ApointmentData	病人就诊日期
DayOfTheWeek	星期几就诊
Status	及时就诊的状态
Diabetes	病人是否有糖尿病
Alcoolism	病人是否有酒精中毒
HiperTension	病人是否有恻隐
Handicap	病人是否残疾
Smokes	病人是否吸烟
Tuberculosis	病人是否有结核病
Scholarship	病人是否有看病补贴
Sms_Reminder	是否有短信提示
AwaitingTime	等待时间=病人预约日期-病人就诊日期

（2）目的

数据处理，对异常值数据处理后再对数据属性进行分析
利用分类算法对数据进行分类

二、分析思路

在这里插入图片描述
在本项目中，涉及到的模型有决策树、SGD、随机森林、GradientBoosting、通过本项目，让我对这些分类算法有一定的了解。

项目代码详解

（1）数据探索

1）首先导入项目可能涉及到的库、包、函数，并设置显示图像的参数

import pandas as pd
import numpy as np
from time import time 
import matplotlib.pyplot as plt
from IPython.display import Image
from matplotlib.pylab import rcParams
from pylab import *  

from sklearn import metrics
from sklearn.cross_validation import train_test_split
from sklearn.decomposition import PCA
from sklearn import kernel_approximation
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.kernel_approximation import (RBFSampler,Nystroem)
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
%matplotlib inline

rcParams['figure.figsize']=15,5  #设置图像的尺寸
mpl.rcParams['font.sans-serif'] = ['SimHei']  #显示中文设置
plt.rcParams['axes.unicode_minus']=False  #显示负号

2）读取数据，对数据进行简单的探索

df=pd.read_csv('No-show-Issue-Comma-300k.csv')  #读取数据
df.shape  #查看数据的维度

在这里插入图片描述
从显示结果可知，数据共有300000行，有15和属性

df.info()  #查看数据信息

在这里插入图片描述
显示结果可以知道，数据中不存在空值，有5个属性值不是整型，在建模过程中，需要将其转换或者删除。
3）属性中唯一值统计

for i in df.columns:  #遍历数据属性名称
    print(i,'\t',df[i].nunique()) #特征中的唯一值统计

在这里插入图片描述
结果显示各个属性中含有多少种不同的值，例如Gender只有M和F，所以结果显示为2；DayOfTheWeek有周一到周日7中不同值，故结果显示为7。

（2）数据处理

1）Age处理，Age不可能有小于0的数据，因此需要对Age进行初步探索

df.loc[df.Age<0]['Age'].count()  #查看Age是否有小于0的数据

结果显示有6条数据是小于0的，故应该删除这6条数据

df=df.loc[df.Age>=0]  #去除Age异常值

2）DayOfTheWeek数据转化
为了后面方便建模，需要将字符串类型数据转为整数型数据。

c={'Monday':1,"Tuesday":2,'Wednesday':3,'Thursday':4,'Friday':5,'Saturday':6,"Sunday":7}
df.DayOfTheWeek=df.DayOfTheWeek.map(c)  #map函数将字典应用到数据中
df.DayOfTheWeek.head()

在这里插入图片描述
结果已经将object数据转为int64类型。

3）Gender和Status数据处理

for b in ['Gender','Status']:
    df[b]=pd.Categorical.from_array(df[b]).codes  #将Gender和Status转为int类型数据
df.head()

在这里插入图片描述
从结果可以看出，Gender数据中的M和F分别转为1、0；Status数据中的Show-Up和No-Show分别转为1、0。

4）AppointmentRegistration和ApointmentData。这两个属性在后续的构建新属性时需要，故暂时不做处理。

（3）数据特征分析

单个特征分析

1）Age（年龄）特征

plt.figure(figsize=(8,4))
plt.hist(df.Age,len(df.Age.unique()),color='gray')
plt.title('Age Analysis')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.xlim(0,100)
plt.show()

在这里插入图片描述
结果显示的是各个Age唯一值所拥有的数量，图中Age数量最多的是0岁，其次56。

2）AwaitingTime（等待时间）特征

plt.figure(figsize=(8,4))
plt.hist(df.AwaitingTime,len(df.AwaitingTime.unique()),color='red')
plt.show()

在这里插入图片描述
因为AwaitingTime是病人预约时间减去病人就诊时间，故结果值均为负值。从图中也可以明显的知道数据均是负值。现在需要将数据改为正值（即病人就诊日期减去病人预约日期）

df.AwaitingTime=df['AwaitingTime'].apply(lambda x:abs(x)) #将负值相应取绝对值
plt.figure(figsize=(8,4))
plt.hist(df.AwaitingTime,len(df.AwaitingTime.unique()))
plt.show()

在这里插入图片描述

3）其他特征
利用柱形图对其他特征进行可视化。

x=['Gender','Status','DayOfTheWeek','Diabetes','Alcoolism',
'HiperTension','Handcap','Smokes','Scholarship','Tuberculosis',
'Sms_Reminder']   #需要进行可视化的特征
for a in x:  #遍历
    y=df[a].unique()
    print(y)

在这里插入图片描述
结果显示，这些要进行可视化的特征都是int64的类型。

def others_plot(x): #定义函数
    plt.figure(figsize=(15,20))  #定义显示图像的尺寸
    for i,j in enumerate(x):  #对x和x的索引进行遍历，其索引从0开始
        plt.subplot(6,2,i+1)  #6行2列的第i+1个位置
        df[j].value_counts().plot(kind='bar',title=j)
        plt.ylabel('Frequency')
 others_plot(x)

在这里插入图片描述

两个特征分析

1）AwaitingTime（等待时间）与Age（年龄）的关系
用散点图对两个特征绘制图像

plt.scatter(df['Age'],df['AwaitingTime'],s=0.5)
plt.title('Analysis of between AwaitingTime and Age')
plt.xlabel('Age(year)')
plt.ylabel('AwaitingTime(day)')
plt.xlim(0,120)
plt.ylim(0,120)
plt.show()

在这里插入图片描述
结果显示，开始的时候没有相关迹象，在90岁以后出现一些负相关（即年龄增大，等待时间减小），这个结果不足以说明两者的关系。需要对两者求相关系数矩阵

pd.set_option('display.width',100)  #横向最多显示200个字符
pd.set_option('precision',3)  #显示小数点后面3位
correlations=df[['Age','AwaitingTime']].corr(method='pearson')  #Age和AwaitingTime的相关系数矩阵
print(correlations)

在这里插入图片描述
相关系数的绝对值在0.3以下是无直线相关，0.3以上是直线相关；0.3-0.5是低度相关；0.5-0.8是显著相关（中等程度相关）；0.8以上是高度相关。相关性分为正相关（范围是0.00-1.00）和负相关（范围是-1.00-0.00），对这个相关性取绝对值，一般绝对值大的相关性就越强，正相关是一个增大另一个也增大，负相关是一个增大另一个减小，做分类的时候一般找这个相关性不高的值来用，相关性不高说明两个变量之间的影响不大，可以看做有的关系比较小
从而知道AwaitingTime和Age有关系的系数是很小的。

2）Status（及时就诊的状态）与Sms_Remider（短信提示）的关系

g=df.groupby(['Sms_Reminder','Status'])['Sms_Reminder'].count().unstack()  #按Sms和Status分组
g[[0,1]].plot(kind='bar',stacked=True)  #画堆积图
plt.title('Analysis of between Sme_Reminder and Status')
plt.ylabel('Frequency')
plt.show()

在这里插入图片描述
图中显示，短信提示为1的数量最多，0的其次。同时，不管是0还是1，Status的1都是最多的。

3）Age和Status的关系

g1=df.groupby(['DayOfTheWeek','Status'])['Status'].count().unstack()
g1[[0,1]].plot(kind='bar')  #画图
plt.title('Analysis of between DayOfTheWeek and Status')
plt.ylabel('Frequency')
plt.show()

在这里插入图片描述
结果显示，不管Status的0还是1，频率最高的都是出现在工作日，在周末两天的频率是非常低的。
利用箱型图对Age和Status进行可视化，结果如下所示

df.boxplot(column=['Age'],return_type='axes',by='Status')
plt.ylabel('Age(year)')
plt.xlabel('Status')
plt.title('Analysis of between Age and Status')
plt.show()

在这里插入图片描述
结果清晰地显示Age的25%，50%，75%的值，同时Status为0的属性都稍微低于1的属性

4）Gender、Age、Status的关系

plt.figure(figsize=(15,3))
for i,j in enumerate(['no show ups','show ups']):
    df_show=df[df['Status']==i]
    plt.subplot(1,2,i+1)
    
    for k in [0,1]:
        df_gender=df_show[df_show['Gender']==k]
        fre_age=df_gender['Age'].value_counts().sort_index()
        fre_age.plot()
    plt.title('Age wise frequency of patient %s for both genders'%j)
    plt.xlabel('Age')
    plt.ylabel('Frequency')
    plt.legend(['Famale','Male'],loc='upper left')

在这里插入图片描述
左图显示的是Status为‘ no show ups’下的男性和女性与年龄的关系，在图中，随着年龄的增加，男性和女性的数量均是下降的。
右图显示的是Status为‘show ups’下的男性和女性与年龄的关系，图中显示，随着年龄的增加，在一定阶段上男性和女性的数量是增加的，但是过了这个阶段，两者的数量均减小。

（4）模型建立部分

1）新建特征
需要将数据中的预约时间和就诊日期进行处理，新建特征保存这些处理

for c in ['AppointmentRegistration','ApointmentData']:  #遍历这两个字符串
    for index,name in enumerate(['Year','Month','Day']):  #遍历
        df['{}_{}'.format(c,name)]=df[c].apply(lambda x: int(x.split('T')[0].split('-')[index]))   #新建特征
        #特征名称为c和name的组合，特征的值是先按照‘T’进行分割，取第0个位置的数据，再按‘-’进行分割，取第index的值

for k,h in enumerate(['Hour','Minute','Second']):
    df['{}_{}'.format('AppointmentRegistration',h)]=df['AppointmentRegistration'].apply(lambda x:int(x.split('T')[1][:-1].split(':')[k]))
    #新建特征名为‘AppointmentRegistration’和h的组合，其值为先按‘T’分割取第1个位置除了“Z”的值，再按照‘：’分割取第k个值
    
df.info()  #查看数据信息

在这里插入图片描述
通过查看数据的信息，从上图可以清楚地看出，新建特征以及完成。此时AppointmentRegistration（预约时间）和ApointmentData（就诊时间）两个属性对建模没有什么意义了

2）划分数据集
将数据的属性分为建模使用的特征和标签，其中标签为Status（及时就诊状态），然后对数据特征和标签进行数据的划分

#选择对建模有用的属性
feature=['Age','Gender','DayOfTheWeek','Diabetes','Alcoolism','HiperTension','Handcap','Smokes','Tuberculosis','Scholarship',
        'Sms_Reminder','AwaitingTime','AppointmentRegistration_Year','AppointmentRegistration_Month',
        'AppointmentRegistration_Day','ApointmentData_Year','ApointmentData_Month','ApointmentData_Day','AppointmentRegistration_Hour',
       'AppointmentRegistration_Minute', 'AppointmentRegistration_Second']

x=np.array(df[feature]) #将特征转为数组形式
y=np.array(df['Status'])  #将标签转为数组形式

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=True)  #按30%随机将数据分割

3）调用模型
首先定义一个函数，该函数的功能是输入不同的模型名称得到不同的分类结果，并且可以对所有模型进行比较，从而选择最佳模型

def Classifier(name,x_train,y_train,x_test,y_test):
    if name=='DecisionTree':
        clf=DecisionTreeClassifier()
        clf.fit(x_train,y_train)
        y_pred=clf.predict(x_test)
        
        false_positive_rate1,true_positive_rate1,thresholds1=metrics.roc_curve(y_test,y_pred)
        plt.plot(false_positive_rate1,true_positive_rate1,label='DecisionTreeClassifier')
        plt.plot([0,1],[0,1],'r--')
        plt.legend(loc='lower right')
        plt.show()
    elif name=='SGD':
        clf=SGDClassifier()
        clf.fit(x_train,y_train)
        y_pred=clf.predict(x_test)
        
        false_positive_rate1,true_positive_rate1,thresholds1=metrics.roc_curve(y_test,y_pred)
        plt.plot(false_positive_rate1,true_positive_rate1,label='SGDClassifier')
        plt.plot([0,1],[0,1],'r--')
        plt.legend(loc='lower right')
        plt.show()
    elif name=='RandomForest':
        clf=RandomForestClassifier()
        clf.fit(x_train,y_train)
        y_pred=clf.predict(x_test)
        
        false_positive_rate1,true_positive_rate1,thresholds1=metrics.roc_curve(y_test,y_pred)
        plt.plot(false_positive_rate1,true_positive_rate1,label='RandomForestClassifier')
        plt.plot([0,1],[0,1],'r--')
        plt.legend(loc='lower right')
        plt.show()
    elif name=='GradientBoosting':
        clf=GradientBoostingClassifier(random_state=10,learning_rate=0.1,n_estimators=200,max_depth=5,max_features=10)
        clf.fit(x_train,y_train)
        y_pred=clf.predict(x_test)
        
        false_positive_rate1,true_positive_rate1,thresholds1=metrics.roc_curve(y_test,y_pred)
        plt.plot(false_positive_rate1,true_positive_rate1,label='GradientBoostingClassifier')
        plt.plot([0,1],[0,1],'r--')
        plt.legend(loc='lower right')
        plt.show()
    elif name=='Contrast':
        clf1=DecisionTreeClassifier()
        clf1.fit(x_train,y_train)
        
        clf2=SGDClassifier()
        clf2.fit(x_train,y_train)
        
        clf3=RandomForestClassifier()
        clf3.fit(x_train,y_train)
        
        clf4=GradientBoostingClassifier(random_state=10,learning_rate=0.1,n_estimators=200,max_depth=5,max_features=10)
        clf4.fit(x_train,y_train)
        
        y_pred1=clf1.predict(x_test)
        y_pred2=clf2.predict(x_test)
        y_pred3=clf3.predict(x_test)
        y_pred4=clf4.predict(x_test)
        
        y=[y_pred1,y_pred2,y_pred3,y_pred4]
        for y_pred in y:
            false_positive_rate1,true_positive_rate1,thresholds=metrics.roc_curve(y_test,y_pred)
            plt.plot(false_positive_rate1,true_positive_rate1)
            plt.legend(['DecisionTreeClassifier','SGDClassifier','RandomForestClassifier'],loc='lower right')
        return 
    print('测试集精确度检验（Accuracy Score）：{}'.format(metrics.accuracy_score(y_test,y_pred)))  #测试数据和预测数据的精确度
    print('训练集精确度：{}'.format(clf.score(x_train,y_train)))
    fpr,tpr,thresholds=metrics.roc_curve(y_test,y_pred)
    print('精确召回曲线下的区域：{}'.format(metrics.auc(fpr,tpr)))

在函数中有四个分类模型，分别为决策树、SGD、随机森林和GradientBoosting。通过输入不同的模型名称，可以得到不同的结果

Classifier('DecisionTree',x_train,y_train,x_test,y_test)
#第一个参数除了DecisionTree外，还可以是SGD、RandomForest、GradientBoosting、Contrast

这些参数的输出结果分别如下图所示

a）决策树结果显示
在这里插入图片描述

b）SGD结果显示

在这里插入图片描述

c）随机森林结果显示
在这里插入图片描述

d）GradientBoosting结果显示
在这里插入图片描述

e）模型比较结果显示

4）结果分析
Accuracy Score(分类准确率分数)：是指所有分类正确的百分比
训练集精确度：即是一种评估方法，将训练好的模型在测试集上进行评分，越接近1越好
精确召回曲线下的区域：很多时候ROC曲线并不能清晰的说明哪个分类器的效果更好。因此常常使用AUC值作为评价标准，其值越大越好。

由结果可知
Accuracy Score：GradientBoosting>SGD>随机森林>决策树
训练集精确度：决策树>随机森林>GradientBoosting>SGD
精确召回曲线下的区域：随机森林>决策树>GradientBoosting>SGD

由此可知，随机森林在本项目中的分类效果最好。

我不拽世界怎麼精彩

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
俄亥俄州诊所供求分类问题

详细分析及分类过程一、背景及目的（1）背景这是俄亥俄州的一个诊所面临的损失问题，尽管诊所拥有最好的医生，而且诊所的预约情况很好，但是诊所的收益还是亏损。对于这种情况，该诊所需要找出原因，现在将提取得到的数据属性如下表所示。属性名称属性说明Age病人的年龄Gender病人的性别AppointmentRegistration病人预约日期Apointm...
复制链接

扫一扫