机器学习实战:用逻辑回归从疝气病预测病马死亡状况

一、数据处理
首先,我们调用数据,并查看数据:

file = open('F:/MachineLearning/data/horse-colic.txt')
print(file.read())

得到数据显示为:

1 1 530170 38.10 88 24 3 3 4 1 5 4 3 2 1 ? 3 4 41.00 4.60 ? ? 2 1 02209 00000 00000 2 
1 1 527709 38.00 108 60 2 3 4 1 4 3 3 2 ? ? 3 4 ? ? 3 ? 1 1 02205 00000 00000 2 
2 1 528169 38.20 48 ? 2 ? 1 2 3 3 1 2 1 ? ? 2 34.00 6.60 ? ? 1 2 03111 00000 00000 2 
......
1 1 529386 37.50 72 30 4 3 4 1 4 4 3 2 1 ? 3 5 60.00 6.80 ? ? 2 1 03205 00000 00000 2
1 1 530612 36.50 100 24 3 3 3 1 3 3 3 3 1 ? 4 4 50.00 6.00 3 3.40 1 1 02208 00000 00000 1
1 1 534618 37.2 40 20 ? ? ? ? ? ? ? ? ? ? 4 1 36 62 1 1 3 2 06112 00000 00000 2 

在查看这些特征所代表的含义:

names_data = open('F:/MachineLearning/data/horse-colic_names.txt')
names=names_data.read()
print(names)

得到:

1. TItle: Horse Colic database

2. Source Information
   -- Creators: Mary McLeish & Matt Cecile
                Department of Computer Science
                University of Guelph
                Guelph, Ontario, Canada N1G 2W1
                mdmcleish@water.waterloo.edu
   -- Donor:    Will Taylor (taylor@pluto.arc.nasa.gov)
   -- Date:     8/6/89

3. Past Usage:
   -- Unknown

4. Relevant Information:

   -- 2 data files 
      -- horse-colic.data: 300 training instances
      -- horse-colic.test: 68 test instances
   -- Possible class attributes: 24 (whether lesion is surgical)
     -- others include: 23, 25, 26, and 27
   -- Many Data types: (continuous, discrete, and nominal)

5. Number of Instances: 368 (300 for training, 68 for testing)

6. Number of attributes: 28

7. Attribute Information:

  1:  surgery?
          1 = Yes, it had surgery
          2 = It was treated without surgery

  2:  Age 
          1 = Adult horse
          2 = Young (< 6 months)

  3:  Hospital Number 
          - numeric id
          - the case number assigned to the horse
            (may not be unique if the horse is treated > 1 time)

  4:  rectal temperature
          - linear
          - in degrees celsius.
          - An elevated temp may occur due to infection.
          - temperature may be reduced when the animal is in late shock
          - normal temp is 37.8
          - this parameter will usually change as the problem progresses
               eg. may start out normal, then become elevated because of
                   the lesion, passing back through the normal range as the
                   horse goes into shock
  5:  pulse 
          - linear
          - the heart rate in beats per minute
          - is a reflection of the heart condition: 30 -40 is normal for adults
          - rare to have a lower than normal rate although athletic horses
            may have a rate of 20-25
          - animals with painful lesions or suffering from circulatory shock
            may have an elevated heart rate

  6:  respiratory rate
          - linear
          - normal rate is 8 to 10
          - usefulness is doubtful due to the great fluctuations

  7:  temperature of extremities
          - a subjective indication of peripheral circulation
          - possible values:
               1 = Normal
               2 = Warm
               3 = Cool
               4 = Cold
          - cool to cold extremities indicate possible shock
          - hot extremities should correlate with an elevated rectal temp.

  8:  peripheral pulse
          - subjective
          - possible values are:
               1 = normal
               2 = increased
               3 = reduced
               4 = absent
          - normal or increased p.p. are indicative of adequate circulation
            while reduced or absent indicate poor perfusion

  9:  mucous membranes
          - a subjective measurement of colour
          - possible values are:
               1 = normal pink
               2 = bright pink
               3 = pale pink
               4 = pale cyanotic
               5 = bright red / injected
               6 = dark cyanotic
          - 1 and 2 probably indicate a normal or slightly increased
            circulation
          - 3 may occur in early shock
          - 4 and 6 are indicative of serious circulatory compromise
          - 5 is more indicative of a septicemia

 10: capillary refill time
          - a clinical judgement. The longer the refill, the poorer the
            circulation
          - possible values
               1 = < 3 seconds
               2 = >= 3 seconds

 11: pain - a subjective judgement of the horse's pain level
          - possible values:
               1 = alert, no pain
               2 = depressed
               3 = intermittent mild pain
               4 = intermittent severe pain
               5 = continuous severe pain
          - should NOT be treated as a ordered or discrete variable!
          - In general, the more painful, the more likely it is to require
            surgery
          - prior treatment of pain may mask the pain level to some extent

 12: peristalsis                              
          - an indication of the activity in the horse's gut. As the gut
            becomes more distended or the horse becomes more toxic, the
            activity decreases
          - possible values:
               1 = hypermotile
               2 = normal
               3 = hypomotile
               4 = absent

 13: abdominal distension
          - An IMPORTANT parameter.
          - possible values
               1 = none
               2 = slight
               3 = moderate
               4 = severe
          - an animal with abdominal distension is likely to be painful and
            have reduced gut motility.
          - a horse with severe abdominal distension is likely to require
            surgery just tio relieve the pressure

 14: nasogastric tube
          - this refers to any gas coming out of the tube
          - possible values:
               1 = none
               2 = slight
               3 = significant
          - a large gas cap in the stomach is likely to give the horse
            discomfort

 15: nasogastric reflux
          - possible values
               1 = none
               2 = > 1 liter
               3 = < 1 liter
          - the greater amount of reflux, the more likelihood that there is
            some serious obstruction to the fluid passage from the rest of
            the intestine

 16: nasogastric reflux PH
          - linear
          - scale is from 0 to 14 with 7 being neutral
          - normal values are in the 3 to 4 range

 17: rectal examination - feces
          - possible values
               1 = normal
               2 = increased
               3 = decreased
               4 = absent
          - absent feces probably indicates an obstruction

 18: abdomen
          - possible values
               1 = normal
               2 = other
               3 = firm feces in the large intestine
               4 = distended small intestine
               5 = distended large intestine
          - 3 is probably an obstruction caused by a mechanical impaction
            and is normally treated medically
          - 4 and 5 indicate a surgical lesion

 19: packed cell volume
          - linear
          - the # of red cells by volume in the blood
          - normal range is 30 to 50. The level rises as the circulation
            becomes compromised or as the animal becomes dehydrated.

 20: total protein
          - linear
          - normal values lie in the 6-7.5 (gms/dL) range
          - the higher the value the greater the dehydration

 21: abdominocentesis appearance
          - a needle is put in the horse's abdomen and fluid is obtained from
            the abdominal cavity
          - possible values:
               1 = clear
               2 = cloudy
               3 = serosanguinous
          - normal fluid is clear while cloudy or serosanguinous indicates
            a compromised gut

 22: abdomcentesis total protein
          - linear
          - the higher the level of protein the more likely it is to have a
            compromised gut. Values are in gms/dL

 23: outcome
          - what eventually happened to the horse?
          - possible values:
               1 = lived
               2 = died
               3 = was euthanized

 24: surgical lesion?
          - retrospectively, was the problem (lesion) surgical?
          - all cases are either operated upon or autopsied so that
            this value and the lesion type are always known
          - possible values:
               1 = Yes
               2 = No

 25, 26, 27: type of lesion
          - first number is site of lesion
               1 = gastric
               2 = sm intestine
               3 = lg colon
               4 = lg colon and cecum
               5 = cecum
               6 = transverse colon
               7 = retum/descending colon
               8 = uterus
               9 = bladder
               11 = all intestinal sites
               00 = none
          - second number is type
               1 = simple
               2 = strangulation
               3 = inflammation
               4 = other
          - third number is subtype
               1 = mechanical
               2 = paralytic
               0 = n/a
          - fourth number is specific code
               1 = obturation
               2 = intrinsic
               3 = extrinsic
               4 = adynamic
               5 = volvulus/torsion
               6 = intussuption
               7 = thromboembolic
               8 = hernia
               9 = lipoma/slenic incarceration
               10 = displacement
               0 = n/a
 28: cp_data
          - is pathology data present for this case?
               1 = Yes
               2 = No
          - this variable is of no significance since pathology data
            is not included or collected for these cases

8. Missing values: 30% of the values are missing

这里我们有28个特征,我们产看这些特征所代表的含义,可以删减或者对一些特征进行处理:①对于第3个特征,代表了医院登记号,实际上它并没有什么统计的意义,所以可以考虑删除这个属性;②对于第23个特征,它代表了我们马活着、死掉或者安乐死,实际上这是我们需要对马预测的死或者活的结果,也就是我们数据的标记,而安乐死和死我们可以一起归类于死亡(这里我们认为是正例,用1表示;存活认为是反例用0表示),我们考虑把这个特征与训练集抽离;③对于第25个特征,也就是用五位数字所表示的特征,实际上这五个数字分别代表了不同的含义,所以考虑把第25个特征拆分为5个特征进行处理④对于第26、27个属性实际上在所有数据上的取值都是一样的,也就是说这两个特征对马的存活情况不产生影响,或者说在这个训练集上体现不出来影响,所以我们考虑删除这两个属性;⑤最后一个特征,即第28个特征表示的是这个病例有无病理资料,对与这个属性后面也补充到:这个属性没有统计意义,我们也可以把它考虑删除;⑥对于问号?部分,很显然是数据的缺失,我们考虑把这部分数据处理为缺失特征的平均值或者0,这里为了方便,我们直接用0处理,它的理论依据是:特征值为0的话,他在梯度函数中的对梯度函数的贡献为0,也就是不产生影响,这个做法是合理的;⑦最后别忘了给训练集添常数项。
此外,我们还可以从最后的提示看出这个数据集拥有30%的数据缺失。
下面为数据处理函数,他直接返回加过常数项的训练集及其标记:

import numpy as np

def prepared_data(path):
    file = open(path)
    fr = file.readlines()
    data_list=[]
    y = []
    for data in fr:
        data=data.strip()
        data=data.split(' ')
        data_25 = data[24]
        y.append(data[22])
        del data[-1]
        del data[26]
        del data[25]
        del data[24]
        del data[22]
        del data[2]
        for i in data_25:
            data.append(i)
        data = [0 if data_elem=='?' else float(data_elem) for data_elem in data]
        data_list.append(data)
    data_number = len(y)
    y = np.array([1 if y[j]=='2' or y[j]=='3' else 0 for j in range(data_number)])
    X = np.array(data_list)
    X=np.insert(X,0,1,axis=1)
    return X,y

用此函数处理训练集和测试集:

X,y = prepared_data('F:/MachineLearning/data/horse-colic.txt')
X_test,y_test=prepared_data('F:/MachineLearning/data/horse-colic-test.txt')

二、 s i g m o i d sigmoid sigmoid函数、代价函数和梯度函数

def sigmoid(z):
    return 1/(1+np.exp(-z))


def regularized_cost(param,X,y,regularized_param):
    m,d = X.shape
    h = sigmoid(X @ param)
    total_cost1 = - y.T @ np.log(h)
    total_cost2 = -(1-y).T @ np.log(1-h)
    cost = (total_cost1 + total_cost2) / m
    regularized_term = (param.T @ param) * (regularized_param / (2*m))
    regularized_cost = cost + regularized_term
    return regularized_cost


def cost(param,X,y):
    m,d = X.shape
    h = sigmoid(X @ param)
    total_cost1 = - y.T @ np.log(h)
    total_cost2 = -(1-y).T @ np.log(1-h)
    cost = (total_cost1 + total_cost2) / m
    return cost


def regularized_gradient(param,X,y,regularized_param):
    m,d = X.shape
    h = sigmoid(X @ param)
    gradient = (X.T @ (h-y)) / m
    regularized_term = (regularized_param / m) * (param)
    regularized_term[0] = 0
    regularized_gradient = regularized_term + gradient
    return regularized_gradient

这里我们的梯度函数和代价函数都是具有正则化,另外我们我们还要编写一个不带正则化项的代价函数,以便我们可以在测试集上选择正则化系数。

三、选择正则化系数
这里我们考虑在测试集上给出选出正则化系数。实际上这个做法并不谨慎,合理的做法是:应该依靠交叉验证集选出正则化系数,然后在测试集上给出模型的准确率。但是由于我们数据所限,我们采用这个不谨慎的做法。下面为系数选择的函数并输出错误率:

def select_regularized_param(regularized_param_list):
    param=np.zeros(X.shape[1])
    from scipy.optimize import minimize
    for regularized_param in regularized_param_list:
        fmin = minimize(fun = regularized_cost,x0=param,args=(X,y,regularized_param),method='TNC',jac=regularized_gradient)
        theta = fmin.x
        h_test = sigmoid(X_test @ theta)
        h_test = [1 if y >=0.5 else 0 for y in h_test]
        m_test = y_test.shape[0]
        error_number = 0
        for i in range(m_test):
            if y_test[i] != h_test[i]:
                error_number += 1
        error_rate = error_number / m_test
        print(error_rate,regularized_param)

我们设立一个正则化系数的集合,并从中选择最佳参数:

L=[0,0.001,0.003,0.01,0.03,0.1,0.3,1,3,10,30,100]
select_regularized_param(L)

得到:

0.27941176470588236 0
0.27941176470588236 0.001
0.27941176470588236 0.003
0.27941176470588236 0.01
0.27941176470588236 0.03
0.27941176470588236 0.1
0.27941176470588236 0.3
0.2647058823529412 1
0.2647058823529412 3
0.2647058823529412 10
0.2647058823529412 30
0.29411764705882354 100

我们在得到最低错误率的正则化系数中选择,即1,3,10,30中选择一个正则化系数即可,得到最低错误率为:0.2647058823529412。实际上这个错误率,稍显大了点,但是我们考虑到我们有30%的数据缺失这个结果还是合理的。

  • 5
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值