网络程序设计课程总结

最新推荐文章于 2022-04-22 01:52:41 发布

meishuguo

最新推荐文章于 2022-04-22 01:52:41 发布

阅读量858

点赞数

本文链接：https://blog.csdn.net/u010454438/article/details/53983250

版权

项目主要内容
我的Pull Request
项目demo实现
课程学习心得

项目主要内容

课程主要实现运用机器学习的方法对一份血常规报告进行分析，主要实现性别和年龄的预测。NP2016

我的Pull Request

1.ocr识别 pytesseract 调用 tesseract进行OCR识别【手工合并】
这里写图片描述
2.数据处理using numpy Organized into an easy-to-use format【已接受】
3.LR和模型学习曲线【未处理】

项目demo实现

版本库

运行环境

# 安装numpy
sudo apt-get install python-numpy # http://www.numpy.org/
# 安装opencv
sudo apt-get install python-opencv # http://opencv.org/

##安装OCR和预处理相关依赖
sudo apt-get install tesseract-ocr
sudo pip install pytesseract
sudo apt-get install python-tk
sudo pip install pillow

# 安装Flask框架、mongo
sudo pip install Flask
sudo apt-get install mongodb # 如果找不到可以先sudo apt-get update
sudo service mongodb started
sudo pip install pymongo

#安装pandas和 scikit-learn
sudo apt-get install python-numpy cython python-scipy python-matplotlib
pip install -U scikit-learn(如果不行就加sudo)
pip install pandas

项目说明

图像处理

A2主要完成的是如何将一张血常规报告单转换为可读训练数据。这是属于图像处理的内容。

图像处理有两种方法，一种是传统的图像处理方法，运用图像的本身的颜色、几何、频域信息，做相关的变换提取信息，比如，用傅里叶变换提取频域信息去噪声，均值算子做平滑处理等；而另一种方法就是机器学习内容，最近很火的模仿梵高作画就是用这种方式实现的。在我看来，两者是相互促进的，深度学习的方法能够发现更好的算子，而传统方法能够使深度学习更加有效。
不论使用哪种方法，都得理解图像的意义，对于深度学习做图像处理，只是提取了图像更多的信息，如果不理解图像本身，一顿瞎调参，是不太可能得到好的结果的

这一部分是相当有挑战的内容，整个项目卡在这里有几天时间。

这部分我做的内容较少，在这里仅仅是记录下学习其他童鞋的处理方法。

先对整张图片做预处理。
血常规报告单

#灰度化
img_gray = cv2.cvtColor(self.img, cv2.COLOR_BGR2GRAY)
#高斯平滑
img_gb = cv2.GaussianBlur(img_gray, (gb_param,gb_param), 0)
#闭运算 
closed = cv2.morphologyEx(img_gb, cv2.MORPH_CLOSE, kernel)
#开运算
opened = cv2.morphologyEx(closed, cv2.MORPH_OPEN, kernel)
#canny算子边缘检测
edges = cv2.Canny(opened, canny_param_lower , canny_param_upper)

这里写图片描述
由于整张图有英文，中文，数字，符号等混杂在一起，所以要直接识别实现难度相当大。所以首先要对图片进行裁剪成单个的小方块。定位图片是关键。

将黑线当作识别的特征

# 调用findContours提取轮廓
contours, hierarchy = cv2.findContours(edges, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
# 求得最小外接矩形
def getbox(i):
            rect = cv2.minAreaRect(contours[i])
            box = cv2.cv.BoxPoints(rect)
            box = np.int0(box)
            return box

比较最小外接矩形相邻两条边的长短
以两条短边的中点作为线的两端
所有的线两两进行比较筛选

 # 由三条线来确定表头的位置和表尾的位置
        line_upper, line_lower = findhead(line[2],line[1],line[0])

        # 由表头和表尾确定目标区域的位置

        # 利用叉乘的不可交换性确定起始点
        total_width = line_upper[1]-line_upper[0]
        total_hight = line_lower[0]-line_upper[0]
        cross_prod = cross(total_width, total_hight)
        if cross_prod <0:
            temp = line_upper[1]
            line_upper[1] = line_upper[0]
            line_upper[0] = temp
            temp = line_lower[1]
            line_lower[1] = line_lower[0]
            line_lower[0] = temp

由于图像不肯能拍的完全正，所以进行透视变换

#使用透视变换将表格区域转换为一个1000*760的图
PerspectiveMatrix = cv2.getPerspectiveTransform(points,standard)

self.PerspectiveImg = cv2.warpPerspective(self.img,PerspectiveMatrix, (1000, 760))

这样得到的图就是1000*760的相同大小图，能够进行裁剪成每个小方块。之后调用ocr识别库进行识别。就能够得到训练的数据。
由于血常规的字符有限，所以在用Tesseract做识别的时候可以规定自己的字典，这样不仅识别的速度够快，而且准确率能够得到保证。

深度学习和机器学习

这是本项目的数据处理部分，我在这方面做的工作主要有：
- 对训练的数据进行了清洗整理
- 学习了深度学习的原理（虽然现在不能理解每一层代表什么用keras实现了一个3层的神经网络对性别和年龄进行预测）
- 逻辑斯特回归（用scikit-learn实现了年龄和性别的预测）。

数据整理

#表一
sex age checkdate   shelfID
男   1   5/7/2013    1
男   1   12/11/2013  1
男   1   12/13/2013  1
男   1   1/15/2014   1

#表二
itemid  value1  checkdate   shelfid
1       5.2     1/11/2013   91
2       4.53    1/11/2013   91
3       138     1/11/2013   91
4      0.402    1/11/2013   91
5         89    1/11/2013   91

#去掉中间数据缺失的项和错误的项，仅保留26项的数据
for row in csv_file2_object:
    if len(row[1])<10 and int(row[0])<=26:
        data_2.append(row)
    else:
        pass
data2=np.array(data_2)
col=0
data2=data2[np.argsort(data2[:,col])]

#对两表做连接操作，去掉项数不够的项。
for row in csv_file1_object:
    right_only_stats= data2[(data2[0::,2]==row[2] ) & (data2[0::,3]==row[3]),1]
    right_only_stats=
      np.insert(right_only_stats,0,values=i,axis=None)
    i=i+1

    right_only_stats=np.insert(
    right_only_stats,1,values=row[0],axis=None)

    right_only_stats,2,values=row[1],axis=None)

    if len(right_only_stats)==29:
        csv_file3_object.writerow(right_only_stats)

#数据格式
id  sex age WBC RBC HGB HCT MCV MCH MCHC    RDW PLT MPV PCT PDW LYM LYM%    MON MON%    NEU NEU%    EOS EOS%    BAS BAS%    ALY ALY%    LIC LIC%
1   女   6   5.2 7.6 0.176   12.2    2.79    53.6    0.7 13.5    1.41    27.8    0.05    4.93    0.1 0.08    1.6 0.11    2.2 0.06    1.2 138 0.409   83  28  337 11.8    233
2   女   8   11.2    7.7 0.235   12.2    2.47    22.1    1.1 9.8 7.47    66.7    0.08    4.62    0.7 0.08    0.7 0.09    0.8 0.23    2.1 127 0.376   81  27.5    338 11.6    306
3   男   9   15  6.8 0.292   9.5 5.15    34.4    1.29    8.6 8.11    54.2    0.27    4.41    1.8 0.15    1   0.17    1.1 0.36    2.4 121 0.348   79  27.5    348 8.5 431
4   女   9   8.9 7.2 0.225   9.2 2.84    31.8    1.09    12.2    4.88    54.7    0.06    4.12    0.7 0.05    0.6 0.06    0.7 0.18    2.1 121 0.355   86  29.3    340 10  314
5   女   10  3.7 7.3 0.271   11  1.47    39.5    0.34    9.1 1.78    47.8    0.11    5.06    3   0.02    0.6 0.03    0.7 0.02    0.6 139 0.417   82  27.6    335 12.4    371
6   男   20  10.4    8.1 0.267   13  3.07    29.6    0.75    7.2 6.05    58.3    0.42    4.91    4   0.09    0.9 0.09    0.8 0.14    1.4 144 0.417   85  29.4    346 10.2    331

逻辑斯特回归

逻辑斯特回归解决一个分类问题。只能解决二分类问题，如果要进行N分类问题，则需要N个分类器。具体的内容主要在这里不介绍。
LR_age.py和LR_sex.py
使用scikit-learn python库，实现了年龄和性别预测的逻辑斯特回归模型。

    with open('dataset/train.csv', 'rb') as myFile1:
        lines1 = cv.reader(myFile1)
        head1 = lines1.next()
        del head1[0:3]
        # print type(head1)
        # print head1;
        for line in lines1:
            # 对于训练数据删除前三列，其中第二列性别当作标签，0表示男，1表示女
            if line[1].decode("gbk").encode("utf-8") == '女':
                y_train.append(0)
            else:
                y_train.append(1)
            x_train.append(line)
        x_train = np.array(x_train)
        x_train = np.delete(x_train, np.s_[0:3], 1)
        y_train = np.array(y_train)
    with open('dataset/predict.csv', 'rb') as myFile2:
        lines2 = cv.reader(myFile2)
        head2 = lines2.next()
        for line in lines2:
            if line[1].decode("gbk").encode("utf-8") == '女':
                y_test.append(0)
            else:
                y_test.append(1)
            x_test.append(line)
        x_test = np.array(x_test)
        x_test = np.delete(x_test, np.s_[0:3], 1)
        y_test = np.array(y_test)
    # 归一化处理
    # min_max_scaler = preprocessing.MaxAbsScaler()
    min_max_scaler = preprocessing.MinMaxScaler()
    for i in range(26):
        x_test[0:, i] = min_max_scaler.fit_transform(x_test[0:, i])
        x_train[0:, i] = min_max_scaler.fit_transform(x_train[0:, i])

将数据整理成所需的格式，输入为26维的numpy.array。使用L1正则化防止过拟合。迭代次数设置为200次，通过linear_model.LogisticRegression.fit函数来训练模型。

 clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6, max_iter=200)
    x_test = x_test.astype(np.float)
    x_train = x_train.astype(np.float)
    y_train = y_train.astype(np.int)
    clf.fit(x_train, y_train)
    print pd.DataFrame({"columns": head1, "coef": list(clf.coef_.T)})
    predictions = clf.predict(x_test)
    correct_pairs = [(x, y) for (x, y) in zip(y_test, predictions) if x == y]
    precision = float(len(correct_pairs)) / len(x_test)
    print precision

同样，对年龄的测试采用同样的做法，只是在判断准确率时，预测误差与实际相差在5以内就算正确。

delta=[x1 - x2 for (x1, x2) in zip (y_test, predictions)]
    correct_indices = [x for x in delta if abs(x)<5]

逻辑斯特回归是一种浅层学习，相比较于深度学习，它的好处在于我们能够清晰的知道每一项特征与最终结果的关联，能够快速的给予特征工程上的启发。而在深度学习中，我们很难发现每一项之间的关联，调参的难度很大。在性别测试中，我拿到了26项特征对于性别的影响。

                 coef columns
0               [0.0]     WBC
1   [-0.201537934071]     RBC
2    [-1.56609191156]     HGB
3   [-0.966680870315]     HCT
4               [0.0]     MCV
5    [-1.34449278351]     MCH
6               [0.0]    MCHC
7     [2.67399030546]     RDW
8    [0.554843411227]     PLT
9    [-1.01595391084]     MPV
10   [0.980749596616]     PCT
11   [0.106646248831]     PDW
12              [0.0]     LYM
13              [0.0]    LYM%
14              [0.0]     MON
15              [0.0]    MON%
16    [1.64386143402]     NEU
17              [0.0]    NEU%
18              [0.0]     EOS
19    [7.53530847513]    EOS%
20              [0.0]     BAS
21    [1.30166526719]    BAS%
22              [0.0]     ALY
23              [0.0]    ALY%
24  [-0.185934731798]     LIC
25              [0.0]    LIC%

这里的特征权重说明，有些项对于性别的影响几乎没有，有些项是正项影响例如：[7.53530847513] EOS% 这个特征越大，说明该病人是男性的可能性就会越大。这也是符合我们的常理的。

对于模型状态的判断，我用sklearn的learning_curve得到training_score和cv_score，使用matplotlib画出learning curve。

# 用sklearn的learning_curve得到training_score和cv_score，使用matplotlib画出learning curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1,
                        train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
    """
    画出data在某模型上的learning curve.
    参数解释
    ----------
    estimator : 你用的分类器。
    title : 表格的标题。
    X : 输入的feature，numpy类型
    y : 输入的target vector
    ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
    cv : 做cross-validation的时候，数据分成的份数，其中一份作为cv集，其余n-1份作为training(默认为3份)
    n_jobs : 并行的的任务数(默认1)
    """
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    if plot:
        plt.figure()
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel(u"number of train-set")
        plt.ylabel(u"score")
        plt.gca().invert_yaxis()
        plt.grid()
        plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,
                         alpha=0.1, color="b")
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,
                         alpha=0.1, color="r")
        plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"train-set score")
        plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"cv-set score")
        plt.legend(loc="best")
        plt.draw()
        plt.gca().invert_yaxis()
        plt.show()
    midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
    diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
    return midpoint, diff

性别模型学习曲线
这里写图片描述

年龄模型学习曲线
这里写图片描述

曲线的横坐标是训练集的量，纵坐标为准确率，在训练集不断增大时，交叉验证集和训练集的准确率越来越靠近，最终相差不大，说明模型既没有过拟合，也没有欠拟合。对于年龄预测，完全正确的概率在0.05，性别预测在0.73。
对于200个测试集，年龄预测相差5以内算正确，这样的正确概率为0.15，而性别的正确率反而下降到了0.65，这里我感到很奇怪，交叉验证的模型正确率低于测试集。
h

深度学习

年龄预测和性别预测使用底层Theano，python接口位keras实现了三层全连接神经网络，输入层维度为26，第一个隐藏层100个节点，采用的激励函数为tanh，dropout参数为0.4（防止过拟合在每次训练中随机屏蔽百分之40的节点），第二个隐藏层100个节点，激励函数和dropout同第一层，输出层为2分类，采用softmax激励函数。经过100次迭代，batch_size为128（每128个数据更新一次权值），在200个测试集上准确率在0.7左右。

model = Sequential()
model.add(Dense(100, input_dim=26,init='uniform',activation='tanh'))
model.add(Dropout(0.4))
model.add(Dense(100,init='uniform',activation='tanh'))
model.add(Dropout(0.4))
model.add(Dense(100,init='uniform'))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])
model.fit(X_train, Y_train,batch_size=batch_size, nb_epoch=nb_epoch,verbose=1, validation_data=(X_test, Y_test),shuffle=True)
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

这里写图片描述

年龄预测模型和性别测试一样，采用同一种模型得到的结果和逻辑斯特回归一样。都是0.15。性别测试则为0.7。
这说明浅层学习模型已经能够较好的拟和数据，在数据量很小的时候，深度学习的效果和浅层学习的效果几乎一样。

项目整合

保存模型

 joblib.dump(model, 'model/LR_sex.pkl')

读取模型预测

import numpy as np
from sklearn.externals import joblib
def predict(arr):
    #arr=list(arr)
    print arr
    for i in range (4):
        arr.append(0)
    arr=np.array(arr)
    print arr
    print arr.shape
    arr=arr.astype(np.float)
    arr = np.reshape(arr, [1, 26])
    clf_age=joblib.load('model/LR_age.pkl')
    clf_sex = joblib.load('model/LR_sex.pkl')
    age=clf_age.predict(arr)
    sex=clf_sex.predict(arr)
    return age[0],sex[0]

这里要说明的是，ocr识别的只有22项数据，模型有26项，我们直接对缺失的项补0，这样这一项表示没有，不会引入噪声。

项目demo

主界面
图片描述l

上传报告和识别结果
这里写图片描述

预测结果
这里写图片描述

课程学习心得

一开始让我直接上代码，我是拒绝的，在上这门课前，我觉得首先要拥有扎实的理论基础才能很好的完成任务。但是现在，我发现并不是这样，理论加实践，在做的过程中学习，这样的学习是最有效果的。这里在我以前的学习中没有经历过的。

整个项目我主要对于模型整合方面几乎没有做什么工作，对整个web系统，仅仅是用了其他同学的代码，这是我在接下来需要学习的。

对于机器学习来说，可能是我学过理论和实践结合最为紧密的一个学科。虽然看似理论的东西居多，但是，真正要实现一个准确率很高的模型。单靠理论是太可行的。理论不能发现数据内在的联系，真正深入到数据内才能有所收获。有了一个idea，实现它，验证是不是真的对模型有提高，人的学习是一个反复学习螺旋上升的过程，同样，模型的学习也是这样，没有一次就能完美的模型。

整个课程有几点体会。
首先，是代码实现，在作为一个小白程序员实现一个功能时，不要拘泥于代码的质量，实现它，不管用什么方式，在实现的基础上，才能谈的上优化。
第二，立即行动。一个优秀的程序员，在进行合作开发时，commit始终是不会少的，立即行动，当一个新的需求提出时，应该立即着手解决，而不是明天再做。
第三，注重代码维护，代码一定要写注释，不要解释代码是如何工作的，只写做了什么。git 用起来，项目几次工程混乱都是删掉fork 再重新fork，管理好版本库。

这门课程仅仅是开始，接下来学习机器学习的道路还很长很长。