Automatic grading system for human tear films人体泪膜自动分级系统

最新推荐文章于 2023-05-01 22:46:51 发布

hushenming3

最新推荐文章于 2023-05-01 22:46:51 发布

阅读量1.3k

点赞数

本文链接：https://blog.csdn.net/hushenming3/article/details/89950621

版权

Abstract

Dry eye syndrome is a prevalent disease which affects a wide range of the population and has a negative impact on their daily activities, such as driving or working with computers. Its diagnosis and monitoring require a battery of tests which measure different physiological characteristics. One of these clinical tests consists in capturing the appearance of the tear film using the Doane interferometer. Once acquired, the interferometry images are classified into one of the five categories considered in
this research. The variability in appearance makes the use of a computer-based analysis system highly desirable. For this reason, a general methodology for the automatic analysis and categorization of interferometry images is proposed. The evelopment of this methodology included a deep study based on several techniques for image texture analysis, three color spaces and different machine learning algorithms. The adequacy of this methodology was demonstrated, achieving classification rates over 93 %. Also, it provides unbiased results and allows important time savings for experts

干眼症是一种普遍的疾病，它影响广泛的人群并对他们的日常活动产生负面影响，例如驾驶或使用计算机。其诊断和监测需要一系列测试来测量不同的生理特征。这些临床测试之一在于使用Doane干涉仪捕获泪膜的外观。一旦获得，干涉测量图像被分类为考虑的五个类别之一，这项研究。外观的可变性使得基于计算机的分析系统的使用非常需要。为此，提出了用于干涉测量图像的自动分析和分类的一般方法。该方法的开发包括基于图像纹理分析的几种技术，三种颜色空间和不同的机器学习算法的深入研究。证明了该方法的充分性，实现了超过93％的分类率。此外，它提供了无偏见的结果，并为专家节省了大量时间

1 Introduction
Dry eye, resulting from an inadequate tear film, is a common and frequently distressing condition. It affects a relatively large proportion of the population: over 14 % of 65 age group in one US study [1], and over 30 % of the same age group in a population of Chinese subjects [2]. Many sufferers will require treatment and the potential cost is significant [3]. Monitoring the effect of the different treatments is, therefore, of great importance in ensuring the maximum benefit to each individual
由泪膜不足引起的干眼症是常见且经常令人痛苦的情况。它影响了相对较大比例的人口：在一项美国研究中，65岁年龄组中超过14％[1]，在中国人群中超过30％的同龄人群[2]。许多患者需要治疗，潜在的成本很高[3]。因此，监测不同治疗方法的效果对于确保每个人的最大利益非常重要

Dry eye disease is a group of diseases with different etiologies and different modes of expression [5–7]. It cannot, therefore, be characterized using one single measure, but rather requires a battery of tests, each designed to determine different aspects of the underlying disease
干眼病是一组具有不同病因和不同表达方式的疾病[5-7]。因此，它不能使用单一测量来表征，而是需要一系列测试，每个测试旨在确定潜在疾病的不同方面

The tear film is a multi-layer structure, consisting of an outer lipid layer, an intermediate aqueous layer and an inner mucous layer [9, 10]. One aspect of tear film assessment is the evaluation of the lipid layer [11–13]. This layer plays an important role in the retention of the tear film by retarding evaporation [14]. Although it is transparent, interference fringes are created when light rays,reflecting off the anterior surface, interfere with rays which have been reflected by the posterior surface [15]. These
fringes may be colored if the incident light is white, or monochromatic if the incident light is monochromatic [11,13, 16, 17]. With monochromatic light, the resultant image consists of dark bands of destructive interference and light bands of constructive interference. As with contour lines on a map, the number and spacing of these bands depend on the thickness of the lipid layer and the rate at which this changes. With white light, a rainbow effect occurs due to constructive interference commensurate with different layer thicknesses resulting in bright bands of single wavelengths disguising the dark bands of other wavelengths
泪膜是多层结构，由外脂层，中间水层和内粘膜层组成[9,10]。泪膜评估的一个方面是脂质层的评估[11-13]。该层通过延迟蒸发在泪膜的保留中起重要作用[14]。虽然它是透明的，但是当从前表面反射的光线干扰已经被后表面反射的光线时会产生干涉条纹[15]。如果入射光是白色，条纹可以是彩色的，如果入射光是单色的，则条纹可以是单色的[11,13,16,17]。对于单色光，合成图像由破坏性干涉的暗带和相长干涉的光带组成。与地图上的等高线一样，这些带的数量和间隔取决于脂质层的厚度和变化的速率。对于白光，由于与不同层厚度相称的相长干涉而产生彩虹效应，导致单个波长的明亮波段伪装其他波长的暗带

The tear film interferometer, developed by Doane [11],originally consisted of a light source and an observation system which captured the appearance of the tear film using a video-based system. This arrangement allowed the dynamic changes, that occur over time, to be recorded
由Doane [11]开发的泪膜干涉仪最初由光源和观察系统组成，该系统使用基于视频的系统捕获泪膜的外观。这种安排允许记录随时间发生的动态变化

The instrumentation used for this report utilized a digital PC-attached CMEX-1301 camera [18]. Initially, the method of categorizing the images followed that of Thai et al. [13]. However, the change to a digital camera made a modification of the categories desirable, due to the detail seen in the digital images. A large selection of images were viewed and the different features identified. The most obvious image characteristics related to the appearance of contour lines suggest variation in the lipid layer thickness.These were either gray or made up of colored fringes. The presence or absence of disturbances in the tear film was also noted. Taken together, these features suggested groups described in terms of ‘‘strong’’ fringes (colored) and ‘‘fine’’ or ‘‘faint’’ fringes (with very little color differentiation).Both of these categories were sub-divided by whether the pattern was regular across the image or broken up. A further category was termed ‘‘debris’’ (indicating small to medium disturbances within the underlying pattern)
本报告使用的仪器使用了数字PC连接的CMEX-1301相机[18]。最初，对图像进行分类的方法遵循Thai等人的方法。 [13]。然而，由于在数字图像中看到的细节，对数码相机的改变使得所需类别的修改成为可能。观察了大量图像并识别出不同的特征。与轮廓线外观相关的最明显的图像特征表明脂质层厚度的变化。这些是灰色的或由彩色条纹组成。还注意到泪膜中存在或不存在干扰。总之，这些特征表明群体以“强”条纹（有色）和“细”或“微弱”条纹（具有非常小的色差）来描述。这些类别中的两个被细分为图案是在图像上是规则的还是分散的。另一类被称为“碎片”（表示基础模式中的中小扰动）

It was immediately evident that the subjective nature of such analysis made it difficult to be consistent in interpreting the images. Although some images, from an individual subject, conformed to a single pattern, it was more common for them to be made up of a combination of different patterns. Furthermore, in the interval between blinks,the thickness of both the lipid and the aqueous elements of the tear film can change fairly rapidly, this being particularly true of the dry eye. This resulted in successive images that differed dramatically from each other. This variability in appearance resulted in major intra- as well as interobserver variations and unreliable grading, and made the use of a computer-based analysis system highly desirable.
To the best knowledge of the authors, there are no other attempts in the literature to develop such an automatic categorization system
很明显，这种分析的主观性质使得难以在解释图像时保持一致。尽管来自个体主体的一些图像符合单个图案，但是它们更常见的是由不同图案的组合构成。此外，在眨眼之间的间隔中，泪膜的脂质和含水元素的厚度可以相当快地改变，这对于干眼尤其如此。这导致连续的图像彼此显着不同。外观的这种可变性导致主要的内部和内部观察者变化以及不可靠的分级，并且使得基于计算机的分析系统的使用非常期望。据作者所知，文献中没有其他尝试来开发这种自动分类系统

The aim of this paper is to describe the development of an automated analysis system which would accurately and consistently categorize interferometry images. As is usual with other grading scales used in a clinical setting, five categories were chosen for this investigation: strong fringes, coalescing strong fringes, fine fringes, coalescing fine fringes and debris
本文的目的是描述自动分析系统的开发，该系统能够准确一致地对干涉图像进行分类。与临床环境中使用的其他分级量表一样，本研究选择了五个类别：强烈的条纹，联合的条纹，精细的条纹，联合的细条纹和碎片

2 Research methodology
The methodology presented in this section consists of four stages (see Fig. 1). The first step entails the acquisition of the input image. Then, its region of interest (ROI) is extracted. After that, the texture and color information of the ROI are analyzed and a descriptor of the image is obtained. Finally, the image is assigned to the most appropriate category, as suggested by the resultant descriptor.

本节介绍的方法包括四个阶段（见图1）。第一步需要获取输入图像。然后，提取其感兴趣区域（ROI）。之后，分析ROI的纹理和颜色信息，并获得图像的描述符。最后，根据结果描述符的建议，将图像分配给最合适的类别。

2.1 Image acquisition
Input image acquisition was carried out with the Doane interferometer [11] and a digital PC-attached CMEX-1301 camera [18]. The program ImageFocus [19] was used for image capture and images were stored at a spatial resolution of 1280*1024 pixels in the RGB color space. Other settings used in the study for the image acquisition include:exposure 4:00, gamma correction 2:00, low sensitivity to avoid bleaching the image, and white light to observe the colored fringes. Multiple still images were taken (approximately 4 per second at the settings used) for up to one minute (in excess of 200 images per subject). For ease of analysis, these could be converted, using the software ImageToAvi [20], to a video format allowing a review of the dynamics of the tear film in approximately 7 seconds, for 1 minute of recording time
使用Doane干涉仪[11]和数字PC连接的CMEX-1301相机[18]进行输入图像采集。程序ImageFocus [19]用于图像捕获，图像以RGB颜色空间中的1280×1024像素的空间分辨率存储。用于图像采集的研究中使用的其他设置包括：曝光4:00，伽马校正2:00，低灵敏度以避免漂白图像，以及白光以观察彩色条纹。拍摄多个静止图像（在所使用的设置下每秒约4次）长达一分钟（每个对象超过200个图像）。为了便于分析，可以使用ImageToAvi [20]软件将这些转换为视频格式，允许在大约7秒内查看泪膜的动态，记录时间为1分钟

Due to the various artifacts associated with image capture (e.g., blinking and ocular movement), many images were unsuitable for analysis (see Fig. 2). It was, therefore,necessary to select an image that was an appropriate representation of the tear film status during the image collection. In this sense, the images were analyzed by an optometrist who selected those taken shortly after blinking, generally within a couple of seconds of the blink, and at a time when the eye was fully open. Note that it was important to avoid images taken immediately after or just prior to the lids meeting during a blink, when the lipid layer is thickened, due to the layer being squashed between the two lids. Under these conditions, the lipid layer status would not be representative of its purpose when most needed, i.e., that of protecting the tears from evaporation.Notice also that those images selected by an optometrist are the same that specialists analyze by hand
由于与图像捕获相关的各种伪像（例如，眨眼和眼睛运动），许多图像不适合于分析（参见图2）。因此，有必要在图像采集期间选择适当表示泪膜状态的图像。在这个意义上，图像由验光师进行分析，验光师选择眨眼后不久拍摄的图像，通常在眨眼后的几秒钟内，以及眼睛完全打开时。注意，重要的是避免在眨眼期间，当脂质层变厚时，由于在两个盖子之间挤压该层时，在盖子会合之后或之前立即拍摄图像。在这些条件下，脂质层状态不能代表其最需要的目的，即保护眼泪免于蒸发的目的。另外，验光师选择的那些图像与专家用手分析的图像相同

2.2 Extraction of the region of interest
The input images, as depicted in Fig. 3, include an external
area that does not contain relevant information for the
classification. Furthermore, due to the shallow depth of
field, the most useful part of the image is the central part of
the yellowish or greenish area, formed by the anterior
surface of the tear film covering the cornea. This forces a
preprocessing step aimed at extracting the ROI.

The acquisition procedure guarantees that the relevant
part of the image is characterized by green or yellow
tonalities. Since the input image is acquired in the RGB
color space, only a single channel (in this case, green was
selected) needs to be considered when identifying the
background region. The background is determined by
finding those pixels whose gray level is less (i.e., darker)
than a calculated threshold and eliminating these regions
from further analysis. This threshold is calculated as:
threshold ¼ mean p stdDev ð1Þ
where mean is the mean value of the gray levels of the
image, stdDev is its standard deviation and p is a weight
factor. Note that the weight factor was empirically determined to obtain a preliminary ROI mostly free of irrelevant
artifacts (p ¼ 0:1).

Once this preliminary ROI is identified, its central part
has to be located. Some images include other regions that
do not contain relevant information for the classification,
but are brighter than the threshold value, for example,
eyelashes or shadows cast by them. To eliminate these
irrelevant regions from further analysis, the morphological
operator erosion [21] is applied using an ellipse as the
structuring element. Notice that the size of the structuring
element was empirically set to ten pixels. Next, the rectangle of maximum area inside the preliminary ROI is
located through a completely automatic process. This
rectangle is then reduced by a pre-determined percentage.

Due to the steps described above, this region is likely to
be free of irrelevant regions. However, the size of the
eyelashes, in some images, can be big enough to force a
last step which consists of an iterative process to reduce the
size of the ROI until no areas of the background remain.
Notice that the size of the input images is 1280 1024
pixels, and the size of the final ROI is 547 578 pixels on
average

Figure 3 shows the extraction of the ROI from two input
images step by step. The left side of the figure depicts an
example where there are no irrelevant parts or shadows
within the image, whereas the right side shows an example
where the eyelashes have spoiled the image, and therefore
make the additional step of erosion necessary

2.3 Color analysis
Some categories show distinctive color characteristics (see
Fig. 4) and, for this reason, images are analyzed in three
different ways [22], these being grayscale, the L*a*b*
color space, and the RGB color space making use of the
opponent color theory. Next, these three options for color
analysis are explained

2.3.1 Grayscale
A grayscale image is one in which the only color is gray,
represented by different levels from black to white. In this
case, less information needs to be provided since it is only
necessary to specify a single intensity value for each pixel.

To analyze the texture over grayscale images, the three
channels of the ROI in RGB have to be converted into only
one gray channel (Gr). Next, the gray component is analyzed and the descriptor obtained (see Fig. 5).

2.3.2 The L*a*b* color space
The CIE 1976 L*a*b* color space [23] describes all the
colors that the human eye can perceive. L*a*b* is a 3D
model where its three components represent: (i) the luminance of the color L*, (ii) its position between magenta and
green a*, and (iii) its position between yellow and blue b*.
This color space is perceptually uniform, which means that
a change of the same amount in a color value produces a
change of the same visual importance. This characteristic is
really important since the specialists’ perception is being
imitated.

Using the L*a*b* color space to extract texture information entails converting the three channels of the ROI in
RGB to the three components of L*a*b*. Each component
is then analyzed separately and the final descriptor is the
concatenation of these descriptors (see Fig. 6)

2.3.3 The RGB color space: opponent colors
The RGB color space [24] (RGB) is an additive color space
based on the physiology of the eye. It is defined by three
chromatic components: the red channel R, the green
channel G, and the blue channel B. Despite being one of the
most frequently used color spaces for image processing, it
is not perceptually uniform. Therefore, the opponent process theory of human color vision, proposed by Hering [25]
in the 1800s, is considered. This theory states that the
human visual system interprets information about color by
processing three opponent channels: red vs. green (RG),
green vs. red (GR) and blue vs. yellow (BY). The three
opponent channels are defined as [26]:
where p is a Butterworth low-pass filter

To analyze the texture using opponent colors, the three
opponent channels have to be calculated from the ROI in
RGB. Next, each channel is analyzed separately and the
final descriptor is the concatenation of the individual
descriptors (see Fig. 7)

2.4 Texture analysis
Texture is used to characterize the interference patterns of
the five categories. Several techniques for texture analysis
could be applied, and so five popular methods were
selected based on previous studies [22]. Three of these are
signal processing methods: Butterworth filters, Gabor filters and the discrete wavelet transform. The other two
methods include a model-based method, Markov random
fields; and a statistical method using co-occurrence
features

2.4.1 Butterworth filters
Butterworth band-pass filters [21] are frequency domain
filters that have an approximately flat response in the band
pass frequency, which gradually decays in the stopband. A
Butterworth filter is defined as:

where x the angular frequency, x0 the cutoff frequency, xc
the center frequency and n is the order of the filter.
A bank of Butterworth filters composed of nine second
order filters is used, with band-pass frequencies covering
the whole frequency spectrum. The filter bank maps each
input image into nine filtered images, one per frequency
band. Each filtered image is then normalized separately and
its uniform histogram with non-equidistant bins [27] is
calculated to obtain the final descriptor

2.4.2 Gabor filters
Gabor filters are complex exponential signals modulated by
Gaussians [28]. A 2D Gabor filter [29], using cartesian
coordinates in the spatial domain and polar coordinates in
the frequency domain, is defined as:

a and b model the shape of the filter, while x0, y0, f0 and h0
represent the location in the spatial and frequency domains,
respectively

A bank of 16 Gabor filters centered at 4 frequencies and
4 orientations is used. The filter bank maps each input
image to 16 filtered images, one per frequency–orientation
pair. Using the same idea that in Butterworth filters, each
filtered image is normalized and then its uniform histogram, with non-equidistant bins [27], is computed to generate the final descriptor

2.4.3 The discrete wavelet transform
The discrete wavelet transform [30] generates a set of
wavelets by scaling and translating a mother wavelet, that
can be represented in 2D as [31]:

where a ¼ ðax; ayÞ governs the scale and b ¼ ðbx; byÞ the
translation of the function. The values of a and b control
the band-pass of the filter, generating high-pass (H) or lowpass (L) filters

The wavelet decomposition of an image consists of
applying these wavelets horizontally and vertically, generating 4 subimages (LL, LH, HL, HH) which are then
subsampled by a factor of 2. After the decomposition of the
input image, the process is repeated n 1 times over the
LL subimage, where n is the number of scales

The descriptor of an input image is constructed by
computing the mean and the absolute average deviation of
the input and LL images, and the energy from the LH, HL
and HH images [32]. Notice that Haar and Daubechies [33]
methods were chosen for this paper as mother wavelets.

2.4.4 Markov random fields
Markov random fields (MRF) generate a texture model by
expressing the gray values of each pixel in an image as a
function of the gray values in a neighborhood of the pixel.
The Markovian process for textures was modeled using a
Gaussian Markov random field (GMRF) defined as [34]:

where XðcÞ is the gray value of a pixel c on an N M
image with c ¼ 1; 2; . . .N M, m is an offset from the
center cell c, bc;m are the parameters that weight a pair of
symmetric neighbors to the center cell and ec is the zero
mean Gaussian distributed noise. The b coefficients
describe the Markovian properties of the texture and the
spatial interactions among pixels, and can be estimated
using a least squares fitting [32].

To generate the descriptor of an input image, the
directional variances proposed by C¸ esmeli and Wang [35]
are calculated from the b coefficients. Note that the
neighborhood of a pixel is defined as the set of pixels
within a Chebyshev distance d.

2.4.5 Co-occurrence features
Co-occurrence features analysis was introduced by Haralick et al. [36], and it is based on the computation of the
conditional joint probabilities of all pairwise combinations
of gray levels. For a given distance d and an orientation h,
this method generates a set of gray level co-occurrence
matrices and extracts several statistics from their elements
Ph;dði; jÞ. Notice that a total of four orientations must be
considered for a distance d ¼ 1 (0, 45, 90 and 135), and
so four matrices are generated. For a distance d [ 1, the
number of orientations increases and, therefore, the number
of matrices does. Concretely, the number of orientations
for a distance d is 4d. As an example, Fig. 8 depicts the
orientations considered for two representative distances.

From each co-occurrence matrix, a set of 14 statistics
proposed by Haralick et al. [36] is computed. These statistical measures represent features such as homogeneity or
contrast. The descriptor of an input image consists of two
properties, the mean and the range across matrices of these
statistics. Thus, a total of 28 features per distance are
obtained. As well as in Markov random fields, the
Chebyshev distance is used.

2.5 Classification
Supervised machine learning is one of the tasks most frequently carried out by so-called intelligent systems. Thus, a
large number of techniques have been developed based on
artificial intelligence, and statistics. The goal of supervised
learning is to construct a classifier that can correctly predict
the classes of new samples given training samples of old
objects

The next sections describe the four, very popular,
machine learning algorithms considered in this research.
They are needed to carry out the final step of the methodology. These methods were selected based on [38] to
provide different approaches of the learning process.

2.5.1 Naive Bayes
Naive Bayes (NB) [39] is a statistical learning algorithm
based on the Bayesian theorem and the maximum posteriori hypothesis that can predict class membership probabilities. During the training process, the posteriori
probabilities of each class are calculated according to the
Bayes’ theorem:
where cj is a class and X is a sample. Pða; bÞ is the posteriori probability of a conditioned on b, and PðaÞ is the
priory probability of a

Given a sample X, the trained classifier will predict that
X belongs to the class which has the highest a posteriori
probability conditioned on X. That is X is predicted to
belong to the class ci if and only if:
where the class ci is called the maximum posteriori
hypothesis

This classifier greatly simplifies learning by assuming
that features are independent for a given class. Although
independence cannot be assumed, in practice this algorithm
competes well with more sophisticated classifiers [40].
Thus, its main advantages are that it is simple and fast, but
its problem lies in that it cannot learn interactions between
features.

2.5.2 Random tree
Random tree (RT) [41] is a tree randomly constructed from
a set of possible trees having K random features at each
node. In this context, ‘‘at random’’ means that in the set of
trees each tree has an equal chance of being sampled.

To construct a random tree, all its nodes are associated
with rectangular cells such that at each step of the construction, the collection of cells associated with the leaves
forms a partition of ½0; 1d. The root of the tree is ½0; 1d
itself. The following procedure is then repeated dlog2kne,
where log2 is the base-2 logarithm, de is the ceiling
function, and k 2 a deterministic parameter. The procedure is as follows: at each node, a coordinate of X ¼
ðXð1Þ; . . .; XðdÞÞ is selected, with the j-th feature having the
probability pnj 2 ð0; 1Þ of being selected; next, the split is
at the midpoint of the chosen side

Notice that a randomized tree rnðX;HÞ, where H is a
randomizing variable, outputs the average error over all Yi
for which the corresponding vectors Xi fall in the same cell
of the random partition as X. Note also that the main
advantage of random trees is that they can be generated
efficiently.

2.5.3 Random forest
Random forest (RF) [42] is an effective tool in predictive
tasks formed by a combination of tree predictors. Formally,
it can be defined as a classifier which consists of a collection of tree-structured classifiers fhðx;HkÞ; k ¼ 1; . . .g
where the fHkg are independent identically distributed
random vectors and each tree casts a unit vote for the most
popular class at input x.

Given an ensemble of classifiers h1ðxÞ; h2ðxÞ; . . .; hkðxÞ,
and with the training set randomly drawn from the distribution of the random vector Y, X, the margin function can
be defined as follows:
where IðÞ is the indicator function. The margin measures
the extent to which the average number of votes at X, Y for
the right class exceeds the average vote for any other class.
Thus, the larger the margin, the more confidence is the
classification.

The generalization error is given by:
PE ¼ PX;YðmgðX; YÞ\0Þ ð11Þ
where the subscripts X, Y indicate that the probability is
over the X, Y space.

The main advantage of this method is that it does not
overfit as more trees are added, according to the strong law
of large numbers and the tree structure [42]. However, it
produces a limiting value of the generalization error.

2.5.4 Support vector machine
Support vector machine (SVM) [43] is based on the statistical learning theory and revolves around the notion of a
‘‘margin’’, either side of a hyperplane that separates two
classes. If the training data are linearly separable, then a
hyperplane that separates two classes can be defined as:

where x are the samples, w is the normal to the hyperplane
and b
jjwjj is the perpendicular distance from the hyperplane
to origin. The aim of SVMs is to orientate this hyperplane
in such a way as to be as far as possible from the closest
members of both classes, which means selecting the variables w and b so that the training data can be described by

where xi is the i-th sample and yi its class. From all the
possible hyperplanes, SVMs try to find the one that maximizes the margin

Most real world problems involve non-separable data.
Consequently, there is no hyperplane that successfully
separates two classes. In this case, the idea is to map the
input data onto a higher dimensional space and define a
separating hyperplane there. This higher dimensional space
is called the transformed feature space and it is obtained
using kernel functions.

SVM necessarily reaches a global minimum and avoids
ending in a local minimum, which may happen in other
algorithms. They avoid problems of overfitting and, with
an appropriate kernel, they can work well even if the data
are not linearly separable.

3 Materials and methods
The aim of this work is to present a methodology for tear
film classification based on color texture analysis, using the
Doane interferometer as the image acquisition instrument.
The materials and methods used in this work are described
in this section.

3.1 Dataset
The grading scale used in this research is composed of five
lipid layer categories: strong fringes, coalescing strong
fringes, fine fringes, coalescing fine fringes and debris. See
the representative images from Table 1 as examples of the
five categories considered, and also a description of each of
them

In this research, a clinical dataset [44] provided by
experts will be used to test the proposed methodology. All
images in this dataset have been annotated by two different
optometrists from the Department of Life Sciences, Glasgow Caledonian University (Glasgow, UK). The acquisition of these images was carried out according to the
description presented in Sect. 2.1, using the Doane interferometer and a digital PC-attached CMEX-1301 camera.

The dataset contains examples of real interferometric
images and is used to compute the performance of the
algorithms. It is composed of 106 images from patients
with average age 55 16, and includes samples from all of
the categories: 11 strong fringes, 25 coalescing strong
fringes, 30 fine fringes, 26 coalescing fine fringes and 14
debris images. Computer images were selected, based on
quality and available detail, from a large database of
images of dry eye subjects’ tear films. No personally
identifying characteristics were saved with the images used
for the development of the computer analysis system.

3.2 Experimental procedure
The experimental procedure is detailed as follows:
1. Apply the three color analysis techniques and the five
texture analysis methods to the dataset, using different
configurations of parameters.
2. Train the four machine learning algorithms per combination of texture extraction method, color space and
a subset of parameter configurations.
3. Statistically analyze the classifiers and select the most
competitive one.
4. Evaluate the effectiveness of the whole set of parameter configurations for each pair of texture-color
analysis, in terms of the accuracy of the most
competitive classifier.

Experimentation was performed on an IntelCoreTMi5
CPU 760 @ 2.80GHz with RAM 4GB. C?? was the
programming language used to implement the proposed
methodology to perform the experimental procedure.
SVMs with radial basis kernel and automatic parameter
estimation were considered [45]. Regarding the parameters
of the other three classifiers, the configurations provided by
Weka [46] were used: normal distribution for the numerical
attributes when using NB; and no maximum depth of the
trees when using both RT and RF. Notice that a tenfold
cross-validation [47] was used, so the average error across
all 10 trials is computed for each case.

4 Results and discussion
In this section, the results obtained with each texture analysis method, each color space and each machine learning
algorithm will be analyzed in terms of percentage accuracy. This measure represents the percentage of images that
are correctly classified according to their category. Next,
the experiments performed to select the most competitive
machine learning algorithm, and the best methods for color
texture analysis are explained.

4.1 Machine learning algorithms
The objective here is to determine which machine learning
algorithm performs best for the problem at hand. One of
the possible methods to deal with this issue is to compare
the performance of the four machine learning algorithms
by performing a statistical analysis to find significant differences. For the sake of simplicity, only a subset of the
parameter configurations of the texture analysis techniques
was considered when analyzing the behavior of the different classifiers. After selecting the most competitive one,
all the combinations of texture extraction methods and
color spaces will be analyzed in depth

If there are only two classifiers to compare, the mean
accuracy can be compared by means of the paired t test
[48] or the Wilcoxon test [49]. However, if the number of
algorithms is three or more, it is not appropriate to compare
each pair of models using these tests. The reason is that the
likelihood of incorrectly detecting a significant difference
increases with the number of comparisons. In this sense,
the ANOVA test [50] is one of the most common statistical
methods for testing the differences between more than two
related sample means. Its main problem is that it is based
on assumptions which are not always met, for example,
normal distribution of the data. Instead of the ANOVA test,
the use of the Friedman test is proposed in this research
according to Demsˇar [51].

The Friedman test [52] is a non-parametric equivalent of
the repeated-measures ANOVA, which ranks the algorithms for each dataset separately. If it rejects the null
hypothesis, that all population means are equal, then a post
hoc test has to be applied to determine which ones are
significantly different. For this second task, the Nemenyi
test [53], a multiple comparison procedure that tests all
means pairwise, is used to compare all the classifiers to
each other

The results of all the experiments performed are presented in tables in terms of percentage accuracy. From top
to bottom, each cell shows the results obtained in grayscale, L*a*b* and opponent colors. The best results for
each color space are highlighted

The first experiment was performed using Butterworth
filters in order to analyze the 9 frequency bands separately
(see Table 2). The Friedman test rejected the null
hypothesis which means that there are significant differences between some classifiers. In grayscale, the Nemenyi
test concluded that there are significant differences among
SVM and NB, but not among SVM, RT and RF; so, SVM,
RT and RF are the best classifiers in this case. Regarding
L*a*b* and opponent colors, the multiple comparison test
concluded that there are significant differences among
SVM, NV and RT, but not among SVM and RF. Thus,
SVM and RF are statistically better than the other algorithms in these two color spaces

The second experiment analyzes Gabor filters using nine
different histogram sizes, from 3 to 19 (see Table 3). In this
case, the Friedman test also rejected the null hypothesis
which means that there are significant differences between
some classifiers. Independently of the color space considered, the multiple comparison test concluded that there are
significant differences among SVM, NV and RT, but not
among SVM and RF. Thus, SVM and RF are statistically
better than the other algorithms when using Gabor filters.

The third experiment analyzes the discrete wavelet
transform using 8 scales and the Haar algorithm as the
mother wavelet (see Table 4). In the three color spaces, the
Friedman test rejected the null hypothesis and the Nemenyi
test concluded exactly the same: SVM and RF are significantly better than the other two classifiers considered.

The fourth experiment consisted in analyzing the Markov random fields method for 10 different neighborhoods
or distances (see Table 5). According to Friedman test,
there are significant differences among the four classifiers.
Once again, the Nemenyi test concluded that there are
significant differences among SVM, NV and RT, but not
among SVM and RF. Thus, SVM and RF are statistically
better than the other algorithms

Finally, the last experiment analyzes the co-occurrence
features method and considers 17 distances separately. For
clarity reasons, Table 6 only shows the intermediate distances since they perform better in the problem at hand.
The Friedman test rejected the null hypothesis that all
population means are equal, and so the Nemenyi test was
performed in the three color spaces. Regarding grayscale
images and the L*a*b* color space, the Nemenyi test
concludes that there are significant differences among
SVM, RT and RF, but not among SVM and NB. Thus,
SVM and NB are statistically better than the other algorithms in these two color spaces. Finally, the Nemenyi test
concluded that, using opponent colors, there are significant
differences among SVM, NV and RT, but not among SVM
and RF. Thus, SVM and RF are statistically better than the
other two classifiers in this last color space

As a summary, Table 7 shows the most competitive
classifiers for each texture extraction method in the three
color spaces, according to the experiments previously
presented. Analyzing these results, it can be concluded that
the SVM is the only one which appears between the most
competitive methods in all the combinations of texture and
color analysis techniques. Furthermore, it outperforms the
other three classifiers and produces the highest accuracies,
independently of the texture extraction method and the
color space. This outperformance occurs because the SVM
fits the boundaries between classes better.

4.2 Color and texture analysis
Once the machine learning algorithms were analyzed, and
the SVM was chosen as the most competitive method for
the problem at hand, the color and texture analysis techniques were evaluated in depth using the whole set of
parameter configurations. For this task, the metric known
as F-measure is also considered [54]. Notice that this
metric represents the harmonic mean of precision and
recall, and will be used when summarizing the best
parameter configurations of each texture analysis technique. It should be highlighted that F-measure is a good
metric when there is some imbalance along the classes, as
in this case.

The experiment using Butterworth filters was performed
on grayscale, L*a*b* and opponent colors, and analyzes
each frequency band separately as well as in combination.
To combine the adjacent frequency bands, their individual
descriptors were concatenated. This experiment will help to
decide which color space and frequency bands are more
appropriate for the problem at hand. Figure 9 shows the
obtained results in terms of percentage accuracy for all the
frequency band concatenations. As can be seen, the lower
frequencies are more discriminative than the intermediate
and higher ones. In grayscale, results achieved are over
70 % of correct classifications in most cases, whilst the
best combinations provide classification rates higher than
85 %. Regarding the L*a*b* color space, it can be seen
that it does outperform grayscale and produces the same
best results over 85 %. Finally, it can be seen how color
information, using opponent colors, improves the accuracy
of the method compared to grayscale. In this case, the
accuracy is almost 90 % for the best combination of frequency bands. In general terms, results are quite stable,
since there is a wide range of frequency band combinations
where the accuracy rates are over 80 %. As a summary,
Table 8 shows the best parameter configuration for each
color space, and its performance.

The next experiment was focused on Gabor filters and
consisted of using a different number of bins to create a
uniform histogram which defines the descriptor. In this
case, histograms from 3 to 19 bins were tested, the analysis
being the same as that comparing different machine
learning algorithms. Figure 10 shows the results in grayscale, L*a*b* and opponent colors using only the SVM
classifier. As can be seen, the number of bins which produce better results depends on the color space considered.
Opponent colors provide the maximum accuracy of over
83 % using only 3 bins. However, this accuracy is
improved using a greater number of bins, not only with
L*a*b* but also with grayscale which does not consider
color information. Grayscale produces accuracies over
85 % when the larger number of bins are considered.
Finally, the L*a*b* color space outperforms any other
result obtained with this method, since it produces
classification rates of over 90 %. As a summary, Table 8
shows the best parameter configuration for each color
space, and its performance

The experiment performed with the discrete wavelet
transform aimed at analyzing not only the behavior of each
mother wavelet but also the number of scales. The mother
wavelets analyzed are Haar, Daub4, Daub6 and Daub8;
where Daubi is an orthonormal wavelet whose number of
vanishing moments is i=2. Note that the Haar wavelet is
equivalent to Daub2. Figure 11 shows the results of this
experiment, where scales from 1 to 8 were considered. As
can be seen, the intermediate scales are more discriminative than the smaller and larger ones. Also, in general
terms, the Haar wavelet provides the highest accuracy
rates. Regarding grayscale, the use of the Haar wavelet
achieves results over 85 % of correct classifications in
several scales, exactly the same accuracy that Daub4 and
Daub6 but these two in only one case. On the other hand,
the use of color information improves the results both in
L*a*b* and opponent colors. Concerning opponent colors,
the accuracy is almost 89 % using the Haar wavelet.
However, the best result is obtained with the combination
of the L*a*b* color space and Daub4 which is closely
followed by L*a*b* and Haar with no significant differences. In general terms, results are quite stable, since there
is a wide range of scales where the accuracies are over a
80 % accuracy, independently of the mother wavelet. As a
summary, Table 10 shows the best parameter configuration
for each color space, and its performance

The target of the experiment performed using Markov
random fields was to compare different distances, and their
adjacent combinations, in the three color spaces. In this
particular case, distances from 1 to 10 were considered.
Figure 12 shows the percentage accuracies obtained in this
experiment. As can be seen, the shorter and intermediate
distances outperform the larger ones. Regarding grayscale
and opponent colors, their behavior is quite similar since
they produce better results in the same range of distances
and their highest accuracy is also the same: over 83 %.
However, the use of the L*a*b* color space provides the
best results of the method with maximum accuracy over
90 %. As a summary, Table 11 shows the best parameter
configuration for each color space, and its performance.

The last experiment was focused on co-occurrence
features and consisted of analyzing each distance, from 1 to
17, separately as well as the combination of the adjacent
distances. This combination was achieved through the
concatenation of the individual descriptors. Figure 13
shows the results in terms of percentage accuracy for the
most relevant combinations in grayscale, L*a*b* and
opponent colors. The intermediate and larger distances are
more discriminative than the shorter ones, providing over
90 % of correct classifications using grayscale images. The
best combinations of distances provide classifications rates
over 92 %. Opponent colors do not outperform grayscale,
since the best result with this method is less than 90 %. In
contrast, L*a*b* demonstrates how color information can
improve the obtained accuracy rates, providing the best
results with this texture extraction method. Furthermore,
almost all the distance combinations obtain over 90 % of
accuracy and some of them around 93 %. As a summary,
Table 12 shows the best parameter configuration for each
color space, and its performance

4.2.1 Performance on the different classes
To analyze the performance of the system on the different
categories, Table 13 depicts the confusion matrix for the
best parameter configuration. As can be seen, the number
of misclassified classes is really low. In addition, when a
sample is misclassified, it is always classified into one of
the adjacent classes, i.e., into the most similar classes.
There are not misclassifications among the classes with
more different features, for example among fine fringes and
debris. Regarding the behavior of the system on the classes,
the coalescing strong fringes seems to be the easiest one for
classification since all the samples which belong to this
class are correctly classified. In contrast, the strong fringes
with 2 out of 11 samples misclassified, and debris with 2
out of 12 are the most difficult ones. The reason in this case
may be the under-representation of these two categories in
the dataset.

5 Conclusions
This research presents a study of different methods of
classifying human tear films, based on the detection of a
region of interest and the analysis of its low-level features
through five texture extraction techniques and three color
spaces. For each combination of texture and color analysis,
four machine learning algorithms and different parameter
configurations were considered

In general terms, the SVM classifier produces the best
results compared with three other machine learning algorithms, achieving results with over 80 % accuracy independently of the texture extraction method and the color
space. Also, it is significantly different from the other
classifiers in most cases. Regarding color analysis, the use
of color information generally improves the results
achieved when compared to grayscale analysis because
some categories contain not only morphological features,
but also color features. All the texture extraction methods
perform quite well and provide results with over 90 % of
accuracy in some cases, but co-occurrence features generate the best results, closely followed by Gabor filters.
Despite the fact that Markov random fields use neighborhood information, as does co-occurrence features analysis,
the Markov technique does not perform as well, because
less information is included in the final descriptor. In
essence, the combination of co-occurrence features and the
L*a*b* color space produces the best classification rates
with an accuracy over 93 %. Despite the fact that there is
some imbalance along the five classes in the dataset considered, the F-measure confirms the reliability of the results
presented in terms of percentage accuracy since there are
not significant differences among these two popular
metrics.

In clinical terms, the utility of any treatment depends on
identifying whether the treatment used actually affected the
disease process or not. This depends on having a reliable
means of measuring the condition under examination [6,
55–57]. In the case of dry eye treatment, one factor that can
change, is the spread of the lipid layer across the anterior
surface of the tears. Unfortunately, the interferometric
appearance of this layer can vary hugely, even within one
image, thus subjective examination can be uncertain.
Introducing an impartial method of categorizing the lipid
layer appearance is, therefore, tremendously helpful. And
therein lies the importance of the proposed methodology,
whose results demonstrate that the manual process done by
experts can be automated. This results in a system that is
both faster than the manual process and also unaffected by
subjective factors.

Most images are made up of a combination of different
patterns. This heterogeneity of the tear film lipid layer
makes its classification into a single category impossible.
Consequently, future research will involve performing
local analysis and classification in order to detect multiple
categories in each individual subject. In addition to further
analysis of individual images, investigation of the dynamic
changes seen in the tear film during the inter-blink time
interval, would help in identifying those subjects with poor
tear film stability.

hushenming3

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Automatic grading system for human tear films人体泪膜自动分级系统

AbstractDry eye syndrome is a prevalent disease which affects a wide range of the population and has a negative impact on their daily activities, such as driving or workingwith computers. Its diagn...
复制链接

扫一扫