Alvaro Fuentes 1 ID , Sook Yoon 2,3, Sang Cheol Kim 4 and Dong Sun Park 5,*
1 Department of Electronics Engineering, Chonbuk National University, Jeonbuk 54896, Korea;
afuentes@jbnu.ac.kr
2 Research Institute of Realistic Media and Technology, Mokpo National University, Jeonnam 534-729, Korea;
syoon@mokpo.ac.kr
3 Department of Computer Engineering, Mokpo National University, Jeonnam 534-729, Korea
4 National Institute of Agricultural Sciences, Suwon 441-707, Korea; sckim@rda.go.kr
5 IT Convergence Research Center, Chonbuk National University, Jeonbuk 54896, Korea
* Correspondence: dspark@jbnu.ac.kr; Tel.: +82-63-270-2475
Received: 10 July 2017; Accepted: 28 August 2017; Published: 4 September 2017
Abstract: Plant Diseases and Pests are a major challenge in the agriculture sector. An accurate and a
faster detection of diseases and pests in plants could help to develop an early treatment technique
while substantially reducing economic losses. Recent developments in Deep Neural Networks have
allowed researchers to drastically improve the accuracy of object detection and recognition systems.
In this paper, we present a deep-learning-based approach to detect diseases and pests in tomato plants
using images captured in-place by camera devices with various resolutions. Our goal is to find the
more suitable deep-learning architecture for our task. Therefore, we consider three main families of
detectors: Faster Region-based Convolutional Neural Network (Faster R-CNN), Region-based Fully
Convolutional Network (R-FCN), and Single Shot Multibox Detector (SSD), which for the purpose of
this work are called “deep learning meta-architectures”. We combine each of these meta-architectures
with “deep feature extractors” such as VGG net and Residual Network (ResNet). We demonstrate the
performance of deep meta-architectures and feature extractors, and additionally propose a method
for local and global class annotation and data augmentation to increase the accuracy and reduce the
number of false positives during training. We train and test our systems end-to-end on our large
Tomato Diseases and Pests Dataset, which contains challenging images with diseases and pests,
including several inter- and extra-class variations, such as infection status and location in the plant.
Experimental results show that our proposed system can effectively recognize nine different types of
diseases and pests, with the ability to deal with complex scenarios from a plant’s surrounding area.
Keywords: plant disease; pest; deep convolutional neural networks; real-time processing; detection
1. Introduction
Crops are affected by a wide variety of diseases and pests, especially in tropical, subtropical, and
temperate regions of the world [1]. Plant diseases involve complex interactions between the host plant,
the virus, and its vector [2]. The context of this problem is sometimes related to the effects of the
climate change in the atmosphere and how it alters an ecosystem. Climate change basically affects
regional climate variables, such as humidity, temperature, and precipitation, that consequently serve
as a vector in which pathogens, virus, and plagues can destroy a crop, and thus cause direct impacts
on the population, such as economic, health, and livelihood impacts [3].
Diseases in plants have been largely studied in the scientific area, mainly focusing on the biological
characteristics of diseases [4]. For instance, studies on potato [5] and tomato [6,7] show how susceptible
a plant is to be affected by diseases. The problem of plant diseases is a worldwide issue also related
to food security [8]. Regardless of frontiers, media, or technology, the effects of diseases in plants
cause significant losses to farmers [9]. An earlier identification of disease is nowadays a challenging
approach and needs to be treated with special attention [10].
In our approach, we focus on the identification and recognition of diseases and pests that
affect tomato plants. Tomato is economically the most important vegetable crop worldwide, and
its production has been substantially increased through the years [11]. The worldwide cultivation of
tomato exposes the crop to a wide range of new pathogens. Many pathogens have found this crop to
be highly susceptible and essentially defenseless [6]. Moreover, viruses infecting tomato have been
described, while new viral diseases keep emerging [12].
Several techniques have been recently applied to apparently identify plant diseases [13].
These include using direct methods closely related to the chemical analysis of the infected area
of the plant [14–16], and indirect methods employing physical techniques, such as imaging and
spectroscopy [17,18], to determine plant properties and stress-based disease detection. However, the
advantages of our approach compared to most of the traditionally used techniques are based on the
following facts:
Our system uses images of plant diseases and pests taken in-place, thus we avoid the process of
collecting samples and analyzing them in the laboratory.
It considers the possibility that a plant can be simultaneously affected by more than one disease
or pest in the same sample.
Our approach uses input images captured by different camera devices with various resolutions,
such as cell phone and other digital cameras.
It can efficiently deal with different illumination conditions, the size of objects, and background
variations, etc., contained in the surrounding area of the plant.
It provides a practical real-time application that can be used in the field without employing any
expensive and complex technology.
Plant diseases visibly show a variety of shapes, forms, colors, etc. [10]. Understanding this
interaction is essential to design more robust control strategies to reduce crop damage [2]. Moreover,
the challenging part of our approach is not only in disease identification but also in estimating how
precise it is and the infection status that it presents. At this point, it is necessary to clarify the differences
between the notions of image classification and object detection. Classification estimates if an image
contains any instances of an object class (what), unlike a detection approach, which deals with the class
and location instances of any particular object in the image (what and where). As shown in Figure 1,
our system is able to estimate the class based on the probability of a disease and its location in the
image shown as a bounding box containing the infected area of the plant.
Diseases in plants have been largely studied in the scientific area, mainly focusing on the
biological characteristics of diseases [4]. For instance, studies on potato [5] and tomato [6,7] show
how susceptible a plant is to be affected by diseases. The problem of plant diseases is a worldwide
issue also related to food security [8]. Regardless of frontiers, media, or technology, the effects of
diseases in plants cause significant losses to farmers [9]. An earlier identification of disease is
nowadays a challenging approach and needs to be treated with special attention [10].
In our approach, we focus on the identification and recognition of diseases and pests that affect
tomato plants. Tomato is economically the most important vegetable crop worldwide, and its
production has been substantially increased through the years [11]. The worldwide cultivation of
tomato exposes the crop to a wide range of new pathogens. Many pathogens have found this crop to
be highly susceptible and essentially defenseless [6]. Moreover, viruses infecting tomato have been
described, while new viral diseases keep emerging [12].
Several techniques have been recently applied to apparently identify plant diseases [13]. These
include using direct methods closely related to the chemical analysis of the infected area of the plant
[14–16], and indirect methods employing physical techniques, such as imaging and spectroscopy
[17,18], to determine plant properties and stress-based disease detection. However, the advantages
of our approach compared to most of the traditionally used techniques are based on the following
facts:
• Our system uses images of plant diseases and pests taken in-place, thus we avoid the process of
collecting samples and analyzing them in the laboratory.
• It considers the possibility that a plant can be simultaneously affected by more than one disease
or pest in the same sample.
• Our approach uses input images captured by different camera devices with various resolutions,
such as cell phone and other digital cameras.
• It can efficiently deal with different illumination conditions, the size of objects, and background
variations, etc., contained in the surrounding area of the plant.
• It provides a practical real-time application that can be used in the field without employing any
expensive and complex technology.
Plant diseases visibly show a variety of shapes, forms, colors, etc. [10]. Understanding this
interaction is essential to design more robust control strategies to reduce crop damage [2]. Moreover,
the challenging part of our approach is not only in disease identification but also in estimating how
precise it is and the infection status that it presents. At this point, it is necessary to clarify the differences
between the notions of image classification and object detection. Classification estimates if an image
contains any instances of an object class (what), unlike a detection approach, which deals with the class
and location instances of any particular object in the image (what and where). As shown in Figure 1,
our system is able to estimate the class based on the probability of a disease and its location in the
image shown as a bounding box containing the infected area of the plant.
Recent advances in hardware technology have allowed the evolution of Deep Convolutional
Neural Networks and their large number of applications, including complex tasks such as object
recognition and image classification. Since the success of AlexNet [19] in the ImageNet Large Scale
Visual Recognition Challenge [20] 2012 (ILSVRC), deeper and deeper networks [21–26] have been
proposed and achieved state-of-the-art performance on ImageNet and other benchmark datasets [27].
Thus, these results evidence the need to study the depth and width, as deeper and wider networks
generate better results [28].
In this paper, we address disease and pest identification by introducing the application of deep
meta-architectures [29] and feature extractors. Instead of using traditionally employed methods, we
basically develop a system that successfully recognizes different diseases and pests in images collected
in real scenarios. Furthermore, our system is able to deal with complex tasks, such as infection status
(e.g., earlier, last), location in the plant (e.g., leaves, steam), sides of leaves (e.g., front, back), and
different background conditions, among others.
Following previous approaches [30–32], we aim to use meta-architectures based on deep detectors
to identify Regions of Interest (ROI) in the image, which correspond to infected areas of the plant.
Each ROI is then classified as containing or not containing a disease or pest compared to the
ground-truth annotated data. Using deep feature extractors, our meta-architecture can efficiently
learn complex variations among diseases and pests found in different parts of the plant and deal with
different sizes of candidates in the image.
The contributions of this paper are as follows: we propose a robust deep-learning-based detector
for real-time tomato diseases and pests recognition. The system introduces a practical and applicable
solution for detecting the class and location of diseases in tomato plants, which in fact represents a main
comparable difference with traditional methods for plant diseases classification. Our detector uses
images captured in-place by various camera devices that are processed by a real-time hardware
and software system using graphical processing units (GPUs), rather than using the process of
collecting physical samples (leaves, plants) and analyzing them in the laboratory. Furthermore,
it can efficiently deal with different task complexities, such as illumination conditions, the size of
objects, and background variations contained in the surrounding area of the plant. A detailed review
of traditional methods for anomaly detection in plants and deep-learning techniques is presented
in Section 2. Our proposed deep-learning-based system and the process for detecting diseases and
pests is detailed in Section 3. In Section 4, we show the experimental results to demonstrate how our
detector is able to successfully recognize nine different diseases and pests and their location in the
images while providing robust real-time results. Moreover, we found out that using a technique-based
data annotation and augmentation method results in better performance. In the last section, we study
some of the detection failures and conclude that, although the system shows outstanding performance
when dealing with all complex scenarios, there is still room for prediction improvements as our dataset
becomes larger and includes more classes.
2. RelatedWorks
2.1. Anomaly Detection in Plants
Plant diseases identification is a critical topic that has been studied through the years, and is
motivated by the need to produce healthy food. However, some desirable elements to take into
account should be cost-effectiveness, user-friendliness, sensitiveness, and accuracy [33]. In the last
decade, several works have proposed some nondestructive techniques to overcome those facts. In [34],
hyperspectral proximal sensing techniques were used to evaluate plant stress to environmental
conditions. Optical technologies are practical tools considered for monitoring plant health; for
example, in [35], thermal and fluorescence imaging methods were introduced for estimating plant
stress produced mainly by increased gases, radiation, water status, and insect attack, among others.
Another important area includes the study of plant defense in response to the presence of pathogens
For that effect, in [36], chemical elements were applied to leaves in order to estimate their defense
capabilities against pathogens. To study plant robustness against nutritional facts, in [37], potato plants
were cultivated in the presence of several nutritional elements to evaluate their effects in the crop.
As mentioned earlier, the area of plant anomaly detection has been dealt with by different media.
Although previous methods show outstanding performance in the evaluated scenarios, they do not
provide yet a highly accurate solution for estimating diseases and pests in a real-time manner. Instead,
their experiments are mainly conducted in a laboratory or using expensive techniques. Therefore,
our approach is focused on a cost-effective technique that uses images collected in situ as our source
of information, including variations of the scenario in place. Before Deep Learning became popular
in the Computer Vision field, several handcrafted feature-based methods had been widely applied
specifically for image recognition. A handcrafted method is called so because of all the human
knowledge implied in the development of the algorithm itself and the complex parameters that are
included in the process. Some disadvantages of these methods are also the high computational cost
and time consumption due to the complex preprocessing, feature extracting, and classifying. Some of
the best-known handcrafted feature methods are the Histogram of Oriented Gradients (HOG) [38]
and Scale-Invariant Feature Transform (SIFT) [39], which are usually combined with classifiers such as
Adaptive Boosting (AdaBoost) [40] or Support Vector Machines (SVM) [41].
The facilities of Deep Learning have allowed researchers to design systems that can be trained and
tested end-to-end (all included in the same process), unlike when using handcrafted-based methods that
use separate processes. Due to the outstanding performance of Convolutional Neural Networks (CNNs)
as a feature extractor in image recognition tasks, the idea has been also extended to different areas, such
as in agriculture, automation, and robotics. Some of the applications for agriculture utilize Computer
Vision and CNNs to solve complex tasks, such as plant recognition. For instance, in [42], it is shown how
a CNN-based method outperforms local feature descriptors and bag of visual words techniques when
recognizing 10 types of plants. In [43], the authors found that using a fusion of deep representations and
handcrafted features leads to a higher accuracy of leaf plant classification. They applied a CNN for leaf
segmentation, extracted handcrafted features with image processing techniques, trained an SVMwith
feature vectors, and used an SVMwith a CNN to identify species among 57 varieties of trees.
Subsequently, due to the recent advance in Machine Learning, the principle of CNN has been
applied to plant diseases recognition in different crops, such as [44] using a CNN-based LeNet and
image processing to recognize two leaf diseases out of healthy ones. In [45], an image processing
and statistical inference approach was introduced to identify three types of leaf diseases in wheat.
In [46], the authors developed a method to discriminate good and bad condition images which
contain seven types of diseases out of healthy ones in cucumber leaves. For that effect, they used
an image-processing technique and a four-layer CNN, which showed an average of 82.3% accuracy
under a 4-fold cross-validation strategy. Another approach for cucumber leaf diseases, [47], used a
three-layer CNN to train images containing two diseases out of healthy ones. To support the application
of machine learning, [48] proposed to use a method called Color and Oriented FAST and Rotated
BRIEF (ORB) to extract features and tree classifiers (Linear Support Vector Classifier (SVC), K-Nearest
Neighbor, Extremely Randomized Trees) to recognize four types of diseases in cassava. As a result,
they present a smartphone-based system that uses the classification model that has learned to do
real-time prediction of the state of health of a farmer’s garden.
Other works that use deep convolutional neural networks for diseases recognition have been also
proposed, showing good performance on different crops. For instance, [49] developed a CNN-based
system to identify 13 types of diseases out of healthy ones in five crops using images downloaded from
the internet. The performance of that approach shows a top-1 success of 96.3% and top-5 success of
99.99%. In [50], the authors evaluate two CNN approaches based on AlexNet [19] and GoogleNet [23],
to distinguish 26 diseases included in 14 crops using the Plant Village Dataset [51]. Another work in the
same dataset shows a test accuracy of 90.4% using a VGG-16 model trained with transfer learning [52].
However, the Plant Village Dataset contains only images of leaves that are previously cropped in the
field and captured by a camera in the laboratory. This is unlike the images in our Tomato Diseases and
Pest Dataset, which are directly taken in-place by different cameras with various resolutions, including
not only leaves infected by specific pathogens at different infection stages but also other infected parts
of the plant, such as fruits and stems. Furthermore, the challenging part of our dataset is to deal with
background variations mainly caused by the surrounding areas or the place itself (greenhouse).
Although the works mentioned above show outstanding performance on leaf diseases recognition,
the challenges, such as pattern variation, infection status, different diseases or pests and their location
in the image, and surrounding objects, among others, are still difficult to overcome. Therefore,
we consider a technique that not only recognizes the disease in the image but also identifies its location
for the posterior development of a real-time system.
2.2. Deep Meta-Architectures for Object Detection
Convolutional Neural Networks are considered nowadays as the leading method for object
detection. As hardware technology has been improved through the years, deeper networks with better
performance have been also proposed. Among them, we mention some state-of-the-art methods for
object recognition and classification. In our paper, we focus principally on three recent architectures:
Faster Region-Based Convolutional Neural Network (Faster R-CNN) [30], Single ShotMultibox Detector
(SSD) [31], and Region-based Fully Convolutional Networks (R-FCN) [32]. As proposed in [29], while
these meta-architectures were initially proposed with a particular feature extractor (VGG, Residual
Networks ResNet, etc.), we now apply different feature extractors for the architectures. Thus, each
architecture should be able to bemergedwith any feature extractor depending on the application or need.
2.2.1. Faster Region-based Convolutional Neural Network (Faster R-CNN)
In Faster R-CNN, the detection process is carried out in two stages. In the first stage, a Region
Proposal Network (RPN) takes an image as input and processes it by a feature extractor [30]. Features at
an intermediate level are used to predict object proposals, each with a score. For training the RPNs,
the system considers anchors containing an object or not, based on the Intersection-over-Union
(IoU) between the object proposals and the ground-truth. In the second stage, the box proposals
previously generated are used to crop features from the same feature map. Those cropped features
are consequently fed into the remaining layers of the feature extractor in order to predict the class
probability and bounding box for each region proposal. The entire process happens on a single
unified network, which allows the system to share full-image convolutional features with the detection
network, thus enabling nearly cost-free region proposals.
Since the Faster R-CNN was proposed, it has influenced several applications due to its outstanding
performance on complex object recognition and classification.
2.2.2. Single Shot Detector (SSD)
The SSD meta-architecture [31] handles the problem of object recognition by using a feed-forward
convolutional network that produces a fixed-size collection of bounding boxes and scores for the
presence of an object class in each box. This network is able to deal with objects of various sizes
by combining predictions from multiple feature maps with different resolutions. Furthermore,
SSD encapsulates the process into a single network, avoiding proposal generation and thus saving
computational time.
2.2.3. Region-based Fully Convolutional Network (R-FCN)
The R-FCN framework [32] proposes to use position-sensitive maps to address the problem of
translation invariance. This method is similar to Faster R-CNN, but instead of cropping features from
the same layer where region proposals are predicted, features (regions with a higher probability of
containing an object or being part of it) are cropped from the last layer of features prior to prediction [29].
By the application of that technique, this method minimizes the amount of memory utilized in region
computation. In the original paper [32], they show that using a ResNet-101 as feature extractor can
generate competitive performance compared to Faster R-CNN.
2.3. Feature Extractors
In each meta-architecture, the main part of the system is the “feature extractor” or deep
architecture. As mentioned in the previous section, year by year different deep architectures have been
proposed and their application drastically depends on the complexity of problem itself. There are
some conditions that should be taken into consideration when choosing a deep architecture, such as
the type or number of layers, as a higher number of parameters increases the complexity of the system
and directly influences the memory computation, speed, and results of the system.
Although each network has been designed with specific characteristics, all share the same goal,
which is to increase accuracy while reducing computational complexity. In Table 1, some of the feature
extractors used in this work are mentioned, including their number of parameters and performance
achieved in the Image Net Challenge. We select some of the recent deep architectures because of their
outstanding performance and applicability to our system.
Table 1. Properties of the deep feature extractors used in this work and their performance on the
ImageNet Challenge.
As shown in Figure 2, our system proposes to treat the deep meta-architecture as an open system
on which different feature extractors can be adapted to perform on our task. The input image captured
by a camera device with different resolutions and scales is fed into our system, which after processing
by our deep network (feature extractor and classifier) results in the class and localization of the infected
area of the plant in the image. Thus, we can provide a nondestructive local solution only where the
damage is presented, and therefore avoid the disease’s expansion to the whole crop and reduce the
excessive use of chemical solutions to treat them.