Application of Feature Selection for Identification of Cucumber Leaf Diseases (Cucumis sativa L.)

− According to data from BPS Kabupaten Jember, the amount of cucumber production fluctuated from 2013 to 2017. Some literature also mentions that one of the causes of the amount of cucumber production is disease attacks on these plants. Most of the cucumber plant diseases found in the leaf area such as downy mildew and powdery mildew which are both caused by fungi (fungal diseases). So far, farmers check cucumber plant diseases manually, so there is a lack of accuracy in determining cucumber plant diseases. To help farmers, a computer vision system that is able to identify cucumber diseases automatically will have an impact on the speed and accuracy of handling cucumber plant diseases. This research used 90 training data consisting of 30 healthy leaf data, 30 powdery mildew leaf data and 30 downy mildew leaf data. while for the test data as many as 30 data consisting of 10 data in each class. To get suitable parameters, a feature selection process is carried out on color features and texture features so that suitable parameters are obtained, namely: red color features, texture features consisting of contrast, Inverse Different Moment (IDM) and correlation. The K-Nearest Neighbor classification method is able to classify diseases on cucumber leaves (Cucumis sativa L.) with a training accuracy of 90% and a test accuracy of 76.67% using a variation of the value of K = 7.

INTRODUCTION Cucumber (Cucumis sativa L.) is one of the most widely consumed fruit vegetables and the growing conditions are very flexible, because it can grow in both highlands and lowlands [1]. In Indonesia, the amount of cucumber production fluctuated (instability) from 2013 to 2017. Total cucumber production was 9.97 tons/ha in 2013, then decreased to 9.84 tons/ha in 2014 and increased to 10.27 tons/ha in 2015. In 2016, production decreased again to 10.19 tons/ha and increased again to 10.67 tons/ha in 2017 [2]. One of the causes of the fluctuating amount of cucumber production is the attack of pests and diseases on the plant. Symptoms of disease in plants can be seen from the plant body parts, such as leaves, fruit, stems and roots. Most of the cucumber plant diseases that are found in the leaf area such as antracnose lesion [3], downy mildew and powdery mildew are usually caused by fungi (fungal disease) [4].
So far, farmers have checked for cucumber plant diseases, which are still done manually based on the experience of farmers. This determination certainly has weaknesses, one of which is the lack of accuracy in determining cucumber plant diseases. To help farmers, a cucumber leaf disease identification system was created using computer vision. Computer vision is a branch of science where a system utilizes digital image processing techniques which are then analyzed using artificial intelligence.
Some studies that become the reference of this research are cucumber disease detection using diagnosis of diseases of cucumber through extracting three characteristic values of shape, texture and color [5] [3], Then the research was developed by adding image smoothing and edge detection so that the segmentation results were better [6]. The feature extraction technique used is first-order statistical features and second-order statistical features such as Gray Level Co-Occurrence Matrix (GLCM) to get an accuracy of 80.45% [7], then developed the introduction of cucumber disease using sparse representation classification with an accuracy rate of 85.7% using the classification method used in this study is K-Nearest Neighbor [8]. The KNN method is also able to classify other research objects such as the classification of platelets on peripheral blood smears with an accuracy of 83.67% [9], on the classification of tuberculosis bacteria with an accuracy of 94.92% [10], Classification of white blood cell abnormalities based on shape features (area, perimeter, metric and compactness) with an accuracy rate of 94.3% [11] and classification of bacteria that cause ARI with an accuracy of 91.67% using a variation of the value of K = 3.5 and 7 [12].
To classify data, the KNN method uses the closest distance to the object so that the method is often known as lazy learning. The basic principle of KNN is to find the value of K where the value of K is the closest amount of data that will determine the classification results and to calculate the closest distance using Euclidean distance calculations. Based on this description, the K-Nearest Neighbor (KNN) method is able to classify cucumber leaf diseases with a good level of accuracy.

II.
RESEARCH METHODOLOGY This research was conducted in a cucumber field in Lumajang Regency, East Java. Data collection is done with the help of direct sunlight, so the resulting image is very bright and does not cause shadows (noise). This will affect the results in the image processing process. The system flow consists of starting from the image of cucumber leaves, then the image is processed using digital image 174 processing techniques, the result of feature extraction becomes the input of the KNN classification method which will be classified into 3 classes as shown in Figure 1.

A. The Cucumber Leaf Images Data
Image data was taken using a Canon EOS 1100D camera with a camera resolution of 12 MP, using a tripod and studio light box. The data is divided into 3 classes, namely healthy leaf images, downy mildew leaf images and powdery mildew leaf images as shown in Figure 2. This research used 90 training data consisting of 30 healthy leaf data, 30 powdery mildew leaf data and 30 downy mildew leaf data. while for the test data as many as 30 data consisting of 10 data in each class.

B. Image Processing
The image processing process is the initial stage of the image processing process which aims to improve image quality and the data normalization process. This process begins with a cropping process whose aim is to obtain a smaller image size to reduce the computational load [13]. The initial image size of 4272 x 2848 pixels is cropped to 1001 x 1001 pixels as shown in Figure 3.

C. Feature Extraction
Feature extraction aims to extract the unique characteristics of the cucumber leaf image. Before performing feature extraction, the original image is an RGB color space image, then the process of splitting the color components into red, green and blue as shown in Figure 4.
In this study, two features were used, namely color and texture. The color features used are red, green and blue color features, while the texture features used are texture features from the gray level co-occurrence matrix (GLCM) method.  GLCM is a gray degree matrix that represents the frequency of occurrence of two pixels with a certain intensity in a distance d and a certain angle direction θ. Therefore, the matrix provides information that differs from the difference in distance between pixels [14]. The angles used are 0°, 30°, 45°, 90° and 135°. The formula equation used is as follows [15]: Correlation = (i − μi )(i − μj )(GLCM(i, j)) σiσj principle of looking for a constant K value where the K value is the number of closest distances that affect the classification results. The calculation of distance using Euclidean distance calculation with the equation [12] : III.

RESULTS AND DISCUSSION
In this research, the results of the cropping process are taken for the color features of each component of the RGB color space as shown in Figure 4. The image shows that in the blue and green images there is no significant difference in the classes: healthy leaves, downy mildew and powdery mildew. On the other hand, the red image shows a significant difference, especially in the powdery mildew leaf class, so that the red image is the best in representing the textures of the three classes. In the image of the red component, the gray level values on the downy mildew and powdery mildew leaves are clearly visible so that the red image becomes the input for texture feature extraction based on the Gray Level Co-Occurrence Matrix (GLCM) value.

A. Color Feature Extraction
In this research, color features were taken so that each class was as shown in Table 1. The table shows that there is a closeness of values for the Blue features of the Healthy leaf class and the downy mildew leaf class. Whereas in the Green feature, there is also a closeness of values in the Healthy leaf class and the powdery mildew leaf class, so that the color feature used is the red image feature.

B. Texture Feature Extraction
This research also takes texture features using GLCM texture features. To take the GLCM feature, a red image is used as the input image, as previously explained that the red image best represents the texture in the three classes. The average value of texture features is shown in Table 3, where the features used are Angular Second Moment (ASM), Contrast, Inverse Different Moment (IDM), Entropy and Correlation. Table 3 shows that there is a closeness of values for the ASM features of the three classes, while for the entropy features there is also a closeness of values for the downy mildew and powdery mildew class.

C. Feature Selection
Based on the discussion on color feature extraction and texture feature extraction, it was found that some features have close values and this will affect the classification process. There will be errors in the classification process due to the proximity of these values, so the researcher conducts a feature selection process and the features used are red color features, contrast texture features, IDM, and Correlation. An example of selecting a red feature is shown in Figure 5.  Figure 5 shows that the blue color describes the distribution of healthy leaf data, the red color describes the distribution of downy mildew data and the green color describes the distribution of powdery mildew data. The graph will be different when compared to the green and blue color features as shown in Figure 6. On the graph (Figure 6), there is still a lot of data that occurs in the values between the three classes so that the graphs of the three classes intersect. While the GLCM features a graph depicting the value of each feature in the three classes (Figure 7). On Figure 7 shows that in features (a) ASM and (d) entropy features, there are many data that have values that are tangent to each other in the three classes. this affects the accuracy of the system in classifying data so that the two features become features that will reduce the system in classifying the three classes. Whereas in features (b) contrast, (c) IDM and (e) correlation, there is a difference in value even though there are features whose values intersect on healthy leaf data with downy mildew data or on healthy leaf data with powdery mildew data. Based on the results of feature selection, this study used 4 features consisting of color features and texture features, namely red color features, texture features (contrast, IDM and correlation) as shown in Table 3.

D. K-Nearest Neighbor Classification
The number of training data is 90 cucumber leaf image data consisting of 30 healthy leaf data, 30 powdery mildew leaf data, 30 downy mildew leaf data. While 30 test data consisting of 10 data in each class. The results of the accuracy of the KNN classification method. Based on Table  4, the highest accuracy of system training is 100% at the variation of the value of K = 1. However, the accuracy of the test is 66.77%, so that judging from the results of the highest test accuracy, it is 76.67% with a training accuracy of 90% at the variation of the value of K = 7. The calculation of system accuracy is obtained based on ROC calculations with a confusion matrix as shown in Table 5 for the training process and Table 6 for the testing process.  Based on the results of these calculations, it can be seen that there is a significant difference in accuracy in the training and testing process. The system training accuracy rate is 90% while the system testing accuracy is 76.67%. This can happen due to the lack of data used or there is a significant difference in the value of the training data and testing data.

IV.
CONCLUSION The K-Nearest Neighbor (KNN) classification method is able to classify diseases on cucumber leaves (Cucumis sativa L.) with a training accuracy of 90% and a test accuracy of 76.67% using a variation of the value of K = 7. The lack of data also affects the classification results because the system has limitations in recognizing patterns from healthy leaf classes, downy mildew and powdery mildew. In addition, this research must also compare other classification methods.