Application of Information Gain to Select Attributes in Improving Naive Bayes Accuracy in Predicting Customer ' s Payment Capability

− The customer is the main factor in the running of PT. XYZ. A good understanding of customers is very important for predicting the capability of customers to pay. The implementation of credit collectibility is used to determine the quality of customer credit, one of which is the customer's capability to pay interest and principal on time. While manually, it is very difficult to accurately predict the capability of customer credit payments. Data mining techniques with the Naïve Bayes algorithm were chosen to classify customers to be able to find patterns, analyze and predict, because they have good performance, are efficient, and simple. The Naïve Bayes algorithm has a weakness in terms of sensitivity to many attributes, so the accuracy is low. Based on the problem stated, his study will apply the Information Gain method to select the most influential attribute on the label in order to increase the accuracy of the Naïve Bayes algorithm. This research produces a new dataset with seven attributes: TENOR, SALARY, DOWN PAYMENT, INSTALLMENT, APPROVAL, OTR CLASS, AGE with Labels: Status and Id: Id number based on the Information Gain method. The dataset comparison process with 995 data records showed an increase in accuracy, precision, and AUC using the new dataset compared to the old dataset, but in the t-Test test with an alpha value = 0.05 there is a difference but not significant. In the evaluation process, performance experienced a significant increase in the use of new datasets with the following percentages of performance improvement: accuracy = 8%, precision = 18.42%, recall = 17.65% and AUC= 0.057%. The results of this study obtained AUC of 0.876, accuracy of 87.88%, precision of 61.90%, and recall of 76.47%, and classified into good classification.


INTRODUCTION
In the present era, business strategy has developed very significantly. Customers occupy a very important position in this regard. Customers are also a major factor in the running of PT. XYZ. A good understanding of customers is very important for predicting the capability of customers to pay in the future. The implementation of credit collectibility is used to determine the quality of customer credit [1], one of which is the customer's capability to pay interest and principal on time that has been mutually agreed upon [2]. The problem currently being faced is that manually it is very difficult to predict the capability of customer credit payments accurately based on the dataset they have [3], and many companies have difficulty identifying customers who are able to pay on time [4].
The data mining technique using the customer classification approach is an approach that is widely used to find patterns, analyze and predict [5]. With a customer classification approach based on existing datasets, we can predict a customer's payment capability [6]. Meanwhile, manually, it is very difficult to accurately predict the capability of customer credit payments based on the dataset they have.
Currently, there are many data mining techniques with classification approach algorithms that have been used to find patterns, analyze and predict customer behavior, such as the Decision Tree Algorithm [7], Neural Network [8], Support Vector Machine [9], Naive Bayes [10], and K-Nearest Neighbor [11]. The Naïve Bayes algorithm is the algorithm with the most widely used classification approach and was chosen to classify customers because it has good performance, is efficient, and is simple in terms of finding patterns, analyzing and predicting [12]. The Naïve Bayes algorithm has one drawback, namely that it is sensitive to many attributes, so the accuracy is low. Selection or choosing the attribute that has the most influence on the label is very important for the Nave Bayes algorithm to be able to increase the accuracy of the algorithm [13].
One of the methods for determining the best attribute or selecting the attribute that has the most influence on the label is the Information Gain method. The Information Gain method is superior to other methods because the Information Gain method will measure how much absence and presence of an attribute that plays a role in making good classification decisions in any class or label. The Information Gain method is one of the successful attribute selection approaches in classification [14].
In this study, the authors apply the Information Gain method to select the most influential attribute on the label to be applied to the new dataset in an effort to improve the accuracy of the Naïve Bayes Algorithm in predicting the payment capability of customers at PT. XYZ.

Collectibility of Credit
Credit collectibility is a credit quality status that is taken from a person's score or track record in the banking world [15]. This quality is based on 3 main standards, one of which is the customer's capability to pay principal and interest on time that has been mutually agreed upon [16]. Low collectibility or bad loans can affect the economic condition of a business and worsen the trickle down effect on the overall economy, where this has an impact on the company's growth and income in the future [17]. There are several important elements in the provision of a credit facility, namely Trust, Agreement, Term, Risk and Rewards [18]. To find out how to provide credit to customers based on good credit quality, it is necessary to accurately predict the capability of customer credit payments as a reference for management in making decisions to improve credit quality and collectability [19].

Data Mining
Data mining is a process of using existing or past data, then processing it so that it finds patterns, meaningful relationships, and trends by examining a set of stored data using statistical and mathematical techniques [20]. Data mining became popular in the 1990s as a solution for extracting previously unknown patterns and information based on a set of data [21]. Data mining can complete several jobs and is divided into four groups, namely prediction modeling, cluster analysis, association analysis, and anomaly detection [22]. However, data mining techniques can also be applied to other data representations, such as spatial, text-based, and multimedia (image) data domain [23]. Data Mining can also be defined as the process of extracting information from large data sets through the use of algorithms with techniques taken from the field of statistics and Database Management Systems [24]. The flow of the research process described can be explained as follows: a. Business Understanding: at this stage, the researcher understands the problem in the research object and then looks for solutions and goals to solve it.
b. Understanding data: at this stage, the researcher determines and collects what data is needed and then defines it according to the solution and research objectives.
c. Data Preparation: at this stage, the researcher cleans the data so that he gets a clean dataset that will be used as a classification model. d. Information gain: researchers will choose the best or most influential attribute on the label and then make it a new dataset.
e. Dataset Comparison: at this stage, the researcher compares the old dataset and the new dataset on the Naïve Bayes algorithm by means of a different test using t-Test.
f. Modeling: at this stage, the researcher will model the data based on the dataset and a new dataset with a 90% data split (Training) : 10% (Testing) g. Evaluation: at this stage, the researcher will compare the results of measuring accuracy, precision, recall, and AUC on the old dataset and the new dataset.
h. Development: this stage, the researcher builds a prototype that will be used to predict the customer's payment capability.

Research Formulas a. Naïve Bayes Algorithm
The Naive Bayes algorithm is an algorithm with a classification approach for predicting a simple probabilistic based that was put forward by the English scientist Thomas Bayes which is based on the application of Bayes' theorem or rules with the assumption of strong independence on features, meaning that a feature in a data is not related to the existence or the absence of other features in the same data [26].
The advantage of using the Nave Bayes algorithm classification approach is that the Nave Bayes algorithm only requires a small amount of training data needed for the classification process [27]. The Naïve Bayes algorithm classification approach has been proven to be applicable in real and complex situations [27].
The Naïve Bayes algorithm classification approach can be defined as follows [28]: The Information Gain method is a method of selecting or selecting attributes in the simplest dataset by only ranking the attributes. The Information Gain method is widely used in the application of data analysis categorization, text, microarray, and image data analysis [29]. In the selection of dataset attributes, the pattern classification approach plays a very important role [30]. The Information Gain method can help reduce noise caused by irrelevant attributes [31]. The initial step that must be done is to determine the best attribute value by calculating the entropy value. Entropy is the process of using the probability of certain events or attributes to measure class uncertainty [32]. After calculating the entropy value, then we can only calculate the Information Gain method [33].
Calculating entropy is defined as follows [33]: where c is the number of values on the classification label and Pi is the number of samples for class i.
The information gain method is defined as follows [23]: where A is an attribute, v is a possible value for attribute A, Values(A) is the set of possible values for A, |Sv| is the number of samples for the value of v, |S| is the sum of all data samples and Entropy (Sv) is the entropy for samples that have a value of v.

Prototype Development Design
The following is the flow of the Flowchart which is implemented into the prototype payment capability prediction:  The user opens the application where the application will directly direct the user to the login page menu, the user enters the username and password to enter the home page, to predict the new data, the user selects the import data menu and uploads the data to be predicted. The user clicks import to see the prediction result display. After that, the user can choose the preview result menu to see the percentage of the predicted results from the data, to exit the user logs out.

III. RESULTS AND DISCUSSION
This research begins with the business understanding stage, where researchers find the problem manually is very difficult to accurately predict the capability of customer credit payments based on the dataset they have, as well as the Naïve Bayes algorithm which is sensitive to many attributes. The next stage is data understanding. At this stage, the researcher combines data from 4 tables, namely active management summary contracts, order management application hiders, order management applications, and acctmgmt.ar contracts. From this process, researchers get 995 data from 2019-2020. The data taken is contract data from 3 branch offices, namely Serang Branch, Bandung Branch, and Tasik Branch, with a total of 27 attributes and 1 label. Furthermore, data preparation, at this stage, the researcher performs data reduction, namely the selection of attributes that are relevant to the target to be achieved. The selected attributes are expected to be determinants of the information to be processed. After data reduction is done, 13 attributes are generated, 1 ID and 1 label. The data transformation stage here is used to obtain a suitable representation for the specific task to be performed. For example: the age value of 17-25 becomes the new value for late teenagers, as can be seen in Table 1.

Table 1. Data Transformation
Then the data cleaning stage is done by filling in the blank data based on the average value, and replacing the data values that do not match. The results of all the stages above produce a dataset (Clean) or old dataset, which can be seen in table 2. Table 2. Dataset Clean or Old Dataset

The experimental process
Researchers conducted an experimental process using Rapidminer 9.9 tools to perform Data Cleaning, Information Gain, Dataset Comparison, Modeling, and Evaluation. The selection of Rapidminer tools is considered capable of being used for research, prototyping, and supporting all steps of the data mining process such as data preparation, result visualization, validation, and optimization [34]. The experimental process can be seen in Figure 3.  The researcher will use the attribute that has the most influence on the label. This attribute will later become an attribute in the new dataset based on the calculation results of the Information Gain method. New attribute by weight method Information Gain can be seen in table 3.

Dataset Comparison
In this process, the researcher compares the old dataset and the new dataset on the Naïve Bayes algorithm using 10 fold cross-validation by dividing the dataset into 10 parts, of the 10 parts of the data, 9 parts are used as training data, and the remaining 1 part is used as testing data. Then a different test was performed using t-Test to determine the significant difference in the dataset used in the Naïve Bayes algorithm. The results can be seen in the table below. Table 4. Comparison of 10 fold cross-validation Table 5. Different Test (t-Test) Based on the results of the 10-fold cross-validation test, there is an increase in accuracy, precision, and AUC in the Naïve Bayes algorithm using the new dataset compared to the old dataset, but in the t-Test test with an alpha value of 0.05 there is a difference but not significant.

Modeling
The researcher will model the old dataset and new dataset using the Naïve Bayes algorithm based on the split data operator in rundom subsets with a ratio of 90% (Training): 10% (Testing) as shown in figure 3.

Evaluation.
This process will compare the model testing of the old dataset and the new dataset that have been determined in the modeling process by measuring their performance. The results of the performance comparison can be seen in table 6.

160
The following are the results of model testing based on performance measurements and AUC on the old dataset. The results can be seen in figures 5 and 6.  Based on the results of the performance and AUC above, the result of the comparison performance is as follows: Table 6. Result Comparison Performance Based on the measurement of the performance model with the operator split data in a rundom subset with a comparison of 90% (Training): 10% (Testing) which can be seen in table 6. The results show that by using a new dataset based on the calculation of the Information Gain method on the Naïve algorithm Bayes is much better than the old dataset, both in terms of Accuracy, Precision, Recall, and AUC. The percentage increase in performance using the new dataset is as follows: Accuracy = 8%, Precision = 18.42%, Recall = 17.65% and AUC = 0.057%.

Development
This process is based on the results of the dataset comparison and evaluation of the above model testing. It is known that the Naïve Bayes algorithm has a good level of accuracy and performance by using a new  161 dataset based on the calculation of the Information Gain method, so that the rules generated by the Nave Bayes algorithm can be used as rules for making prototypes. The researcher hopes that this prototype can make it easier for PT. XYZ in predicting the capability of customers to pay. for the flowchart prototype can be seen in figure 3.
The prototype used in this study was made desktopbased with programming language using Delphi 7.0 and database using MySQL. The display for the main form of the Graphical User Interface (GUI) prototype predicting the capability of customer credit payments can be seen in the image below.

Figure 9. Login Form
When the application is running, the first form will be displayed, which is the login form. For security of access, the user must have a username and password.  The import menu is used by the user to make predictions on new data by uploading the data to be predicted, after uploading the user clicks Import to see the display of the prediction results and the user can select the preview result menu to see the percentage number of results from the predicted data.

Figure 12. Prediction Result Percentage Form
Testing is carried out with the aim of knowing whether the application is built according to the expected functionality. In testing, the upload file is divided into two parts, namely, selecting the file and uploading the file. In testing the results, the user just clicks on the results. IV. CONCLUSION Based on the Research Process Flow that has been carried out by the researchers, it can be concluded: 1. The data preparation process produces 13 attributes, 1 ID and 1 label with a total of 995 data from 2019-2020.
2. The process of calculating the Information Gain method on the old dataset produces a new dataset with seven attributes: TENOR, SALARY, DOWN PAYMENT, INSTALLMENT, APPROVAL, OTR CLASS, AGE and Label: Status and Id: Id_number.
3. In the dataset comparison process, there is an increase in accuracy, precision, and AUC using the new dataset compared to the old dataset, but in the t-Test test with an alpha value of 0.05, there is a difference but not significant.
4. In the evaluation process, performance experienced a significant increase in the use of new datasets with the following percentage increases in performance: Accuracy = 8%, Precision = 18.42%, Recall = 17.65% and AUC = 0.057%.
5. The development process Based on Black Box testing, the application that was built was running well and as expected and The results of this study obtained AUC of 0.876, accuracy of 87.88%, precision of 61.90%, and recall of 76.47%, and classified into good classification. This study has not been able to provide good t-Test test results because there is no significant difference in the t-Test test results. Based on the findings of this study, it is suggested that further researchers can use chi square, log likelihood ratio or others to select the most influential attribute in order to get good t-test results on the Naïve Bayes algorithm.