Implementation of K-Means C lustering A lgorithm in M apping the G roups of G raduated or D ropped-out S tudents in the Management Department of the National University

- The dropout rate at the National University is still high. National Universities must make efforts to anticipate this rate and increase the number of graduates. This study aims to determine the characteristics of students who are likely to graduate or drop out (DO) in the management department of the National University, Jakarta. The study was conducted by implementing the K-Means algorithm, where each data is grouped according to the closest distance to the centroid. Determination of Cluster C1 graduate or C2 drop out is based on the attributes of status of students (active, leave, out and non-active), educational status (graduated or DO), GPA, total credits taken and length of study. To facilitate the clustering process, Orange tools are used that provide K-Means algorithm features. The total data input in this study were 1988 students from various classes. As a result, a pattern or mapping of graduated or DO students was found based on the attributes mentioned earlier. Testing the results of this cluster with the silhouette method, by measuring the distance between cluster members, both C1 and C2, showed good Silhouetter value, reaching 85% which indicates that this clustering method can be applied as an effort to overcome the high dropout rate. The management department, National University can use the results of this study to predict the graduation of their students. However, along the way, not a few students have dropped out (DO) in the management department of the National University. This has an impact on the and downs of credibility and of the


INTRODUCTION
The National University management department is part of the faculty of economics which was founded in 1964, then its status was registered and recognized in 1985 by the Ministry of Education and Culture. Since then, the department has successfully graduated thousands of graduates, who have worked in various sectors in companies, government agencies, universities, and also entrepreneurs. The National University has been recognized as one of the higher education institutions in Indonesia [1].
The National University has documented the student data well in their system. However, the data is still in raw form so it is difficult to read. To better utilize and maximize the existence of this data, the National University should implement data mining in it. Besides being very helpful, data mining methods are trusted and have been widely used by various other institutions to extract more value from their data. Therefore, in this study, the data processing and grouping of students in the Management Department of the National University were carried out using the clustering method. Just as other researchers have done, this study will utilize the K-Means clustering method to do this. The number of datasets processed in this study were 1,988 records. With this method, a pattern and mapping will be generated that can be used to see and predict the likelihood that students will graduate or drop out.
using the K-Means method in mapping the groups of the graduated or dropped-out students at the Management Department of the National University includes several stages: literature review, determination of the data sets and data pre-processing-which consists of data validation, data transformation and data reduction as well as data exploration using Orange application. To test the results of this study, silhouette was applied. The results will show whether or not this clustering method can be used as an effort to overcome the high dropout rate at the National University. The description of the flow and stages of the research that the authors did for this study is as shown in Figure 1.

A. Literature Review
As literature to support this research, the authors use several trusted scientific journals, proceedings, e-books and websites. These papers contain about clustering, such as the Application of the K-Means Method for Student Clustering Based on Academic Values with a Weka Interface Case Study at the Department of Informatics, UMM Magelang [12], Prediction of Students Academic Execution Using K-Means and K-Medoids Clustering Technique [12] 18], K-Means Cluster Analysis in the Grouping of Students' Capabilities [17] and the Implementation of the K-Means Method in Mapping Student Groups through Lecture Activity Data [15].
Based on the literature review, it is known that the Kmeans clustering method in education has been widely used, especially in grouping students, both learning activities and predicting academic scores. The percentage of grouping shows good results, so that it can be applied to solve problems at the university. The focus of this research is to cluster students who are likely to graduate or drop out (DO) using the K-Means method.

B. Dataset Determination
The data set in this study was obtained from the Department of Management, the National University, Jakarta. Furthermore, the data is processed to the data preprocessing stage. The number of datasets processed in this study were 1,988 records or rows of data containing detailed information about the academic data of students majoring in Management at the National University of various generations, from 1990 to 2016. In detail, from the number of student data collected in this study, 56% were male and 44% were female. The number of student data from the class of 1990-1995 was 48% or 960 students, the class of 1996-2000 was 17% or 343 students, the class of 2001-2005 was 20% or 390 students, the class of 2006-2010 was 3% or 65 students, the class of 2010-2016 was 12% or 230 students. For this study, several attributes were used as consideration for creating student clusters, whether they pass or drop out. These attributes are the GPA score, the number of credits taken, length of study, initial status and final status of students. From these attributes, it is hoped that a cluster will be formed that can help the National University to make predictions for students' graduation or drop-out.

C. Data Pre-processing
This stage includes 3 main parts: data validation, data transformation, and data reduction. For information, to facilitate the implementation of this stage, several features in the Orange application are used. The details of the three stages are:

Data validation
The data validation stage in this study was carried out to ensure that the training data set was in good condition and there were no more missing values in it. The missing value in question is due to incomplete data, outliers (abnormal data) or data with inconsistent values. From the data validation stage, data will be generated where there are no missing values or normal conditions for the next stage.

Data transformation
The next stage is data transformation, which is performed with the aim of maintaining and ensuring the accuracy of the training data set. For this stage, the outlier technique contained in the Orange application is also applied, namely the Outlier Detection Method feature. From this data transformation process, more accurate data will be generated for the next stage.

Data reduction
Data reduction is done to take data sampling from the existing training set. The goal is that the selected data sets are really ready to be classified. Of the total 1,988 data records, then filtering was carried out until the remaining 60% of the data or 1,173 records were left. So, from this II. RESEARCH METHOD The research methodology in the clustering process stage, a sample data will be generated that is ready to be processed to the next stage.

E. Clustering Result Analysis
At this stage, the results of the classification or grouping that have been obtained from calculations using the Orange application are analyzed and reviewed. From this stage, patterns might be as the results of K-Means clustering.

A. Data Set
The first stage carried out in this study was the collection and definition of the dataset as shown in Figure  2. In this data, which contains data from students of the Management Department of the National University of various generations from 1990 to 2016, there are 1,988 data records available. The dataset is then uploaded to Orange to facilitate data processing, including data preparation. In this set, it was seen that there were still missing values of 0.3% or around 34 rows or data records. d. Initial Status, is student status data during the lecture process. The authors has classified this data and made it a numerical attribute. The value is 1 for "active" status, 2 for "inactive" status, 3 for "leave" status and 4 for "out" status. e. Final Status is data on the final academic status of students, graduated or dropped out (DO). The authors made it as a numerical attribute with a value of 1 for "pass" status and 2 for "Drop Out" status.

B. Data Pre-processing
This section contains the data preparation stage, which consists of data validation, data transformation and data reduction. The authors use the Orange application in all stages of this preparation to make the data processing easy and effective. a. Data validation As previously explained, there are missing values in the dataset used in this study, so it is necessary to pre-process the data first. This is also to improve the validation of existing data. For this reason, the Preprocess feature contained in Orange is used to do this. Furthermore, there is also a Data Table feature that can be used to view data, including data from the validation results. As a result, the missing values in the dataset were eliminated so that the number of data became 1,954 records and could be continued to the next process. These results are as shown in Figure 3.
This stage is one of the important stages in the clustering process, because this is where the data will actually be processed: the value is calculated and the results are analyzed. In this study, data processing and calculation methods were carried out using the features in the Orange application. There are several features that will be used in it, such as the outlier feature, K-Means, select rows and a scatter plot to see the clustering results. c. Duration of study, namely data on the length of study that students have taken until they graduate or drop out and is a numerical attribute.

III. RESULTS AND DISCUSSION
Research for the clustering of students majoring in Management at the National University starts from collecting training data sets, designing and implementing data mining. Later, a certain pattern will be generated that can be used to predict the possibility of students graduating or dropping out. These stages include identification of training sets, data preparation, data exploration using Orange and analysis of the results. To perform the data transformation process, the Outlier feature in Orange is used. Outlier means there is an unnatural (anomalous) data set in the data set, so that it must be corrected before the next stage. After that, the results are displayed in the form of Data Inliers, which means data that is not normal or all data other than normal data. The description of the Outlier feature as a transformation process in Orange is as shown in Figure 4. As additional information, before the dataset is entered into the Orange application, data creations have been carried out. The data creations are the addition of several attributes that are deemed necessary and changing some attributes that contain categorical data into numeric data. This aims to facilitate the classification process with the K-Means feature. These attributes include initial status, final status and length of study as described in the dataset section of this paper. c. Data reduction This data reduction is basically a sampling of existing data, especially because there are quite a lot of rows. In addition to streamlining the algorithm, it is also less heavy and doesn't take long to process data. Although several rows of data are reduced at this stage, the quality of the resulting data remains the same, so that it still meets the research requirements.
In the Orange application, to perform this stage, the "Data Sampler" feature can be used which is linked from the Preprocess feature. As previously explained, the data used at this stage is about 60% of the total data or as many as 1,954 records. The technique used is random sampling where the data will be randomly selected. From the "Data Sampler" output, the Outlier feature is then used, so that later, inlier data is obtained, namely data other than data from outlier data. The data is then processed further so that a total of 592 records are obtained from all existing data. Inliers data obtained as output is as shown in Figure 5. The inliers data will be processed further at a later stage. At this stage, the data is really ready to be processed and explored further.

Figure 5. Data sampling inliers d. Data exploration
In the processing and exploring of the dataset using Orange, the features are emphasized. In addition, the process of identifying the intensity of the relationship between attributes is carried out. At this stage also, a trial with the Unvariate Analysis technique was carried out. Actually, at this stage, testing can also be done with the Bivariate Analysis and Multivariate Analysis techniques. However, due to limitations, only Unvariate Analysis is used where the properties of each attribute are investigated. The technique is done by applying the "Feature Statistics" feature provided by Orange. The picture is as in Figure 6.  Figure 6, the results of data processing can generally be read. The data center for each attribute is SKS = 113.43, GPA = 2.87, length of study = 5.75 years, initial status = 1.57 and final status = 1.35. Next is data processing using the K-Means method with the aim of grouping students' data into 2 clusters. To do clustering in Orange, use the "K-Means" feature and then select the attribute row that becomes the center of the cluster with the "Select Row" feature. Figure  successfully passed while cluster C2 shows students who drop out. In theory, according to [15], the following are the calculation steps using the K-Means method: a) Determine the number of clusters. b) Allocate data into clusters randomly. c) Calculate the centroid / average of the data in each cluster d) Allocate each data to the nearest centroid / average. e) Return to Step c), if there is still data that has moved to different clusters or if the change in the centroid value is above the specified threshold value or if the value change in the objective function used is above the specified threshold value One alternative to the application of K-Means with several related calculation theories development is Eucledian Distance (L2-Norm). The distance between two points is formulated as follows: ( , ) = ‖ , ‖ 2 = ∑( − ) 2 −1 Information: d = determinant (Euclidean Distance) x = the center of the cluster y = data n = amount of data i = data to-Furthermore, [15] describes the shortest distance between the centroid and the document to determine the cluster position of a document. For example, document A has the shortest distance to centroid 1 compared to the others, then document A is included in group 1. Recalculate the position of the new centroid for each centroid (Ci.j) by taking the average of the documents that enter the initial cluster (Gi..j). Iteration is carried out continuously until the group position does not change. The following is the formula for determining the centroid: = 1 + 2 + .. + … ∑ Information: x1 = the value of the 1st data record x2 = 2nd record data value Σx = number of data records By processing data using Orange, it is quite easy to visualize the results of the cluster, namely the "Scatter Plot" feature. The cluster graph was then analyzed and tested. For more details, one of the cluster graphs is shown in Figure  8.

C. Analysis of Classification Results
The cluster graph shown in Figure 8, illustrates the grouping of students in the Department of Management of the National University who have successfully passed or dropped out based on the number of credits taken and the length of study. Cluster C1, which is marked in blue contains a group of students who have successfully passed. Meanwhile, cluster C2 is a group of students who drop out, which is marked in red. In the C2 cluster group, it can be seen that the length of study for DO students is over 6 years with the number of credits taken below 100. Meanwhile, in the C1 cluster, it can be seen that the group of students who successfully passed education for 6 years and under with the full number of credits is 144. This graph shows the irregularity of the academic conditions of dropping out students where they have been recorded and studied for a long time, but still attend lectures with a small number of credits. This is an anomalous condition where the length of the study is inversely proportional to the number of credits taken. Figure 9 shows the clustering of students who graduated or dropped out (DO) based on the GPA obtained and the length of study. Cluster C2 shows groups of students who drop out and is marked in red, while cluster C1 is a group of students who have successfully completed their studies, which is marked in blue. From this graph, it can be seen that on average students with cluster C1, graduating and completing the study, have a GPA> = 3 with a length of study <= 6 years. Meanwhile, the group of students who dropped out, cluster C2, had a GPA> = 3 with a length of study> = 6 years. This is certainly quite reasonable because the academic conditions of students who drop out usually have a GPA below the average, especially if they have taken courses for more than 4 years.
The cluster graphic in Figure 10 emphasizes the relationship between the number of credits a student has taken or the GPA a student has with the length of study he/she has taken. The graph depicts students whose status has finally passed, cluster C1 is marked in blue and groups of students who drop out are marked in red. In line with the pattern in Figure 8 and Figure 9, in this graph it can be seen that the group of students who successfully graduated, cluster C1, has a length of study period under 6 years. Meanwhile, cluster C2, students who drop out (DO), have a study period of more than 6 years.
This study shows that students who have taken education above the normal time, which is 4 years or more but are still recorded taking a small number of credits, it is necessary to take preventive measures so they will not drop out. The university should have taken anticipatory steps, so that the students concerned can complete their studies. Furthermore, students who have a GPA below 3 and have taken education above the normal time, 4 years or below, need to be given special attention as a preventive measure.
There are several other insights that can be taken from research that applies clustering with the K-Means method based on Figure 7, including: a) Students who have taken SKS> = 90 but have GPA> = 2, need to be given special attention. These students need to be supported in order to increase their achievement index, especially if the students have taken education above the normal time, 4 years. b) Students whose status was initially inactive and had just taken the number of SKS <= 90, need special attention.
Do not let it drag on and become a drop out (DO) in the final status, especially if the students have taken education above the normal time, 4 years. The university needs to be pro-active in communicating with these students. c) Students whose GPA> = 3, and have taken SKS> = 90, but the length of study is above normal, 4 years, need to be given special attention, especially if they have been recorded as inactive or on leave.
To test the results of the clusters produced in this study, the Silhouette Plot feature which is also included in the Orange application is used. According to [19], the Silhouette Coefficient is used to see the quality and strength of the cluster, how well an object is placed in a cluster. This method is a combination of the cohesion and separation method. The stages of calculating the Silhouette Coefficient are as follows: a) Calculate the average distance from a document for example i with all other documents that are in the same where j is another document in one cluster A and d(i, j) is the distance between document i and j. b) Calculate the average distance from document i to all documents in other clusters, and take the smallest value.
where d (i, C) is the average distance of document i with all objects in other clusters C where A ≠ C. c) The Silhouette Coefficient value is: The results of the Silhoutte Coefficient shown by the application of Orange in this study are as shown in Figure  11. The silhouette method average value approach is used to estimate the quality of the clusters formed. The higher the average value is, the better it is. Based on the graph in Figure 11, it can be seen that the optimal cluster formed at C1, k = 1, with an average Silhoutte value of more than 0.85 or 85%. All graduate student members, cluster C1, showed average grades. In cluster C2, k = 2, the Silhouette values vary between 0.5 -0.8. However, the value is quite good. This shows that the clusters formed in this study have been tested and can be used by the Management Department of the National University to predict the likelihood that students will graduate well or experience drop out (DO). rsitas classify students who graduate or drop out (DO). The results of the clustering are used as predictions to determine the possibility of students' academic conditions that can pass or experience dropouts. Insights obtained from the results of this calculation can be used by the Department of Management of the National University to take preventive steps and ensure that students do not drop out. This Kmeans method is easily implemented in Orange because it has proven features, so it is quite fast and accurate. The results of the silhouette test in this study reached 85%, which indicates that the clustering method to determine the characteristics of students who are likely to graduate or drop out (DO) in the management department of the National University, Jakarta, can be applied as an effort to overcome the high dropout rate. With the results of clustering like this, for example, it makes it easier for universities to control and observe the academic conditions of their students. Finally, the campus can improve academic services and graduate students with good academic conditions.