Music Genre Recommendations Based on Spectrogram Analysis Using Convolutional Neural Network Algorithm with RESNET-50 and VGG-16 Architecture

− Recommendations are a very useful tool in many industries. Recommendations provide the best selection of what the user wants and provide satisfaction compared to ordinary searches. In the music industry, recommendations are used to provide songs that have similarities in terms of genre or theme. There are various kinds of genres in the world of music, including pop, classic, reggae and others. With genre, the difference between one song and another can be heard clearly. This genre can be analyzed by spectrogram analysis. Convolutional Neural Network(CNN) is a neural network algorithm that is commonly used to recognize and classify image data. In this study, an image spectrogram analysis was developed which will be the input feature for the Convolutional Neural Network. CNN will classify and provide song recommendations according to what the user wants. In addition, testing was carried out with two different architectures from CCN, namely VGG-16 and RESNET-50. From the results of the study obtained, the best accuracy results were obtained by the VGG-16 model with 20 epochs with accuracy 60%, compared to the RESNET-50 model with more than 20 epochs. The results of the recommendations generated on the test data obtained a good similarity value for VGG-16 compared to RESNET-50.


I. INTRODUCTION
Music is an inseparable part of people's lives. Music often accompanies someone in their activities. Sometimes listening to music can also affect the mood of the listeners. Usually someone listens to music according to his feelings at that time. So that the role of music becomes important in managing people psychology [1].
Correct song listened by someone can affect listener's feelings. Due to the large amount of music available either through the internet or other music service applications, it will be difficult for people to choose the songs they want. Music is also distinguished by a variety of genres, speed, tempo and themes that vary and vary widely [2]. For western songs, the genres are distinguished by Hip-Hop, International, Electronic, Folk, Experimental, Rock, Pop, and Instrumental. This makes it difficult for music lovers to choose the right song.
Music lovers usually choose songs using manual method in finding the desired music. Like asking for recommendations from friends or listening to music shows to choose music [3]. Often the song that is listened to, does not match his mood or is not a fan of the genre of the song. Recommendations are implemented in various music player platforms on the internet, to provide more experience in listening to the music. The recommendation system is able to predict the favorite music desired by the user. Besides for users, recommendations are also useful for music service providers, because they can increase user satisfaction for using the music service.
Deep learning is a part of Artificial Neural Networkbased Machine Learning. With deep learning, a computer can classify and recommend data in the form of images or sounds [4]. One of the methods commonly used for the classification and recommendation process is the Convolutional Neural Network (CNN). CNN is an extension of Multilayer Perceptron. CNN is able to learn from an image by using supervised learning techniques. This technique will provide a target for output by comparing past learning experiences. There are several architectures that can be used to optimize CNN so that they can have optimal classification results. There are the VGG architecture, mobileNet, ResNet etc. ResNet, short for Residual Networks is a classic neural network [5]. This model is also the winner of the ImageNet challenge in 2015. This model is also easier to optimize, and can get accuracy from great depths increases. ResNet 50 is the best CNN architecture, it is proved on the research by Talo was to conduct research on the classification of brain diseases with MRI images [6]. The spectrogram is a visual representation of the frequency spectrum of the signal. The spectrogram is formed using the Fourier transform. Making a spectrogram with FFT (Fast Fourier Transform) is done by first taking the data in the time domain, and breaking the data into several parts, and doing a Fourier Transform to calculate the magnitude of the frequency spectrum for each part.
The spectrogram is very useful for analyzing sound, where the spectrogram forms a ratio of magnitude to frequency at a given time. Music recommendations can also be made based on the mood of the user. Where in the research that has been done by Amala George et al. The system developed is able to analyze the mood of the user based on his face, then analyze it using the CNN algorithm [2]. From the results of this mood classification, recommendations are then given using the recommendation module. From the research that has been done, the accuracy is 98%.
Research of music types Classification using the CNN algorithm has also been carried out by analyzing spectrogram images. The spectrogram image that has been generated from the music, then deep learning process will be carried out by using the CNN model. Based on the research that has been done, it is found that the use of 35 epochs has an optimal accuracy of 81.33%. When compared with the KNN method, CNN produces a better level of accuracy [1]. Other research on spectrogram analysis for music genre classification using CNN and Melspectograms has been carried out and the test results depend on the number of datasets, training iterations and computer specifications greatly affect the level of accuracy and duration of modeling. The resulting accuracy is very optimal in classifying music genres, which is 99% for the RELU activation function and 95% for ELU [7].
Music recommendations based on genre have also been carried out using the Convolutional Recurrent Neural Network. Where in this study also uses a spectrogram and analyzes it using CRNN. This study also compared the use of CRNN and CNN methods to classify music genres. From the research results, it is found that CRNN which takes into account the frequency and time sequence features has better performance than CNN [8]. Research on next-song recommendation has also been carried out, where Neural network has performed well in all types of tests. In this study it was concluded that the NN-based next-song recommenders, CNN-rec, NN-rec and Word2Vec, outperform the non-NN based ones [9]. In this research demonstrate that the NN-based next-song recommenders, which combine users' general preference and sequential listening patterns, have the highest performance.
Music recommendation using deep content also done by A¨aron van den Oord dkk [10]. In their research showed that recent advances in deep learning translate very well to the music recommendation setting in combination with approach used in this study, with deep convolutional neural networks significantly outperforming a more traditional approach using bag-of-words representations of audio signals. Also other research on music recommendation done by using user behaviour [11]. The approach considered genre, recording year, freshness, favor and time pattern as factors to recommend songs. The evaluation results demonstrate that the approach is effective.
Research on music recommendations by genre is carried out by comparing several machine learning algorithms such as KNN, RF, NB, DT dan SVM [12]. According to the results summarized in this research, SVM achieved better classification results than other methods. In addition, changing the window size and window type caused very small performance changes. Research about music recommendation using similarity between using decided genre value and using feature vector distance also have been done by Jonseol Dee et al. In their paper, proposed a recommendation system based on a preference classification using real-time user brainwaves and genre feature classification. Proposed user's preference clasifier achieved an overall accuracy of 81.07% [13].
Based on the research that has been done previously, this study will carry out a music genre recommendation process using the GZTAN dataset which is composed of 10 types of genres, where the music data is first processed using a spectrogram. The results will be classified using the CNN algorithm with RESNET50 and VGG16 architecture. The results of the recommendations generated will be tested whether they are in accordance with the song desired by the user.

II. RESEARCH METHODOLOGY
The method used in this research is the dataset preparation process, pre-processing, spectrogram, classification process and calculating similarity using cosine similiarity.

A. Dataset
This research uses a dataset in the form of spectrogram images taken from the GTZAN dataset. To simplify the classification of music data using a neural network, it is necessary to change the music data data into a melspectrogram to be processed by the Neural Network. GTZAN consists of music data and Mel spectrogram results from that music file. Where this dataset is a public dataset that is widely used for evaluating the introduction of music genres (Music Genre Recognition / MGR). GTZAN is a collection of music collected from 2000-2001, which comes from various sources such as CDs, radio and microphone recordings. This dataset consists of 10 genres, namely blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae and rock. The duration of each of these music is 30 seconds. Each genre contains 100 music files. The number of datasets used in this study is divided into 3 parts : training data, validation data and test data. With details in each section as follows:

B. Spectogram
The spectrogram is a visual representation of the frequency spectrum of the signal [14]. In the GTZAN dataset, spectrograms have been generated and stored in their respective classes. Before being entered into the CNN network, this data is further divided into training data, validation data and test data. Each of these spectrogram images will be included in the array, then labeled according to their respective index folders. Then after being given a label, the data will be appended into an array to make it easier to pass the data. From the spectrogram image there are many values and features of the music file that can be displayed. The following is an example of an illustration of the spectrogram of each class in this study. Based on the picture above, it can be seen that there are differences in the spectrograms of each genre. The image on the right shows a spectrogram for the hip hop and rock genres, here you can see the frequency density compared to the spectrogram on the left. The spectrogram image is a wave generated as an audio representation in the time, frequency and magnitude domains. To generate spectrograms from each music genre and use it on an artificial neural network, in this study, the Librosa library was used. With librosa, we can retrieve important features in a music file, such as tempo, chroma and spectrogram.

C. Convolutional Neural Network
Convolutional Neural Network is an artificial neural network that is widely used in the field of image classification. In this study, the audio/music signal is represented as a spectrogram which has a 2D image. CNN is used to classify music genres with the help of spectrograms. Based on the spectrogram images of each music genre, the pattern of the audio signal can be seen. So that each of these genres can be input of the CNN artificial neural network.
In this study, two CNN architectures were used, namely Resnet and VGG16. Resnet is a residual network, which is in charge of image recognition. RESNET-50 is an improved version of VGG-16. Where the last number of this architecture represents the number of layers in the network. RESNET stands for Residual Network which is an artificial neural network innovation that won the 2015 ILSVRC classification competition with an error rate of only 3.15% [15]. While VGG-16 stands for Visual Geometry Group and 16 is the number of layers. The VGG-16 is also a well-known model that participated in the 2014 ILSVRC and obtained an accuracy rate of 92.7%. VGG-16 is also used in image classification and is very popular because of its ease of implementation. The following in Figure 3 is a comparison of the RESNET and VGG architectures. The research flow used in this study can be described in outline as follows: The first step in this research is to collect the dataset, in the form of a spectrogram image from the GTZAN dataset. After that by using the required libraries such as imageDataGenerator in the Keras library to manage training, test and validation data. After that the MFCC image data from 10 music genres have been grouped by category. The next step is to build a CNN model with Keras. There are 2 different processes will be carried out using the VGG-16 and Resnet 50 architectures. After the model is obtained from the training process using two different architectures. Then the process of finding similarities between feature vectors is carried out using cosine similarity. The application will display a recommendation of 5 songs that match those in the validation data. Where the recommendation process is carried out by calculating the value of the similarity of features between one music and another. The first process is to choose music from each genre that will be used as the basis for the recommendation system. Then the forecast from the music base is calculated based on an artificial neural network. The cosine similarity value is calculated from the 2 featured vector being compared. To calculate the similarity of 2 pieces of music with the number of features N, where the first music has a feature vector x=[x1,x2,x3….xn] and the second music has a feature vector y=[y1,y2,y3,…yn] then the formula which is used as follows:  The next process is to carry out the transfer learning process with 2 different architectures, namely VGG16 and RESNET50. Transfer learning is the process of using an existing model for different problems. By using transfer learning, it is hoped that the results of the training will be better. The parameters needed in this transfer learning process are lastfourtrainable, if the value of this parameter is false then the last fully connected layer will be trained. If true then the last 4 layer models that have parameters will be trained. For these two architectural models, "adam" optimization is used. The training process was carried out on each model of 20 epochs. This training will produce a model that will be used in the testing process.

A. Result analysis
After the training process was carried out, the precision, recall, and f1-score values were obtained from each music class. Following are the values of Precision, recall, f1-score and accuracy on the VGG16 model.  After the process of model formation with transfer learning VGG16 and RESNET50. Then proceed with making feature extraction. In the VGG16 model, for example, in this study, we will take a model that has been previously stored in the training process. After that, the output weight will be obtained before the classification layer of this model. From this model we will derive the feature vectors for the training and validation datasets. The result of this feature vector is then searched for its similarity with cosine similarity.
In Figure 9, 5 music recommendations are obtained based on the spectrogram image of the music file desired by the user. Where "test image" is the music spectrogram testing data. As Seen with VGG16, the recommended spectrogram has almost the same shape as the test spectrogram. With the test image from the Blues class, the recommendation results are also obtained from the Blues class with a similarity level of 1. While on RESNET50, it has the same testing process as VGG16. After the experiment, the training process with RESNET50 requires a larger epoch to get better accuracy results. In this study, quite good accuracy results were obtained at epochs of 30 for RESNET50. The picture below shows the calculation results of Precision, recall, f1-score and accuracy on the RESNET50 model. The resulting Accuracy value is slightly lower than the VGG16 model with a larger number of epochs. The results of the confusion matrix for the CNN-RESNET50 model are shown in Fig. Where the results obtained are still lower than the VGG16 model in classifying classes from the music dataset used.
. Figure 10. RESNET50 Accuracy value The results of the confusion matrix for the CNN-RESNET50 model are shown in Fig. Where the results obtained are still lower than the VGG16 model in classifying classes from the music dataset used. The best classification is obtained from the reggeae and pop classes. In rock class, the RESNET50 model is not able to give good classification results. Figure 11 RESNET50 Confusion matrix For the testing process, the steps taken are the same as the process in the VGG16 model, that is looking for feature extraction from the test image and looking for its cosine similarity with feature extraction from the training dataset. So that the results of the music spectrogram recommendations are obtained in accordance with the testing dataset used. The following is in Figure 12 the results of 5 image similarities from the tested test data. It can be seen that the results of the spectrogram recommendation are quite good, only the level of similarity is lower than the VGG16 model. CONCLUSION In this research it is implemented using Python, Google Colab, and TensorFlow and hard libraries. Input shape on CNN model in This research is 224x224 pixels, the filter size is 3x3, the number of epochs is 20 and 30, and the training data is 799 and the validation is 100 data. CNN is the most widely used method in image data. For research with audio data, this data is first processed by spectrogram analysis in the form of Cartesian coordinates with the amplitude of the music as the y-axis. In this study, the spectrogram results become input for CNN with VGG16 and RESNET50 architectures.
Based on the results of the design of prediction system using the CNN method, the accuracy value for the VGG16 training data model is 0.8408, the training data loss is 0.4827, the test data accuracy is is 0.6094 and the test data loss is 1.2762. Meanwhile, for the RESNET50 model, the training data accuracy value is 0.6286, the training data loss is 1.0383, the test data accuracy is is 0.3438 and the test data loss is 1.8529. So, from these results it can be concluded that the results in both the data is still underfitting. This is because there are still many datasets that are more numerous in number and variants that have characteristics that are similar to each class.
The best accuracy results were obtained by the VGG16 model with 20 epochs compared to the RESNET50 model with more than 20 epochs. The results of the recommendations generated on the test data obtained a good similarity value for VGG16 compared to RESNET50. The suggestion for this research is that in the future it can increase the dataset so that the accuracy obtained is even better, because in this study the songs in the dataset do not have clear boundaries between one genre and another. In addition, the epoch value during the training process is also further improved so that the accuracy level is even better for each CNN model.