Automating the Extraction of Words and Topics in Indonesian Using the Term Frequency-Inverse Document Frequency Algorithm and Latent Dirichlet Allocation

Lalu Mutawalli; Mohammad Taufan Asri Zaen; Muhammad Fauzi Zulkarnaen

doi:10.31326/jisa.v7i1.2012

Automating the Extraction of Words and Topics in Indonesian Using the Term Frequency-Inverse Document Frequency Algorithm and Latent Dirichlet Allocation

Lalu Mutawalli, Mohammad Taufan Asri Zaen, Muhammad Fauzi Zulkarnaen

Abstract

Keyword extraction and topic modeling in the analysis of Gojek user reviews in Indonesian are very important. By understanding user preferences and needs through keyword extraction, as well as grouping user reviews into different topics through topic modeling, stakeholders can use the information to further improve services. This research uses TF-IDF and LDA approaches to analyze text data from Gojek user reviews and feedback. The data spans from Nov 5, 2021, to Jan 2, 2024, totaling 225,002 rows. Each row includes username, content, time, and app version. The focus is on content reviews. The average length is 8 words, with a maximum of 104 and a minimum of a few words. The variability indicates a non-normal distribution. Preprocessing is conducted to maintain topic analysis accuracy. The TF-IDF method is used to extract relevant keywords, while the LDA approach is used to model the topics in user reviews. The topic analysis reveals patterns in Gojek user reviews. The first topic discusses experience, services, and affordable pricing. The second emphasizes app usability and benefits. The third relates to promos, discounts, and vouchers. The fourth reflects positive evaluations of service quality. However, the fifth topic highlights high costs and app issues. The sixth underscores overall user satisfaction and service convenience. Testing on the topic model yielded a coherence level of 0.509, indicating that the model's topics demonstrate a good level of consistency in finding relevant topics from Gojek user review data. The use of a combination of TF-IDF and LDA in Indonesian text analysis, particularly in the context of Gojek user reviews, is an important step in enhancing understanding and utilization of text data to improve overall user experience.

Keywords

word extraction; topic modeling; preferences; TF-IDF; LDA

Full Text:

PDF

References

F. Tjandradinata, J. J. Q. Yap, and S. Putra, “Indonesia Big Data and Analytics Software Market Grew,” IDC Corporate, 2023. https://www.idc.com/getdoc.jsp?containerId=prAP50219423 (accessed Mar. 25, 2024).

N. N. Arief and A. Gustomo, “Analyzing the impact of big data and artificial intelligence on the communications profession: A case study on Public Relations (PR) Practitioners in Indonesia,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 10, no. 3, pp. 1066–1071, 2020, doi: 10.18517/ijaseit.10.3.11821.

E. Santika, “Aplikasi Transportasi Online Terbanyak Diunduh di RI 2023,” Kata Data, 2024. https://databoks.katadata.co.id/datapublish/2024/01/23/aplikasi-transportasi-online-terbanyak-diunduh-di-ri-2023-gojek-juaranya (accessed Mar. 27, 2024).

H. T. Y. Achsan, H. Suhartanto, W. C. Wibowo, D. A. Dewi, and K. Ismed, “Automatic Extraction of Indonesian Stopwords,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 2, pp. 166–171, 2023, doi: 10.14569/IJACSA.2023.0140221.

A. Nurkasanah and M. Hayaty, “Feature Extraction using Lexicon on the Emotion Recognition Dataset of Indonesian Text,” Ultim. J. Tek. Inform., vol. 14, no. 1, pp. 20–27, 2022, doi: 10.31937/ti.v14i1.2540.

J. Petrus, Ermatita, Sukemi, and Erwin, “A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 2, pp. 541–549, 2023, doi: 10.14569/IJACSA.2023.0140264.

W. Maulana, E. Setia, and T. Lubis, “Corpus-Based Terms Extraction in Linguistics Domain for Indonesian Language,” J. Kata, vol. 6, no. 2, pp. 257–270, 2022, doi: 10.22216/kata.v6i2.908.

K. Madatov, S. Bekchanov, and J. Vičič, “Dataset of stopwords extracted from Uzbek texts,” Data Br., vol. 43, 2022, doi: 10.1016/j.dib.2022.108351.

Y. S. Yohanes, Y. J. Kumar, N. Z. Zulkarnain, and B. Raza, “Extraction and attribution of public figures statements for journalism in Indonesia using deep learning,” Knowledge-Based Syst., vol. 289, no. October 2023, p. 111558, 2024, doi: 10.1016/j.knosys.2024.111558.

N. A. Saputra, K. Aeni, and N. M. Saraswati, “Indonesian Hate Speech Text Classification Using Improved K-Nearest Neighbor with TF-IDF- ICSρF,” vol. 11, no. 1, pp. 21–30, 2024, doi: 10.15294/sji.v11i1.48085.

J. H. Lee and M. J. Ostwald, “Latent Dirichlet Allocation (LDA) topic models for Space Syntax studies on spatial experience,” City Territ. Archit., vol. 11, no. 1, p. 3, 2024, doi: 10.1186/s40410-023-00223-3.

J. Akbar, T. A. M., Y. Tolla, A. E. Ahmad, A. Yaqin, and E. Utami, “Pemodelan Topik Menggunakan Latent Dirichlet Allocation pada Ulasan Aplikasi PeduliLindungi,” InComTech J. Telekomun. dan Komput., vol. 13, no. 1, p. 40, 2023, doi: 10.22441/incomtech.v13i1.15572.

M. A. Khadija and W. Nurharjadmo, “Enhancing Indonesian customer complaint analysis: LDA topic modelling with BERT embeddings,” Sinergi (Indonesia), vol. 28, no. 1, pp. 153–162, 2024, doi: 10.22441/sinergi.2024.1.015.

D. Yu and B. Xiang, “Discovering topics and trends in the field of Artificial Intelligence: Using LDA topic modeling,” Expert Syst. Appl., vol. 225, no. September 2022, p. 120114, 2023, doi: 10.1016/j.eswa.2023.120114.

S. E. Uthirapathy and D. Sandanam, “Topic Modelling and Opinion Analysis on Climate Change Twitter Data Using LDA and BERT Model.,” Procedia Comput. Sci., vol. 218, no. 2022, pp. 908–917, 2022, doi: 10.1016/j.procs.2023.01.071.

G. S. Buana, R. Tyasnurita, N. C. Puspita, R. A. Vinarti, and F. Mahananto, “Text-Based Content Analysis on Social Media Using Topic Modelling to Support Digital Marketing,” JOIV Int. J. Informatics Vis., vol. 8, no. 1, pp. 88–95, 2024, doi: 10.62527/joiv.8.1.1636.

M. A. E. Ignaco and M. A. Ballera, “Optimize Searching Using Latent Dirichlet Allocation,” Int. J. Intell. Syst. Appl. Eng., vol. 12, no. 3s, pp. 161–166, 2024.

A. Zidane, “Gojek App Reviews Bahasa Indonesia,” Kaggle, 2024. https://www.kaggle.com/datasets/ucupsedaya/gojek-app-reviews-bahasa-indonesia (accessed Mar. 25, 2024).

M. Kumar and R. Vig, “Term-Frequency Inverse Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler,” in Global Trends in Information Systems and Software Applications, New Yoork: Springer, 2011, pp. 31–36.

M. Bakrey, “All about Latent Dirichlet Allocation (LDA) in NLP,” Medium, 2023. https://mohamedbakrey094.medium.com/all-about-latent-dirichlet-allocation-lda-in-nlp-6cfa7825034e (accessed Apr. 02, 2024).

C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2008.

DOI: https://doi.org/10.31326/jisa.v7i1.2012

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

JOURNAL IDENTITY

Journal Name: JISA (Jurnal Informatika dan Sains)
e-ISSN: 2614-8404, p-ISSN: 2776-3234
Publisher: Program Studi Teknik Informatika Universitas Trilogi
Publication Schedule: June and December
Language: English
APC: The Journal Charges Fees for Publishing
Indexing: EBSCO , DOAJ, Google Scholar, Arsip Relawan Jurnal Indonesia, Directory of Research Journals Indexing, Index Copernicus International, PKP Index, Science and Technology Index (SINTA, S4) , Garuda Index
OAI address: http://trilogi.ac.id/journal/ks/index.php/JISA/oai
Contact: jisa@trilogi.ac.id
Sponsored by: DOI – Digital Object Identifier Crossref, Universitas Trilogi

In Collaboration With: Indonesian Artificial Intelligent Ecosystem(IAIE), Relawan Jurnal Indonesia, Jurnal Teknologi dan Sistem Komputer (JTSiskom)

JISA (Jurnal Informatika dan Sains) is Published by Program Studi Teknik Informatika, Universitas Trilogi under Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me