Hybrid Feature Combination of TF-IDF and BERT for Enhanced Information Retrieval Accuracy

Pajri Aprilio; Michael Felix; Putu Surya Nugraha; Hasanul Fahmi

doi:10.31326/jisa.v8i1.2179

Hybrid Feature Combination of TF-IDF and BERT for Enhanced Information Retrieval Accuracy

Pajri Aprilio, Michael Felix, Putu Surya Nugraha, Hasanul Fahmi

Abstract

Text representation is a critical component in Natural Language Processing tasks such as information retrieval and text classification. Traditional approaches like Term Frequency-Inverse Document Frequency (TF-IDF) provide a simple and efficient way to represent term importance but lack the ability to capture semantic meaning. On the other hand, deep learning models such as Bidirectional Encoder Representations from Transformers (BERT) produce context-aware embeddings that enhance semantic understanding but may overlook exact term relevance. This study proposes a hybrid approach that combines TF-IDF and BERT through a weighted feature-level fusion strategy. The TF-IDF vectors are reduced in dimension using Truncated Singular Value Decomposition and aligned with BERT vectors. The combined representation is used to train a fully connected neural network for binary classification of document relevance. The model was evaluated using the CISI benchmark dataset and compared with standalone TF-IDF and BERT models. Experimental results show that the hybrid model achieved a training accuracy of 97.43 percent and the highest test accuracy of 80.02 percent, outperforming individual methods. These findings confirm that combining lexical and contextual features can enhance classification accuracy and generalization. This approach provides a more robust solution for improving real-world information retrieval systems where both term specificity and contextual relevance are important.

Keywords

TF-IDF; BERT; Text Classification; Information Retrieval; Hybrid Model; Semantic Embedding; Neural Network

Full Text:

PDF

References

Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (SIGIR '20.). https://doi.org/10.1145/3397271.3401075

Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv (Cornell University). Retrieved from https://arxiv.org/pdf/2104.08663

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1810.04805

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . . Stoyanov, V. (2019). ROBERTA: A robustly optimized BERT pretraining approach. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1907.11692

Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., . . . Yih, W. (2020). Dense passage retrieval for Open-Domain question answering. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2004.04906

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). https://doi.org/10.18653/v1/d19-1410

MacAvaney, S., Yates, A., Cohan, A., & Goharian, N. (2019). CEDR: Contextualized Embeddings for Document Ranking. SIGIR’19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieva. https://doi.org/10.1145/3331184.3331317

Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1901.04085

Yang, W., Zhang, H., & Lin, J. (2019). Simple applications of BERT for ad hoc document retrieval. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1903.10972

Lin, J., Nogueira, R., & Yates, A. (2020). Pretrained transformers for text ranking: BERT and beyond. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2010.06467

Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021b). BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv (Cornell University). Retrieved from https://arxiv.org/pdf/2104.08663

Gao, L., Callan, J. (2021). Unsupervised Corpus Aware Language Model Pretraining for Dense Retrieva. (ACL Findings).

https://doi.org/10.18653/v1/2021.findings-acl.413

Reimers, N., & Gurevych, I. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (EMNLP 2020).

https://doi.org/10.18653/v1/2020.emnlp-main.733

Gao, L., & Callan, J. (2021). Unsupervised Corpus Aware Language Model Pretraining for Dense Retrieval (Proceedings of ACL 2021). https://doi.org/10.18653/v1/2021.acl-long.455

Zhan, J., Mao, Y., Liu, J., Callan, J., & Gao, W. (2021). Optimizing Dense Retrieval Model Training with Hard Negatives. Proceedings of SIGIR '21. https://doi.org/10.1145/3404835.3462893

Qu, Y., Liu, L., Liu, J., Ren, P., Chen, Z., Ma, J., & de Rijke, M. (2021). RocketQA: An Optimized Training Approach to Dense Passage Retrieval (Proceedings of NAACL 2021). https://doi.org/10.18653/v1/2021.naacl-main.466

Xiong, L., et al. (2021). Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. (ICLR 2021).

https://openreview.net/forum?id=ze8wPjDclz

Ma, X., Yu, J., & Sun, M. (2022). Multiview Training for IR: Improving Dense Retrieval with Multi-granularity Views. Proceedings of the Web Conference 2022 (WWW '22). https://doi.org/10.1145/3485447.3511970

Hofstätter, S., Reimer, N., & Hanbury, A. (2021). Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling (ECIR 2021).

https://doi.org/10.1007/978-3-030-72113-8_9

Glasgow IDOM - CISI Collection. (n.d.). [Dataset]. Retrieved from https://ir.dcs.gla.ac.uk/resources/test_collections/cisi/

DOI: https://doi.org/10.31326/jisa.v8i1.2179

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

JOURNAL IDENTITY

Journal Name: JISA (Jurnal Informatika dan Sains)
e-ISSN: 2614-8404, p-ISSN: 2776-3234
Publisher: Program Studi Teknik Informatika Universitas Trilogi
Publication Schedule: June and December
Language: English
APC: The Journal Charges Fees for Publishing
Indexing: EBSCO , DOAJ, Google Scholar, Arsip Relawan Jurnal Indonesia, Directory of Research Journals Indexing, Index Copernicus International, PKP Index, Science and Technology Index (SINTA, S4) , Garuda Index
OAI address: http://trilogi.ac.id/journal/ks/index.php/JISA/oai
Contact: jisa@trilogi.ac.id
Sponsored by: DOI – Digital Object Identifier Crossref, Universitas Trilogi

In Collaboration With: Indonesian Artificial Intelligent Ecosystem(IAIE), Relawan Jurnal Indonesia, Jurnal Teknologi dan Sistem Komputer (JTSiskom)

JISA (Jurnal Informatika dan Sains) is Published by Program Studi Teknik Informatika, Universitas Trilogi under Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me