Hybrid Feature Combination of TF-IDF and BERT for Enhanced Information Retrieval Accuracy
Abstract
Text representation is a critical component in Natural Language Processing tasks such as information retrieval and text classification. Traditional approaches like Term Frequency-Inverse Document Frequency (TF-IDF) provide a simple and efficient way to represent term importance but lack the ability to capture semantic meaning. On the other hand, deep learning models such as Bidirectional Encoder Representations from Transformers (BERT) produce context-aware embeddings that enhance semantic understanding but may overlook exact term relevance. This study proposes a hybrid approach that combines TF-IDF and BERT through a weighted feature-level fusion strategy. The TF-IDF vectors are reduced in dimension using Truncated Singular Value Decomposition and aligned with BERT vectors. The combined representation is used to train a fully connected neural network for binary classification of document relevance. The model was evaluated using the CISI benchmark dataset and compared with standalone TF-IDF and BERT models. Experimental results show that the hybrid model achieved a training accuracy of 97.43 percent and the highest test accuracy of 80.02 percent, outperforming individual methods. These findings confirm that combining lexical and contextual features can enhance classification accuracy and generalization. This approach provides a more robust solution for improving real-world information retrieval systems where both term specificity and contextual relevance are important.
Keywords
Full Text:
PDFReferences
Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (SIGIR '20.). https://doi.org/10.1145/3397271.3401075
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv (Cornell University). Retrieved from https://arxiv.org/pdf/2104.08663
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1810.04805
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . . Stoyanov, V. (2019). ROBERTA: A robustly optimized BERT pretraining approach. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1907.11692
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., . . . Yih, W. (2020). Dense passage retrieval for Open-Domain question answering. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2004.04906
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). https://doi.org/10.18653/v1/d19-1410
MacAvaney, S., Yates, A., Cohan, A., & Goharian, N. (2019). CEDR: Contextualized Embeddings for Document Ranking. SIGIR’19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieva. https://doi.org/10.1145/3331184.3331317
Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1901.04085
Yang, W., Zhang, H., & Lin, J. (2019). Simple applications of BERT for ad hoc document retrieval. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1903.10972
Lin, J., Nogueira, R., & Yates, A. (2020). Pretrained transformers for text ranking: BERT and beyond. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2010.06467
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021b). BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv (Cornell University). Retrieved from https://arxiv.org/pdf/2104.08663
Gao, L., Callan, J. (2021). Unsupervised Corpus Aware Language Model Pretraining for Dense Retrieva. (ACL Findings).
https://doi.org/10.18653/v1/2021.findings-acl.413
Reimers, N., & Gurevych, I. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (EMNLP 2020).
https://doi.org/10.18653/v1/2020.emnlp-main.733
Gao, L., & Callan, J. (2021). Unsupervised Corpus Aware Language Model Pretraining for Dense Retrieval (Proceedings of ACL 2021). https://doi.org/10.18653/v1/2021.acl-long.455
Zhan, J., Mao, Y., Liu, J., Callan, J., & Gao, W. (2021). Optimizing Dense Retrieval Model Training with Hard Negatives. Proceedings of SIGIR '21. https://doi.org/10.1145/3404835.3462893
Qu, Y., Liu, L., Liu, J., Ren, P., Chen, Z., Ma, J., & de Rijke, M. (2021). RocketQA: An Optimized Training Approach to Dense Passage Retrieval (Proceedings of NAACL 2021). https://doi.org/10.18653/v1/2021.naacl-main.466
Xiong, L., et al. (2021). Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. (ICLR 2021).
https://openreview.net/forum?id=ze8wPjDclz
Ma, X., Yu, J., & Sun, M. (2022). Multiview Training for IR: Improving Dense Retrieval with Multi-granularity Views. Proceedings of the Web Conference 2022 (WWW '22). https://doi.org/10.1145/3485447.3511970
Hofstätter, S., Reimer, N., & Hanbury, A. (2021). Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling (ECIR 2021).
https://doi.org/10.1007/978-3-030-72113-8_9
Glasgow IDOM - CISI Collection. (n.d.). [Dataset]. Retrieved from https://ir.dcs.gla.ac.uk/resources/test_collections/cisi/
DOI: https://doi.org/10.31326/jisa.v8i1.2179
Refbacks
- There are currently no refbacks.
Copyright (c) 2025 Pajri Aprilio, Michael Felix, Putu Surya Nugraha, Hasanul Fahmi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
JOURNAL IDENTITY
Journal Name: JISA (Jurnal Informatika dan Sains)
e-ISSN: 2614-8404, p-ISSN: 2776-3234
Publisher: Program Studi Teknik Informatika Universitas Trilogi
Publication Schedule: June and December
Language: English
APC: The Journal Charges Fees for Publishing
Indexing: EBSCO , DOAJ, Google Scholar, Arsip Relawan Jurnal Indonesia, Directory of Research Journals Indexing, Index Copernicus International, PKP Index, Science and Technology Index (SINTA, S4) , Garuda Index
OAI address: http://trilogi.ac.id/journal/ks/index.php/JISA/oai
Contact: jisa@trilogi.ac.id
Sponsored by: DOI – Digital Object Identifier Crossref, Universitas Trilogi
In Collaboration With: Indonesian Artificial Intelligent Ecosystem(IAIE), Relawan Jurnal Indonesia, Jurnal Teknologi dan Sistem Komputer (JTSiskom)
JISA (Jurnal Informatika dan Sains) is Published by Program Studi Teknik Informatika, Universitas Trilogi under Creative Commons Attribution-ShareAlike 4.0 International License.