Data Pipeline Architecture with Near Real-Time Streaming Multiple Source Indonesian Online News Data Lake

Angelina Pramana Thenata

Abstract


The rapid development of information has made online news increasingly needed. Online news attracts readers' attention by providing convenience and speed in presenting news from various fields. However, the large amount (volume) of online news that spreads in a short time (velocity) and the public's need to consume news in various references (variety) can affect people's lives. Therefore, the government as the regulator and news agencies need to monitor online news circulating. Based on these problems, the researcher proposes a data lake architectural design that is suitable for online news and can run in real-time. Data lakes can solve the main problems of Big Data (volume, velocity, variety). In proposing this data lake architecture, the researcher conducted a literature study and analyzed the flow of the data lake architecture according to online news. Furthermore, the researcher will use this architecture to combine and uniform the online news data structure from several online news channels and then stream it in real-time to fill the data lake. The results of using the data lake architecture for online news will be stored on MongoDB which functions as a database to store all data for both the short and long term. Finally, this data lake will be a means to accommodate, dive into, and analyze the circulating online news data. Keywords – Data Lake, Online News, Real-Time


Keywords


Data Lake; Online News; Real-Time

Full Text:

PDF

References


C. Juditha, “News Accuracy in Online Journalism (News of Alleged Corruption The Constitutional Court in Detiknews),” J. Pekommas, vol. 16, no. 3, pp. 145–154, 2013.

Nurkinan, “Dampak Media Online Terhadap Perkembangan Media Konvensional,” J. Polit. Indones., vol. 2, no. 2, pp. 28–42, 2017.

SimilarWeb Ltd, “SimilarWeb,” www.similarweb.com, 2020. https://www.similarweb.com/top-websites/indonesia/category/news-and-media/ (accessed Jul. 01, 2020).

D. S. Adhiarso, P. Utari, and Y. Slamet, “Pemberitaan Hoax di Media Online Ditinjau dari Konstruksi Berita dan Respons Netizen,” J. Ilmu Komun., vol. 15, no. 3, pp. 215–225, 2017.

M. Chessell, D. Wolfson, and T. Vincent, “Architecting to Deliver Value From A Big Data and Hybrid Cloud Architecture,” in Software Architecture for Big Data and the Cloud, 1st ed., I. Mistrik, R. Bahsoon, N. Ali, M. Heisel, and B. Maxim, Eds. Elsevier Inc., 2017, pp. 33–48.

M. R. Llave, “Data lakes in business intelligence: Reporting from the trenches,” Procedia Comput. Sci., vol. 138, pp. 516–524, 2018, doi: 10.1016/j.procs.2018.10.071.

H. Fang, “Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem,” IEEE Int. Conf. Cyber Technol. Autom. Control Intell. Syst., pp. 820–824, 2015, doi: 10.1109/CYBER.2015.7288049.

S. Rangarajan, H. Liu, H. Wang, and C. L. Wang, “Scalable Architecture for Personalized Healthcare Service Recommendation Using Big Data Lake,” Springer, vol. 234, pp. 65–79, 2018, doi: 10.1007/978-3-319-76587-7_5.

F. Ravat and Y. Zhao, “Metadata Management for Data Lakes,” Springer, vol. 1064, pp. 37–44, 2019, doi: 10.1007/978-3-030-30278-8.

M. S. Hadj Sassi, F. G. Jedidi, and L. C. Fourati, “A new architecture for cognitive internet of things and big data,” Procedia Comput. Sci., vol. 159, pp. 534–543, 2019, doi: 10.1016/j.procs.2019.09.208.

H. Alrehamy and C. Walker, “SemLinker: automating big data integration for casual users,” J. Big Data, vol. 5, no. 1, 2018, doi: 10.1186/s40537-018-0123-x.

N. Miloslavskaya and A. Tolstoy, “Big Data, Fast Data and Data Lake Concepts,” Procedia Comput. Sci., vol. 88, pp. 300–305, 2016, doi: 10.1016/j.procs.2016.07.439.

R. Benaissa, F. Benhammadi, O. Boussaid, and A. Mokhtari, “Clustering Approach for Data Lake Based on Medoid’s Ranking Strategy,” Springer, vol. 50, pp. 250–260, 2019, doi: 10.1007/978-3-319-98352-3.

A. Abelló, “Big data design,” Dol. Proc. ACM Int. Work. Data Warehous. Ol., vol. 23-Oct-201, pp. 35–38, 2015, doi: 10.1145/2811222.2811235.

H. Abbes and F. Gargouri, “Big Data Integration: A MongoDB Database and Modular Ontologies based Approach,” Procedia Comput. Sci., vol. 96, pp. 446–455, 2016, doi: 10.1016/j.procs.2016.08.099.

T. Wen, “Data Aggregation,” Encylopedia of Big Data. Springer, Cham, 2020, doi: https://doi.org/10.1007/978-3-319-32001-4.

W. Jiang, L. G. Xu, H. B. Hu, and Y. Ma, “Improvement design for distributed real-time stream processing systems,” J. Electron. Sci. Technol., vol. 17, no. 1, pp. 3–12, 2019, doi: 10.11989/JEST.1674-862X.80904011.




DOI: https://doi.org/10.31326/jisa.v3i1.657

Refbacks

  • There are currently no refbacks.


Copyright (c) 2020 Angelina Pramana Thenata

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


JOURNAL IDENTITY

Journal Name: JISA (Jurnal Informatika dan Sains)
e-ISSN: 2614-8404, p-ISSN: 2776-3234
Publisher: Program Studi Teknik Informatika Universitas Trilogi
Publication Schedule: June and December 
Language: English
APC: The Journal Charges Fees for Publishing 
IndexingEBSCODOAJGoogle ScholarArsip Relawan Jurnal IndonesiaDirectory of Research Journals Indexing, Index Copernicus International, PKP IndexScience and Technology Index (SINTA, S4) , Garuda Index
OAI addresshttp://trilogi.ac.id/journal/ks/index.php/JISA/oai
Contactjisa@trilogi.ac.id
Sponsored by: DOI – Digital Object Identifier Crossref, Universitas Trilogi

In Collaboration With: Indonesian Artificial Intelligent Ecosystem(IAIE), Relawan Jurnal IndonesiaJurnal Teknologi dan Sistem Komputer (JTSiskom)

 

 


JISA (Jurnal Informatika dan Sains) is Published by Program Studi Teknik Informatika, Universitas Trilogi under Creative Commons Attribution-ShareAlike 4.0 International License.