Designing resilient, low-latency data pipelines for streaming big data analytics using Apache Kafka and Spark ecosystems

Uju Ugonna Uzoagu

doi:10.30574/wjarr.2025.27.3.3369

Uju Ugonna Uzoagu ^*

Department of Computer Science, College of Computing and Software Engineering, Kennesaw State University, USA.

Review Article

World Journal of Advanced Research and Reviews, 2025, 27(03), 1856-1873

Article DOI: 10.30574/wjarr.2025.27.3.3369

DOI url: https://doi.org/10.30574/wjarr.2025.27.3.3369

Publication history

Received on 21 August 2025; revised on 26 September 2025; accepted on 29 September 2025

Abstract

The exponential growth of real-time data streams from digital platforms, Internet of Things (IoT) devices, and enterprise applications has redefined the requirements for big data analytics. Traditional batch-processing architectures, while robust for historical analysis, are increasingly insufficient in addressing the need for low-latency decision-making in sectors such as finance, healthcare, telecommunications, and e-commerce. Consequently, resilient streaming data pipelines have become critical in supporting fault-tolerant, scalable, and high-throughput analytics. This study explores the design and implementation of resilient, low-latency data pipelines for streaming big data analytics by leveraging the Apache Kafka and Apache Spark ecosystems. Kafka, a distributed publish-subscribe messaging system, provides durable, fault-tolerant ingestion capabilities with strong scalability properties, while Spark Structured Streaming delivers near real-time analytical processing and advanced machine learning integration. Together, these technologies form a complementary foundation for constructing streaming pipelines capable of handling large volumes of high-velocity data. The paper discusses architectural design principles, including partitioning strategies, replication for fault tolerance, stateful stream processing, and backpressure handling. It further evaluates techniques for ensuring end-to-end resilience, such as exactly-once semantics, checkpointing, and integration with containerized environments like Kubernetes for deployment scalability. Case study insights highlight latency benchmarks and system performance under varying workloads, demonstrating how the Kafka-Spark integration supports enterprise-grade analytics. By uniting resilience, scalability, and analytical depth, the proposed pipeline framework enables organizations to harness real-time insights while ensuring reliability under fluctuating conditions. The findings contribute practical guidelines for architects, engineers, and decision-makers seeking to operationalize streaming analytics infrastructures that meet the growing demands of modern data-driven enterprises.

Keywords

Streaming data pipelines; Apache Kafka; Apache Spark; Big data analytics; Low-latency processing; Resilient architectures

Download Article PDF

https://journalwjarr.com/sites/default/files/fulltext_pdf/WJARR-2025-3369.pdf

Preview Article PDF

How to cite this article

Uju Ugonna Uzoagu. Designing resilient, low-latency data pipelines for streaming big data analytics using Apache Kafka and Spark ecosystems. World Journal of Advanced Research and Reviews, 2025, 27(03), 1856-1873. Article DOI: https://doi.org/10.30574/wjarr.2025.27.3.3369.

Copyright information

Designing resilient, low-latency data pipelines for streaming big data analytics using Apache Kafka and Spark ecosystems

Uju Ugonna Uzoagu *

Preview Article PDF

Uju Ugonna Uzoagu ^*