This repository provides a real-time sales data pipeline that ingests, processes, and stores sales data using Kafka, Spark, Cassandra, and Redash. It offers a comprehensive solution for streaming data analysis and visualization.
Follow these steps to set up and run the Real-Time Sales Data Pipeline:
git clone https://github.com/saadkh1/Real-Time_Sales_Data_Pipeline_Kafa_Spark_Cassandra_Redash
cd Real-Time_Sales_Data_Pipeline_Kafa_Spark_Cassandra_Redash
- Windows:
run.bat
- Linux:
run.sh
This command will use Docker Compose to start all the necessary Docker containers, including Kafka, Spark, Cassandra, and the FastAPI service. It will also create the Kafka topic and sets up the Cassandra keyspace and table.
The api-pos
directory contains Python scripts (api_pos_1.py
and api_pos_2.py
) that simulate sales data using the POST method. These scripts will send data to the FastAPI service, which will then forward it to the Kafka topic.
cd api-pos
python api_pos_1.py & # Run in background
python api_pos_2.py & # Run in background (optional for more data)
- The generated data is sent to the FastAPI service using POST requests.
- The FastAPI service receives the data and sends it to the Kafka topic.
- Kafka acts as a real-time message broker, streaming the sales data to the Spark application.
- A Spark job continuously reads data from the Kafka topic.
- The data is processed (data transformations).