For Hire Vehicle Data Analysis: Data Engineerin Project

This repo contains the final project implemented for the Data Engineering zoomcamp course.

Introduction

The aim of this project is to analyse the hire-taxi data in newy york city for 2022. The project is developed intended to analyse the FHV(For hi re-vehicle) data and to get the answer of few questions as below:

Different providers offering services in tax-hiring and their market share?
Distribution of taxi hiring based on each month of 2022 and drill down based on service providers?

Dataset

The NYC Taxi For Hire Vehicle(FHV) Data dataset is used. This dataset is updated monthly.

Details include information about the time, location and descriptive categorizations of the trip records for FHV taxi high volume data. To know more about the dataset click Here.

Tools

The following components were utilized to implement the required solution:

Data Ingestion: Data extracted using python requests module using NYC Taxi data internal API
Infrastructure as Code: Terraform
Workflow orchestration: Airflow
Data Lake: Google Cloud Storage
Data Warehouse: Google BigQuery
Data Pipeline: Spark batch processing
Data Transformation: Spark via Google Dataproc
Reporting: Google Looker Studio

Architecture

Reproduce

Local setup

Install the below tools:
- Terraform
- Google Cloud SDK
- docker + docker-compose v2

Cloud setup

In GCP, create a service principal with the following permissions:
- BigQuery Admin
- Storage Admin
- Storage Object Admin
- Dataproc Admin
Download the service principal authentication file and save it as $HOME/.google/credentials/google_credentials_project.json.
Ensure that the following APIs are enabled:
- Compute Engine API
- Cloud Dataproc API
- Cloud Dataproc Control API
- BigQuery API
- Bigquery Storage API
- Identity and Access Management (IAM) API
- IAM Service Account Credentials API

Initializing Infrastructure (Terraform)

Perform the following to set up the required cloud infrastructure

cd terraform
terraform init
terraform plan
terraform apply

cd ..

Data Ingestion

Setup airflow to perform data ingestion

cd airflow

docker-compose build
docker-compose up airflow-init
docker-compose up -d

Go to the aiflow UI at the web address localhost:8080 and enable the FHVHV_DATA_ETL dag.
This dag will ingest the month wise FHVHV data for year 2022, upload it to the data lake(GCS).

Data Transformation

Install and setup spark Follow this.
Enable and run the Spark_FHVHV_ETL dag.
This will intialize the below steps:
- Create a Dataproc cluster.
- Submit a spark code to GCS.
- submit a spark job to Dataproc for transformation and analysis and save processed data to GCS after partitioning.
- Save the processed data from GCS to BigQuery with clustering data based on month.
- Delete the DataProc cluster.

Dashboard

It can also be viewed at this link.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
airflow		airflow
images		images
terraform		terraform
.gitignore		.gitignore
README.md		README.md
pyspark.ipynb		pyspark.ipynb
sparkInstallation.md		sparkInstallation.md
spark_processing.py		spark_processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

For Hire Vehicle Data Analysis: Data Engineerin Project

Introduction

Dataset

Tools

Architecture

Reproduce

Local setup

Cloud setup

Initializing Infrastructure (Terraform)

Data Ingestion

Data Transformation

Dashboard

About

Releases

Packages

Languages

Rohitjoshi07/FHVDataAnalysis

Folders and files

Latest commit

History

Repository files navigation

For Hire Vehicle Data Analysis: Data Engineerin Project

Introduction

Dataset

Tools

Architecture

Reproduce

Local setup

Cloud setup

Initializing Infrastructure (Terraform)

Data Ingestion

Data Transformation

Dashboard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages