README.md

# Smart cities big data services

This repository represents my code submission for the Big data course project, SCPD, 2024 - 2025 cohort.

# Contact Information
Andrei-Bogdan Plesoianu - SCPD1-A - andrei.plesoianu@stud.acs.upb.ro

# Data
The data used for this project can be downloaded, in CSV form, from [here](https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households).

# Project structure
I've structured the project as a series of interconnected microservices. For this reason, in order to start the project, you need to start each individual component.
Because the purpose of the architecture is to be scaled individually, with components that are distributed, I decided to use docker-compose to encapsulate each service. This way, one can deploy the services in a highly scalable environment, such as a cloud one, with more ease.

# Components
The following components are available:
- <strong>influxdb</strong>: An InfluxDB database service, for ingesting processed data and offer a dashboard for performing BI type analytics;
- <strong>ingestion-server</strong>: A Node.js server that acts as an entry point in the dataflow, through the ingesting layer of the ETL pipeline. The server exposes the ETL to the exterior, through a Rest API interface and send the data it receives to the next layer in the architecture, which is the Kafka broker;
- <strong>kafka</strong>: A Kafka cluster, represented by a number of docker images, packed up in a docker compose file. This cluster consists of a Zookeeper node, for metadata and coordination jobs, a broker, kafka-UI, which offers a browser interface to inspect what's going on with the cluster and a script based job that creates some queues cluster initialization;
- <strong>kafka-influxdb-connector</strong>: A Node.js script that acts as a sink for the Kafka cluster. It reads processed data from the "resampled" queue, and inserts it in the InfluxDB database;
- <strong>kafka-postgres-connector</strong>: Another Node.js script that acts as a sink, this time to pull unprocessed data from the "topic" input of the Kafka stream, and insert it into the Postgres database;
- <strong>kafka-resample-stream</strong>: A Java program that acts as a Kafka Stream process. This is used to process data ingested by the cluster in a streaming fashion, one record at a time, as it is ingested. It mainly does three things: it filters out missing measurements in the incoming data, in detects faulty measurements (data with "*" characters in it) and places it in a sperate topic for further processing and finally, it resamples the data from 30 minutes resolution, to one hour, and places it into the "resampled" topic;
- <strong>postgres</strong>: A postgres database, packaged as a docker container service. It acts as the main database in the ETL pipeline, which stores minimally processed data. It's used in ad hoc analytics and machine learning tasks as the main data source;

# Tooling used
I've used the following necessary tools:
- Node.js, version 18.16.1;
- Java 17;
- Docker, version 20.10.23;
- Intellij 2024.3;
- Datagrip 2024.3 (optional, used to manage data in the postgres database);
- IPython, version 8.27.0;
- ipykernel, version 6.28.0;

# Running the project
In order to run the project, each individual component needs to be started. As such, you should issue the following commands:
```
# start InfluxDB
cd influxdb; docker-compose up

# start the ingestion server
node src/server.js

# start the kafka cluster
cd kafka; docker-compose up

# start the kafka to influxdb sink connector
cd kafka-influxdb-connector; node src/index.js

# start the kafka to postgres sink connector
cd kafka-postgres-connector; node src/index.js
```

In order to start the kafka stream job, I've used Intellij, but you can build the JAR from the CLI. In this case, you need maven (one that is compatible with the Java SDK that sits in your PATH).

# Ad hoc analytics
One of the main purposes of this architecture is to allow data scientists, data analysts or machine learning engineers to use the data ingested for different analytics scenario.

For this reason, I've created a POC demonstrating the possibility of running such workflows with ease. I've used Jupyter Notebooks for this example, but, in fact, any tool that makes it possible to connect to a postgres database would do just fine.

In the <strong>jupyter</strong> folder, there is the <strong>Ad Hoc EDA.ipynb</strong> file that runs some exploratory data analysis on the ingested data.