All projects
Project 03 / Cloud Data Engineering

Relay

Raw sales data goes in. Clean, queryable insights come out — on the same pipeline, every time.

Role
Solo build — data engineering
Stack
AWS Glue RDS Terraform Jenkins Python
EXTRACT TRANSFORM LOAD REPORT Raw Sales Data → S3 CSV / JSON AWS Glue + Python clean · derive · cast PostgreSQL on RDS structured tables Excel Reports OpenPyXL auto-generated ⚙ Jenkins CI/CD orchestrates every run ⌨ Terraform + CloudFormation provisions RDS · S3 · IAM λ Lambda event triggers ORCHESTRATION + INFRA
FIG. 01 — Architecture A linear ETL flow with Jenkins driving every run and Terraform owning the cloud underneath.

Most teams sit on messy CSVs in someone's email and call it data. Querying any of it means an analyst manually cleaning the same file every week — until they leave and the whole workflow breaks.

Relay turns that mess into a real pipeline: raw sales data lands in S3, AWS Glue cleans and reshapes it, and the result loads into PostgreSQL on Amazon RDS where it's queryable, joinable, and trustworthy.

The whole stack is infrastructure-as-code. Terraform provisions everything. Jenkins runs the pipeline. No manual clicking through the AWS console, no "it worked on my machine," no lost data. Just a button that turns raw input into reports.

Pipeline stages

Four steps, all automated.

/ 01
Ingest S3 · Lambda

Raw sales CSVs land in an S3 bucket as the single source of truth. Lambda trigger functions sit alongside the bucket to kick off the downstream job when a refresh is needed, so the pipeline can be wired to fire on events instead of polling.

/ 02
Transform AWS Glue · Python

Glue jobs run Python scripts that clean nulls, normalize columns, cast types, and compute derived fields like tax. Heavy lifting happens here so the database stays fast and downstream queries stay simple.

/ 03
Load PostgreSQL · RDS

Transformed rows write into PostgreSQL on Amazon RDS, where the data is structured, queryable, and ready for joins or reports. No more raw CSVs floating around.

/ 04
Report Python · OpenPyXL

A Python step queries the loaded data and writes formatted Excel reports — the format business stakeholders actually open. Generated fresh every run so the numbers stay current.

Why it holds up

Designed for boring reliability.

/ 01

Infrastructure as code

Terraform + CloudFormation provision RDS, S3, and IAM. The entire cloud setup is one repo and one command — reproducible, reviewable, no console clicking.

/ 02

Repeatable Jenkins runs

Pipeline-as-Code in Jenkins runs the same way every time. Failed stage? You see exactly where it broke. Need to backfill? Same job, different input.

/ 03

More than just moving rows

The Glue layer doesn't just shuffle data — it computes derived fields like tax inline, so the loaded tables are already analysis-ready, not raw dumps with work left to do.

Want to poke around the code?