Open Forem: Oliver Samuel

Building a Supermarket Data Pipeline

Oliver Samuel — Thu, 05 Feb 2026 23:37:53 +0000

How I Built an Automated System That Turns Messy Sales Data Into Business Gold

Ever wonder how your favorite supermarket knows exactly when to restock the shelves, which products are flying off the racks, or why they always seem to have your favorite snacks in stock? The secret lies in data pipelines, and I built one from scratch.

The Problem: Data Drowning

Imagine you're the manager of a busy supermarket(e.g., Naivas). Every single day, thousands of transactions flow through your registers, customers buying milk, bread, snacks, cleaning supplies. Each transaction generates a line of data: who bought what, how much they paid, and how they paid.

Now here's the challenge: all this data is sitting in a messy Google spreadsheet, updated by cashiers in real-time. It's like having a river of gold nuggets flowing past you, but no way to catch them.

The questions that keep you up at night:

Which products are selling the most?
What payment methods do customers prefer?
Are there duplicate transactions messing up your accounting?
How can you make this data useful for reports AND for your mobile app?

This is exactly the problem I solved with the Supermarket ETL Pipeline.

The Solution: An Automated Data Factory

Think of my solution like a water treatment plant for data:

Stage	Water Plant Analogy	What My Pipeline Does
Extract	Pumping water from the river	Pulling raw sales data from Google Sheets
Transform	Filtering out dirt and impurities	Cleaning duplicates, fixing missing values
Load	Storing clean water in tanks	Saving clean data to PostgreSQL & MongoDB

The Google Sheet

Capture the source Google Sheet showing raw transaction data with columns like id, quantity, product_name, total_amount, payment_method, customer_type. Show some messy/duplicate rows if possible.

How It Works

Step 1: Extraction: "Fishing for Data"

My pipeline starts by reaching out to Google Sheets, think of it like casting a fishing net into a lake. The spreadsheet contains raw transaction records: every purchase, every customer, every payment.

The Pipeline says: "Hey Google, give me all the sales data!"
Google responds: "Here's 1,000 rows of transactions!"

Why Google Sheets? Because it's where real businesses often keep their data, it's accessible, shareable, and doesn't require expensive software.

Terminal showing extraction logs

Capture the terminal output showing: "Starting extraction from Google Sheets" and "Extracted X rows" messages.

Step 2: Transformation: "The Car Wash for Data"

Raw data is messy. Imagine every car that comes through a car wash covered in mud, leaves, and bird droppings. The transformation stage is my car wash, it takes dirty data and makes it sparkle.

What gets cleaned:

Problem	Solution
Duplicate transactions (same ID twice)	Removed automatically
Missing transaction IDs	Rows dropped
Unnecessary columns	Only essential fields kept

The pipeline keeps only what matters:

id — Unique transaction identifier
quantity — How many items purchased
product_name — What was bought
total_amount — How much was paid
payment_method — Cash, card, or digital
customer_type — Member or regular customer

Transformation Logs

Step 3: Loading: "Two Warehouses, Two Purposes"

Here's where it gets interesting. Instead of storing data in just one place, I built a dual-database strategy. Think of it like having two different storage facilities:

PostgreSQL: The Library

PostgreSQL is like a meticulously organized library. Every book (data record) has its place, follows strict rules, and can be cross-referenced with other books easily.

Best for:

Financial reports ("How much revenue did we make last month?")
Accounting audits (data integrity is guaranteed)
Complex queries ("Show me all cash transactions over $100 from member customers")

MongoDB: The Flexible Warehouse

MongoDB is like a modern warehouse with adjustable shelving. You can store items of different shapes and sizes without reorganizing everything.

Best for:

Mobile app backends (JSON-friendly)
Rapid prototyping ("Let's quickly add a new field!")
Analytics dashboards (flexible data exploration)

Docker containers running

PostgreSQL data view

MongoDB data view

How It Works (The Technical Deep-Dive)

For my fellow engineers, let's pop the hood and look at the engine.

Architecture Overview

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Google Sheets  │────▶│  Python ETL     │────▶│  PostgreSQL     │
│  (Data Source)  │     │  (Container)    │     │  (Relational)   │
└─────────────────┘     │                 │     └─────────────────┘
                        │  • Extract      │
                        │  • Transform    │     ┌─────────────────┐
                        │  • Load         │────▶│  MongoDB        │
                        └─────────────────┘     │  (Document)     │
                                                └─────────────────┘

Project folder structure

The Modular Design Philosophy

Instead of one giant script, I split the pipeline into specialized modules, like having different specialists in a hospital:

File	Role	Hospital Analogy
`config.py`	Configuration management	Hospital administrator
`extract.py`	Data extraction	Ambulance driver
`transform.py`	Data cleaning	Surgeon
`load_postgres.py`	PostgreSQL loading	Recovery ward nurse
`load_mongo.py`	MongoDB loading	Rehabilitation specialist
`main.py`	Orchestration	Chief of Medicine

Why this matters:

Testability: I can test the transformation logic without needing a database connection
Maintainability: Changing the data source doesn't break the loading logic
Scalability: Adding a new destination (like Snowflake) is just adding one new file

main.py code

from etl_pipeline.config import Config
from etl_pipeline.extract import extract_data
from etl_pipeline.transform import transform_data
from etl_pipeline.load_postgres import load_to_postgres
from etl_pipeline.load_mongo import load_to_mongo
import sys
import logging

# Configure logging to stdout
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler(sys.stdout)]
)


def main():
    logging.info("ETL Application pipeline initialized.")

    # 1. Extract
    try:
        if Config.DATA_SOURCE_TYPE == "sheets":
            logging.info(f"Starting extraction from Google Sheets (ID: {Config.GOOGLE_SHEET_ID})")

            # Extract data
            data = extract_data(
                source_type="sheets",
                sheet_id=Config.GOOGLE_SHEET_ID
            )
        else:
            logging.error(f"Unknown data source: {Config.DATA_SOURCE_TYPE}")
            return

        logging.info(f"Extracted {len(data)} rows.")


        # 2. Transform
        logging.info("Step 2: Transform")
        transformed_data = transform_data(data)
        logging.info(f"Transformed Data Shape: {transformed_data.shape}") 

        # 3. Load to PostgreSQL
        logging.info("Step 3: Load to PostgreSQL")
        load_to_postgres(transformed_data, Config.POSTGRES_URL)

        # Load to MongoDB
        logging.info("Step 4: Load to MongoDB")
        load_to_mongo(
                transformed_data,
                Config.MONGO_URI,
                Config.MONGO_DB
            )

        logging.info("\nETL pipeline completed successfully.")

    except Exception as e:
        logging.critical(f"ETL failed: {e}")

if __name__ == "__main__":
    main()

The Code Walkthrough

Extraction: Pandas Does the Heavy Lifting

def extract_from_public_sheet(sheet_id):
    export_url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv"
    df = pd.read_csv(export_url)
    return df

The magic: Google Sheets can export any public sheet as CSV. Pandas reads it directly from the URL, no authentication needed for public sheets!

Transformation: Clean Data or Bust

def transform_data(df):
    required_columns = ["id", "quantity", "product_name", 
                        "total_amount", "payment_method", "customer_type"]

    df_transformed = df[required_columns].copy()
    df_transformed.drop_duplicates(subset=['id'], inplace=True)
    df_transformed.dropna(subset=['id'], inplace=True)

    return df_transformed

Key decisions:

Only keep essential columns (data minimization)
Remove duplicates by transaction ID (data integrity)
Drop rows with missing IDs (no orphan records)

Loading: Two Paths, One Pipeline

PostgreSQL with SQLAlchemy:

def load_to_postgres(df, db_url, table_name="transactions"):
    engine = create_engine(db_url)
    df.to_sql(table_name, engine, if_exists='replace', index=False)

MongoDB with PyMongo:

def load_to_mongo(df, mongo_uri, db_name, collection_name="transactions"):
    client = MongoClient(mongo_uri)
    collection = client[db_name][collection_name]
    records = df.to_dict("records")
    collection.insert_many(records)

Successful ETL run

Docker: The "It Works on My Machine" Killer

One of the biggest headaches in software is environment setup. "It works on my machine!" is the developer's equivalent of "the dog ate my homework."

Docker solves this by containerizing everything. My entire stack, Python app, PostgreSQL, MongoDB runs in isolated containers that work identically on any machine.

The docker-compose.yml Magic

services:
  postgres:
    image: postgres:15
    # PostgreSQL runs in its own container

  mongo:
    image: mongo:6
    # MongoDB runs in its own container

  etl-app:
    build: .
    depends_on:
      - postgres
      - mongo
    # My Python app waits for databases to be ready

To run the entire system:

docker compose up -d --build
docker compose exec etl-app python main.py

Key Lessons & Design Decisions

Why Two Databases?

Use Case	Best Database	Reason
Financial reports	PostgreSQL	ACID compliance, SQL support
Mobile app API	MongoDB	JSON-native, flexible schema
Complex joins	PostgreSQL	Relational model excels
Rapid prototyping	MongoDB	No schema migrations needed

Why Python?

Pandas: Industry-standard for data manipulation
SQLAlchemy: ORM that prevents SQL injection
PyMongo: Lightweight MongoDB driver
Rich ecosystem: Libraries for everything

Why Modular Design?

Think of it like LEGO blocks. Each module is a self-contained piece that:

Can be tested independently
Can be replaced without breaking others
Makes debugging a breeze

Future Enhancements

This pipeline is production-ready, but here's what could come next:

Scheduling: Run automatically every hour with Apache Airflow or cron
Message Queues: Use Kafka/RabbitMQ for async processing at scale
Data Validation: Add Great Expectations for data quality checks
Monitoring: Add Prometheus/Grafana for pipeline observability
More Sources: Extend to pull from APIs, S3, or other databases

Conclusion

Building this ETL pipeline taught me that good data engineering is invisible. When it works, nobody notices, the reports are accurate, the app loads fast, and decisions get made with confidence.

But behind that invisibility is careful architecture: modular code, dual-database strategy, containerized deployment, and clean data transformations.

Whether you're a business analyst who just wants clean data, or an engineer looking to build your own pipeline, I hope this walkthrough demystified the magic behind turning chaotic spreadsheets into business intelligence gold.

The supermarket never runs out of your favorite snacks because somewhere, a data pipeline is quietly doing its job.

If you're interested in the code, check out the repository here:GitHub Repo

Building Scalable Data Pipelines with Airflow, Docker, and Python: A SightSearch Case Study

Oliver Samuel — Fri, 30 Jan 2026 00:26:52 +0000

Data is the new oil, but a raw oil field isn't useful until you build a pipeline to refine it. In this article, I'll take you through the journey of building SightSearch, a robust data ingestion orchestration pipeline. Whether you're a seasoned data engineer or a product manager curious about how data moves from a website to a database, you're in the right place.

The Problem: Why Orchestration Matters

Imagine you need to scrape thousands of product images and details daily. You write a script. It works fine on day one. But then:

The script crashes halfway through
You run out of disk space
You forget to run it on Sunday
The website layout changes

A simple script isn't enough. You need orchestration, a system that manages, schedules, monitors, and retries your tasks automatically.

Tech Stack

I entered the workshop with a clear goal: build something scalable and reliable. Here are the tools I chose:

Apache Airflow: The industry standard for orchestrating complex workflows (DAGs)
Docker & Docker Compose: To ensure our code runs the same way on my laptop as it does in production
Python: For the heavy lifting (scraping, image processing)
MongoDB: NoSQL storage for our flexible product data
PostgreSQL: Relational storage for Airflow's internal metadata

Architecture Overview

The pipeline is split into independent, reusable "tasks." This modularity is key. If the scraping works but the database is down, we don't lose the data, we just retry the storage step later.

A high-level diagram of the architecture

The Pipeline in Action

Let's look at the heart of our project: the Airflow DAG (Directed Acyclic Graph). It defines the order of operations.

1. The Scrape Task

First, we hit the target website to gather raw product titles and image URLs. We use smart logic to handle pagination and rate limiting.

2. Image Processing

Raw images are heavy. We download them, calculate their hash (pHash) for deduplication, and extract metadata like dimensions and file size.

3. Validation and Storage

Data quality is paramount. We validate every record. Good data goes to MongoDB; bad data is logged for review.

Step-by-Step Walkthrough

Here's how we bring this system to life.

Phase 1: The Setup

We use docker-compose to spin up our entire infrastructure with one command:

docker compose -f docker/docker-compose.yml up -d

Terminal showing Docker containers starting up successfully

Phase 2: The Airflow UI

Once running, we log into the Airflow webserver. This is our command center.

We unpause our sightsearch_ingestion_pipeline and trigger a run.

Phase 3: Monitoring Execution

As the pipeline runs, we can watch each task succeed in real-time. This visual feedback is incredibly satisfying and useful for debugging.

Airflow UI showing specific tasks turning dark green, indicating success

Phase 4: Verifying the Data

Finally, the moment of truth. We check our database to ensure the data actually arrived.

MongoDB query db.products.findOne() returning a structured product document with title, price, and image_metadata

Challenges and Best Practices

It wasn't all smooth sailing. Here are critical lessons I learned:

1. Handling Secrets Securely

Initially, I hardcoded database passwords in docker-compose.yml. This is a huge security risk!

Solution: I refactored to use a .env file, keeping my credentials out of version control.

2. Module-Level Connections

I initially opened a database connection at the top of our scraping script. This caused Airflow to try and connect to the DB just while parsing the file, leading to timeouts.

Solution: I moved the connection logic inside the execution functions. Always initialize resources lazily!

Conclusion

SightSearch demonstrates that with the right tools, even complex data ingestion can be made reliable and transparent. Airflow gives us control, Docker gives us consistency, and Python gives us power.

If you're interested in the code, check out the repository here: GitHub Repo

CHW Monthly Activity Aggregation: Turning Visit Logs into Insight

Oliver Samuel — Tue, 02 Dec 2025 18:33:21 +0000

Community Health Workers (CHWs) generate a huge amount of visit-level data: every household visit, assessment, and follow-up. On its own, this raw data is hard to use. Our CHW Monthly Activity Aggregation project turns those raw records into a clean monthly summary that's ready for dashboards and performance reviews.

At a high level, this project is a dbt (data build tool) project that reads a fact table of CHW activities and produces a single, analytics-ready table: public.chw_activity_monthly, with one row per CHW per reporting month.

What Problem Does This Project Solve?

Today's challenge: Managers often get raw logs (“John visited household 123 at 10:15”) instead of understandable summaries (“John visited 28 households in March, with 6 pregnancy visits”).
Our solution: We automatically group all visits into monthly summaries per CHW, applying agreed-upon business rules, so decision-makers see consistent, comparable metrics over time.
Source: public.fct_chw_activity (via dbt source marts.fct_chv_activity)
Output: public.chw_activity_monthly, incremental model keyed by ['chv_id', 'report_month']
Tech stack: dbt + Postgres, orchestrated via Docker

Project layout

How the Data Flows

The project follows a simple story:

1. Start with raw visit data

Each row in fct_chv_activity is a single CHW activity: who visited, when, what type of visit, and where.

2. Clean and filter

We:

Drop records with missing CHW IDs
Drop records with missing activity dates
Exclude visits marked as deleted

3. Assign a “reporting month”

Instead of just using calendar months, we use a 26th-of-the-month rule:

Visits on the 1st to 25th belong to that month
Visits on the 26th or later are counted in the next month

This matches the business reporting cycle CHW programs often use.

4. Aggregate to monthly metrics

For each CHW and reporting month, we calculate:

total_activities
unique_households_visited
unique_patients_served
pregnancy_visits
child_assessments
family_planning_visits

5. Store the result as a reusable table

The final table public.chw_activity_monthly is what reporting tools (e.g., Power BI, Looker) will connect to.

A simple diagram (whiteboard or slide) showing arrows from fct_chv_activity → cleaning/filtering → month assignment → chw_activity_monthly with the key metrics.

Key Components

Macro: `month_assignment(date_column)` in `macros/month_assignment.sql`

Implements the 26th-day rule using SQL case + date_trunc and a + interval '1 month' when the day ≥ 26.

Model: `models/starter_code/chw_activity_monthly.sql`

materialized = 'incremental'
incremental_strategy = 'delete+insert'
unique_key = ['chv_id', 'report_month']

Within the model:

CTE raw
- Selects from {{ source('marts', 'fct_chv_activity') }}
- Filters out null activity_date, null chv_id, and is_deleted = true
- For incremental runs, only processes months at or after the earliest report_month currently present
CTE assigned
- Calls {{ month_assignment('activity_date') }} to compute report_month
CTE aggregated
- Groups by chv_id, report_month
- Computes counts and conditional sums for activity types

chw_activity_monthly.sql model with the CTE structure visible and the config block at the top.

Business Rules in More Detail

Only valid, non-deleted records are included (no missing CHW, no missing date, no deleted events).
Report month:
- A visit on March 24 counts in March.
- A visit on March 27 counts in April.
Metrics per CHW per month:
- Count total activities
- Count unique households and unique patients
- Break out program categories: pregnancy visits, child assessments, family planning visits

If the organization later changes how months are defined or what counts as, say, a “pregnancy visit,” we can adjust these rules centrally in the dbt code, and all downstream dashboards will automatically align.

A small, anonymized excerpt of fct_chv_activity and the corresponding rows in chw_activity_monthly, to illustrate how several visits roll up into one summary row.

SELECT 
       chv_id,
       report_month,
       total_activities,
       unique_households_visited,
       unique_patients_served,
       pregnancy_visits,
       child_assessments,
       family_planning_visits
FROM public.chw_activity_monthly
ORDER BY chv_id, report_month
LIMIT 5;

Testing and Data Quality

The project uses dbt’s testing framework to keep the model trustworthy:

Not-null tests on key fields like chv_id and report_month
Uniqueness test that enforces each (chv_id, report_month) pair appears only once (via dbt_utils.unique_combination_of_columns)

This ensures you do not accidentally have duplicate rows or missing identifiers in your analytical table.

Terminal output from dbt test --select chw_activity_monthly showing all tests passing.

How to Run the Project

These steps assume you have Docker installed and are running in the project root folder.

1. Start the Docker services

docker compose up -d

This brings up:

A dbt runner container
A Postgres database with the CHW activity data loaded

Terminal view showing docker compose up -d completing successfully, with containers listed via docker ps.

2. Open a shell in the dbt runner

docker exec -it dbt_runner bash
cd chw_project/

Inside this container, you're now in the dbt/chw_project directory, where the dbt project is configured (via dbt_project.yml and profiles.yml).

3. Build the monthly activity model

Run:

dbt run

dbt will:

Compile the SQL
Apply the filters and the month assignment
Materialize or update the public.chw_activity_monthly table

Terminal output of dbt run, showing chw_activity_monthly as OK or SUCCESS with timing.

4. Run tests for the model

Still inside the dbt runner container:

dbt test --select chw_activity_monthly

You should see the tests for:

Not-null constraints
Unique combination of chv_id + report_month

Test summary output, highlighting that all tests are passing.

5. Inspect the final table in Postgres

From your host machine:

docker exec -it dbt_postgres bash
psql -U dbt_user -d analytics

In the psql prompt:

SELECT *
FROM public.chw_activity_monthly
LIMIT 20;

This shows sample rows for CHWs, each representing a single CHW’s activity in a reporting month.

Recommendations

Once the table exists and has passed tests, analytics or program teams can:

Point their BI tool to public.chw_activity_monthly
Build dashboards like:
- CHW monthly productivity (activities per month)
- Coverage (households visited by CHW and region)
- Program mix (share of pregnancy vs. child vs. family-planning visits)
Track trends over time and identify:
- Under-served areas (few households reached)
- High-performing CHWs
- Seasonal patterns in demand

Because all these dashboards share the same centrally defined table and business rules, reports across teams will be consistent and comparable.

Conclusion

This project turns complex CHW visit logs into a simple summary table, so you can clearly see who is doing what, where, and when.
It’s a dbt incremental model over fct_chv_activity, using a macro for the 26th-day reporting rule, strong record-level filtering, and dbt tests for key constraints.

The result is a reliable foundation for CHW performance analytics and program decision-making, with transparent and maintainable logic behind every number.

Outerwear Performance Analysis: A Data-Driven Investigation

Oliver Samuel — Tue, 18 Nov 2025 16:59:08 +0000

1. Problem Statement

The Outerwear category has shown persistent underperformance across multiple business dimensions: revenue, margin, and customer engagement. Despite moderate sales spikes in peak seasons (Fall and Winter), total Outerwear revenue ($18.5K) lags far behind other categories, indicating structural weaknesses in demand generation and retention.

High discount penetration (44.4%) suggests dependency on promotions to move stock, compressing margins and signaling that customers perceive inadequate value at full price. Meanwhile, ratings are only moderate (3.75 overall) and decline further in Fall (3.64), implying inconsistent product quality or unmet customer expectations during the peak sales window.

Seasonal dependency, discount-driven sales, and stagnant customer retention lead to a vicious cycle: deep discounting drives one-time purchases but suppresses long-term profitability. Our analysis aims to diagnose these issues and design actionable solutions for merchandising, marketing, and financial optimization.

2. Primary Hypothesis

H₀ (Null Hypothesis):
Outerwear performance aligns with normal seasonal apparel patterns and observed sales fluctuations reflect natural demand variability.

H₁ (Alternative Hypothesis):
Outerwear underperforms due to addressable factors — seasonal dependency, discount addiction, quality decline, and poor retention — which can be mitigated via assortment diversification, pricing strategy, and loyalty optimization.

3. Sub-Hypotheses and Analytical Approach

ID	Sub-Hypothesis	Key Tests	Expected Outcome
H1	Seasonality Hypothesis: Outerwear revenue is overly concentrated in Fall–Winter seasons.	Seasonality Index; Chi-square for uniformity; MoM trendline	Revenue >50% from Fall/Winter → confirms dependency.
H2	Discount Dependency Hypothesis: High discount penetration artificially sustains volume.	T-test on AOV (discounted vs non-discounted); Repeat purchase rate	Discounts → higher one-time buyers, lower loyalty.
H3	Quality/Ratings Hypothesis: Outerwear ratings dropping in Fall correlate with reduced repurchase rates.	ANOVA: ratings vs season; correlation (rating vs repurchase)	Declining Quality → reduced retention, especially in Fall.
H4	Retention Hypothesis: Outerwear attracts one-time, non-subscriber customers.	Chi-square: loyalty distribution; segment comparison (new vs loyal buyers)	Majority transactions from non-subscribers → poor loyalty.
H5	Assortment Mismatch Hypothesis: SKU concentration in specific sizes/colors limits appeal.	Herfindahl Index on SKU diversity; distribution plots (size/color)	Over-indexed in M/Cyan → missing revenue from underserved segments.

4. PromptBI Dashboard Link

Dashboard

Outerwear Category Analytics and Deep Dive

Summary Insight: The Outerwear category has shown significant trends in revenue, transaction counts, and customer preferences, with notable seasonal variations and discount impacts.

Key Metrics:

Total Outerwear Revenue: $36,753.00
Average Outerwear Rating: 3.75
Outerwear Transaction Count: 639
Most Common Outerwear Size: M
Most Common Outerwear Color: Cyan

Supporting Metrics and Trends:

Revenue by Category: Clothing leads with the highest total revenue, but Outerwear has a significant presence.
Average Order Value (AOV): Footwear has the highest AOV, indicating higher-value purchases.
Transaction Count: Clothing is the most transacted category.
Units Sold: Clothing also leads in units sold.
Rating Distribution: The most frequent rating bin is 3.25-3.5.
Discount Penetration: Outerwear has the highest discount penetration among all categories.
Customer Loyalty Metrics: Loyal customers show a higher transaction count in the Outerwear category.
Seasonal Trends:
- Revenue Peak: Fall season
- Transaction Peak: Winter season
Price Distribution: The most frequent price bin for Outerwear is $20.00-$30.00.

5. Visuals Section — PromptBI Chart Placement Guide

1. Revenue by Category Comparison (Bar Chart)

Key Insights:

The chart reveals the total revenue generated by different product categories: Accessories, Clothing, Footwear, and Outerwear.
The highest revenue is from Clothing with $104,264, followed by Accessories with $74,200.
Footwear generates $36,093 in revenue, while Outerwear has the lowest revenue at $18,524.

Trends and Implications:

Clothing is the top revenue generator, indicating strong customer demand and potentially higher profit margins.
Outerwear, with the lowest revenue, suggests either lower demand, higher competition, or pricing issues.

Customer and Business Impact:

The disparity in revenue between categories highlights the need for targeted strategies to boost underperforming segments like Outerwear.
Understanding why Outerwear has the lowest revenue could uncover market gaps or customer pain points.

2. Average Order Value by Category (Bar Chart)

Key Insights:

The Outerwear category has the lowest Average Order Value (AOV) at $57.17, significantly below the other categories.
In contrast, Footwear leads with the highest AOV at $60.26.

Business Implications:

The lower AOV in Outerwear suggests potential areas for improvement in customer engagement or product offerings within this category.
Consider analyzing customer feedback and sales data specific to Outerwear to identify pain points or opportunities for upselling.

3. Transaction Count by Category (Bar Chart)

Key Insights:

Outerwear transactions are 54% lower than Accessories and 81% lower than Clothing.
This indicates potential underperformance or lower customer demand in the Outerwear segment.

Business Implications:

Consider reviewing the Outerwear product lineup for relevance and appeal.
Evaluate marketing efforts targeted at this category to identify gaps.
Explore seasonal trends or external factors affecting Outerwear sales.

4. Units Sold Comparison (Bar Chart)

Key Insights:

Lowest Sales Volume: Outerwear has the fewest units sold at 324, significantly lower than other categories.
Sales Gap: There's a notable 1,413 units difference between Outerwear and the highest-selling category, Clothing.

Business Implications:

The low sales in Outerwear may indicate market saturation, competitor dominance, or customer disinterest.
Consider a market analysis to understand the underlying causes and explore strategies to boost Outerwear sales.

5. Rating Distribution (Histogram)

The histogram reveals the distribution of customer ratings for the Outerwear category, which is crucial for understanding customer satisfaction and identifying areas for improvement.

Key Trends:

The most frequent rating bin is 3.25-3.5, indicating a central tendency towards average satisfaction.
Ratings between 3.25-3.5 and 3.75-4.0 have the highest number of reviews, suggesting a significant portion of customers are moderately satisfied.
There is a noticeable drop in the number of reviews for ratings below 3.0, implying fewer customers are highly dissatisfied.
The lowest rating bin (2.5-2.75) still has a considerable number of reviews (379), indicating room for improvement in product quality or customer experience.

Business Implications:

Focus on enhancing products or services to move the average ratings upwards, especially targeting the 2.5-2.75 range.
Investigate the causes behind the moderate satisfaction levels in the 3.25-3.5 range to identify specific pain points or areas for enhancement.
Leverage the high number of reviews in the 3.75-4.0 range to gather insights on what customers appreciate most, and amplify those aspects in marketing and product development.

6. Discount Penetration by Category (Bar Chart)

Outerwear leads in discount penetration, signaling strong customer response.

Key Insights:

Outerwear has the highest discount penetration at 44.44%, indicating a robust customer response to discounts in this category.
Accessories and Footwear follow with discount penetrations of 43.79% and 43.24% respectively, showing consistent customer interest.
Clothing has the lowest penetration at 42.08%, suggesting potential room for optimization in discount strategies.

Business Implications:

Focus on maintaining and enhancing discount strategies for Outerwear to capitalize on high customer engagement.
Investigate why Clothing has lower discount penetration and explore opportunities to increase its appeal through targeted promotions or product improvements.
Monitor trends in Accessories and Footwear to ensure continued customer interest and adjust strategies as needed.

7. Customer Loyalty Metrics by Category (Stacked Bar)

Transaction Insights:
The Outerwear category shows a total of 324 transactions (233 non-subscribers + 91 subscribers).

Key Trends:

Non-subscribers dominate with 233 transactions, significantly higher than subscribers.
Subscribers contribute 91 transactions, indicating a smaller but present loyal customer base.

Business Implications:

The lower transaction count for subscribers suggests potential growth in customer loyalty for Outerwear.
Focus on converting non-subscribers to subscribers could increase overall transaction volume.

8. Outerwear Revenue by Season (Line Chart)

Why It Matters:
Understanding seasonal revenue trends in the Outerwear category is crucial for aligning inventory, marketing efforts, and customer engagement strategies.

Key Trends:

Spring: High revenue at $9,749, indicating strong demand.
Summer: Lowest revenue at $7,449, suggesting reduced need for outerwear.
Fall: Peak season with revenue at $9,778, the highest point.
Winter: Slight drop from Fall, revenue at $9,777.

Business Implications:

Focus marketing and promotions in Fall to capitalize on peak demand.
Consider inventory adjustments for Summer to align with lower demand.
Analyze customer behavior in Spring to replicate successful strategies in other seasons.

9. Outerwear Transactions by Season (Line Chart)

Why It Matters:
Understanding seasonal trends in outerwear transactions helps in aligning inventory, marketing efforts, and customer engagement strategies to maximize sales and customer satisfaction.

Key Trends:

Winter Peak: Outerwear transactions peak in Winter with 170 transactions, indicating high demand during this season.
Spring High: Spring follows closely with 169 transactions, suggesting strong seasonal demand.
Lowest in Summer: Summer sees the lowest transaction count at 134, reflecting reduced need for outerwear.
Fall Resurgence: Fall shows a resurgence with 166 transactions, close to Spring levels.

Business Implications:

Focus marketing and promotional efforts on Winter and Spring to capitalize on high transaction periods.
Consider inventory adjustments to ensure sufficient stock during peak seasons while managing lower demand in Summer.
Explore strategies to boost sales in Summer, such as promoting lightweight outerwear or transitional pieces.

10. Discount Impact on Outerwear Revenue (Scatter Plot)

Why This Matters:
Understanding how discounts affect outerwear revenue is crucial for optimizing sales strategies and maximizing profit margins.

Key Trends:

Fall shows the highest revenue at $9,778 with a discount penetration of 45.18%.
Summer has the highest discount penetration at 50% but the lowest revenue at $7,449.
Spring and Winter show similar revenue figures around $9,750 with discount penetrations of 40.24% and 47.65%, respectively.

Trendline Analysis:
The trendline equation y = -185.25x + 17666.75 indicates a negative correlation between discount penetration and revenue. As discount penetration increases, revenue tends to decrease.

Business Implications:

Higher discounts do not necessarily lead to higher revenue, as seen in Summer.
Moderate discount levels in Fall and Winter correlate with peak revenue.
Consider reducing discount levels in Summer to potentially increase revenue.

11. Customer Segmentation by Season (Clustered Bar)

Why This Matters:
Understanding customer behavior across different seasons helps tailor marketing strategies and inventory planning for the Outerwear category.

Key Trends:

Loyal Customers Drive Transactions: The 'Loyal' segment consistently shows higher transaction volumes compared to the 'Active' segment across all seasons.
Seasonal Variance:
- 'Loyal' customers peak in Spring with 133 transactions and maintain high levels in Fall (129) and Winter (132).
- 'Active' customers show less variance, ranging from 31 transactions in Summer to 38 in Winter.
Summer Low for Loyal Customers: The 'Loyal' segment drops to 103 transactions in Summer, indicating a potential area for engagement strategies.

Business Implications:

Focus retention strategies on the 'Loyal' segment to maintain high transaction volumes.
Investigate why 'Loyal' customers drop in Summer and develop targeted campaigns to re-engage them during this period.
Consider inventory adjustments to align with the high demand from 'Loyal' customers in Fall, Spring, and Winter.

12. Price Distribution (Histogram)

Customers prefer lower price ranges.

Key Trends:

The most frequent price bin is 20.0-30.0 USD, with 112 transactions. This indicates a strong preference for lower-priced outerwear.
The number of transactions decreases as the price increases from 20.0-30.0 USD to 70.0-80.0 USD.
There is a slight uptick in transactions in the 80.0-90.0 USD range, suggesting a segment of customers willing to pay a bit more.

Business Implications:

Focus marketing efforts on the 20.0-30.0 USD range to capture the largest customer segment.
Consider promotional strategies for the 80.0-90.0 USD range to leverage the observed interest.
Analyze the 70.0-80.0 USD range to understand the drop-off and adjust pricing or product offerings accordingly.

13. Size and Color Distribution (Stacked Bar)

Why This Matters:
Understanding the distribution of outerwear sizes and colors helps tailor inventory and marketing strategies to meet customer preferences effectively.

Key Trends:

Medium Size Dominance: Size M has the highest counts across almost all colors, indicating a strong preference for this size.
Popular Colors: Beige, Blue, Brown, and Gray are consistently popular across all sizes, with notable peaks in Size M.
Size L Insights: Size L shows a varied distribution with Cyan and Brown leading, suggesting a niche market within larger sizes.
Size S Observations: Size S has lower overall counts, with Beige and Olive standing out, hinting at specific customer segments.
Size XL Trends: Size XL has the lowest counts, with Cyan and Lavender showing slight preference, indicating limited demand.

Business Implications:

Focus inventory replenishment on Size M, especially in popular colors like Beige, Blue, and Gray.
Consider targeted marketing campaigns for Size L, highlighting popular colors like Cyan and Brown.
Evaluate the need for Size S and XL, possibly reducing stock for less popular colors to optimize inventory.
Explore customer feedback for Size S and XL to understand specific needs and preferences.

6. Short Conclusion & Prioritized Recommendations

Key Takeaways

Outerwear's low profitability stems from high discount usage, narrow SKU range, and weak customer retention.
Seasonality is confirmed but not optimized — demand surges in Fall/Winter, yet margins erode.
Moderate but declining ratings suggest emerging product or expectation alignment issues.
Data indicates assortment imbalance (overindexing in size M and Cyan color), limiting growth potential.

Recommendations by Function

Function	Action Priority	Recommendation
Merchandising	🔹 High	Diversify Outerwear SKUs — introduce extended sizes (S, XL), rebalance color range, and add transitional products for off-seasons.
Marketing	🔹 Medium	Shift communication from discount-heavy promos to "durability and design" value messaging; launch a Summer lightweight outerwear campaign.
Finance	🔹 High	Optimize discount policy — cap average discount <35%, test bundle promos instead of direct markdowns to protect margin.
CRM / Loyalty	🔹 Medium	Introduce a loyalty reward or seasonal bundle subscription for outerwear customers to improve repeat purchase rates.

One-Line Executive Summary

"Outerwear's performance problem is not demand shortage but value perception and assortment imbalance — by optimizing variety, discount structure, and retention strategy, the category can regain profitability and year-round engagement."

Outerwear Performance Analysis: A Data-Driven Investigation

Oliver Samuel — Tue, 18 Nov 2025 12:27:07 +0000

1. Problem Statement

2. Primary Hypothesis

H₀ (Null Hypothesis):
Outerwear performance aligns with normal seasonal apparel patterns and observed sales fluctuations reflect natural demand variability.

3. Sub-Hypotheses and Analytical Approach

ID	Sub-Hypothesis	Key Tests	Expected Outcome
H1	Seasonality Hypothesis: Outerwear revenue is overly concentrated in Fall–Winter seasons.	Seasonality Index; Chi-square for uniformity; MoM trendline	Revenue >50% from Fall/Winter → confirms dependency.
H2	Discount Dependency Hypothesis: High discount penetration artificially sustains volume.	T-test on AOV (discounted vs non-discounted); Repeat purchase rate	Discounts → higher one-time buyers, lower loyalty.
H3	Quality/Ratings Hypothesis: Outerwear ratings dropping in Fall correlate with reduced repurchase rates.	ANOVA: ratings vs season; correlation (rating vs repurchase)	Declining Quality → reduced retention, especially in Fall.
H4	Retention Hypothesis: Outerwear attracts one-time, non-subscriber customers.	Chi-square: loyalty distribution; segment comparison (new vs loyal buyers)	Majority transactions from non-subscribers → poor loyalty.
H5	Assortment Mismatch Hypothesis: SKU concentration in specific sizes/colors limits appeal.	Herfindahl Index on SKU diversity; distribution plots (size/color)	Over-indexed in M/Cyan → missing revenue from underserved segments.

4. PromptBI Dashboard Link

Dashboard

Outerwear Category Analytics and Deep Dive

Summary Insight: The Outerwear category has shown significant trends in revenue, transaction counts, and customer preferences, with notable seasonal variations and discount impacts.

Key Metrics:

Total Outerwear Revenue: $36,753.00
Average Outerwear Rating: 3.75
Outerwear Transaction Count: 639
Most Common Outerwear Size: M
Most Common Outerwear Color: Cyan

Supporting Metrics and Trends:

Revenue by Category: Clothing leads with the highest total revenue, but Outerwear has a significant presence.
Average Order Value (AOV): Footwear has the highest AOV, indicating higher-value purchases.
Transaction Count: Clothing is the most transacted category.
Units Sold: Clothing also leads in units sold.
Rating Distribution: The most frequent rating bin is 3.25-3.5.
Discount Penetration: Outerwear has the highest discount penetration among all categories.
Customer Loyalty Metrics: Loyal customers show a higher transaction count in the Outerwear category.
Seasonal Trends:
- Revenue Peak: Fall season
- Transaction Peak: Winter season
Price Distribution: The most frequent price bin for Outerwear is $20.00-$30.00.

5. Visuals Section — PromptBI Chart Placement Guide

1. Revenue by Category Comparison (Bar Chart)

Key Insights:

The chart reveals the total revenue generated by different product categories: Accessories, Clothing, Footwear, and Outerwear.
The highest revenue is from Clothing with $104,264, followed by Accessories with $74,200.
Footwear generates $36,093 in revenue, while Outerwear has the lowest revenue at $18,524.

Trends and Implications:

Clothing is the top revenue generator, indicating strong customer demand and potentially higher profit margins.
Outerwear, with the lowest revenue, suggests either lower demand, higher competition, or pricing issues.

Customer and Business Impact:

The disparity in revenue between categories highlights the need for targeted strategies to boost underperforming segments like Outerwear.
Understanding why Outerwear has the lowest revenue could uncover market gaps or customer pain points.

2. Average Order Value by Category (Bar Chart)

Key Insights:

The Outerwear category has the lowest Average Order Value (AOV) at $57.17, significantly below the other categories.
In contrast, Footwear leads with the highest AOV at $60.26.

Business Implications:

The lower AOV in Outerwear suggests potential areas for improvement in customer engagement or product offerings within this category.
Consider analyzing customer feedback and sales data specific to Outerwear to identify pain points or opportunities for upselling.

3. Transaction Count by Category (Bar Chart)

Key Insights:

Outerwear transactions are 54% lower than Accessories and 81% lower than Clothing.
This indicates potential underperformance or lower customer demand in the Outerwear segment.

Business Implications:

Consider reviewing the Outerwear product lineup for relevance and appeal.
Evaluate marketing efforts targeted at this category to identify gaps.
Explore seasonal trends or external factors affecting Outerwear sales.

4. Units Sold Comparison (Bar Chart)

Key Insights:

Lowest Sales Volume: Outerwear has the fewest units sold at 324, significantly lower than other categories.
Sales Gap: There's a notable 1,413 units difference between Outerwear and the highest-selling category, Clothing.

Business Implications:

The low sales in Outerwear may indicate market saturation, competitor dominance, or customer disinterest.
Consider a market analysis to understand the underlying causes and explore strategies to boost Outerwear sales.

5. Rating Distribution (Histogram)

The histogram reveals the distribution of customer ratings for the Outerwear category, which is crucial for understanding customer satisfaction and identifying areas for improvement.

Key Trends:

The most frequent rating bin is 3.25-3.5, indicating a central tendency towards average satisfaction.
Ratings between 3.25-3.5 and 3.75-4.0 have the highest number of reviews, suggesting a significant portion of customers are moderately satisfied.
There is a noticeable drop in the number of reviews for ratings below 3.0, implying fewer customers are highly dissatisfied.
The lowest rating bin (2.5-2.75) still has a considerable number of reviews (379), indicating room for improvement in product quality or customer experience.

Business Implications:

Focus on enhancing products or services to move the average ratings upwards, especially targeting the 2.5-2.75 range.
Investigate the causes behind the moderate satisfaction levels in the 3.25-3.5 range to identify specific pain points or areas for enhancement.
Leverage the high number of reviews in the 3.75-4.0 range to gather insights on what customers appreciate most, and amplify those aspects in marketing and product development.

6. Discount Penetration by Category (Bar Chart)

Outerwear leads in discount penetration, signaling strong customer response.

Key Insights:

Outerwear has the highest discount penetration at 44.44%, indicating a robust customer response to discounts in this category.
Accessories and Footwear follow with discount penetrations of 43.79% and 43.24% respectively, showing consistent customer interest.
Clothing has the lowest penetration at 42.08%, suggesting potential room for optimization in discount strategies.

Business Implications:

Focus on maintaining and enhancing discount strategies for Outerwear to capitalize on high customer engagement.
Investigate why Clothing has lower discount penetration and explore opportunities to increase its appeal through targeted promotions or product improvements.
Monitor trends in Accessories and Footwear to ensure continued customer interest and adjust strategies as needed.

7. Customer Loyalty Metrics by Category (Stacked Bar)

Transaction Insights:
The Outerwear category shows a total of 324 transactions (233 non-subscribers + 91 subscribers).

Key Trends:

Non-subscribers dominate with 233 transactions, significantly higher than subscribers.
Subscribers contribute 91 transactions, indicating a smaller but present loyal customer base.

Business Implications:

The lower transaction count for subscribers suggests potential growth in customer loyalty for Outerwear.
Focus on converting non-subscribers to subscribers could increase overall transaction volume.

8. Outerwear Revenue by Season (Line Chart)

Why It Matters:
Understanding seasonal revenue trends in the Outerwear category is crucial for aligning inventory, marketing efforts, and customer engagement strategies.

Key Trends:

Spring: High revenue at $9,749, indicating strong demand.
Summer: Lowest revenue at $7,449, suggesting reduced need for outerwear.
Fall: Peak season with revenue at $9,778, the highest point.
Winter: Slight drop from Fall, revenue at $9,777.

Business Implications:

Focus marketing and promotions in Fall to capitalize on peak demand.
Consider inventory adjustments for Summer to align with lower demand.
Analyze customer behavior in Spring to replicate successful strategies in other seasons.

9. Outerwear Transactions by Season (Line Chart)

Key Trends:

Winter Peak: Outerwear transactions peak in Winter with 170 transactions, indicating high demand during this season.
Spring High: Spring follows closely with 169 transactions, suggesting strong seasonal demand.
Lowest in Summer: Summer sees the lowest transaction count at 134, reflecting reduced need for outerwear.
Fall Resurgence: Fall shows a resurgence with 166 transactions, close to Spring levels.

Business Implications:

Focus marketing and promotional efforts on Winter and Spring to capitalize on high transaction periods.
Consider inventory adjustments to ensure sufficient stock during peak seasons while managing lower demand in Summer.
Explore strategies to boost sales in Summer, such as promoting lightweight outerwear or transitional pieces.

10. Discount Impact on Outerwear Revenue (Scatter Plot)

Why This Matters:
Understanding how discounts affect outerwear revenue is crucial for optimizing sales strategies and maximizing profit margins.

Key Trends:

Fall shows the highest revenue at $9,778 with a discount penetration of 45.18%.
Summer has the highest discount penetration at 50% but the lowest revenue at $7,449.
Spring and Winter show similar revenue figures around $9,750 with discount penetrations of 40.24% and 47.65%, respectively.

Business Implications:

Higher discounts do not necessarily lead to higher revenue, as seen in Summer.
Moderate discount levels in Fall and Winter correlate with peak revenue.
Consider reducing discount levels in Summer to potentially increase revenue.

11. Customer Segmentation by Season (Clustered Bar)

Why This Matters:
Understanding customer behavior across different seasons helps tailor marketing strategies and inventory planning for the Outerwear category.

Key Trends:

Loyal Customers Drive Transactions: The 'Loyal' segment consistently shows higher transaction volumes compared to the 'Active' segment across all seasons.
Seasonal Variance:
- 'Loyal' customers peak in Spring with 133 transactions and maintain high levels in Fall (129) and Winter (132).
- 'Active' customers show less variance, ranging from 31 transactions in Summer to 38 in Winter.
Summer Low for Loyal Customers: The 'Loyal' segment drops to 103 transactions in Summer, indicating a potential area for engagement strategies.

Business Implications:

Focus retention strategies on the 'Loyal' segment to maintain high transaction volumes.
Investigate why 'Loyal' customers drop in Summer and develop targeted campaigns to re-engage them during this period.
Consider inventory adjustments to align with the high demand from 'Loyal' customers in Fall, Spring, and Winter.

12. Price Distribution (Histogram)

Customers prefer lower price ranges.

Key Trends:

The most frequent price bin is 20.0-30.0 USD, with 112 transactions. This indicates a strong preference for lower-priced outerwear.
The number of transactions decreases as the price increases from 20.0-30.0 USD to 70.0-80.0 USD.
There is a slight uptick in transactions in the 80.0-90.0 USD range, suggesting a segment of customers willing to pay a bit more.

Business Implications:

Focus marketing efforts on the 20.0-30.0 USD range to capture the largest customer segment.
Consider promotional strategies for the 80.0-90.0 USD range to leverage the observed interest.
Analyze the 70.0-80.0 USD range to understand the drop-off and adjust pricing or product offerings accordingly.

13. Size and Color Distribution (Stacked Bar)

Why This Matters:
Understanding the distribution of outerwear sizes and colors helps tailor inventory and marketing strategies to meet customer preferences effectively.

Key Trends:

Medium Size Dominance: Size M has the highest counts across almost all colors, indicating a strong preference for this size.
Popular Colors: Beige, Blue, Brown, and Gray are consistently popular across all sizes, with notable peaks in Size M.
Size L Insights: Size L shows a varied distribution with Cyan and Brown leading, suggesting a niche market within larger sizes.
Size S Observations: Size S has lower overall counts, with Beige and Olive standing out, hinting at specific customer segments.
Size XL Trends: Size XL has the lowest counts, with Cyan and Lavender showing slight preference, indicating limited demand.

Business Implications:

Focus inventory replenishment on Size M, especially in popular colors like Beige, Blue, and Gray.
Consider targeted marketing campaigns for Size L, highlighting popular colors like Cyan and Brown.
Evaluate the need for Size S and XL, possibly reducing stock for less popular colors to optimize inventory.
Explore customer feedback for Size S and XL to understand specific needs and preferences.

6. Short Conclusion & Prioritized Recommendations

Key Takeaways

Outerwear's low profitability stems from high discount usage, narrow SKU range, and weak customer retention.
Seasonality is confirmed but not optimized — demand surges in Fall/Winter, yet margins erode.
Moderate but declining ratings suggest emerging product or expectation alignment issues.
Data indicates assortment imbalance (overindexing in size M and Cyan color), limiting growth potential.

Recommendations by Function

Function	Action Priority	Recommendation
Merchandising	🔹 High	Diversify Outerwear SKUs — introduce extended sizes (S, XL), rebalance color range, and add transitional products for off-seasons.
Marketing	🔹 Medium	Shift communication from discount-heavy promos to "durability and design" value messaging; launch a Summer lightweight outerwear campaign.
Finance	🔹 High	Optimize discount policy — cap average discount <35%, test bundle promos instead of direct markdowns to protect margin.
CRM / Loyalty	🔹 Medium	Introduce a loyalty reward or seasonal bundle subscription for outerwear customers to improve repeat purchase rates.

One-Line Executive Summary

"Outerwear's performance problem is not demand shortage but value perception and assortment imbalance — by optimizing variety, discount structure, and retention strategy, the category can regain profitability and year-round engagement."

Synthetic Data Generator

Oliver Samuel — Mon, 10 Nov 2025 10:55:35 +0000

By Oliver | November 7, 2025

The Problem: Why We Need Fake Data That Feels Real

Imagine you're building a new mobile app for a bank. Before launching it to real customers, you need to test it thoroughly. But here's the catch: you can't use real customer data for testing - that would be a privacy nightmare and potentially illegal. You also can't just make up random numbers and names because your app needs to handle realistic scenarios.

This is where synthetic data comes in. It's like having a movie set instead of real location. Everything looks authentic, but it's all carefully constructed and completely safe to use.

That's exactly what I built: DataGen - a Python library that creates realistic synthetic datasets at the click of a button.

What is DataGen

Think of DataGen as a digital factory for fake-but-realistic data. Just like a toy factory can produce thousands of identical toys, DataGen can generate thousands of realistic user profiles, salary records, regional information, and vehicle data. All completely synthetic but statistically accurate.

Here's another analogy: If you've ever used a flight simulator to practice flying without risking a real plane, DataGen does the same thing for data. It gives you realistic practice data without any privacy concerns or legal complications.

The Four Data Generators: My Digital Assembly Lines

DataGen consists of four specialized "assembly lines", each producing a different type of data:

1. Profile Generator: Creating Digital People

The Profile Generator creates realistic user profiles - complete with names, emails, addresses, and even geographic coordinates.
It's like having a character generator for a video game, but instead of fantasy characters, you get realistic Kenyan citizens.

What it generates:

Full names (first and last)
Email addresses and usernames
Phone numbers
Complete addresses (street, city, postal code)
Age and date of birth
Gender identity
Geographic coordinates (latitude and longitude)

Real-world use case: A fintech startup testing their loan application system can generate 10,000 realistic customer profiles in seconds, ensuring their system handles Kenyan names, addresses, and phone formats correctly.

Profile Generation Output - Show a table of generated profiles with names, emails, cities, and ages

2. Salary Generator: Modeling Compensation Data

The Salary Generator creates realistic employment and compensation records across different industries and experience levels. Think of it as a salary survey simulator that understands how compensation works in the real world.

What it generates:

Job titles across 8 departments (Engineering, Product, Data, Marketing, Sales, Operations, Finance, HR)
Experience levels (from Junior to C-Level executives)
Base salary, bonuses, and total compensation
Years of experience aligned with job level
Currency support (Kenyan Shillings and US Dollars)

The intelligence behind it: The generator know that a Senior Software Engineer should earn more than a Junior one, and that C-Level executives typically have 20+ years of experience. It's not just random numbers - it's statistically realistic.

Real-world use case: An HR analytics platform can test their salary benchmarking features with realistic compensation data across different industries and experience levels.

Salary Analysis - Show salary distribution by department or level with statistics

3. Region Generator: Mapping the World

The Region Generator creates global organizational data - perfect for companies with international operations. It's like having a world atlas combined with an organizational chart.

What it generates:

Six major global regions (North America, South America, Europe, Middle East, Africa, Asia Pacific)
Countries within each region
Time zones
Regional headquarters locations
Regional managers with contact information

Real-world use case: A multinational company testing their global CRM system can simulate operations across all continents with realistic regional structures.

Region Data Table - Show all regions with their headquarters and country counts

4. Car Generator: Building a Virtual Showroom

The Car Generator creates vehicle inventory data focused on Kenyan automotive market. It's like having a digital car dealership that understands local market preferences.

What it generates:

Popular makes and models in Kenya (Toyota, Nissan, Mazda, etc.)
Manufacturing years (2008-2025)
Colors, transmission types, and fuel types
Realistic pricing in Kenyan Shillings
Dealer locations across major Kenyan cities
Age-based depreciation modeling

The smart part: The generator knows that a 2025 Toyota Corolla should cost more than a 2010 model, and it applies realistic depreciation curves.

Real-world use case: Automotive marketplace app can test their search, filtering, and pricing features with thousands of realistic vehicle listings.

Car Inventory - Show a sample of generated cars with makes, models, years, and prices

The Magic Ingredient: Reproducibility

Here's something crucial that makes DataGen special: reproducibilty.

Imagine you're baking cookies. If you allow the exact same recipe with the exact same measurements, you'll get identical cookies every time. DataGen works the same way through something called a "seed".

When you set a seed(let's say, seed=42), DataGen will generate the exact same data every single time. This is incredibly important for:

- Testing: Developers can reproduce bugs by using the same seed
- Collaboration: Team members can work with identical datasets
- Validation: You can verify that your system produces consistent results

Analogy: Think of the seed as a recipe number. Recipe #42 always makes chocolate chip cookies, Recipe #106 always makes oatmeal cookies. The same recipe number = same cookies, every time.

From Code to Package: The Publishing Journey

Creating the generators was just the first step. To make DataGen useful to the world , I had to package it and publsh it to PyPI(Python Package Index) - think of it as the App Store for Python libraries.

Now, anyone in the world can install DataGen with a single command:

# Go to a new folder (like /Applications)
cd Applications

# Create a brand new, clean environment
python3 -m venv datagen_venv

# Activate the virtual environment
source datagen_venv/bin/activate

# Run the standard install command
pip install sami-datagen

# A sample try
python -c "from datagen import generate_profiles
profiles = generate_profiles(n=10, seed=42)
print(profiles)"

It's like making your homemade recipe available in every grocery store worldwide.

Real-world Impact: Who Benefits?

1. Software Developers

Testing applications without risking real user data. It's like having crash test dummies instead of real people for car safety tests.

2. Data Scientists

Training machine learning models on synthetic data before deploying to production. Think of it as practicing surgery on cadavers before operating on real patients.

3. Business Analysts

Creating demo dashboards and presentations without exposing sensitive company data. Like using a model home to show buyers what their houses could look like.

4. Students and Educators

Learning data analysis and database design with realistic datasets. It's like using a flight simulator in pilot training - safe, repeatable, and realistic.

5. Startups

Building and demonstrating MVPs(Minimum Viable Products) without collecting real user data. Like creating a movie trailer before filming the entire moving.

The Technical Foundation

For those curious about how it works under the hood:

DataGen uses:

- Faker library: Generates realistic names, addresses, and contact information
- Pandas: Organizes data into structured tables (like Excel spreadsheets)
- Statistical modeling: Ensures salary ranges, age distributions, and pricing follow realistic patterns
- Localization: Understands Kenyan naming conventions, cities, and market preferences

Analogy: If DataGen were a restaurant, Faker would be the ingredient supplier, Pandas would be the kitchen organization system, and statistical modeling would be the chef's knowledge of how flavors work together.

Practical Examples: See It In Action

Example 1: Generate 100 user profiles

from datagen import generate_profiles

profiles = generate_profiles(n=100, seed=42)
print(profiles.head())

Output: A table with 100 realistic Kenyan user profiles, complete with names like “Sharon Mohamed” from Nairobi, “Kennedy Atieno” from Mombasa, each with unique emails, addresses, and coordinates.

Code Example Output - Show the actual output from running this code

Example 2: Analyze Salary Distribution

from datagen import generate_salaries

salaries = generate_salaries(n=1000)
avg_by_dept = salaries.groupby('department')['total_compensation'].mean()
print(avg_by_dept)

Output: Average compensation by department, showing that Engineering and Data departments typically have higher compensation than Operations or HR.

Salary Analysis Results - Show the grouped statistics

Beyond the Code: Docker Support

For those who want to use DataGen without installing anything on their computer, I included Docker support.

What's Docker? Think of it as a portable computer inside your computer. It's like having a fully equipped kitchen(with all tools and ingredients) that you can set up anywhere in seconds.

With Docker you can:

Download the DataGen container
Start it with one command
Generate data immediately - no installation, no configuration

Docker Setup - Show the docker-compose command and container running

The Documentation Journey

Creating the library was only three-quarter the battle. Making it usable requires comprehensive documentation:

README.md: A guide covering installation, usage, and examples
Example Scripts: Five Python scripts demonstrating each generator
Inline Documentation: Every function has detailed explanations
API Reference: Complete parameter descriptions and return types

Analogy: It's like buying furniture from IKEA - the product is great, but without clear instructions(with pictures), it's just a pile of woods and screws.

Challenges and Soultions

Challenge 1: Making Data Feel "Real"

Solution: Instead of purely random generation, I implemented statistical models. For example, Senior Engineers have 5-10 years of experience, not 2 years or 30 years.

Challenge 2: Kenyan Localization

Solution: Researched and included actual Kenyan cities, realistic coordinate boundaries, and local naming patterns. The data doesn't just look real - it looks Kenyan real.

Challenge 3: Reproducibility

Solution: Implemented seed-based generation, ensuring that seed=42 always produces identical results, making debugging and testing possible.

The Results: By The Numbers

4 specialized generators covering different data types
60+ job titles across 8 departments
10 experience levels from Junior to C-Level
6 global regions covering 36 countries
10 popular car makes with realistic pricing
100% reproducibility with seed control
Published on PyPI - accessible worldwide
Docker support for zero-installation usage

Complete Demo Output - Show the final output from running complete_demo.py with all statistics

What's Next?

DataGen is just the beginning. Future enhancements could include:

- More data types: Transaction records, event logs, social media posts
- Relationship modeling: Connecting profiles to their salaries and purchases
- Time-series data: Stock prices, sensor readings, website traffic
- Custom templates: Industry-specific data patterns
- Web interface: Generate data without writing code

Try it Yourself

Want to explore DataGen? Here's how:

For technical users:

pip install sami-datagen

For Everyone Else: Visit the GitHub repository at GitHub repo where you'll find:

Complete installation instructions
Step-by-step tutorials

Conclusion

Building DataGen taught me that great tools aren't just about functionality - they're about accessibility. The best technology is technology that anyone can use , understand and benefit from.

Whether you're a developer testing an app. a student learning data science, or a business professional creating a demo, DataGen provides the realistic data you need, when you need it, without compromise.

The code is open source, the documentation is comprehensive, and the possibilities are endless.

About the Author

Oliver is a data engineer who's passionate about building tools that make technology more accessible. This project was completed as part of the LuxDevHQ Data Engineering Internship program.

Connect:

GitHub:@25thOliver
LinkedIn: Samwel Oliver

Understanding Kafka Lag: Why It Happens and How to Fix It

Oliver Samuel — Mon, 10 Nov 2025 05:01:40 +0000

In today's fast-paced digital world, data is constantly being created. We stream movies, make online purchases, and track shipments, in real-time. Behind the scenes, a powerful technology called Apache Kafka often acts as the central nervous system, managing this massive flow of information.

Imagine a busy restaurant kitchen during dinner rush. Orders are coming in faster than the chefs can prepare them. The tickets start piling up on the counter, and customers wait longer for their meals. This growing meals backlog of uncooked orders is essentially what we call "Kafka Lag" in the world of data streaming.

What is Kafka Lag?

Kafka is a messaging system that helps different parts of software applications communicate with each other. Think of it as a sophisticated postal service for digital information. When one part of your system(the producer) sends messages faster than another part(the consumer) can process them, a backlog forms. This backlog is called "lag."

In simple terms: Kafka Lag is the difference between how many messages have been sent and how many have been successfully processed.

Why Does Kafka Lag Happen?

1. The Speed Mismatch Problem

Picture a factory assembly line where bottles are being filled. If the filling station produces 100 bottles per minute but the capping station can only cap 70 bottles per minute, you'll have 30 uncapped bottles pilling up every minute. Similarly, when your data producers send messages faster than consumers can handle them, lag accumulates.

2. Processing Complexity

Not all tasks are created equal. Imagine reading children's book versus analyzing a legal contract, one takes seconds, the other takes hours. If your consumer needs to perform complex calculations, database lookups, or call external services for each message, it naturally slows down, creating lag.

3. Resource Constraints

Think of your consumer as a worker with limited tools. If that worker doesn't have enough memory(like trying to juggle too many tasks at once), insufficient processing power(like using a bicycle to deliver packages across a city), or poor network connectivity(like having a slow internet connection), they simply can' keep up with the workload.

4.Sudden Traffic Spikes

Consider a ticket website when a popular concert goes on sale. Normally, the site handles a few hundred visitors per minute comfortably. Suddenly, 50,000 and people flood in simultaneously. The systems gets overwhelmed. Similarly, unexpected surges in data-like during a flash sale or viral social media event can cause temporary lag.

5 Consumer Downtime

If your consumer application crashes, needs maintenance, or gets redeployed, it's like a cashier taking a lunch break, messages pile up while no one's processing them. When the consumer comes back online, it faces a mountain of unprocessed messages.

6. Inefficient Message Processing

Imagine sorting mail by reading every single completely before deciding where it goes, versus just glancing at the address. Poor coding practices, unnecessary operations, or inefficient algorithms can dramatically slow down message processing.

How to Reduce or Eliminate Kafka Lag

1. Add More Workers(Increase Consumer Instances)

The most straightforward solution: if one cashier can't handle the line, open more registers. By running multiple consumers instances in parallel, you can process more messages simultaneously. Kafka automatically distributes the workload among them through partitioning.
Look at it this way, instead of one person answering customer emails, have a team of five people each handling a portion of the inbox.

2. Optimize the Processing Logic

Make your consumers faster and smarter. Remove unnecessary steps, cache frequently accessed data, and streamline your code. It's like teaching your workers to use keyboard shortcuts instead of clicking through menus, same result, much faster.

Key strategies:

Eliminate redundant operations
Use batch processing where possible
Avoid blocking operations
Implement efficient data structures

3. Increase Partition Count

Kafka divides message streams into partitions. Think of them as multiple conveyor belts instead of one. More partitions means more parallel processing opportunities. However, this is like adding more lanes to a highway; it only helps if you have enough cars(consumers) to use them.

4. Batch Processing

Instead of processing messages one at a time(like making individual trips to deliver each package), group them together(like loading a truck with multiple packages for one delivery run). This reduces overhead and improves throughput significantly.

5. Upgrade Resources

Sometimes you need better tools. Allocating more memory, faster CPUs, or better network bandwidth to your consumers.

6. Implement Asynchronous Processing

Don't wait for one task to finish before starting the next. By processing messages asynchronously, you maximize resource utilization.

7. Use Consumer Groups Wisely

Organize your consumers into groups where each handles specific types of messages. This is like having specialized teams, one for returns, one for new orders, one for inquiries, rather than everyone handling everything.

8. Monitor and Alert

You can't fix what you don't know is broken. Set up monitoring to track lag metrics and alert you when thresholds are exceeded.

9. Implement Backpressure Mechanisms

Sometimes the solution is to slow down the producers temporarily. While not always ideal, it prevents system overload.

10. Prioritize Critical Messages

Not all messages are equally important. Implement priority queues so urgent messages get processed first.

Finding the Right Balance

Eliminating Kafka lag isn't always about processing everything instantly. Sometimes, a small amount of lag is acceptable and even expected. The goal is to keep lag within acceptable boundaries for your business needs.

Conclusion

Kafka lag is a natural consequences of distributed systems handling real-time data. It happens when consumption can't keep pace with production. By understanding the root causes, whether it's speed mismatches, resources constraints, or inefficient processing, you can apply the right solutions.

The key is to monitor continuously, optimize intelligently, and scale appropriately. With the right combination of additional consumers, optimized code, proper resource allocation, and smart architecture decisions, you can keep your Kafka lag minimal and your data flowing smoothly.

Remember: managing Kafka lag is not a one-time fix but an ongoing process of monitoring, measuring, and adjusting as your system evolves and grow.

Real-Time Earthquake CDC Pipeline

Oliver Samuel — Sat, 01 Nov 2025 11:23:46 +0000

Bringing live seismic data to life from API to dashboards, in seconds. This project builds a real-time Change Data Capture(CDC) pipeline that streams live earthquake data from the USGS FDSN API into MySQL, mirrors every change through Kafka + Debezium, lands it to PostgreSQL, and visualizes global seismic trends Grafana.

Project Overview

Earthquakes happen without warning, and understanding their pattern. Traditional earthquake monitoring systems often have delays between when an earthquake occurs and when the data becomes available for analysis. This project eliminates that gap.

Real-world impact:

Emergency responders can see global seismic activity as it happens
Researchers can analyze earthquake patterns in real-time
The public can track seismic events in their region instantly

Think of it like the difference between reading yesterday's newspaper vs. watching live news-except for earthquakes happening anywhere on Earth.

Every minute, the U.S. Geological Survey(USGS) publishes new earthquake events around the world.
In this project, we built a pipeline that:

1. Fetches new quakes every minute from the USGS API

2. Upserts events into MySQL

3. Capture Changes in real-time via Debezium & Kafka

4. Streams them into PostgreSQL

5. Visualizes live quakes and metrics in Grafana dashboards

Overall system architecture diagram

Architecture

USGS API → MySQL → Debezium → Kafka → JDBC Sink → PostgreSQL → Grafana

Each component plays a critical role:

- MySQL - Primary database storing fresh quake data

- Adminer UI - Visualizes our data in primary MySQL database after API ingestion

- Debezium - Captures every insert/update via CDC

- Kafka - Streams events through topics

- PostgreSQL - Sink database for analytics

- Grafana - Visualization layer for insights

- Kafka UI - Monitors topics and connectors visually

Kafka UI showing topics

Sink and Source Connectors

Phases of the Build

Phase 1: USGS API Integration

What's happening here: The United States Geological Survey(USGS) maintains a public API that reports every earthquake detected globally. We poll (ask) this API every 60s: "What earthquake happened in the last minute?"

Why every minute? Earthquakes don't wait, and neither should our data. By checking every minute, we ensure our dashboard shows the most current picture of global seismic activity.

A Python script polls the API:

https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime={NOW-1min}&endtime={NOW}

New events are upserted into earthquake_minute table in MySQL.

Earthquake MySQL table rows after API ingestion in Adminer UI

Phase 2: Change-Data-Capture(CDC)

Imagine MySQL is a busy restaurant kitchen, and the binlog is a camera recording everything the chefs do. Debezium is like a food critic watching that recording in real-time, narrating every dish that gets plated, modified, or sent back.

Without CDC, we'd have to repeatedly ask MySQL "What's new?" every few seconds-inefficient and slow. With CDC, MySQL tells us the moment something changes. It's the difference between spam-refreshing your email vs. getting instant push notifications.

MySQL binary logging enabled (binlog_format=ROW)
Understanding the Binlog
At the core of this project lies MySQL's Binary log(binlog), a special journal that records every change made to the database: inserts, updates, and deletes.

By enabling it in ROW format, MySQL doesn't just log that "something changed". It records exactly what changed in each row. This is what allows tools like Debezium to reconstruct the full story for every database mutation in real-time.

**Why it Matters**
- `log_bin = ON` - Enables binary logging
- `binlog_format = ROW` - Captures row-level detail for CDC
- `server_id` - Provides a unique identifier for the MySQL instance(required by Debezium)

Once the binlog is active, Debezium can tap into it via Kafka Connect, continuously streaming every change into Kafka topics—turning your database into a real-time data source.

Debezium MySQL connector listens for changes
Kafka topics carry those changes
JDBC Sink connector writes them to PostgrSQL

Debezium connector configuration (Kafka Connect UI)

Kafka UI → Topics → Messages view

Phase 3: Grafana Visualization

Grafana connects to PostgreSQL and brings seismic data to life through four panels:

1. Real-Time World Map — Global quake visualization

Why it's useful: Instantly see WHERE earthquakes are clustering. Notice the "Ring of Fire" pattern around the Pacific?

2. Quakes Per Hour — Time-series trend of activity

Why it's useful: Spot unusual spikes that might indicate aftershock sequences or increased regional activity

3. Top 5 Hotspot Regions — Aggregated regional summary

Why it's useful: Quantify which areas are most seismically active over time

4. Quakes in Last Hour (Gauge) — Real-time activity level

Why it's useful: A quick "pulse check" showing if Earth is currently rumbling more than usual

Grafana dashboard (full view)

Close-up of world map panel

Conclusion

This project demonstrates the power of streaming data from a global earthquake API to actionable visualizations in real time.
By Combining open-source tools like Debezium, Kafka, and Grafana, we built a pipeline that's not just functional but alive constantly evolving with Earth's tremors.

What we proved:

Real-time data pipelines can be built with free, open-source tools
Complex infrastructure can be orchestrated with Docker Compose
CDC is the key to keeping distributed systems in sync without manual intervention

Real-world applications of this architecture:

IoT sensor networks (replace earthquakes with temperature/pressure readings)
E-commerce inventory systems (track stock changes across warehouses)
Financial fraud detection (monitor transactions in real-time)
Healthcare patient monitoring (stream vital signs to alert systems)

The principles here scale from earthquake monitoring to any domain where seeing changes as they happen creates value.

Quick Start

# Start all services
docker compose up -d

# Access components
MySQL        → localhost:3306
Kafka UI     → http://localhost:8082
Grafana      → http://localhost:3000
PostgreSQL   → localhost:5435
Adminer UI   → http://localhost:8080

Curious to dig deeper?
All the source code, configurations, and dashboards for this real-time earthquake streaming pipeline are open-source here:
GitHub Repo

Real-Time Crypto Data Pipeline

Oliver Samuel — Mon, 27 Oct 2025 18:01:50 +0000

Introduction

Ever wondered how trading platforms display live crypto prices? In this article, I'll show you how I built a fully automated real-time data pipeline that streams cryptocurrency data from Binance and visualizes it like a Bloomberg Terminal - completely open source!

What you'll learn:

Setting up Change Data Capture (CDC) with Debezium
Building event-driven architectures with Kafka
Handling time-series data at scale with Cassandra
Creating real-time dashboards with Grafana

Why I Built This

I wanted to understand how major trading platforms handle real-time data at scale. Instead of just reading about it, I decided to build a production-grade pipeline that could:

Handle thousands of price updates per minute
Never lose data even if services crash
Provide instant insights through dashboards
Scale horizontally as data grows

This project taught me more about distributed systems in one month than a year of tutorials.

Challenges I Faced

Challenge 1: Database Polling Overhead

Initially, I was polling PostgreSQL every second. CPU usage was 80%+!

Solution: Implemented Debezium CDC using PostgreSQL's replication log. CPU dropped to 5%.

Challenge 2: Data Loss During Failures

When Cassandra went down, data disappeared.

Solution: Kafka acts as a durable buffer - it stores events until consumers catch up.

Challenge 3: Time-Series Query Performance

PostgreSQL struggled with millions of time-series records.

Solution: Moved analytics workload to Cassandra, optimized for time-series data.

What We've Achieved

Real-Time Data Collection: Automatically fetches live crypto market data from Binance every 3600 seconds

Automated Data Pipeline: Data flows seamlessly from Binance → PostgreSQL → Debezium CDC → Kafka → Cassandra without manual intervention

Change Data Capture (CDC): Allows this system to detect new data in PostgreSQL in realtime without polling. Instead of repeatedly querying the database, Debezium listens to changes directly through PostgreSQL's replication log, ensuring near-zero latency and minimal load.

Scalable Architecture: Built with enterprise-grade technologies (Debezium, Kafka, Cassandra) that can handle millions of records

Beautiful Visualizations: Ready-to-use Grafana dashboards for monitoring crypto markets

Architecture Overview

Binance API → PostgreSQL → Debezium CDC → Kafka → Cassandra → Grafana
    ↓             ↓              ↓           ↓          ↓          ↓
  Prices      Primary      Change       Message    Fast      Beautiful
  Stats       Storage      Detection    Queue      Storage   Dashboards
                ↓                        ↓
          Every INSERT              Stream Changes

End-to-end pipeline from data ingestion to visualization.

Components Breakdown

Binance Data Collector (Python)
- Fetches 5 types of market data: prices, 24hr stats, order books, recent trades, and candlestick data
- Writes data to PostgreSQL every 3600 seconds
PostgreSQL Database
- Primary storage for all crypto market data
- Stores historical data with timestamps
- Change Data Capture (CDC) enabled via logical replication
Debezium Change Data Capture
- Automatically detects and captures database changes in real-time
- Monitors PostgreSQL for INSERT, UPDATE, DELETE operations
- Converts database changes into Kafka messages
- No impact on database performance
Apache Kafka
- Kafka acts as a real-time buffer between Debezium and Cassandra, ensuring data reliability. If Cassandra goes down, no data is lost. Kafka stores all change events until Cassandra comes back online.
Cassandra Sink Connector
- The Cassandra Sink Connector (Datastax) continuously listens to Kafka topics and mirrors every change into Cassandra table that match the PostgreSQL schema.
Apache Cassandra
- Fast, distributed database optimized for time-series data
- Powers our real-time dashboards
- Stores denormalized data for quick reads
Grafana Dashboards
- Visual interface for exploring crypto market data
- Live charts and analytics

Data We Collect

Data Type	Description	Update Frequency
Prices	Latest price for all trading pairs	Every 3600 seconds
24hr Stats	Price changes, volumes, and market movements	Every 3600 seconds
Order Books	Current buy/sell orders	Every 3600 seconds
Recent Trades	Latest market transactions	Every 3600 seconds
Candlesticks	Historical price patterns (OHLCV)	Every 6000 seconds

Live dashboard displaying top-performing cryptocurrencies by 24h change.

Getting Started

Prerequisites

Docker and Docker Compose installed on your computer

Quick Start (3 Steps)

Create environment file

   # Create a .env file with these contents:
   POSTGRES_USER=crypto_user
   POSTGRES_PASSWORD=crypto_pass
   POSTGRES_DB=crypto_db

Start everything

   docker compose up --build -d

Shows container orchestration success

View your dashboards
- Open http://localhost:3000 in your browser
- Login: admin / admin
- Explore the crypto market data!

Project Screenshots

Kafka UI - Monitoring Topics & Connectors

Kafka UI providing a real-time view of all Kafka topics, internal connector states, and message traffic.

The Kafka UI interface offers an intuitive dashboard for monitoring the Kafka ecosystem:

Topics View – Displays all internal and user-created topics such as crypto_prices, crypto_order_book, and more.
Consumers View – Shows active sink connectors and other consumers reading from Kafka topics (e.g., cassandra-sink).
Cluster Health – Visualizes broker status, topic replication, and partition metrics.

This provides a richer, more interactive way to inspect data flow across Kafka.

Architecture & Data Flow

Current Active Topics

Image showing sample query in our PostgreSQL database

A snap showing a query in our analytics database Cassandra

Configuration Files

docker-compose.yml - Orchestrates all services
connectors/cassandra-sink.json - Cassandra data sink configuration
connectors/postgres-source-temp.json - PostgreSQL change data capture configuration
scripts/binance_ingestor.py - Main data collection script

Current Data Statistics

Total Records Collected: Over 1.8 million rows
Active Tables: 5 (prices, stats, order books, trades, candlesticks)
Update Frequency: Every 60 seconds
Data Sources: Binance REST API
Storage: PostgreSQL (primary) + Cassandra (analytics)

Grafana Dashboards

Our dashboards provide:

Real-time price monitoring across all trading pairs
24-hour market analysis with price changes and volumes
Order book depth visualization
Trade history with buy/sell indicators
Candlestick charts for technical analysis

Troubleshooting

Check if services are running

docker ps

View data in PostgreSQL

docker exec postgres psql -U crypto_user -d crypto_db -c "SELECT * FROM crypto_prices LIMIT 10;"

View data in Cassandra

docker exec cassandra cqlsh -e "SELECT * FROM crypto_keyspace.crypto_prices LIMIT 10;"

Check connector status

curl -sS http://localhost:8083/connectors | jq

REST response of CDC pipeline configuration

Crypto_prices topic streams

Key Features

Fully Automated - Set it and forget it, data collects automatically

Real-Time - New data every 3600 seconds

Rich Visualizations - Beautiful Grafana dashboards out of the box

Reliable - Built on proven enterprise technologies

Scalable - Can handle millions of records effortlessly

Learn More

This project demonstrates:

Change Data Capture (CDC) with Debezium - automatically captures database changes
Real-time data streaming with Apache Kafka - reliable message queuing
Time-series data storage with Cassandra - optimized for analytics
Data visualization with Grafana - beautiful dashboards
Microservices architecture with Docker - containerized services

How Change Data Capture Works

Python script inserts data into PostgreSQL every 3600 seconds
Debezium connector watches PostgreSQL for changes using logical replication
When new rows are inserted, Debezium captures them automatically
Changes are converted to JSON messages and sent to Kafka topics
Cassandra sink connector consumes these messages and writes to Cassandra
Result: Zero manual intervention - data flows automatically!

Support

For questions or issues, please check the logs:

docker logs binance_ingestor
docker logs debezium-connect

Explore the Full Project

You can find the complete source code, Docker setup, and connector configurations on GitHub:

👉 https://github.com/25thOliver/Crypto-Data-Pipeline

From smog to streams: how data engineering helps us breathe easier.

Oliver Samuel — Mon, 20 Oct 2025 07:51:46 +0000

Building a Real-Time Air Quality Data Pipeline for Mombasa & Nairobi

The Invisible Problem We Breathe

If you’ve ever driven through Nairobi at rush hour or felt the coastal haze in Mombasa, you’ve likely wondered:

“What exactly am I breathing right now?”

Air pollution often hides in plain sight, invisible but deadly. As a data engineer passionate about real-world impact, I decided to build a system that could listen to the air and tell us the truth in real time.

That journey became the Real-Time Air Quality Pipeline:
a streaming data architecture that fetches hourly pollutant readings, processes them instantly, and makes them queryable within seconds — all built with open-source tools.

Project Overview

This pipeline fetches air quality data (PM2.5, PM10, CO, NO₂, SO₂, Ozone, UV Index) from the Open-Meteo API for Nairobi and Mombasa, then streams it through a real-time pipeline using Kafka, MongoDB, Debezium, and Cassandra.

What It Does

Fetches data hourly
Streams via Kafka
Stores raw data in MongoDB
Uses Debezium for CDC (Change Data Capture)
Writes processed data to Cassandra for analytics
Fully containerized using Docker Compose

End Result: Live, queryable data on Kenya’s air quality — updated every hour.

Architecture Overview

(Open-Meteo API)
       ↓
  [Producer - Python]
       ↓
  [Kafka Topic - air_quality_data]
       ↓
  [Consumer - MongoDB Writer]
       ↓
  [MongoDB - Raw Data Storage]
       ↓
  [Debezium CDC Connector]
       ↓
  [Kafka CDC Topic]
       ↓
  [Cassandra Consumer]
       ↓
  [Cassandra - Analytics Storage]

Each block is a service — communicating in real-time via Kafka topics.
Together, they form a streaming ecosystem that can handle continuous data without breaking a sweat.

Setting the Pipeline in Motion

1. Start the System

docker-compose up -d

After a few minutes, you’ll see all 9 containers running:

mongo, zookeeper, kafka, kafka-ui, mongo-connector, producer, consumer, cassandra, and cassandra-consumer.

2. Initialize Databases

MongoDB Replica Set Setup

bash storage/init-replica-set.sh

Cassandra Schema Initialization

docker exec -i cassandra cqlsh < storage/cassandra_setup.cql

3. Register Debezium CDC Connector

Debezium monitors MongoDB for new data, captures changes, and streams them out.

bash streaming/register-connector.sh

Once registered, every new air quality record inserted in MongoDB automatically triggers a CDC event.

4. System Health Check

bash health-check.sh

When all checks pass — the real-time pipeline is alive!

Deep Dive — The Data Flow in Action

Step 1: The Producer (Python)

Fetches data from Open-Meteo every hour, ensuring we only publish complete readings.

Step 2: The Consumer (MongoDB Writer)

Consumes messages from Kafka and writes them as raw JSON into MongoDB.

Each entry contains pollutant levels, timestamps, and metadata for each city.

Step 3: Debezium CDC Connector

Debezium detects new inserts in MongoDB and publishes “change events” to a Kafka CDC topic.

Step 4: Cassandra Consumer

Reads CDC events, cleans the data, skips incomplete values, and inserts time-series records into Cassandra.

Monitoring & Dashboards

With Kafka UI, you can see your streaming data live.

Kafka UI displaying all active topics.

Real-time message flow for each city.

Consumers processing messages without lag.

Querying the Data

Raw Data in MongoDB

db.air_quality_raw.find().sort({_id: -1}).limit(5)

Raw data including PM2.5, ozone, and NO₂ readings.

Analytics Data in Cassandra

SELECT city, timestamp, pm2_5, pm10, ozone 
FROM air_quality_analytics.air_quality_readings 
WHERE city='Nairobi' LIMIT 5;

Structured air quality readings optimized for analysis.

Timestamp Insight

Each record has two timestamps:

timestamp: when the reading was captured
inserted_at: when it entered the pipeline

This lets you track latency and data freshness — crucial for real-time systems.

What You’ll Learn

Building this pipeline teaches core data engineering concepts:

Concept	What You’ll Learn
Streaming Systems	How to build and manage real-time Kafka pipelines
CDC (Change Data Capture)	Tracking database changes with Debezium
Multi-Database Architecture	Choosing MongoDB for raw data, Cassandra for analytics
Distributed Systems	Managing replication and eventual consistency
Containerization	Deploying complex pipelines with Docker Compose

Future Enhancements

This is just the beginning — imagine expanding this into a nationwide environmental dashboard.

Next Steps:

Add more cities (Kisumu, Eldoret, Nakuru)
Create Grafana dashboards for AQI visualization
Add SMS or Slack alerts for dangerous readings
Integrate ML for forecasting and anomaly detection
Build an API or GraphQL endpoint for app developers

Why It Matters

Data shouldn’t live in spreadsheets — it should live in motion.

By streaming real-time air quality data, we can give cities, developers, and citizens live awareness of environmental health.
Projects like this can inform policy, support research, and raise awareness about what’s really in our air.

“Data is the new air — you can’t see it, but everything depends on it.”

Learn More & Contribute

Author: Samwel Oliver
GitHub: @25thOliver
Email: oliversamwel33@gmail.com

Explore the Full Project on GitHub

Containerization for Data Engineering: A practical Guide with Docker and Docker Compose

Oliver Samuel — Fri, 10 Oct 2025 06:14:00 +0000

Introduction

For many aspiring data engineers, Docker sounds intimidating—complex containers, YAML files, and endless docker commands. But here's the truth: Docker isn't just for backend developers. It's your best friend when managing complex data pipelines with multiple moving parts: databases, schedulers, dashboards, and storage systems.

In this guide, I'll demonstrate how I containerized a full YouTube analytics pipeline using Docker and Docker Compose.

The goal? To automate data extraction, transformation, storage, and visualization—all running seamlessly across containers.

Why Containerize Data Pipelines?

Without containers, setting up tools like Airflow, Spark, PostgreSQL, Grafana, and MinIO locally would take hours, each requiring its own dependencies and configurations.

With Docker Compose, all these services run together with a single command:

docker compose up -d

Docker creates isolated environments for each service, ensuring portability, consistency, and easy scaling.

The Engine-Cartridge Architecture

A key design pattern I used in this project was splitting the setup into two distinct layers:

1. `airflow-docker/` → The Engine

This is the core infrastructure. It defines all containers, networks, environment variables, and Airflow services.

Responsibilities:

Defines the Docker Compose stack (Airflow + PostgreSQL + Grafana + MinIO + Spark)
Acts as the "orchestration engine"
Mounts DAGs and pipeline code dynamically

2. `airflow-youtube-analytics/` → The Cartridge

This is the plug-and-play ETL project, which lives outside the engine but connects seamlessly to it.

Think of it like a "cartridge" you can load into the Airflow engine.

Responsibilities:

Contains all DAGs and ETL scripts (extract.py, transform.py, load.py)
Handles API calls, data transformations, and loading logic
Can be swapped or extended without touching the engine

Relationship Diagram:

+-----------------------+
|  airflow-docker/      |   ---> Engine (Airflow + Services)
|  ├── docker-compose.yml|
|  ├── .env             |
|  └── dags/ <mount> ---┼──> Mounts DAGs from cartridge
+-----------------------+

        ⬇

+-----------------------------+
| airflow-youtube-analytics/  |  ---> Cartridge (ETL logic)
| ├── pipelines/youtube/      |
| │    ├── extract.py         |
| │    ├── transform.py       |
| │    └── load.py            |
| └── dags/youtube_pipeline.py|
+-----------------------------+

With this modular setup:

I can add new "cartridges" (projects) like airflow-nasa-apod/ or airflow-weather-analytics/
The airflow-docker/ engine never changes—it simply mounts the new DAGs and runs them
This makes the system scalable and reusable across multiple ETL projects

Project Setup Overview

Our pipeline components:

Layer	Tool	Purpose
Orchestration	Apache Airflow	Automates ETL workflow
Data Storage	MinIO	Acts as local S3 data lake
Transformation	PySpark / Pandas	Cleans and processes raw data
Warehouse	PostgreSQL	Stores transformed metrics
Visualization	Grafana	Visualizes channel performance

Architecture Diagram:

Containerized pipeline architecture.

Each service runs as a Docker container defined in the docker-compose.yaml.

This approach allowed me to test and run everything from extraction to Grafana visualization on my local machine.

Container Orchestration in Action

Here's a sample of how services are spun together:

cd ~/airflow-docker
docker compose up -d

To sync environment variables between project and containers:

./sync_env.sh

Result:

Airflow runs the DAG (Extract >> Transform >> Load)
Spark handles transformations
Data is stored in PostgreSQL and visualized in Grafana
All communication happens inside containers through a shared Docker network

Running Containers:

All containers running simultaneously via Docker Compose.

Key Takeaways

Docker simplifies multi-service setup for data engineering projects
Containerized Airflow pipelines are reproducible and portable
Local MinIO + PostgreSQL simulates a full-scale cloud environment
With Docker Compose, you can spin up a production-grade analytics stack in minutes

Conclusion

Containerization removes the friction between development and deployment. Instead of juggling tool installations, Docker lets you focus on what matters: data flow, not setup.

If you've ever been scared to touch Docker, this is your sign:

Start with one project, one docker-compose.yaml, and build from there.

By the end, you'll realize containers don't complicate data pipelines—they liberate them.

You can explore the complete codebase and pipeline setup in my GitHub repository.

GitHub Repo

Building an Automated YouTube Analytics Dashboard with Airflow, PySpark, MinIO, PostgreSQL & Grafana

Oliver Samuel — Tue, 07 Oct 2025 06:00:45 +0000

Author: Oliver Samuel

Date: October 2025

Introduction

This project explores the digital footprint of Raye, the UK chart-topping artist known for her soulful pop sound and breakout hits like Escapism. Using a custom-built YouTube Analytics Pipeline powered by Apache Airflow, PySpark, MinIO, PostgreSQL and Grafana, we analyzed Raye's channel performance — from engagement trends to audience distribution.

The goal was to design a scalable data workflow capable of extracting, transforming, and visualizing YouTube channel insights in real time. Beyond technical architecture, this analysis reveals how content release patterns, audience geography, and engagement rates evolve alongside Raye's career milestones.

Overview

This project demonstrates how to design, containerize, and automate an end-to-end data engineering pipeline for YouTube analytics using Apache Airflow, PySpark, MinIO, PostgreSQL, and Grafana.

It automatically fetches YouTube channel data, performs transformations in Spark, loads the results into a PostgreSQL warehouse, and visualizes insights in Grafana — all orchestrated by Airflow.

By the end, you'll have a live dashboard showing:

Total videos, views, and subscribers
Average engagement rates
Country-level view distribution
Growth trends and publishing cadence

Final Grafana dashboard overview

Architecture Overview

Here's the end-to-end data flow:

YouTube API → Raw JSON → MinIO (Data Lake)
         ↓
     PySpark Transform
         ↓
 PostgreSQL Warehouse
         ↓
 Grafana Dashboard (Visualization)
         ↓
 Airflow DAG (Automation & Scheduling)

Architecture Diagram

Step 1: Automated Extraction with Airflow

The first DAG task — extract_youtube_data — uses the YouTube Data API v3 to fetch metadata and statistics for each target channel.

The extracted JSON files are stored in MinIO, a local S3-compatible data lake.

Sample record:

{
  "channel_id": "UC123456...",
  "channel_title": "Raye",
  "statistics": {
    "viewCount": "10402000",
    "subscriberCount": "251000",
    "videoCount": "159",
    "likeCount": "359000",
    "commentCount": "50382"
  },
  "country": "GB",
  "publishedAt": "2014-06-22T10:05:00Z"
}

Raw data in MinIO

Step 2: Data Transformation with PySpark

Next, Airflow triggers the transform task, which runs transform_youtube_data() inside the same containerized environment.

It loads the raw files from MinIO using the S3A connector, casts numeric types, fills missing values, and computes engagement metrics like views_per_video, like_ratio, and engagement_rate.

Key Transformations

transformed_df = (
    transformed_df
        .withColumn("like_count", lit(total_likes))
        .withColumn("comment_count", lit(total_comments))
        .withColumn(
            "views_per_video",
            (lit(total_views) / when(col("video_count") == 0, 1).otherwise(col("video_count")))
        )
        .withColumn(
            "subs_per_video",
            (col("subscriber_count") / when(col("video_count") == 0, 1).otherwise(col("video_count")))
        )
        .withColumn(
            "like_ratio",
            when(lit(total_views) > 0, round(lit(total_likes) / lit(total_views), 4)).otherwise(lit(0))
        )
        .withColumn(
            "comment_ratio",
            when(lit(total_views) > 0, round(lit(total_comments) / lit(total_views), 4)).otherwise(lit(0))
        )
        .withColumn(
            "engagement_rate",
            when(lit(total_views) > 0, round((lit(total_likes) + lit(total_comments)) / lit(total_views), 4)).otherwise(lit(0))
        )
)

Output Format

The cleaned dataset is stored back to MinIO as Parquet for optimized reads:

transformed_df.write.mode("overwrite").parquet(
    "s3a://rayes-youtube/transformed/channel_stats_transformed.parquet"
)

Spark job logs showing transformation success in Airflow

Step 3: Load into PostgreSQL

Airflow's final task — load_to_postgres — transfers the transformed Parquet data into PostgreSQL using a JDBC connector or pandas-based loader.

Schema Alignment

PySpark Column	PostgreSQL Column	Type
channel_id	channel_id	text
channel_title	channel_name	text
published_at	published_at	timestamp
view_count	view_count	bigint
subscriber_count	subscriber_count	bigint
video_count	video_count	bigint
like_count	like_count	bigint
comment_count	comment_count	bigint
like_ratio	like_ratio	double precision
comment_ratio	comment_ratio	double precision
engagement_rate	engagement_rate	double precision
views_per_video	views_per_video	double precision
channel_age_days	channel_age_days	bigint
daily_view_growth	daily_view_growth	double precision
daily_sub_growth	daily_sub_growth	double precision
country	country	text

Note: channel_title in Spark maps to channel_name in PostgreSQL — the only column renamed during loading.

Sample query results from PostgreSQL

Step 4: Airflow Integration and Reusability

This pipeline is designed as a modular Airflow project that plugs into a reusable local engine (~/airflow-docker).

Run Instructions

# Copy and configure environment variables
cp .env
chmod +x sync_env.sh
./sync_env.sh

# Start the Airflow engine
cd ~/airflow-docker
docker compose down
docker compose up -d

Running Docker Containers

To manually test a DAG task inside Airflow:

docker exec -it airflow-docker-airflow-scheduler-1 \
  python /opt/airflow/dags/pipelines/youtube/extract.py

Environment Variables

Variable	Description
`YOUTUBE_API_KEY`	YouTube Data API v3 key
`MINIO_ACCESS_KEY`	MinIO access key
`MINIO_SECRET_KEY`	MinIO secret key
`MINIO_ENDPOINT`	MinIO endpoint (default: `http://172.17.0.1:9000`)

Airflow DAG graph showing extract → transform → load tasks

Step 5: Visualization with Grafana

Grafana connects directly to PostgreSQL to visualize key metrics.

Example Queries

1. Overview Metrics

SELECT 
  SUM(view_count) AS total_views,
  SUM(subscriber_count) AS total_subscribers,
  COUNT(DISTINCT channel_id) AS total_channels
FROM "raye_youtube_channel_stats";

2. Engagement Rate

SELECT 
  ROUND(AVG((like_count + comment_count)::numeric / NULLIF(view_count, 0)), 4) AS avg_engagement_rate,
  SUM(like_count) AS total_likes,
  SUM(comment_count) AS total_comments
FROM "raye_youtube_channel_stats";

3. Country Breakdown

SELECT 
    published_at AS time,      
    country, 
    SUM(view_count) AS total_views
FROM "raye_youtube_channel_stats"
GROUP BY published_at, country
ORDER BY time DESC, total_views DESC;

Grafana panels showing engagement and country metrics

Insights

Engagement Peaks: Engagement rates spike around high-visibility video releases
View Concentration: Most traffic originates from English-speaking regions
Content Rhythm: Publishing trends show periodic releases tied to album cycles

Chart highlighting peak engagement days

Tech Stack Summary

Layer	Tool
Orchestration	Apache Airflow
Data Storage	MinIO (S3-compatible)
Transformation	PySpark
Warehouse	PostgreSQL
Visualization	Grafana
Containerization	Docker Compose

Automation Summary

Each Airflow DAG run performs the full cycle:

Extract: Fetch YouTube channel data
Transform: Clean and compute new metrics via PySpark
Load: Write clean results into PostgreSQL
Visualize: Grafana auto-refreshes metrics in near real time

Key Takeaways

PySpark and MinIO enable scalable, cloud-like ETL locally
Airflow provides robust scheduling and retry mechanisms
Grafana and PostgreSQL make analytics exploration seamless
Modular design allows reuse across multiple data sources or APIs

Conclusion

This project went beyond dashboards and data pipelines — it told a story about how an artist's digital rhythm mirrors their creative journey. By building a robust analytics workflow for Raye's YouTube channel, we connected raw engagement metrics to real-world momentum — from viral singles to album releases.

The pipeline's architecture, powered by Apache Airflow, PySpark, MinIO, PostgreSQL, and Grafana, proved not just scalable but insightful — offering a live pulse on fan interactions, audience geography, and engagement surges tied to content drops.

As a next step, the same framework can be extended to analyze cross-platform trends (Spotify, Instagram, TikTok) and measure how each channel amplifies an artist's reach in the streaming era.

Final dashboard hero shot

Data meets artistry — and every like, view, and comment becomes a note in the bigger symphony of audience connection.

Open Forem: Oliver Samuel

Building a Supermarket Data Pipeline

How I Built an Automated System That Turns Messy Sales Data Into Business Gold

The Problem: Data Drowning

The Solution: An Automated Data Factory

How It Works

Step 1: Extraction: "Fishing for Data"

Step 2: Transformation: "The Car Wash for Data"

Step 3: Loading: "Two Warehouses, Two Purposes"

PostgreSQL: The Library

MongoDB: The Flexible Warehouse

How It Works (The Technical Deep-Dive)

Architecture Overview

The Modular Design Philosophy

The Code Walkthrough

Extraction: Pandas Does the Heavy Lifting

Transformation: Clean Data or Bust

Loading: Two Paths, One Pipeline

Docker: The "It Works on My Machine" Killer

The docker-compose.yml Magic

Key Lessons & Design Decisions

Why Two Databases?

Why Python?

Why Modular Design?

Future Enhancements

Conclusion

Building Scalable Data Pipelines with Airflow, Docker, and Python: A SightSearch Case Study

The Problem: Why Orchestration Matters

Tech Stack

Architecture Overview

The Pipeline in Action

1. The Scrape Task

2. Image Processing

3. Validation and Storage

Step-by-Step Walkthrough

Phase 1: The Setup

Phase 2: The Airflow UI

Phase 3: Monitoring Execution

Phase 4: Verifying the Data

Challenges and Best Practices

1. Handling Secrets Securely

2. Module-Level Connections

Conclusion

CHW Monthly Activity Aggregation: Turning Visit Logs into Insight

What Problem Does This Project Solve?

How the Data Flows

1. Start with raw visit data

2. Clean and filter

3. Assign a “reporting month”

4. Aggregate to monthly metrics

5. Store the result as a reusable table

Key Components

Macro: month_assignment(date_column) in macros/month_assignment.sql

Model: models/starter_code/chw_activity_monthly.sql

Business Rules in More Detail

Testing and Data Quality

How to Run the Project

1. Start the Docker services

2. Open a shell in the dbt runner

3. Build the monthly activity model

4. Run tests for the model

5. Inspect the final table in Postgres

Recommendations

Conclusion

Outerwear Performance Analysis: A Data-Driven Investigation

1. Problem Statement

2. Primary Hypothesis

3. Sub-Hypotheses and Analytical Approach

4. PromptBI Dashboard Link

5. Visuals Section — PromptBI Chart Placement Guide

1. Revenue by Category Comparison (Bar Chart)

2. Average Order Value by Category (Bar Chart)

3. Transaction Count by Category (Bar Chart)

4. Units Sold Comparison (Bar Chart)

5. Rating Distribution (Histogram)

6. Discount Penetration by Category (Bar Chart)

7. Customer Loyalty Metrics by Category (Stacked Bar)

8. Outerwear Revenue by Season (Line Chart)

9. Outerwear Transactions by Season (Line Chart)

10. Discount Impact on Outerwear Revenue (Scatter Plot)

Macro: `month_assignment(date_column)` in `macros/month_assignment.sql`

Model: `models/starter_code/chw_activity_monthly.sql`