<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Open Forem: Oliver Samuel</title>
    <description>The latest articles on Open Forem by Oliver Samuel (@oliver_samuel_028c6f65ad6).</description>
    <link>https://open.forem.com/oliver_samuel_028c6f65ad6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3393962%2F0252b000-257e-4383-9c4d-badf238400b9.jpg</url>
      <title>Open Forem: Oliver Samuel</title>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://open.forem.com/feed/oliver_samuel_028c6f65ad6"/>
    <language>en</language>
    <item>
      <title>Building a Supermarket Data Pipeline</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Thu, 05 Feb 2026 23:37:53 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/building-a-supermarket-data-pipeline-3pfg</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/building-a-supermarket-data-pipeline-3pfg</guid>
      <description>&lt;h2&gt;
  
  
  How I Built an Automated System That Turns Messy Sales Data Into Business Gold
&lt;/h2&gt;




&lt;p&gt;&lt;em&gt;Ever wonder how your favorite supermarket knows exactly when to restock the shelves, which products are flying off the racks, or why they always seem to have your favorite snacks in stock? The secret lies in data pipelines, and I built one from scratch.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Data Drowning
&lt;/h2&gt;

&lt;p&gt;Imagine you're the manager of a busy supermarket(e.g., Naivas). Every single day, thousands of transactions flow through your registers, customers buying milk, bread, snacks, cleaning supplies. Each transaction generates a line of data: &lt;em&gt;who bought what, how much they paid, and how they paid&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Now here's the challenge: &lt;strong&gt;all this data is sitting in a messy Google spreadsheet&lt;/strong&gt;, updated by cashiers in real-time. It's like having a river of gold nuggets flowing past you, but no way to catch them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The questions that keep you up at night:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which products are selling the most?&lt;/li&gt;
&lt;li&gt;What payment methods do customers prefer?&lt;/li&gt;
&lt;li&gt;Are there duplicate transactions messing up your accounting?&lt;/li&gt;
&lt;li&gt;How can you make this data useful for reports AND for your mobile app?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly the problem I solved with the &lt;strong&gt;Supermarket ETL Pipeline&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution: An Automated Data Factory
&lt;/h2&gt;

&lt;p&gt;Think of my solution like a &lt;strong&gt;water treatment plant&lt;/strong&gt; for data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Water Plant Analogy&lt;/th&gt;
&lt;th&gt;What My Pipeline Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extract&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pumping water from the river&lt;/td&gt;
&lt;td&gt;Pulling raw sales data from Google Sheets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transform&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Filtering out dirt and impurities&lt;/td&gt;
&lt;td&gt;Cleaning duplicates, fixing missing values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Load&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Storing clean water in tanks&lt;/td&gt;
&lt;td&gt;Saving clean data to PostgreSQL &amp;amp; MongoDB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Google Sheet&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzka2rxhq89pgigm6okc4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzka2rxhq89pgigm6okc4.png" alt="The Google Sheet" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Capture the source Google Sheet showing raw transaction data with columns like id, quantity, product_name, total_amount, payment_method, customer_type. Show some messy/duplicate rows if possible.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Extraction: "Fishing for Data"
&lt;/h3&gt;

&lt;p&gt;My pipeline starts by reaching out to Google Sheets, think of it like casting a fishing net into a lake. The spreadsheet contains raw transaction records: every purchase, every customer, every payment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The Pipeline says: "Hey Google, give me all the sales data!"
Google responds: "Here's 1,000 rows of transactions!"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why Google Sheets?&lt;/strong&gt; Because it's where real businesses often keep their data, it's accessible, shareable, and doesn't require expensive software.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Terminal showing extraction logs&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyo8o6h18ciba3hwio2vw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyo8o6h18ciba3hwio2vw.png" alt="Extraction logs" width="800" height="325"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Capture the terminal output showing: "Starting extraction from Google Sheets" and "Extracted X rows" messages.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Step 2: Transformation: "The Car Wash for Data"
&lt;/h3&gt;

&lt;p&gt;Raw data is messy. Imagine every car that comes through a car wash covered in mud, leaves, and bird droppings. The transformation stage is my car wash, it takes dirty data and makes it sparkle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What gets cleaned:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Duplicate transactions (same ID twice)&lt;/td&gt;
&lt;td&gt;Removed automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Missing transaction IDs&lt;/td&gt;
&lt;td&gt;Rows dropped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unnecessary columns&lt;/td&gt;
&lt;td&gt;Only essential fields kept&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pipeline keeps only what matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;id&lt;/code&gt; — Unique transaction identifier&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;quantity&lt;/code&gt; — How many items purchased&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;product_name&lt;/code&gt; — What was bought&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;total_amount&lt;/code&gt; — How much was paid&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;payment_method&lt;/code&gt; — Cash, card, or digital&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;customer_type&lt;/code&gt; — Member or regular customer&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Transformation Logs&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps6hcv9lwxj2ijjslmz2.png" alt="Transform logs" width="800" height="410"&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 3: Loading: "Two Warehouses, Two Purposes"
&lt;/h3&gt;

&lt;p&gt;Here's where it gets interesting. Instead of storing data in just one place, I built a &lt;strong&gt;dual-database strategy&lt;/strong&gt;. Think of it like having two different storage facilities:&lt;/p&gt;

&lt;h4&gt;
  
  
  PostgreSQL: The Library
&lt;/h4&gt;

&lt;p&gt;PostgreSQL is like a &lt;strong&gt;meticulously organized library&lt;/strong&gt;. Every book (data record) has its place, follows strict rules, and can be cross-referenced with other books easily.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial reports ("How much revenue did we make last month?")&lt;/li&gt;
&lt;li&gt;Accounting audits (data integrity is guaranteed)&lt;/li&gt;
&lt;li&gt;Complex queries ("Show me all cash transactions over $100 from member customers")&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  MongoDB: The Flexible Warehouse
&lt;/h4&gt;

&lt;p&gt;MongoDB is like a &lt;strong&gt;modern warehouse with adjustable shelving&lt;/strong&gt;. You can store items of different shapes and sizes without reorganizing everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mobile app backends (JSON-friendly)&lt;/li&gt;
&lt;li&gt;Rapid prototyping ("Let's quickly add a new field!")&lt;/li&gt;
&lt;li&gt;Analytics dashboards (flexible data exploration)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Docker containers running&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadsulj0k4aodqqneyuu5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadsulj0k4aodqqneyuu5.png" alt="Containers Running" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;PostgreSQL data view&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftped70e3nuxk1bg8ienc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftped70e3nuxk1bg8ienc.png" alt="PostgreSQL data view" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;MongoDB data view&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig59g2e8evd0mui6qope.png" alt="Mongo Data View" width="800" height="405"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  How It Works (The Technical Deep-Dive)
&lt;/h2&gt;

&lt;p&gt;For my fellow engineers, let's pop the hood and look at the engine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Google Sheets  │────▶│  Python ETL     │────▶│  PostgreSQL     │
│  (Data Source)  │     │  (Container)    │     │  (Relational)   │
└─────────────────┘     │                 │     └─────────────────┘
                        │  • Extract      │
                        │  • Transform    │     ┌─────────────────┐
                        │  • Load         │────▶│  MongoDB        │
                        └─────────────────┘     │  (Document)     │
                                                └─────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Project folder structure&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcy27d4tpaf48qjtqxfa7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcy27d4tpaf48qjtqxfa7.png" alt="Folder Structure" width="800" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Modular Design Philosophy
&lt;/h3&gt;

&lt;p&gt;Instead of one giant script, I split the pipeline into &lt;strong&gt;specialized modules&lt;/strong&gt;, like having different specialists in a hospital:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Hospital Analogy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;config.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Configuration management&lt;/td&gt;
&lt;td&gt;Hospital administrator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;extract.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Data extraction&lt;/td&gt;
&lt;td&gt;Ambulance driver&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;transform.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Data cleaning&lt;/td&gt;
&lt;td&gt;Surgeon&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;load_postgres.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL loading&lt;/td&gt;
&lt;td&gt;Recovery ward nurse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;load_mongo.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MongoDB loading&lt;/td&gt;
&lt;td&gt;Rehabilitation specialist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;main.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;Chief of Medicine&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Testability&lt;/strong&gt;: I can test the transformation logic without needing a database connection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintainability&lt;/strong&gt;: Changing the data source doesn't break the loading logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Adding a new destination (like Snowflake) is just adding one new file&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;main.py code&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;etl_pipeline.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;etl_pipeline.extract&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;extract_data&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;etl_pipeline.transform&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;transform_data&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;etl_pipeline.load_postgres&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_to_postgres&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;etl_pipeline.load_mongo&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_to_mongo&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="c1"&gt;# Configure logging to stdout
&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%(asctime)s - %(name)s - %(levelname)s - %(message)s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;handlers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;StreamHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ETL Application pipeline initialized.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 1. Extract
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DATA_SOURCE_TYPE&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sheets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting extraction from Google Sheets (ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GOOGLE_SHEET_ID&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Extract data
&lt;/span&gt;            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;source_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sheets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;sheet_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GOOGLE_SHEET_ID&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown data source: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DATA_SOURCE_TYPE&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;

        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracted &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


        &lt;span class="c1"&gt;# 2. Transform
&lt;/span&gt;        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Step 2: Transform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;transformed_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Transformed Data Shape: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;transformed_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

        &lt;span class="c1"&gt;# 3. Load to PostgreSQL
&lt;/span&gt;        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Step 3: Load to PostgreSQL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;load_to_postgres&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transformed_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;POSTGRES_URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Load to MongoDB
&lt;/span&gt;        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Step 4: Load to MongoDB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;load_to_mongo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;transformed_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MONGO_URI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MONGO_DB&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;ETL pipeline completed successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;critical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ETL failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Code Walkthrough
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Extraction: Pandas Does the Heavy Lifting
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_from_public_sheet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sheet_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;export_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://docs.google.com/spreadsheets/d/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sheet_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/export?format=csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;export_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The magic&lt;/strong&gt;: Google Sheets can export any public sheet as CSV. Pandas reads it directly from the URL, no authentication needed for public sheets!&lt;/p&gt;

&lt;h4&gt;
  
  
  Transformation: Clean Data or Bust
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transform_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;required_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;df_transformed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;required_columns&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;df_transformed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_transformed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df_transformed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only keep essential columns (data minimization)&lt;/li&gt;
&lt;li&gt;Remove duplicates by transaction ID (data integrity)&lt;/li&gt;
&lt;li&gt;Drop rows with missing IDs (no orphan records)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Loading: Two Paths, One Pipeline
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;PostgreSQL with SQLAlchemy:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_to_postgres&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transactions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;MongoDB with PyMongo:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_to_mongo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mongo_uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transactions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MongoClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mongo_uri&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;db_name&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_many&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Successful ETL run&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsd2iomstocgertwbydly.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsd2iomstocgertwbydly.png" alt="Successful ETL run" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Docker: The "It Works on My Machine" Killer
&lt;/h2&gt;

&lt;p&gt;One of the biggest headaches in software is environment setup. "It works on my machine!" is the developer's equivalent of "the dog ate my homework."&lt;/p&gt;

&lt;p&gt;Docker solves this by &lt;strong&gt;containerizing everything&lt;/strong&gt;. My entire stack, Python app, PostgreSQL, MongoDB runs in isolated containers that work identically on any machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  The docker-compose.yml Magic
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:15&lt;/span&gt;
    &lt;span class="c1"&gt;# PostgreSQL runs in its own container&lt;/span&gt;

  &lt;span class="na"&gt;mongo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mongo:6&lt;/span&gt;
    &lt;span class="c1"&gt;# MongoDB runs in its own container&lt;/span&gt;

  &lt;span class="na"&gt;etl-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;mongo&lt;/span&gt;
    &lt;span class="c1"&gt;# My Python app waits for databases to be ready&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;To run the entire system:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--build&lt;/span&gt;
docker compose &lt;span class="nb"&gt;exec &lt;/span&gt;etl-app python main.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key Lessons &amp;amp; Design Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Two Databases?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Best Database&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Financial reports&lt;/td&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;ACID compliance, SQL support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mobile app API&lt;/td&gt;
&lt;td&gt;MongoDB&lt;/td&gt;
&lt;td&gt;JSON-native, flexible schema&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex joins&lt;/td&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;Relational model excels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rapid prototyping&lt;/td&gt;
&lt;td&gt;MongoDB&lt;/td&gt;
&lt;td&gt;No schema migrations needed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why Python?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pandas&lt;/strong&gt;: Industry-standard for data manipulation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQLAlchemy&lt;/strong&gt;: ORM that prevents SQL injection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyMongo&lt;/strong&gt;: Lightweight MongoDB driver&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich ecosystem&lt;/strong&gt;: Libraries for everything&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Modular Design?
&lt;/h3&gt;

&lt;p&gt;Think of it like LEGO blocks. Each module is a self-contained piece that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can be tested independently&lt;/li&gt;
&lt;li&gt;Can be replaced without breaking others&lt;/li&gt;
&lt;li&gt;Makes debugging a breeze&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Future Enhancements
&lt;/h2&gt;

&lt;p&gt;This pipeline is production-ready, but here's what could come next:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Scheduling&lt;/strong&gt;: Run automatically every hour with Apache Airflow or cron&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message Queues&lt;/strong&gt;: Use Kafka/RabbitMQ for async processing at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Validation&lt;/strong&gt;: Add Great Expectations for data quality checks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: Add Prometheus/Grafana for pipeline observability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More Sources&lt;/strong&gt;: Extend to pull from APIs, S3, or other databases&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building this ETL pipeline taught me that &lt;strong&gt;good data engineering is invisible&lt;/strong&gt;. When it works, nobody notices, the reports are accurate, the app loads fast, and decisions get made with confidence.&lt;/p&gt;

&lt;p&gt;But behind that invisibility is careful architecture: modular code, dual-database strategy, containerized deployment, and clean data transformations.&lt;/p&gt;

&lt;p&gt;Whether you're a business analyst who just wants clean data, or an engineer looking to build your own pipeline, I hope this walkthrough demystified the magic behind turning chaotic spreadsheets into business intelligence gold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The supermarket never runs out of your favorite snacks because somewhere, a data pipeline is quietly doing its job.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're interested in the code, check out the repository here:&lt;a href="https://github.com/25thOliver/ProjectSupermarketAnalysis" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;




</description>
      <category>automation</category>
      <category>data</category>
      <category>dataengineering</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building Scalable Data Pipelines with Airflow, Docker, and Python: A SightSearch Case Study</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Fri, 30 Jan 2026 00:26:52 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/building-scalable-data-pipelines-with-airflow-docker-and-python-a-sightsearch-case-study-e9</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/building-scalable-data-pipelines-with-airflow-docker-and-python-a-sightsearch-case-study-e9</guid>
      <description>&lt;p&gt;&lt;em&gt;Data is the new oil, but a raw oil field isn't useful until you build a pipeline to refine it.&lt;/em&gt; In this article, I'll take you through the journey of building &lt;strong&gt;SightSearch&lt;/strong&gt;, a robust data ingestion orchestration pipeline. Whether you're a seasoned data engineer or a product manager curious about how data moves from a website to a database, you're in the right place.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Why Orchestration Matters
&lt;/h2&gt;

&lt;p&gt;Imagine you need to scrape thousands of product images and details daily. You write a script. It works fine on day one. But then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The script crashes halfway through&lt;/li&gt;
&lt;li&gt;You run out of disk space&lt;/li&gt;
&lt;li&gt;You forget to run it on Sunday&lt;/li&gt;
&lt;li&gt;The website layout changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple script isn't enough. You need &lt;strong&gt;orchestration&lt;/strong&gt;, a system that manages, schedules, monitors, and retries your tasks automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;p&gt;I entered the workshop with a clear goal: build something scalable and reliable. Here are the tools I chose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt;: The industry standard for orchestrating complex workflows (DAGs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker &amp;amp; Docker Compose&lt;/strong&gt;: To ensure our code runs the same way on my laptop as it does in production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt;: For the heavy lifting (scraping, image processing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MongoDB&lt;/strong&gt;: NoSQL storage for our flexible product data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt;: Relational storage for Airflow's internal metadata&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The pipeline is split into independent, reusable "tasks." This modularity is key. If the scraping works but the database is down, we don't lose the data, we just retry the storage step later.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36dldnh33um9h2kogvxl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36dldnh33um9h2kogvxl.png" alt="A high-level diagram of the architecture" width="800" height="267"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A high-level diagram of the architecture&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Pipeline in Action
&lt;/h2&gt;

&lt;p&gt;Let's look at the heart of our project: the &lt;strong&gt;Airflow DAG&lt;/strong&gt; (Directed Acyclic Graph). It defines the order of operations.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. The Scrape Task
&lt;/h3&gt;

&lt;p&gt;First, we hit the target website to gather raw product titles and image URLs. We use smart logic to handle pagination and rate limiting.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Image Processing
&lt;/h3&gt;

&lt;p&gt;Raw images are heavy. We download them, calculate their hash (pHash) for deduplication, and extract metadata like dimensions and file size.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Validation and Storage
&lt;/h3&gt;

&lt;p&gt;Data quality is paramount. We validate every record. Good data goes to &lt;strong&gt;MongoDB&lt;/strong&gt;; bad data is logged for review.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step-by-Step Walkthrough
&lt;/h2&gt;

&lt;p&gt;Here's how we bring this system to life.&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 1: The Setup
&lt;/h3&gt;

&lt;p&gt;We use &lt;code&gt;docker-compose&lt;/code&gt; to spin up our entire infrastructure with one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose &lt;span class="nt"&gt;-f&lt;/span&gt; docker/docker-compose.yml up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx6krznow1q45hykfeka9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx6krznow1q45hykfeka9.png" alt="Terminal showing docker containers starting up successfully" width="800" height="575"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Terminal showing Docker containers starting up successfully&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: The Airflow UI
&lt;/h3&gt;

&lt;p&gt;Once running, we log into the Airflow webserver. This is our command center.&lt;/p&gt;

&lt;p&gt;We unpause our &lt;code&gt;sightsearch_ingestion_pipeline&lt;/code&gt; and trigger a run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3: Monitoring Execution
&lt;/h3&gt;

&lt;p&gt;As the pipeline runs, we can watch each task succeed in real-time. This visual feedback is incredibly satisfying and useful for debugging.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvpb3iupgtu1etlwq83h0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvpb3iupgtu1etlwq83h0.png" alt="Airflow UI" width="800" height="432"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Airflow UI showing specific tasks turning dark green, indicating success&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 4: Verifying the Data
&lt;/h3&gt;

&lt;p&gt;Finally, the moment of truth. We check our database to ensure the data actually arrived.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd5laufzizggv24zzjjx0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd5laufzizggv24zzjjx0.png" alt="A terminal or GUI view of MongoDB showing a query" width="800" height="437"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;MongoDB query &lt;code&gt;db.products.findOne()&lt;/code&gt; returning a structured product document with title, price, and image_metadata&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges and Best Practices
&lt;/h2&gt;

&lt;p&gt;It wasn't all smooth sailing. Here are critical lessons I learned:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Handling Secrets Securely
&lt;/h3&gt;

&lt;p&gt;Initially, I hardcoded database passwords in &lt;code&gt;docker-compose.yml&lt;/code&gt;. This is a huge security risk!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: I refactored to use a &lt;code&gt;.env&lt;/code&gt; file, keeping my credentials out of version control.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Module-Level Connections
&lt;/h3&gt;

&lt;p&gt;I initially opened a database connection at the top of our scraping script. This caused Airflow to try and connect to the DB just while &lt;em&gt;parsing&lt;/em&gt; the file, leading to timeouts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: I moved the connection logic inside the execution functions. Always initialize resources lazily!&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;SightSearch demonstrates that with the right tools, even complex data ingestion can be made reliable and transparent. Airflow gives us control, Docker gives us consistency, and Python gives us power.&lt;/p&gt;

&lt;p&gt;If you're interested in the code, check out the repository here: &lt;strong&gt;&lt;a href="https://github.com/25thOliver/SightSearch" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>automation</category>
      <category>dataengineering</category>
      <category>docker</category>
      <category>python</category>
    </item>
    <item>
      <title>CHW Monthly Activity Aggregation: Turning Visit Logs into Insight</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Tue, 02 Dec 2025 18:33:21 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/chw-monthly-activity-aggregation-turning-visit-logs-into-insight-4jm5</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/chw-monthly-activity-aggregation-turning-visit-logs-into-insight-4jm5</guid>
      <description>&lt;p&gt;Community Health Workers (CHWs) generate a huge amount of visit-level data: every household visit, assessment, and follow-up. On its own, this raw data is hard to use. Our CHW Monthly Activity Aggregation project turns those raw records into a clean monthly summary that's ready for dashboards and performance reviews.&lt;/p&gt;

&lt;p&gt;At a high level, this project is a dbt (data build tool) project that reads a fact table of CHW activities and produces a single, analytics-ready table: &lt;code&gt;public.chw_activity_monthly&lt;/code&gt;, with one row per CHW per reporting month.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Problem Does This Project Solve?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Today's challenge&lt;/strong&gt;: Managers often get raw logs (“John visited household 123 at 10:15”) instead of understandable summaries (“John visited 28 households in March, with 6 pregnancy visits”).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Our solution&lt;/strong&gt;: We automatically group all visits into monthly summaries per CHW, applying agreed-upon business rules, so decision-makers see consistent, comparable metrics over time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Source: &lt;code&gt;public.fct_chw_activity&lt;/code&gt; (via dbt source &lt;code&gt;marts.fct_chv_activity&lt;/code&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Output: &lt;code&gt;public.chw_activity_monthly&lt;/code&gt;, incremental model keyed by &lt;code&gt;['chv_id', 'report_month']&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tech stack: dbt + Postgres, orchestrated via Docker&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxh256qhkvbfgjruqt6y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxh256qhkvbfgjruqt6y.png" alt="Project layout" width="800" height="795"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Project layout&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  How the Data Flows
&lt;/h2&gt;

&lt;p&gt;The project follows a simple story:&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;1. Start with raw visit data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Each row in &lt;code&gt;fct_chv_activity&lt;/code&gt; is a single CHW activity: who visited, when, what type of visit, and where.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;2. Clean and filter&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drop records with missing CHW IDs&lt;/li&gt;
&lt;li&gt;Drop records with missing activity dates&lt;/li&gt;
&lt;li&gt;Exclude visits marked as deleted&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;3. Assign a “reporting month”&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Instead of just using calendar months, we use a &lt;strong&gt;26th-of-the-month rule&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Visits on the &lt;strong&gt;1st to 25th&lt;/strong&gt; belong to that month&lt;/li&gt;
&lt;li&gt;Visits on the &lt;strong&gt;26th or later&lt;/strong&gt; are counted in the &lt;strong&gt;next month&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matches the business reporting cycle CHW programs often use.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;4. Aggregate to monthly metrics&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For each CHW and reporting month, we calculate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;total_activities&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;unique_households_visited&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;unique_patients_served&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pregnancy_visits&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;child_assessments&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;family_planning_visits&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;5. Store the result as a reusable table&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The final table &lt;code&gt;public.chw_activity_monthly&lt;/code&gt; is what reporting tools (e.g., Power BI, Looker) will connect to.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpau9ikzgen81p7mpxvlo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpau9ikzgen81p7mpxvlo.png" alt="Data Flow" width="800" height="514"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A simple diagram (whiteboard or slide) showing arrows from &lt;code&gt;fct_chv_activity → cleaning/filtering → month assignment → chw_activity_monthly&lt;/code&gt; with the key metrics.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Key Components
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Macro&lt;/strong&gt;: &lt;code&gt;month_assignment(date_column)&lt;/code&gt; in &lt;code&gt;macros/month_assignment.sql&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Implements the 26th-day rule using SQL &lt;code&gt;case&lt;/code&gt; + &lt;code&gt;date_trunc&lt;/code&gt; and a &lt;code&gt;+ interval '1 month'&lt;/code&gt; when the day ≥ 26.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Model&lt;/strong&gt;: &lt;code&gt;models/starter_code/chw_activity_monthly.sql&lt;/code&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;materialized = 'incremental'&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;incremental_strategy = 'delete+insert'&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;unique_key = ['chv_id', 'report_month']&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Within the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CTE &lt;code&gt;raw&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Selects from &lt;code&gt;{{ source('marts', 'fct_chv_activity') }}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Filters out null &lt;code&gt;activity_date&lt;/code&gt;, null &lt;code&gt;chv_id&lt;/code&gt;, and &lt;code&gt;is_deleted = true&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;For incremental runs, only processes months at or after the earliest &lt;code&gt;report_month&lt;/code&gt; currently present&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CTE &lt;code&gt;assigned&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Calls &lt;code&gt;{{ month_assignment('activity_date') }}&lt;/code&gt; to compute &lt;code&gt;report_month&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CTE &lt;code&gt;aggregated&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groups by &lt;code&gt;chv_id, report_month&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Computes counts and conditional sums for activity types&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvwhxiqgrfiiflt35wbln.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvwhxiqgrfiiflt35wbln.png" alt="Model" width="800" height="813"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;chw_activity_monthly.sql model with the CTE structure visible and the config block at the top.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Business Rules in More Detail
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Only valid, non-deleted records are included (no missing CHW, no missing date, no deleted events).&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Report month&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A visit on March 24 counts in March.&lt;/li&gt;
&lt;li&gt;A visit on March 27 counts in April.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Metrics per CHW per month:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Count total activities&lt;/li&gt;
&lt;li&gt;Count unique households and unique patients&lt;/li&gt;
&lt;li&gt;Break out program categories: pregnancy visits, child assessments, family planning visits&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the organization later changes how months are defined or what counts as, say, a “pregnancy visit,” we can adjust these rules centrally in the dbt code, and all downstream dashboards will automatically align.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F076qjjj3kqcfx4a8cjpa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F076qjjj3kqcfx4a8cjpa.png" alt="fct\_chv\_activity" width="800" height="780"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A small, anonymized excerpt of fct_chv_activity and the corresponding rows in chw_activity_monthly, to illustrate how several visits roll up into one summary row.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
       &lt;span class="n"&gt;chv_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;report_month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;total_activities&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;unique_households_visited&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;unique_patients_served&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;pregnancy_visits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;child_assessments&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;family_planning_visits&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chw_activity_monthly&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;chv_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report_month&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmxri4dcs40t0xba8pe0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmxri4dcs40t0xba8pe0.png" alt="query\_output" width="800" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing and Data Quality
&lt;/h2&gt;

&lt;p&gt;The project uses dbt’s testing framework to keep the model trustworthy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not-null tests&lt;/strong&gt; on key fields like &lt;code&gt;chv_id&lt;/code&gt; and &lt;code&gt;report_month&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uniqueness test&lt;/strong&gt; that enforces each (&lt;code&gt;chv_id&lt;/code&gt;, &lt;code&gt;report_month&lt;/code&gt;) pair appears only once (via &lt;code&gt;dbt_utils.unique_combination_of_columns&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures you do not accidentally have duplicate rows or missing identifiers in your analytical table.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1c8qn3eduumoj2gwf0vc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1c8qn3eduumoj2gwf0vc.png" alt="dbt test" width="800" height="620"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Terminal output from &lt;code&gt;dbt test --select chw_activity_monthly&lt;/code&gt; showing all tests passing.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  How to Run the Project
&lt;/h2&gt;

&lt;p&gt;These steps assume you have Docker installed and are running in the project root folder.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;1. Start the Docker services&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This brings up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A dbt runner container&lt;/li&gt;
&lt;li&gt;A Postgres database with the CHW activity data loaded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpl4545qeiexkjp2hso9g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpl4545qeiexkjp2hso9g.png" alt="Docker run" width="800" height="593"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Terminal view showing &lt;code&gt;docker compose up -d&lt;/code&gt; completing successfully, with containers listed via &lt;code&gt;docker ps&lt;/code&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;2. Open a shell in the dbt runner&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; dbt_runner bash
&lt;span class="nb"&gt;cd &lt;/span&gt;chw_project/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Inside this container, you're now in the &lt;code&gt;dbt/chw_project&lt;/code&gt; directory, where the dbt project is configured (via &lt;code&gt;dbt_project.yml&lt;/code&gt; and &lt;code&gt;profiles.yml&lt;/code&gt;).&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;3. Build the monthly activity model&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;dbt will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compile the SQL&lt;/li&gt;
&lt;li&gt;Apply the filters and the month assignment&lt;/li&gt;
&lt;li&gt;Materialize or update the &lt;code&gt;public.chw_activity_monthly&lt;/code&gt; table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftotdqtghazsbxw5rrszd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftotdqtghazsbxw5rrszd.png" alt="A successful run" width="800" height="897"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Terminal output of &lt;code&gt;dbt run&lt;/code&gt;, showing &lt;code&gt;chw_activity_monthly&lt;/code&gt; as OK or SUCCESS with timing.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;4. Run tests for the model&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Still inside the dbt runner container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dbt &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--select&lt;/span&gt; chw_activity_monthly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see the tests for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not-null constraints&lt;/li&gt;
&lt;li&gt;Unique combination of &lt;code&gt;chv_id&lt;/code&gt; + &lt;code&gt;report_month&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2dw94ie6rm5yzk08nrj2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2dw94ie6rm5yzk08nrj2.png" alt="Summary output" width="800" height="620"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Test summary output, highlighting that all tests are passing.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;5. Inspect the final table in Postgres&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;From your host machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; dbt_postgres bash
psql &lt;span class="nt"&gt;-U&lt;/span&gt; dbt_user &lt;span class="nt"&gt;-d&lt;/span&gt; analytics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the psql prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chw_activity_monthly&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows sample rows for CHWs, each representing a single CHW’s activity in a reporting month.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbgf9fu3mm13v290vx0u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbgf9fu3mm13v290vx0u.png" alt="psql result" width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommendations
&lt;/h2&gt;

&lt;p&gt;Once the table exists and has passed tests, analytics or program teams can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Point their BI tool to &lt;code&gt;public.chw_activity_monthly&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Build dashboards like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CHW monthly productivity (activities per month)&lt;/li&gt;
&lt;li&gt;Coverage (households visited by CHW and region)&lt;/li&gt;
&lt;li&gt;Program mix (share of pregnancy vs. child vs. family-planning visits)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Track trends over time and identify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Under-served areas (few households reached)&lt;/li&gt;
&lt;li&gt;High-performing CHWs&lt;/li&gt;
&lt;li&gt;Seasonal patterns in demand&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Because all these dashboards share the same centrally defined table and business rules, reports across teams will be consistent and comparable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project turns complex CHW visit logs into a simple summary table, so you can clearly see who is doing what, where, and when.&lt;br&gt;
It’s a dbt incremental model over &lt;code&gt;fct_chv_activity&lt;/code&gt;, using a macro for the 26th-day reporting rule, strong record-level filtering, and dbt tests for key constraints.&lt;/p&gt;

&lt;p&gt;The result is a reliable foundation for CHW performance analytics and program decision-making, with transparent and maintainable logic behind every number.&lt;/p&gt;




</description>
      <category>analytics</category>
      <category>sql</category>
      <category>dataengineering</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Outerwear Performance Analysis: A Data-Driven Investigation</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Tue, 18 Nov 2025 16:59:08 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/outerwear-performance-analysis-a-data-driven-investigation-4k4f</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/outerwear-performance-analysis-a-data-driven-investigation-4k4f</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;1. Problem Statement&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Outerwear category&lt;/strong&gt; has shown persistent underperformance across multiple business dimensions: revenue, margin, and customer engagement. Despite moderate sales spikes in peak seasons (Fall and Winter), total Outerwear revenue ($18.5K) lags far behind other categories, indicating structural weaknesses in demand generation and retention.&lt;/p&gt;

&lt;p&gt;High &lt;strong&gt;discount penetration (44.4%)&lt;/strong&gt; suggests dependency on promotions to move stock, compressing margins and signaling that customers perceive inadequate value at full price. Meanwhile, ratings are only moderate (3.75 overall) and decline further in Fall (3.64), implying inconsistent product quality or unmet customer expectations during the peak sales window.&lt;/p&gt;

&lt;p&gt;Seasonal dependency, discount-driven sales, and stagnant customer retention lead to a vicious cycle: deep discounting drives one-time purchases but suppresses long-term profitability. Our analysis aims to diagnose these issues and design actionable solutions for merchandising, marketing, and financial optimization.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;2. Primary Hypothesis&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;H₀ (Null Hypothesis):&lt;/strong&gt;&lt;br&gt;
Outerwear performance aligns with normal seasonal apparel patterns and observed sales fluctuations reflect natural demand variability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;H₁ (Alternative Hypothesis):&lt;/strong&gt;&lt;br&gt;
Outerwear underperforms due to addressable factors — &lt;em&gt;seasonal dependency, discount addiction, quality decline, and poor retention&lt;/em&gt; — which can be mitigated via assortment diversification, pricing strategy, and loyalty optimization.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. Sub-Hypotheses and Analytical Approach&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;Sub-Hypothesis&lt;/th&gt;
&lt;th&gt;Key Tests&lt;/th&gt;
&lt;th&gt;Expected Outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Seasonality Hypothesis:&lt;/em&gt; Outerwear revenue is overly concentrated in Fall–Winter seasons.&lt;/td&gt;
&lt;td&gt;Seasonality Index; Chi-square for uniformity; MoM trendline&lt;/td&gt;
&lt;td&gt;Revenue &amp;gt;50% from Fall/Winter → confirms dependency.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Discount Dependency Hypothesis:&lt;/em&gt; High discount penetration artificially sustains volume.&lt;/td&gt;
&lt;td&gt;T-test on AOV (discounted vs non-discounted); Repeat purchase rate&lt;/td&gt;
&lt;td&gt;Discounts → higher one-time buyers, lower loyalty.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Quality/Ratings Hypothesis:&lt;/em&gt; Outerwear ratings dropping in Fall correlate with reduced repurchase rates.&lt;/td&gt;
&lt;td&gt;ANOVA: ratings vs season; correlation (rating vs repurchase)&lt;/td&gt;
&lt;td&gt;Declining Quality → reduced retention, especially in Fall.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Retention Hypothesis:&lt;/em&gt; Outerwear attracts one-time, non-subscriber customers.&lt;/td&gt;
&lt;td&gt;Chi-square: loyalty distribution; segment comparison (new vs loyal buyers)&lt;/td&gt;
&lt;td&gt;Majority transactions from non-subscribers → poor loyalty.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Assortment Mismatch Hypothesis:&lt;/em&gt; SKU concentration in specific sizes/colors limits appeal.&lt;/td&gt;
&lt;td&gt;Herfindahl Index on SKU diversity; distribution plots (size/color)&lt;/td&gt;
&lt;td&gt;Over-indexed in M/Cyan → missing revenue from underserved segments.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;4. PromptBI Dashboard Link&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://app.promptbi.ai/public/chat/14ccfba2-f2df-4373-8bd2-ca78a62d0208" rel="noopener noreferrer"&gt;Dashboard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrk2ymr7s5scok8en1hb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrk2ymr7s5scok8en1hb.png" alt="Dashboard Deep dive" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outerwear Category Analytics and Deep Dive&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary Insight:&lt;/strong&gt; The Outerwear category has shown significant trends in revenue, transaction counts, and customer preferences, with notable seasonal variations and discount impacts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total Outerwear Revenue: $36,753.00&lt;/li&gt;
&lt;li&gt;Average Outerwear Rating: 3.75&lt;/li&gt;
&lt;li&gt;Outerwear Transaction Count: 639&lt;/li&gt;
&lt;li&gt;Most Common Outerwear Size: M&lt;/li&gt;
&lt;li&gt;Most Common Outerwear Color: Cyan&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Supporting Metrics and Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Revenue by Category: Clothing leads with the highest total revenue, but Outerwear has a significant presence.&lt;/li&gt;
&lt;li&gt;Average Order Value (AOV): Footwear has the highest AOV, indicating higher-value purchases.&lt;/li&gt;
&lt;li&gt;Transaction Count: Clothing is the most transacted category.&lt;/li&gt;
&lt;li&gt;Units Sold: Clothing also leads in units sold.&lt;/li&gt;
&lt;li&gt;Rating Distribution: The most frequent rating bin is 3.25-3.5.&lt;/li&gt;
&lt;li&gt;Discount Penetration: Outerwear has the highest discount penetration among all categories.&lt;/li&gt;
&lt;li&gt;Customer Loyalty Metrics: Loyal customers show a higher transaction count in the Outerwear category.&lt;/li&gt;
&lt;li&gt;Seasonal Trends:

&lt;ul&gt;
&lt;li&gt;Revenue Peak: Fall season&lt;/li&gt;
&lt;li&gt;Transaction Peak: Winter season&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Price Distribution: The most frequent price bin for Outerwear is $20.00-$30.00.&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;5. Visuals Section — PromptBI Chart Placement Guide&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Revenue by Category Comparison&lt;/strong&gt; &lt;em&gt;(Bar Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frqczt85rcuz4brgv7d42.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frqczt85rcuz4brgv7d42.jpeg" alt="Revenue by Category Comparison" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The chart reveals the total revenue generated by different product categories: Accessories, Clothing, Footwear, and Outerwear.&lt;/li&gt;
&lt;li&gt;The highest revenue is from Clothing with $104,264, followed by Accessories with $74,200.&lt;/li&gt;
&lt;li&gt;Footwear generates $36,093 in revenue, while Outerwear has the lowest revenue at $18,524.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trends and Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clothing is the top revenue generator, indicating strong customer demand and potentially higher profit margins.&lt;/li&gt;
&lt;li&gt;Outerwear, with the lowest revenue, suggests either lower demand, higher competition, or pricing issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Customer and Business Impact:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The disparity in revenue between categories highlights the need for targeted strategies to boost underperforming segments like Outerwear.&lt;/li&gt;
&lt;li&gt;Understanding why Outerwear has the lowest revenue could uncover market gaps or customer pain points.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;2. Average Order Value by Category&lt;/strong&gt; &lt;em&gt;(Bar Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg57zvpxcwwkn095zqmbv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg57zvpxcwwkn095zqmbv.jpeg" alt="Average Order Value by Category" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Outerwear category has the lowest Average Order Value (AOV) at $57.17, significantly below the other categories.&lt;/li&gt;
&lt;li&gt;In contrast, Footwear leads with the highest AOV at $60.26.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The lower AOV in Outerwear suggests potential areas for improvement in customer engagement or product offerings within this category.&lt;/li&gt;
&lt;li&gt;Consider analyzing customer feedback and sales data specific to Outerwear to identify pain points or opportunities for upselling.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;3. Transaction Count by Category&lt;/strong&gt; &lt;em&gt;(Bar Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ndems27q5m0zx8qzp5d.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ndems27q5m0zx8qzp5d.jpeg" alt="Transaction Count by Category" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Outerwear transactions are 54% lower than Accessories and 81% lower than Clothing.&lt;/li&gt;
&lt;li&gt;This indicates potential underperformance or lower customer demand in the Outerwear segment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consider reviewing the Outerwear product lineup for relevance and appeal.&lt;/li&gt;
&lt;li&gt;Evaluate marketing efforts targeted at this category to identify gaps.&lt;/li&gt;
&lt;li&gt;Explore seasonal trends or external factors affecting Outerwear sales.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;4. Units Sold Comparison&lt;/strong&gt; &lt;em&gt;(Bar Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskj8ee00ysnizsvxtcc8.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskj8ee00ysnizsvxtcc8.jpeg" alt="Units Sold Comparison" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lowest Sales Volume: Outerwear has the fewest units sold at 324, significantly lower than other categories.&lt;/li&gt;
&lt;li&gt;Sales Gap: There's a notable 1,413 units difference between Outerwear and the highest-selling category, Clothing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The low sales in Outerwear may indicate market saturation, competitor dominance, or customer disinterest.&lt;/li&gt;
&lt;li&gt;Consider a market analysis to understand the underlying causes and explore strategies to boost Outerwear sales.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;5. Rating Distribution&lt;/strong&gt; &lt;em&gt;(Histogram)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhoweecl7ntmr9pfjgm25.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhoweecl7ntmr9pfjgm25.jpeg" alt="Rating Distribution" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The histogram reveals the distribution of customer ratings for the Outerwear category, which is crucial for understanding customer satisfaction and identifying areas for improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The most frequent rating bin is 3.25-3.5, indicating a central tendency towards average satisfaction.&lt;/li&gt;
&lt;li&gt;Ratings between 3.25-3.5 and 3.75-4.0 have the highest number of reviews, suggesting a significant portion of customers are moderately satisfied.&lt;/li&gt;
&lt;li&gt;There is a noticeable drop in the number of reviews for ratings below 3.0, implying fewer customers are highly dissatisfied.&lt;/li&gt;
&lt;li&gt;The lowest rating bin (2.5-2.75) still has a considerable number of reviews (379), indicating room for improvement in product quality or customer experience.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus on enhancing products or services to move the average ratings upwards, especially targeting the 2.5-2.75 range.&lt;/li&gt;
&lt;li&gt;Investigate the causes behind the moderate satisfaction levels in the 3.25-3.5 range to identify specific pain points or areas for enhancement.&lt;/li&gt;
&lt;li&gt;Leverage the high number of reviews in the 3.75-4.0 range to gather insights on what customers appreciate most, and amplify those aspects in marketing and product development.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;6. Discount Penetration by Category&lt;/strong&gt; &lt;em&gt;(Bar Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0fuo4azbo6jleuey4vy.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0fuo4azbo6jleuey4vy.jpeg" alt="Discount Penetration by Category" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Outerwear leads in discount penetration, signaling strong customer response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Outerwear has the highest discount penetration at 44.44%, indicating a robust customer response to discounts in this category.&lt;/li&gt;
&lt;li&gt;Accessories and Footwear follow with discount penetrations of 43.79% and 43.24% respectively, showing consistent customer interest.&lt;/li&gt;
&lt;li&gt;Clothing has the lowest penetration at 42.08%, suggesting potential room for optimization in discount strategies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus on maintaining and enhancing discount strategies for Outerwear to capitalize on high customer engagement.&lt;/li&gt;
&lt;li&gt;Investigate why Clothing has lower discount penetration and explore opportunities to increase its appeal through targeted promotions or product improvements.&lt;/li&gt;
&lt;li&gt;Monitor trends in Accessories and Footwear to ensure continued customer interest and adjust strategies as needed.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;7. Customer Loyalty Metrics by Category&lt;/strong&gt; &lt;em&gt;(Stacked Bar)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs3u846hfekvjqznhasor.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs3u846hfekvjqznhasor.jpeg" alt="Customer Loyalty Metrics by Category" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transaction Insights:&lt;/strong&gt;&lt;br&gt;
The Outerwear category shows a total of 324 transactions (233 non-subscribers + 91 subscribers).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-subscribers dominate with 233 transactions, significantly higher than subscribers.&lt;/li&gt;
&lt;li&gt;Subscribers contribute 91 transactions, indicating a smaller but present loyal customer base.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The lower transaction count for subscribers suggests potential growth in customer loyalty for Outerwear.&lt;/li&gt;
&lt;li&gt;Focus on converting non-subscribers to subscribers could increase overall transaction volume.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;8. Outerwear Revenue by Season&lt;/strong&gt; &lt;em&gt;(Line Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxnd3wp1j8cmv2swfhv72.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxnd3wp1j8cmv2swfhv72.jpeg" alt="Outerwear Revenue by Season" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt;&lt;br&gt;
Understanding seasonal revenue trends in the Outerwear category is crucial for aligning inventory, marketing efforts, and customer engagement strategies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spring: High revenue at $9,749, indicating strong demand.&lt;/li&gt;
&lt;li&gt;Summer: Lowest revenue at $7,449, suggesting reduced need for outerwear.&lt;/li&gt;
&lt;li&gt;Fall: Peak season with revenue at $9,778, the highest point.&lt;/li&gt;
&lt;li&gt;Winter: Slight drop from Fall, revenue at $9,777.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus marketing and promotions in Fall to capitalize on peak demand.&lt;/li&gt;
&lt;li&gt;Consider inventory adjustments for Summer to align with lower demand.&lt;/li&gt;
&lt;li&gt;Analyze customer behavior in Spring to replicate successful strategies in other seasons.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;9. Outerwear Transactions by Season&lt;/strong&gt; &lt;em&gt;(Line Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2wf4eu13tmek7rw9xjz7.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2wf4eu13tmek7rw9xjz7.jpeg" alt="Outerwear Transactions by Season" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt;&lt;br&gt;
Understanding seasonal trends in outerwear transactions helps in aligning inventory, marketing efforts, and customer engagement strategies to maximize sales and customer satisfaction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Winter Peak: Outerwear transactions peak in Winter with 170 transactions, indicating high demand during this season.&lt;/li&gt;
&lt;li&gt;Spring High: Spring follows closely with 169 transactions, suggesting strong seasonal demand.&lt;/li&gt;
&lt;li&gt;Lowest in Summer: Summer sees the lowest transaction count at 134, reflecting reduced need for outerwear.&lt;/li&gt;
&lt;li&gt;Fall Resurgence: Fall shows a resurgence with 166 transactions, close to Spring levels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus marketing and promotional efforts on Winter and Spring to capitalize on high transaction periods.&lt;/li&gt;
&lt;li&gt;Consider inventory adjustments to ensure sufficient stock during peak seasons while managing lower demand in Summer.&lt;/li&gt;
&lt;li&gt;Explore strategies to boost sales in Summer, such as promoting lightweight outerwear or transitional pieces.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;10. Discount Impact on Outerwear Revenue&lt;/strong&gt; &lt;em&gt;(Scatter Plot)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fedhrg3gw452v2k9r6uyn.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fedhrg3gw452v2k9r6uyn.jpeg" alt="Discount Impact on Outerwear Revenue" width="531" height="521"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters:&lt;/strong&gt;&lt;br&gt;
Understanding how discounts affect outerwear revenue is crucial for optimizing sales strategies and maximizing profit margins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fall shows the highest revenue at $9,778 with a discount penetration of 45.18%.&lt;/li&gt;
&lt;li&gt;Summer has the highest discount penetration at 50% but the lowest revenue at $7,449.&lt;/li&gt;
&lt;li&gt;Spring and Winter show similar revenue figures around $9,750 with discount penetrations of 40.24% and 47.65%, respectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trendline Analysis:&lt;/strong&gt;&lt;br&gt;
The trendline equation y = -185.25x + 17666.75 indicates a negative correlation between discount penetration and revenue. As discount penetration increases, revenue tends to decrease.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher discounts do not necessarily lead to higher revenue, as seen in Summer.&lt;/li&gt;
&lt;li&gt;Moderate discount levels in Fall and Winter correlate with peak revenue.&lt;/li&gt;
&lt;li&gt;Consider reducing discount levels in Summer to potentially increase revenue.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;11. Customer Segmentation by Season&lt;/strong&gt; &lt;em&gt;(Clustered Bar)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttntdqf0bmc3bsk8qso4.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttntdqf0bmc3bsk8qso4.jpeg" alt="Customer Segmentation by Season" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters:&lt;/strong&gt;&lt;br&gt;
Understanding customer behavior across different seasons helps tailor marketing strategies and inventory planning for the Outerwear category.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loyal Customers Drive Transactions: The 'Loyal' segment consistently shows higher transaction volumes compared to the 'Active' segment across all seasons.&lt;/li&gt;
&lt;li&gt;Seasonal Variance:

&lt;ul&gt;
&lt;li&gt;'Loyal' customers peak in Spring with 133 transactions and maintain high levels in Fall (129) and Winter (132).&lt;/li&gt;
&lt;li&gt;'Active' customers show less variance, ranging from 31 transactions in Summer to 38 in Winter.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Summer Low for Loyal Customers: The 'Loyal' segment drops to 103 transactions in Summer, indicating a potential area for engagement strategies.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus retention strategies on the 'Loyal' segment to maintain high transaction volumes.&lt;/li&gt;
&lt;li&gt;Investigate why 'Loyal' customers drop in Summer and develop targeted campaigns to re-engage them during this period.&lt;/li&gt;
&lt;li&gt;Consider inventory adjustments to align with the high demand from 'Loyal' customers in Fall, Spring, and Winter.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;12. Price Distribution&lt;/strong&gt; &lt;em&gt;(Histogram)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd8hukc93imf9y3qty4jb.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd8hukc93imf9y3qty4jb.jpeg" alt="Price Distribution" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Customers prefer lower price ranges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The most frequent price bin is 20.0-30.0 USD, with 112 transactions. This indicates a strong preference for lower-priced outerwear.&lt;/li&gt;
&lt;li&gt;The number of transactions decreases as the price increases from 20.0-30.0 USD to 70.0-80.0 USD.&lt;/li&gt;
&lt;li&gt;There is a slight uptick in transactions in the 80.0-90.0 USD range, suggesting a segment of customers willing to pay a bit more.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus marketing efforts on the 20.0-30.0 USD range to capture the largest customer segment.&lt;/li&gt;
&lt;li&gt;Consider promotional strategies for the 80.0-90.0 USD range to leverage the observed interest.&lt;/li&gt;
&lt;li&gt;Analyze the 70.0-80.0 USD range to understand the drop-off and adjust pricing or product offerings accordingly.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;13. Size and Color Distribution&lt;/strong&gt; &lt;em&gt;(Stacked Bar)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwlvtru1x23qs3ug3wwfj.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwlvtru1x23qs3ug3wwfj.jpeg" alt="Size and Color Distribution" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters:&lt;/strong&gt;&lt;br&gt;
Understanding the distribution of outerwear sizes and colors helps tailor inventory and marketing strategies to meet customer preferences effectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Medium Size Dominance: Size M has the highest counts across almost all colors, indicating a strong preference for this size.&lt;/li&gt;
&lt;li&gt;Popular Colors: Beige, Blue, Brown, and Gray are consistently popular across all sizes, with notable peaks in Size M.&lt;/li&gt;
&lt;li&gt;Size L Insights: Size L shows a varied distribution with Cyan and Brown leading, suggesting a niche market within larger sizes.&lt;/li&gt;
&lt;li&gt;Size S Observations: Size S has lower overall counts, with Beige and Olive standing out, hinting at specific customer segments.&lt;/li&gt;
&lt;li&gt;Size XL Trends: Size XL has the lowest counts, with Cyan and Lavender showing slight preference, indicating limited demand.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus inventory replenishment on Size M, especially in popular colors like Beige, Blue, and Gray.&lt;/li&gt;
&lt;li&gt;Consider targeted marketing campaigns for Size L, highlighting popular colors like Cyan and Brown.&lt;/li&gt;
&lt;li&gt;Evaluate the need for Size S and XL, possibly reducing stock for less popular colors to optimize inventory.&lt;/li&gt;
&lt;li&gt;Explore customer feedback for Size S and XL to understand specific needs and preferences.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;6. Short Conclusion &amp;amp; Prioritized Recommendations&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Takeaways&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Outerwear's &lt;strong&gt;low profitability&lt;/strong&gt; stems from high discount usage, narrow SKU range, and weak customer retention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seasonality is confirmed&lt;/strong&gt; but not optimized — demand surges in Fall/Winter, yet margins erode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Moderate but declining ratings&lt;/strong&gt; suggest emerging product or expectation alignment issues.&lt;/li&gt;
&lt;li&gt;Data indicates &lt;strong&gt;assortment imbalance&lt;/strong&gt; (overindexing in size M and Cyan color), limiting growth potential.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Recommendations by Function&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;Action Priority&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Merchandising&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🔹 High&lt;/td&gt;
&lt;td&gt;Diversify Outerwear SKUs — introduce extended sizes (S, XL), rebalance color range, and add transitional products for off-seasons.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Marketing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🔹 Medium&lt;/td&gt;
&lt;td&gt;Shift communication from discount-heavy promos to "durability and design" value messaging; launch a Summer lightweight outerwear campaign.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Finance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🔹 High&lt;/td&gt;
&lt;td&gt;Optimize discount policy — cap average discount &amp;lt;35%, test bundle promos instead of direct markdowns to protect margin.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CRM / Loyalty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🔹 Medium&lt;/td&gt;
&lt;td&gt;Introduce a loyalty reward or seasonal bundle subscription for outerwear customers to improve repeat purchase rates.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;One-Line Executive Summary&lt;/strong&gt;
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;"Outerwear's performance problem is not demand shortage but &lt;em&gt;value perception and assortment imbalance&lt;/em&gt; — by optimizing variety, discount structure, and retention strategy, the category can regain profitability and year-round engagement."&lt;/p&gt;
&lt;/blockquote&gt;

</description>
    </item>
    <item>
      <title>Outerwear Performance Analysis: A Data-Driven Investigation</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Tue, 18 Nov 2025 12:27:07 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/outerwear-performance-analysis-a-data-driven-investigation-23b9</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/outerwear-performance-analysis-a-data-driven-investigation-23b9</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;1. Problem Statement&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Outerwear category&lt;/strong&gt; has shown persistent underperformance across multiple business dimensions: revenue, margin, and customer engagement. Despite moderate sales spikes in peak seasons (Fall and Winter), total Outerwear revenue ($18.5K) lags far behind other categories, indicating structural weaknesses in demand generation and retention.&lt;/p&gt;

&lt;p&gt;High &lt;strong&gt;discount penetration (44.4%)&lt;/strong&gt; suggests dependency on promotions to move stock, compressing margins and signaling that customers perceive inadequate value at full price. Meanwhile, ratings are only moderate (3.75 overall) and decline further in Fall (3.64), implying inconsistent product quality or unmet customer expectations during the peak sales window.&lt;/p&gt;

&lt;p&gt;Seasonal dependency, discount-driven sales, and stagnant customer retention lead to a vicious cycle: deep discounting drives one-time purchases but suppresses long-term profitability. Our analysis aims to diagnose these issues and design actionable solutions for merchandising, marketing, and financial optimization.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;2. Primary Hypothesis&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;H₀ (Null Hypothesis):&lt;/strong&gt;&lt;br&gt;
Outerwear performance aligns with normal seasonal apparel patterns and observed sales fluctuations reflect natural demand variability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;H₁ (Alternative Hypothesis):&lt;/strong&gt;&lt;br&gt;
Outerwear underperforms due to addressable factors — &lt;em&gt;seasonal dependency, discount addiction, quality decline, and poor retention&lt;/em&gt; — which can be mitigated via assortment diversification, pricing strategy, and loyalty optimization.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. Sub-Hypotheses and Analytical Approach&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;Sub-Hypothesis&lt;/th&gt;
&lt;th&gt;Key Tests&lt;/th&gt;
&lt;th&gt;Expected Outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Seasonality Hypothesis:&lt;/em&gt; Outerwear revenue is overly concentrated in Fall–Winter seasons.&lt;/td&gt;
&lt;td&gt;Seasonality Index; Chi-square for uniformity; MoM trendline&lt;/td&gt;
&lt;td&gt;Revenue &amp;gt;50% from Fall/Winter → confirms dependency.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Discount Dependency Hypothesis:&lt;/em&gt; High discount penetration artificially sustains volume.&lt;/td&gt;
&lt;td&gt;T-test on AOV (discounted vs non-discounted); Repeat purchase rate&lt;/td&gt;
&lt;td&gt;Discounts → higher one-time buyers, lower loyalty.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Quality/Ratings Hypothesis:&lt;/em&gt; Outerwear ratings dropping in Fall correlate with reduced repurchase rates.&lt;/td&gt;
&lt;td&gt;ANOVA: ratings vs season; correlation (rating vs repurchase)&lt;/td&gt;
&lt;td&gt;Declining Quality → reduced retention, especially in Fall.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Retention Hypothesis:&lt;/em&gt; Outerwear attracts one-time, non-subscriber customers.&lt;/td&gt;
&lt;td&gt;Chi-square: loyalty distribution; segment comparison (new vs loyal buyers)&lt;/td&gt;
&lt;td&gt;Majority transactions from non-subscribers → poor loyalty.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;H5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Assortment Mismatch Hypothesis:&lt;/em&gt; SKU concentration in specific sizes/colors limits appeal.&lt;/td&gt;
&lt;td&gt;Herfindahl Index on SKU diversity; distribution plots (size/color)&lt;/td&gt;
&lt;td&gt;Over-indexed in M/Cyan → missing revenue from underserved segments.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;4. PromptBI Dashboard Link&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://app.promptbi.ai/public/chat/14ccfba2-f2df-4373-8bd2-ca78a62d0208" rel="noopener noreferrer"&gt;Dashboard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrk2ymr7s5scok8en1hb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrk2ymr7s5scok8en1hb.png" alt="Dashboard Deep dive" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outerwear Category Analytics and Deep Dive&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary Insight:&lt;/strong&gt; The Outerwear category has shown significant trends in revenue, transaction counts, and customer preferences, with notable seasonal variations and discount impacts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total Outerwear Revenue: $36,753.00&lt;/li&gt;
&lt;li&gt;Average Outerwear Rating: 3.75&lt;/li&gt;
&lt;li&gt;Outerwear Transaction Count: 639&lt;/li&gt;
&lt;li&gt;Most Common Outerwear Size: M&lt;/li&gt;
&lt;li&gt;Most Common Outerwear Color: Cyan&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Supporting Metrics and Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Revenue by Category: Clothing leads with the highest total revenue, but Outerwear has a significant presence.&lt;/li&gt;
&lt;li&gt;Average Order Value (AOV): Footwear has the highest AOV, indicating higher-value purchases.&lt;/li&gt;
&lt;li&gt;Transaction Count: Clothing is the most transacted category.&lt;/li&gt;
&lt;li&gt;Units Sold: Clothing also leads in units sold.&lt;/li&gt;
&lt;li&gt;Rating Distribution: The most frequent rating bin is 3.25-3.5.&lt;/li&gt;
&lt;li&gt;Discount Penetration: Outerwear has the highest discount penetration among all categories.&lt;/li&gt;
&lt;li&gt;Customer Loyalty Metrics: Loyal customers show a higher transaction count in the Outerwear category.&lt;/li&gt;
&lt;li&gt;Seasonal Trends:

&lt;ul&gt;
&lt;li&gt;Revenue Peak: Fall season&lt;/li&gt;
&lt;li&gt;Transaction Peak: Winter season&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Price Distribution: The most frequent price bin for Outerwear is $20.00-$30.00.&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;5. Visuals Section — PromptBI Chart Placement Guide&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Revenue by Category Comparison&lt;/strong&gt; &lt;em&gt;(Bar Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frqczt85rcuz4brgv7d42.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frqczt85rcuz4brgv7d42.jpeg" alt="Revenue by Category Comparison" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The chart reveals the total revenue generated by different product categories: Accessories, Clothing, Footwear, and Outerwear.&lt;/li&gt;
&lt;li&gt;The highest revenue is from Clothing with $104,264, followed by Accessories with $74,200.&lt;/li&gt;
&lt;li&gt;Footwear generates $36,093 in revenue, while Outerwear has the lowest revenue at $18,524.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trends and Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clothing is the top revenue generator, indicating strong customer demand and potentially higher profit margins.&lt;/li&gt;
&lt;li&gt;Outerwear, with the lowest revenue, suggests either lower demand, higher competition, or pricing issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Customer and Business Impact:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The disparity in revenue between categories highlights the need for targeted strategies to boost underperforming segments like Outerwear.&lt;/li&gt;
&lt;li&gt;Understanding why Outerwear has the lowest revenue could uncover market gaps or customer pain points.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;2. Average Order Value by Category&lt;/strong&gt; &lt;em&gt;(Bar Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg57zvpxcwwkn095zqmbv.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg57zvpxcwwkn095zqmbv.jpeg" alt="Average Order Value by Category" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Outerwear category has the lowest Average Order Value (AOV) at $57.17, significantly below the other categories.&lt;/li&gt;
&lt;li&gt;In contrast, Footwear leads with the highest AOV at $60.26.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The lower AOV in Outerwear suggests potential areas for improvement in customer engagement or product offerings within this category.&lt;/li&gt;
&lt;li&gt;Consider analyzing customer feedback and sales data specific to Outerwear to identify pain points or opportunities for upselling.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;3. Transaction Count by Category&lt;/strong&gt; &lt;em&gt;(Bar Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ndems27q5m0zx8qzp5d.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ndems27q5m0zx8qzp5d.jpeg" alt="Transaction Count by Category" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Outerwear transactions are 54% lower than Accessories and 81% lower than Clothing.&lt;/li&gt;
&lt;li&gt;This indicates potential underperformance or lower customer demand in the Outerwear segment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consider reviewing the Outerwear product lineup for relevance and appeal.&lt;/li&gt;
&lt;li&gt;Evaluate marketing efforts targeted at this category to identify gaps.&lt;/li&gt;
&lt;li&gt;Explore seasonal trends or external factors affecting Outerwear sales.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;4. Units Sold Comparison&lt;/strong&gt; &lt;em&gt;(Bar Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskj8ee00ysnizsvxtcc8.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskj8ee00ysnizsvxtcc8.jpeg" alt="Units Sold Comparison" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lowest Sales Volume: Outerwear has the fewest units sold at 324, significantly lower than other categories.&lt;/li&gt;
&lt;li&gt;Sales Gap: There's a notable 1,413 units difference between Outerwear and the highest-selling category, Clothing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The low sales in Outerwear may indicate market saturation, competitor dominance, or customer disinterest.&lt;/li&gt;
&lt;li&gt;Consider a market analysis to understand the underlying causes and explore strategies to boost Outerwear sales.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;5. Rating Distribution&lt;/strong&gt; &lt;em&gt;(Histogram)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhoweecl7ntmr9pfjgm25.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhoweecl7ntmr9pfjgm25.jpeg" alt="Rating Distribution" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The histogram reveals the distribution of customer ratings for the Outerwear category, which is crucial for understanding customer satisfaction and identifying areas for improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The most frequent rating bin is 3.25-3.5, indicating a central tendency towards average satisfaction.&lt;/li&gt;
&lt;li&gt;Ratings between 3.25-3.5 and 3.75-4.0 have the highest number of reviews, suggesting a significant portion of customers are moderately satisfied.&lt;/li&gt;
&lt;li&gt;There is a noticeable drop in the number of reviews for ratings below 3.0, implying fewer customers are highly dissatisfied.&lt;/li&gt;
&lt;li&gt;The lowest rating bin (2.5-2.75) still has a considerable number of reviews (379), indicating room for improvement in product quality or customer experience.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus on enhancing products or services to move the average ratings upwards, especially targeting the 2.5-2.75 range.&lt;/li&gt;
&lt;li&gt;Investigate the causes behind the moderate satisfaction levels in the 3.25-3.5 range to identify specific pain points or areas for enhancement.&lt;/li&gt;
&lt;li&gt;Leverage the high number of reviews in the 3.75-4.0 range to gather insights on what customers appreciate most, and amplify those aspects in marketing and product development.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;6. Discount Penetration by Category&lt;/strong&gt; &lt;em&gt;(Bar Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0fuo4azbo6jleuey4vy.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0fuo4azbo6jleuey4vy.jpeg" alt="Discount Penetration by Category" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Outerwear leads in discount penetration, signaling strong customer response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Outerwear has the highest discount penetration at 44.44%, indicating a robust customer response to discounts in this category.&lt;/li&gt;
&lt;li&gt;Accessories and Footwear follow with discount penetrations of 43.79% and 43.24% respectively, showing consistent customer interest.&lt;/li&gt;
&lt;li&gt;Clothing has the lowest penetration at 42.08%, suggesting potential room for optimization in discount strategies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus on maintaining and enhancing discount strategies for Outerwear to capitalize on high customer engagement.&lt;/li&gt;
&lt;li&gt;Investigate why Clothing has lower discount penetration and explore opportunities to increase its appeal through targeted promotions or product improvements.&lt;/li&gt;
&lt;li&gt;Monitor trends in Accessories and Footwear to ensure continued customer interest and adjust strategies as needed.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;7. Customer Loyalty Metrics by Category&lt;/strong&gt; &lt;em&gt;(Stacked Bar)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs3u846hfekvjqznhasor.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs3u846hfekvjqznhasor.jpeg" alt="Customer Loyalty Metrics by Category" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transaction Insights:&lt;/strong&gt;&lt;br&gt;
The Outerwear category shows a total of 324 transactions (233 non-subscribers + 91 subscribers).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-subscribers dominate with 233 transactions, significantly higher than subscribers.&lt;/li&gt;
&lt;li&gt;Subscribers contribute 91 transactions, indicating a smaller but present loyal customer base.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The lower transaction count for subscribers suggests potential growth in customer loyalty for Outerwear.&lt;/li&gt;
&lt;li&gt;Focus on converting non-subscribers to subscribers could increase overall transaction volume.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;8. Outerwear Revenue by Season&lt;/strong&gt; &lt;em&gt;(Line Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxnd3wp1j8cmv2swfhv72.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxnd3wp1j8cmv2swfhv72.jpeg" alt="Outerwear Revenue by Season" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt;&lt;br&gt;
Understanding seasonal revenue trends in the Outerwear category is crucial for aligning inventory, marketing efforts, and customer engagement strategies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spring: High revenue at $9,749, indicating strong demand.&lt;/li&gt;
&lt;li&gt;Summer: Lowest revenue at $7,449, suggesting reduced need for outerwear.&lt;/li&gt;
&lt;li&gt;Fall: Peak season with revenue at $9,778, the highest point.&lt;/li&gt;
&lt;li&gt;Winter: Slight drop from Fall, revenue at $9,777.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus marketing and promotions in Fall to capitalize on peak demand.&lt;/li&gt;
&lt;li&gt;Consider inventory adjustments for Summer to align with lower demand.&lt;/li&gt;
&lt;li&gt;Analyze customer behavior in Spring to replicate successful strategies in other seasons.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;9. Outerwear Transactions by Season&lt;/strong&gt; &lt;em&gt;(Line Chart)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2wf4eu13tmek7rw9xjz7.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2wf4eu13tmek7rw9xjz7.jpeg" alt="Outerwear Transactions by Season" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt;&lt;br&gt;
Understanding seasonal trends in outerwear transactions helps in aligning inventory, marketing efforts, and customer engagement strategies to maximize sales and customer satisfaction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Winter Peak: Outerwear transactions peak in Winter with 170 transactions, indicating high demand during this season.&lt;/li&gt;
&lt;li&gt;Spring High: Spring follows closely with 169 transactions, suggesting strong seasonal demand.&lt;/li&gt;
&lt;li&gt;Lowest in Summer: Summer sees the lowest transaction count at 134, reflecting reduced need for outerwear.&lt;/li&gt;
&lt;li&gt;Fall Resurgence: Fall shows a resurgence with 166 transactions, close to Spring levels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus marketing and promotional efforts on Winter and Spring to capitalize on high transaction periods.&lt;/li&gt;
&lt;li&gt;Consider inventory adjustments to ensure sufficient stock during peak seasons while managing lower demand in Summer.&lt;/li&gt;
&lt;li&gt;Explore strategies to boost sales in Summer, such as promoting lightweight outerwear or transitional pieces.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;10. Discount Impact on Outerwear Revenue&lt;/strong&gt; &lt;em&gt;(Scatter Plot)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fedhrg3gw452v2k9r6uyn.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fedhrg3gw452v2k9r6uyn.jpeg" alt="Discount Impact on Outerwear Revenue" width="531" height="521"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters:&lt;/strong&gt;&lt;br&gt;
Understanding how discounts affect outerwear revenue is crucial for optimizing sales strategies and maximizing profit margins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fall shows the highest revenue at $9,778 with a discount penetration of 45.18%.&lt;/li&gt;
&lt;li&gt;Summer has the highest discount penetration at 50% but the lowest revenue at $7,449.&lt;/li&gt;
&lt;li&gt;Spring and Winter show similar revenue figures around $9,750 with discount penetrations of 40.24% and 47.65%, respectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trendline Analysis:&lt;/strong&gt;&lt;br&gt;
The trendline equation y = -185.25x + 17666.75 indicates a negative correlation between discount penetration and revenue. As discount penetration increases, revenue tends to decrease.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher discounts do not necessarily lead to higher revenue, as seen in Summer.&lt;/li&gt;
&lt;li&gt;Moderate discount levels in Fall and Winter correlate with peak revenue.&lt;/li&gt;
&lt;li&gt;Consider reducing discount levels in Summer to potentially increase revenue.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;11. Customer Segmentation by Season&lt;/strong&gt; &lt;em&gt;(Clustered Bar)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttntdqf0bmc3bsk8qso4.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttntdqf0bmc3bsk8qso4.jpeg" alt="Customer Segmentation by Season" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters:&lt;/strong&gt;&lt;br&gt;
Understanding customer behavior across different seasons helps tailor marketing strategies and inventory planning for the Outerwear category.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loyal Customers Drive Transactions: The 'Loyal' segment consistently shows higher transaction volumes compared to the 'Active' segment across all seasons.&lt;/li&gt;
&lt;li&gt;Seasonal Variance:

&lt;ul&gt;
&lt;li&gt;'Loyal' customers peak in Spring with 133 transactions and maintain high levels in Fall (129) and Winter (132).&lt;/li&gt;
&lt;li&gt;'Active' customers show less variance, ranging from 31 transactions in Summer to 38 in Winter.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Summer Low for Loyal Customers: The 'Loyal' segment drops to 103 transactions in Summer, indicating a potential area for engagement strategies.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus retention strategies on the 'Loyal' segment to maintain high transaction volumes.&lt;/li&gt;
&lt;li&gt;Investigate why 'Loyal' customers drop in Summer and develop targeted campaigns to re-engage them during this period.&lt;/li&gt;
&lt;li&gt;Consider inventory adjustments to align with the high demand from 'Loyal' customers in Fall, Spring, and Winter.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;12. Price Distribution&lt;/strong&gt; &lt;em&gt;(Histogram)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd8hukc93imf9y3qty4jb.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd8hukc93imf9y3qty4jb.jpeg" alt="Price Distribution" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Customers prefer lower price ranges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The most frequent price bin is 20.0-30.0 USD, with 112 transactions. This indicates a strong preference for lower-priced outerwear.&lt;/li&gt;
&lt;li&gt;The number of transactions decreases as the price increases from 20.0-30.0 USD to 70.0-80.0 USD.&lt;/li&gt;
&lt;li&gt;There is a slight uptick in transactions in the 80.0-90.0 USD range, suggesting a segment of customers willing to pay a bit more.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus marketing efforts on the 20.0-30.0 USD range to capture the largest customer segment.&lt;/li&gt;
&lt;li&gt;Consider promotional strategies for the 80.0-90.0 USD range to leverage the observed interest.&lt;/li&gt;
&lt;li&gt;Analyze the 70.0-80.0 USD range to understand the drop-off and adjust pricing or product offerings accordingly.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;13. Size and Color Distribution&lt;/strong&gt; &lt;em&gt;(Stacked Bar)&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwlvtru1x23qs3ug3wwfj.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwlvtru1x23qs3ug3wwfj.jpeg" alt="Size and Color Distribution" width="531" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters:&lt;/strong&gt;&lt;br&gt;
Understanding the distribution of outerwear sizes and colors helps tailor inventory and marketing strategies to meet customer preferences effectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Trends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Medium Size Dominance: Size M has the highest counts across almost all colors, indicating a strong preference for this size.&lt;/li&gt;
&lt;li&gt;Popular Colors: Beige, Blue, Brown, and Gray are consistently popular across all sizes, with notable peaks in Size M.&lt;/li&gt;
&lt;li&gt;Size L Insights: Size L shows a varied distribution with Cyan and Brown leading, suggesting a niche market within larger sizes.&lt;/li&gt;
&lt;li&gt;Size S Observations: Size S has lower overall counts, with Beige and Olive standing out, hinting at specific customer segments.&lt;/li&gt;
&lt;li&gt;Size XL Trends: Size XL has the lowest counts, with Cyan and Lavender showing slight preference, indicating limited demand.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Implications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus inventory replenishment on Size M, especially in popular colors like Beige, Blue, and Gray.&lt;/li&gt;
&lt;li&gt;Consider targeted marketing campaigns for Size L, highlighting popular colors like Cyan and Brown.&lt;/li&gt;
&lt;li&gt;Evaluate the need for Size S and XL, possibly reducing stock for less popular colors to optimize inventory.&lt;/li&gt;
&lt;li&gt;Explore customer feedback for Size S and XL to understand specific needs and preferences.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;6. Short Conclusion &amp;amp; Prioritized Recommendations&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Takeaways&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Outerwear's &lt;strong&gt;low profitability&lt;/strong&gt; stems from high discount usage, narrow SKU range, and weak customer retention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seasonality is confirmed&lt;/strong&gt; but not optimized — demand surges in Fall/Winter, yet margins erode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Moderate but declining ratings&lt;/strong&gt; suggest emerging product or expectation alignment issues.&lt;/li&gt;
&lt;li&gt;Data indicates &lt;strong&gt;assortment imbalance&lt;/strong&gt; (overindexing in size M and Cyan color), limiting growth potential.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Recommendations by Function&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;Action Priority&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Merchandising&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🔹 High&lt;/td&gt;
&lt;td&gt;Diversify Outerwear SKUs — introduce extended sizes (S, XL), rebalance color range, and add transitional products for off-seasons.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Marketing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🔹 Medium&lt;/td&gt;
&lt;td&gt;Shift communication from discount-heavy promos to "durability and design" value messaging; launch a Summer lightweight outerwear campaign.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Finance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🔹 High&lt;/td&gt;
&lt;td&gt;Optimize discount policy — cap average discount &amp;lt;35%, test bundle promos instead of direct markdowns to protect margin.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CRM / Loyalty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🔹 Medium&lt;/td&gt;
&lt;td&gt;Introduce a loyalty reward or seasonal bundle subscription for outerwear customers to improve repeat purchase rates.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;One-Line Executive Summary&lt;/strong&gt;
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;"Outerwear's performance problem is not demand shortage but &lt;em&gt;value perception and assortment imbalance&lt;/em&gt; — by optimizing variety, discount structure, and retention strategy, the category can regain profitability and year-round engagement."&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>datascience</category>
      <category>management</category>
      <category>showcase</category>
    </item>
    <item>
      <title>Synthetic Data Generator</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Mon, 10 Nov 2025 10:55:35 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/building-a-synthetic-data-generator-from-concept-to-reality-63m</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/building-a-synthetic-data-generator-from-concept-to-reality-63m</guid>
      <description>&lt;p&gt;&lt;em&gt;By Oliver | November 7, 2025&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Why We Need Fake Data That Feels Real
&lt;/h2&gt;

&lt;p&gt;Imagine you're building a new mobile app for a bank. Before launching it to real customers, you need to test it thoroughly. But here's the catch: you can't use real customer data for testing - that would be a privacy nightmare and potentially illegal. You also can't just make up random numbers and names because your app needs to handle realistic scenarios.&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;synthetic data&lt;/strong&gt; comes in. It's like having a movie set instead of real location. Everything looks authentic, but it's all carefully constructed and completely safe to use.&lt;/p&gt;

&lt;p&gt;That's exactly what I built: &lt;strong&gt;DataGen&lt;/strong&gt; - a Python library that creates realistic synthetic datasets at the click of a button.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is DataGen
&lt;/h2&gt;

&lt;p&gt;Think of DataGen as a digital factory for fake-but-realistic data. Just like a toy factory can produce thousands of identical toys, DataGen can generate thousands of realistic user profiles, salary records, regional information, and vehicle data. All completely synthetic but statistically accurate.&lt;/p&gt;

&lt;p&gt;Here's another analogy: If you've ever used a flight simulator to practice flying without risking a real plane, DataGen does the same thing for data. It gives you realistic practice data without any privacy concerns or legal complications.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Data Generators: My Digital Assembly Lines
&lt;/h2&gt;

&lt;p&gt;DataGen consists of four specialized "assembly lines", each producing a different type of data:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Profile Generator: Creating Digital People&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Profile Generator creates realistic user profiles - complete with names, emails, addresses, and even geographic coordinates.&lt;br&gt;
It's like having a character generator for a video game, but instead of fantasy characters, you get realistic Kenyan citizens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it generates:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full names (first and last)&lt;/li&gt;
&lt;li&gt;Email addresses and usernames&lt;/li&gt;
&lt;li&gt;Phone numbers&lt;/li&gt;
&lt;li&gt;Complete addresses (street, city, postal code)&lt;/li&gt;
&lt;li&gt;Age and date of birth&lt;/li&gt;
&lt;li&gt;Gender identity&lt;/li&gt;
&lt;li&gt;Geographic coordinates (latitude and longitude)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-world use case:&lt;/strong&gt; A fintech startup testing their loan application system can generate 10,000 realistic customer profiles in seconds, ensuring their system handles Kenyan names, addresses, and phone formats correctly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuk61hz45q8u9mn8cmpfg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuk61hz45q8u9mn8cmpfg.png" alt="Profile Generation Output" width="800" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbt7ei7o5tswy8bzq1wb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbt7ei7o5tswy8bzq1wb.png" alt="Profile Generation Output" width="800" height="412"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Profile Generation Output - Show a table of generated profiles with names, emails, cities, and ages&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Salary Generator: Modeling Compensation Data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Salary Generator creates realistic employment and compensation records across different industries and experience levels. Think of it as a salary survey simulator that understands how compensation works in the real world.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it generates:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Job titles across 8 departments (Engineering, Product, Data, Marketing, Sales, Operations, Finance, HR)&lt;/li&gt;
&lt;li&gt;Experience levels (from Junior to C-Level executives)&lt;/li&gt;
&lt;li&gt;Base salary, bonuses, and total compensation&lt;/li&gt;
&lt;li&gt;Years of experience aligned with job level&lt;/li&gt;
&lt;li&gt;Currency support (Kenyan Shillings and US Dollars)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The intelligence behind it:&lt;/strong&gt; The generator know that a Senior Software Engineer should earn more than a Junior one, and that C-Level executives typically have 20+ years of experience. It's not just random numbers - it's statistically realistic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world use case:&lt;/strong&gt; An HR analytics platform can test their salary benchmarking features with realistic compensation data across different industries and experience levels.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jzvx95xze6kb2q3uamp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jzvx95xze6kb2q3uamp.png" alt="Salary Analysis" width="800" height="417"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Salary Analysis - Show salary distribution by department or level with statistics&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Region Generator: Mapping the World&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Region Generator creates global organizational data - perfect for companies with international operations. It's like having a world atlas combined with an organizational chart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it generates:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Six major global regions (North America, South America, Europe, Middle East, Africa, Asia Pacific)&lt;/li&gt;
&lt;li&gt;Countries within each region&lt;/li&gt;
&lt;li&gt;Time zones&lt;/li&gt;
&lt;li&gt;Regional headquarters locations&lt;/li&gt;
&lt;li&gt;Regional managers with contact information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-world use case:&lt;/strong&gt; A multinational company testing their global CRM system can simulate operations across all continents with realistic regional structures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz8yg76kfsd0jc53dj89a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz8yg76kfsd0jc53dj89a.png" alt="Region Data" width="800" height="417"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Region Data Table - Show all regions with their headquarters and country counts&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Car Generator: Building a Virtual Showroom&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Car Generator creates vehicle inventory data focused on Kenyan automotive market. It's like having a digital car dealership that understands local market preferences.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it generates:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Popular makes and models in Kenya (Toyota, Nissan, Mazda, etc.)&lt;/li&gt;
&lt;li&gt;Manufacturing years (2008-2025)&lt;/li&gt;
&lt;li&gt;Colors, transmission types, and fuel types&lt;/li&gt;
&lt;li&gt;Realistic pricing in Kenyan Shillings&lt;/li&gt;
&lt;li&gt;Dealer locations across major Kenyan cities&lt;/li&gt;
&lt;li&gt;Age-based depreciation modeling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The smart part:&lt;/strong&gt; The generator knows that a 2025 Toyota Corolla should cost more than a 2010 model, and it applies realistic depreciation curves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world use case:&lt;/strong&gt; Automotive marketplace app can test their search, filtering, and pricing features with thousands of realistic vehicle listings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzpr6pze6tg9jvcfkhvf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzpr6pze6tg9jvcfkhvf.png" alt="Car Inventory" width="800" height="417"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Car Inventory - Show a sample of generated cars with makes, models, years, and prices&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The Magic Ingredient: Reproducibility
&lt;/h2&gt;

&lt;p&gt;Here's something crucial that makes DataGen special: &lt;strong&gt;reproducibilty&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Imagine you're baking cookies. If you allow the exact same recipe with the exact same measurements, you'll get identical cookies every time. DataGen works the same way through something called a "seed".&lt;/p&gt;

&lt;p&gt;When you set a seed(let's say, seed=42), DataGen will generate the exact same data every single time. This is incredibly important for:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Testing:&lt;/strong&gt; Developers can reproduce bugs by using the same seed&lt;br&gt;
&lt;strong&gt;- Collaboration:&lt;/strong&gt; Team members can work with identical datasets&lt;br&gt;
&lt;strong&gt;- Validation:&lt;/strong&gt; You can verify that your system produces consistent results&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analogy&lt;/strong&gt;: Think of the seed as a recipe number. Recipe #42 always makes chocolate chip cookies, Recipe #106 always makes oatmeal cookies. The same recipe number = same cookies, every time.&lt;/p&gt;
&lt;h2&gt;
  
  
  From Code to Package: The Publishing Journey
&lt;/h2&gt;

&lt;p&gt;Creating the generators was just the first step. To make DataGen useful to the world , I had to package it and publsh it to PyPI(Python Package Index) - think of it as the &lt;strong&gt;App Store for Python libraries&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Now, anyone in the world can install DataGen with a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Go to a new folder (like /Applications)&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;Applications

&lt;span class="c"&gt;# Create a brand new, clean environment&lt;/span&gt;
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv datagen_venv

&lt;span class="c"&gt;# Activate the virtual environment&lt;/span&gt;
&lt;span class="nb"&gt;source &lt;/span&gt;datagen_venv/bin/activate

&lt;span class="c"&gt;# Run the standard install command&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;sami-datagen

&lt;span class="c"&gt;# A sample try&lt;/span&gt;
python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"from datagen import generate_profiles
profiles = generate_profiles(n=10, seed=42)
print(profiles)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's like making your homemade recipe available in every grocery store worldwide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-world Impact: Who Benefits?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Software Developers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Testing applications without risking real user data. It's like having crash test dummies instead of real people for car safety tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Data Scientists&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Training machine learning models on synthetic data before deploying to production. Think of it as practicing surgery on cadavers before operating on real patients.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Business Analysts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Creating demo dashboards and presentations without exposing sensitive company data. Like using a model home to show buyers what their houses could look like.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Students and Educators&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Learning data analysis and database design with realistic datasets. It's like using a flight simulator in pilot  training - safe, repeatable, and realistic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Startups&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building and demonstrating MVPs(Minimum Viable Products) without collecting real user data. Like creating a movie trailer before filming the entire moving.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Foundation
&lt;/h2&gt;

&lt;p&gt;For those curious about how it works under the hood:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DataGen uses:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Faker library:&lt;/strong&gt; Generates realistic names, addresses, and contact information&lt;br&gt;
&lt;strong&gt;- Pandas:&lt;/strong&gt; Organizes data into structured tables (like Excel spreadsheets)&lt;br&gt;
&lt;strong&gt;- Statistical modeling:&lt;/strong&gt; Ensures salary ranges, age distributions, and pricing follow realistic patterns&lt;br&gt;
&lt;strong&gt;- Localization:&lt;/strong&gt; Understands Kenyan naming conventions, cities, and market preferences &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt; If DataGen were a restaurant, Faker would be the ingredient supplier, Pandas would be the kitchen organization system, and statistical modeling would be the chef's knowledge of how flavors work together.&lt;/p&gt;
&lt;h2&gt;
  
  
  Practical Examples: See It In Action
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Example 1: Generate 100 user profiles&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datagen&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;generate_profiles&lt;/span&gt;

&lt;span class="n"&gt;profiles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_profiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;profiles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;: A table with 100 realistic Kenyan user profiles, complete with names like “Sharon Mohamed” from Nairobi, “Kennedy Atieno” from Mombasa, each with unique emails, addresses, and coordinates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo9nbu8e21fm3c66qrs4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo9nbu8e21fm3c66qrs4.png" alt="Code Example Output" width="800" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F27ionc3qxyn7ggkrcfpw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F27ionc3qxyn7ggkrcfpw.png" alt="Terminal Output" width="800" height="598"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Code Example Output - Show the actual output from running this code&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 2: Analyze Salary Distribution&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datagen&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;generate_salaries&lt;/span&gt;

&lt;span class="n"&gt;salaries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_salaries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;avg_by_dept&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;salaries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_compensation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_by_dept&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output: Average compensation by department, showing that Engineering and Data departments typically have higher compensation than Operations or HR.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb4xf6a7row796y5dc9fu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb4xf6a7row796y5dc9fu.png" alt="Salary Analysis Results" width="800" height="598"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Salary Analysis Results - Show the grouped statistics&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Beyond the Code: Docker Support
&lt;/h2&gt;

&lt;p&gt;For those who want to use DataGen without installing anything on their computer, I included &lt;strong&gt;Docker support&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's Docker?&lt;/strong&gt; Think of it as a &lt;strong&gt;portable computer inside your computer&lt;/strong&gt;. It's like having a fully equipped kitchen(with all tools and ingredients) that you can set up anywhere in seconds.&lt;/p&gt;

&lt;p&gt;With Docker you can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Download the DataGen container&lt;/li&gt;
&lt;li&gt;Start it with one command&lt;/li&gt;
&lt;li&gt;Generate data immediately - no installation, no configuration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhsrs8fu8oqszr49f2ff8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhsrs8fu8oqszr49f2ff8.png" alt="Docker setup" width="800" height="594"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Docker Setup - Show the docker-compose command and container running&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The Documentation Journey
&lt;/h2&gt;

&lt;p&gt;Creating the library was only three-quarter the battle. Making it usable requires comprehensive documentation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;README.md:&lt;/strong&gt; A guide covering installation, usage, and examples&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example Scripts:&lt;/strong&gt; Five Python scripts demonstrating each generator&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inline Documentation:&lt;/strong&gt; Every function has detailed explanations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Reference:&lt;/strong&gt; Complete parameter descriptions and return types&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt; It's like buying furniture from IKEA - the product is great, but without clear instructions(with pictures), it's just a pile of woods and screws.&lt;/p&gt;
&lt;h2&gt;
  
  
  Challenges and Soultions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Challenge 1: Making Data Feel "Real"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Instead of purely random generation, I implemented statistical models. For example, Senior Engineers have 5-10 years of experience, not 2 years or 30 years.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge 2: Kenyan Localization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Researched and included actual Kenyan cities, realistic coordinate boundaries, and local naming patterns. The data doesn't just look real - it looks &lt;em&gt;Kenyan&lt;/em&gt; real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge 3: Reproducibility&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Implemented seed-based generation, ensuring that seed=42 always produces identical results, making debugging and testing possible.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Results: By The Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;4 specialized generators covering different data types&lt;/li&gt;
&lt;li&gt;60+ job titles across 8 departments&lt;/li&gt;
&lt;li&gt;10 experience levels from Junior to C-Level&lt;/li&gt;
&lt;li&gt;6 global regions covering 36 countries&lt;/li&gt;
&lt;li&gt;10 popular car makes with realistic pricing&lt;/li&gt;
&lt;li&gt;100% reproducibility with seed control&lt;/li&gt;
&lt;li&gt;Published on PyPI - accessible worldwide&lt;/li&gt;
&lt;li&gt;Docker support for zero-installation usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxrbtdhr7mm0dr4irza2k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxrbtdhr7mm0dr4irza2k.png" alt="Complete Demo Output" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fho3a6tqxphz22aazjpv5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fho3a6tqxphz22aazjpv5.png" alt="Complete Demo Output" width="800" height="405"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Complete Demo Output - Show the final output from running complete_demo.py with all statistics&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;DataGen is just the beginning. Future enhancements could include:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- More data types:&lt;/strong&gt; Transaction records, event logs, social media posts&lt;br&gt;
&lt;strong&gt;- Relationship modeling:&lt;/strong&gt; Connecting profiles to their salaries and purchases&lt;br&gt;
&lt;strong&gt;- Time-series data:&lt;/strong&gt; Stock prices, sensor readings, website traffic&lt;br&gt;
&lt;strong&gt;- Custom templates:&lt;/strong&gt; Industry-specific data patterns&lt;br&gt;
&lt;strong&gt;- Web interface:&lt;/strong&gt; Generate data without writing code&lt;/p&gt;
&lt;h2&gt;
  
  
  Try it Yourself
&lt;/h2&gt;

&lt;p&gt;Want to explore DataGen? Here's how:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For technical users:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;sami-datagen
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;For Everyone Else:&lt;/strong&gt; Visit the GitHub repository at &lt;a href="https://github.com/25thOliver/Datagen" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; where you'll find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete installation instructions&lt;/li&gt;
&lt;li&gt;Step-by-step tutorials&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building DataGen taught me that great tools aren't just about functionality - they're about &lt;strong&gt;accessibility&lt;/strong&gt;. The best technology is technology that anyone can use , understand and benefit from.&lt;/p&gt;

&lt;p&gt;Whether you're a developer testing an app. a student learning data science, or a business professional creating a demo, DataGen provides the realistic data you need, when you need it, without compromise.&lt;/p&gt;

&lt;p&gt;The code is open source, the documentation is comprehensive, and the possibilities are endless.&lt;/p&gt;

&lt;h2&gt;
  
  
  About the Author
&lt;/h2&gt;

&lt;p&gt;Oliver is a data engineer who's passionate about building tools that make technology more accessible. This project was completed as part of the LuxDevHQ Data Engineering Internship program.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connect:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub:&lt;a href="https://github.com/25thOliver" rel="noopener noreferrer"&gt;@25thOliver&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LinkedIn: &lt;a href="https://www.linkedin.com/in/samwel-oliver/" rel="noopener noreferrer"&gt;Samwel Oliver&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>privacy</category>
      <category>testing</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Understanding Kafka Lag: Why It Happens and How to Fix It</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Mon, 10 Nov 2025 05:01:40 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/understanding-kafka-lag-why-it-happens-and-how-to-fix-it-5a6k</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/understanding-kafka-lag-why-it-happens-and-how-to-fix-it-5a6k</guid>
      <description>&lt;p&gt;In today's fast-paced digital world, data is constantly being created. We stream movies, make online purchases, and track shipments, in real-time. Behind the scenes, a powerful technology called &lt;strong&gt;Apache Kafka&lt;/strong&gt; often acts as the central nervous system, managing this massive flow of information.&lt;/p&gt;

&lt;p&gt;Imagine a busy restaurant kitchen during dinner rush. Orders are coming in faster than the chefs can prepare them. The tickets start piling up on the counter, and customers wait longer for their meals. This growing meals backlog of uncooked orders is essentially what we call "Kafka Lag" in the world of data streaming.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Kafka Lag?
&lt;/h2&gt;

&lt;p&gt;Kafka is a messaging system that helps different parts of software applications communicate with each other. Think of it as a sophisticated postal service for digital information. When one part of your system(the producer) sends messages faster than another part(the consumer) can process them, a backlog forms. This backlog is called "lag."&lt;/p&gt;

&lt;p&gt;In simple terms: &lt;em&gt;Kafka Lag is the difference between how many messages have been sent and how many have been successfully processed.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Does Kafka Lag Happen?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The Speed Mismatch Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Picture a factory assembly line where bottles are being filled. If the filling station produces 100 bottles per minute but the capping station can only cap 70 bottles per minute, you'll have 30 uncapped bottles pilling up every minute. Similarly, when your data producers send messages faster than consumers can handle them, lag accumulates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Processing Complexity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not all tasks are created equal. Imagine reading children's book versus analyzing a legal contract, one takes seconds, the other takes hours. If your consumer needs to perform complex calculations, database lookups, or call external services for each message, it naturally slows down, creating lag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Resource Constraints&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think of your consumer as a worker with limited tools. If that worker doesn't have  enough memory(like trying to juggle too many tasks at once), insufficient processing power(like using a bicycle to deliver packages across a city), or poor network connectivity(like having a slow internet connection), they simply can' keep up with the workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4.Sudden Traffic Spikes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consider a ticket website when a popular concert goes on sale. Normally, the site handles a few hundred visitors per minute comfortably. Suddenly, 50,000 and people flood in simultaneously. The systems gets overwhelmed. Similarly, unexpected surges in data-like during a flash sale or viral social media event can cause temporary lag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5 Consumer Downtime&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your consumer application crashes, needs maintenance, or gets redeployed, it's like a cashier taking a lunch break, messages pile up while no one's processing them. When the consumer comes back online, it faces a mountain of unprocessed messages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Inefficient Message Processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Imagine sorting mail by reading every single completely before deciding where it goes, versus just glancing at the address. Poor coding practices, unnecessary operations,  or inefficient algorithms can dramatically slow down message processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Reduce or Eliminate Kafka Lag
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Add More Workers(Increase Consumer Instances)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The most straightforward solution: if one cashier can't handle the line, open more registers. By running multiple consumers instances in parallel, you can process more messages simultaneously. Kafka automatically distributes the workload among them through partitioning.&lt;br&gt;
Look at it this way, instead of one person answering customer emails, have a team of five people each handling a portion of the inbox.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Optimize the Processing Logic&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Make your consumers faster and smarter. Remove unnecessary steps, cache frequently accessed data, and streamline your code. It's like teaching your workers to use keyboard shortcuts instead of clicking through menus, same result, much faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key strategies:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eliminate redundant operations&lt;/li&gt;
&lt;li&gt;Use batch processing where possible&lt;/li&gt;
&lt;li&gt;Avoid blocking operations&lt;/li&gt;
&lt;li&gt;Implement efficient data structures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Increase Partition Count&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kafka divides message streams into partitions. Think of them as multiple conveyor belts instead of one. More partitions means more parallel processing opportunities. However, this is like adding more lanes to a highway; it only helps if you have enough cars(consumers) to use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Batch Processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of processing messages one at a time(like  making individual trips to deliver each package), group them together(like loading a truck with multiple packages for one delivery run). This reduces overhead and improves throughput significantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Upgrade Resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes you need better tools. Allocating more memory, faster CPUs, or better network bandwidth to your consumers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Implement Asynchronous Processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't wait for one task to finish before starting the next. By processing messages asynchronously, you maximize resource utilization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Use Consumer Groups Wisely&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Organize your consumers into groups where each handles specific types of messages. This is like having specialized teams, one for returns, one for new orders, one for inquiries, rather than everyone handling everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Monitor and Alert&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can't fix what you don't know is broken. Set up monitoring to track lag metrics and alert you when thresholds are exceeded. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9. Implement Backpressure Mechanisms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes the solution is to slow down the producers temporarily. While not always ideal, it prevents system overload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10. Prioritize Critical Messages&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not all messages are equally important. Implement priority queues so urgent messages get processed first.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding the Right Balance
&lt;/h3&gt;

&lt;p&gt;Eliminating Kafka lag isn't always about processing everything instantly. Sometimes, a small amount of lag is acceptable and even expected. The goal is to keep lag within acceptable boundaries for your business needs. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Kafka lag is a natural consequences of distributed systems handling real-time data. It happens when consumption can't keep pace with production. By understanding the root causes, whether it's speed mismatches, resources constraints, or inefficient processing, you can apply the right solutions.&lt;/p&gt;

&lt;p&gt;The key is to monitor continuously, optimize intelligently, and scale appropriately. With the right combination of additional consumers, optimized code, proper resource allocation, and smart architecture decisions, you can keep your Kafka lag minimal and your data flowing smoothly.&lt;/p&gt;

&lt;p&gt;Remember: managing Kafka lag is not a one-time fix but an ongoing process of monitoring, measuring, and adjusting as your system evolves and grow.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>monitoring</category>
      <category>performance</category>
    </item>
    <item>
      <title>Real-Time Earthquake CDC Pipeline</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Sat, 01 Nov 2025 11:23:46 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/real-time-earthquake-cdc-pipeline-km3</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/real-time-earthquake-cdc-pipeline-km3</guid>
      <description>&lt;p&gt;Bringing live seismic data to life from API to dashboards, in seconds. This project builds a real-time Change Data Capture(CDC) pipeline that streams live earthquake data from the &lt;a href="https://earthquake.usgs.gov/fdsnws/event/1/" rel="noopener noreferrer"&gt;USGS FDSN API&lt;/a&gt; into &lt;strong&gt;MySQL&lt;/strong&gt;, mirrors every change through &lt;strong&gt;Kafka + Debezium&lt;/strong&gt;, lands it to &lt;strong&gt;PostgreSQL&lt;/strong&gt;, and visualizes global seismic trends &lt;strong&gt;Grafana&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;Earthquakes happen without warning, and understanding their pattern. Traditional earthquake monitoring systems often have delays between when an earthquake occurs and when the data becomes available for analysis. This project eliminates that gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world impact:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Emergency responders can see global seismic activity as it happens&lt;/li&gt;
&lt;li&gt;Researchers can analyze earthquake patterns in real-time&lt;/li&gt;
&lt;li&gt;The public can track seismic events in their region instantly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like the difference between reading yesterday's newspaper vs. watching live news-except for earthquakes happening anywhere on Earth.&lt;/p&gt;

&lt;p&gt;Every minute, the U.S. Geological Survey(USGS) publishes new earthquake events around the world.&lt;br&gt;
In this project, we built a pipeline that:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Fetches&lt;/strong&gt; new quakes every minute from the USGS API&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Upserts&lt;/strong&gt; events into MySQL&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Capture Changes&lt;/strong&gt; in real-time via Debezium &amp;amp; Kafka&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Streams&lt;/strong&gt; them into PostgreSQL&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Visualizes&lt;/strong&gt; live quakes and metrics in Grafana dashboards&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmso0gs9bf7zsj9qlhtfc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmso0gs9bf7zsj9qlhtfc.png" alt="System Architecture" width="800" height="217"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Overall system architecture diagram&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;USGS API → MySQL → Debezium → Kafka → JDBC Sink → PostgreSQL → Grafana

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Each component plays a critical role:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- MySQL&lt;/strong&gt; - Primary database storing fresh quake data&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Adminer UI&lt;/strong&gt; - Visualizes our data in primary MySQL database after API ingestion&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Debezium&lt;/strong&gt; - Captures every insert/update via CDC&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Kafka&lt;/strong&gt; - Streams events through topics&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- PostgreSQL&lt;/strong&gt; - Sink database for analytics&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Grafana&lt;/strong&gt; - Visualization layer for insights&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Kafka UI&lt;/strong&gt; - Monitors topics and connectors visually&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zqunsgyvf3gfpqgoihz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zqunsgyvf3gfpqgoihz.png" alt="Kafka Topics" width="800" height="264"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Kafka UI showing topics&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwj2yflu2lajiixkhmzk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwj2yflu2lajiixkhmzk.png" alt="Sink and Source Connectors" width="800" height="264"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Sink and Source Connectors&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Phases of the Build
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Phase 1: USGS API Integration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What's happening here:&lt;/strong&gt; The United States Geological Survey(USGS) maintains a public API that reports every earthquake detected globally. We poll (ask) this API every 60s: "What earthquake happened in the last minute?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why every minute?&lt;/strong&gt; Earthquakes don't wait, and neither should our data. By checking every minute, we ensure our dashboard shows the most current picture of global seismic activity.&lt;/p&gt;

&lt;p&gt;A Python script polls the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;https://earthquake.usgs.gov/fdsnws/event/1/query?format&lt;span class="o"&gt;=&lt;/span&gt;geojson&amp;amp;starttime&lt;span class="o"&gt;={&lt;/span&gt;NOW-1min&lt;span class="o"&gt;}&lt;/span&gt;&amp;amp;endtime&lt;span class="o"&gt;={&lt;/span&gt;NOW&lt;span class="o"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;New events are upserted into &lt;code&gt;earthquake_minute&lt;/code&gt; table in MySQL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4e56f28kgwrt8nvim57i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4e56f28kgwrt8nvim57i.png" alt="Sample MySQL table rows after API ingestion" width="800" height="397"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Earthquake MySQL table rows after API ingestion in Adminer UI&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Change-Data-Capture(CDC)
&lt;/h3&gt;

&lt;p&gt;Imagine MySQL is a busy restaurant kitchen, and the binlog is a camera recording everything the chefs do. Debezium is like a food critic watching that recording in real-time, narrating every dish that gets plated, modified, or sent back.&lt;/p&gt;

&lt;p&gt;Without CDC, we'd have to repeatedly ask MySQL "What's new?" every few seconds-inefficient and slow. With CDC, MySQL tells us the moment something changes. It's the difference between spam-refreshing your email vs. getting instant push notifications.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;MySQL binary logging enabled (&lt;code&gt;binlog_format=ROW&lt;/code&gt;)&lt;br&gt;
&lt;em&gt;Understanding the Binlog&lt;/em&gt;&lt;br&gt;
At the core of this project lies &lt;strong&gt;MySQL's Binary log(binlog)&lt;/strong&gt;, a special journal that records every change made to the database: inserts, updates, and deletes.&lt;/p&gt;

&lt;p&gt;By enabling it in &lt;strong&gt;ROW format&lt;/strong&gt;, MySQL doesn't just log that "something changed". It records &lt;strong&gt;exactly what changed&lt;/strong&gt; in each row. This is what allows tools like &lt;strong&gt;Debezium&lt;/strong&gt; to reconstruct the full story for every database mutation in real-time.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ox72iilu2d3nuy4dzhl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ox72iilu2d3nuy4dzhl.png" alt="Binlog Settings" width="800" height="885"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Why it Matters**
- `log_bin = ON` - Enables binary logging
- `binlog_format = ROW` - Captures row-level detail for CDC
- `server_id` - Provides a unique identifier for the MySQL instance(required by Debezium)

Once the binlog is active, Debezium can tap into it via Kafka Connect, continuously streaming every change into Kafka topics—turning your database into a real-time data source.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;Debezium MySQL connector listens for changes&lt;/li&gt;
&lt;li&gt;Kafka topics carry those changes&lt;/li&gt;
&lt;li&gt;JDBC Sink connector writes them to PostgrSQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu69mg9yjdk7j70ur3zbp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu69mg9yjdk7j70ur3zbp.png" alt="Debezium connector configuration (Kafka Connect UI)" width="800" height="397"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Debezium connector configuration (Kafka Connect UI)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmn4xs3ytoginuzypq6i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmn4xs3ytoginuzypq6i.png" alt="Kafka UI → Topics → Messages view" width="800" height="397"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Kafka UI → Topics → Messages view&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3: Grafana Visualization
&lt;/h3&gt;

&lt;p&gt;Grafana connects to PostgreSQL and brings seismic data to life through four panels:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Real-Time World Map&lt;/strong&gt; — Global quake visualization&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Why it's useful:&lt;/em&gt; Instantly see WHERE earthquakes are clustering. Notice the "Ring of Fire" pattern around the Pacific?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Quakes Per Hour&lt;/strong&gt; — Time-series trend of activity&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Why it's useful:&lt;/em&gt; Spot unusual spikes that might indicate aftershock sequences or increased regional activity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Top 5 Hotspot Regions&lt;/strong&gt; — Aggregated regional summary&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Why it's useful:&lt;/em&gt; Quantify which areas are most seismically active over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Quakes in Last Hour (Gauge)&lt;/strong&gt; — Real-time activity level&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Why it's useful:&lt;/em&gt; A quick "pulse check" showing if Earth is currently rumbling more than usual&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0x9tgvo8d132ju9v60kf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0x9tgvo8d132ju9v60kf.png" alt="Grafana dashboard" width="800" height="397"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Grafana dashboard (full view)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcu9reslb9ew6fr5ob3f8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcu9reslb9ew6fr5ob3f8.png" alt="Close-up of world map panel" width="800" height="397"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Close-up of world map panel&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project demonstrates the power of streaming data from a global earthquake API to actionable visualizations in real time.&lt;br&gt;
By Combining open-source tools like &lt;strong&gt;Debezium&lt;/strong&gt;, &lt;strong&gt;Kafka&lt;/strong&gt;, and &lt;strong&gt;Grafana&lt;/strong&gt;, we built a pipeline that's not just functional but alive constantly evolving with Earth's tremors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we proved:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time data pipelines can be built with free, open-source tools&lt;/li&gt;
&lt;li&gt;Complex infrastructure can be orchestrated with Docker Compose&lt;/li&gt;
&lt;li&gt;CDC is the key to keeping distributed systems in sync without manual intervention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-world applications of this architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IoT sensor networks (replace earthquakes with temperature/pressure readings)&lt;/li&gt;
&lt;li&gt;E-commerce inventory systems (track stock changes across warehouses)&lt;/li&gt;
&lt;li&gt;Financial fraud detection (monitor transactions in real-time)&lt;/li&gt;
&lt;li&gt;Healthcare patient monitoring (stream vital signs to alert systems)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The principles here scale from earthquake monitoring to any domain where &lt;strong&gt;seeing changes as they happen&lt;/strong&gt; creates value.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8mjqtsqx4vbbpj2pbm1l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8mjqtsqx4vbbpj2pbm1l.png" alt="Grafana + Kafka UI side-by-side for the closing shot" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start all services&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# Access components&lt;/span&gt;
MySQL        → localhost:3306
Kafka UI     → http://localhost:8082
Grafana      → http://localhost:3000
PostgreSQL   → localhost:5435
Adminer UI   → http://localhost:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Curious to dig deeper?&lt;/strong&gt;&lt;br&gt;
All the source code, configurations, and dashboards for this real-time earthquake streaming pipeline are open-source here:&lt;br&gt;
&lt;a href="https://github.com/25thOliver/Real-Time-Earthquake-CDC" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Real-Time Crypto Data Pipeline</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Mon, 27 Oct 2025 18:01:50 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/real-time-crypto-data-pipeline-e8f</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/real-time-crypto-data-pipeline-e8f</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Ever wondered how trading platforms display live crypto prices? In this article, I'll show you how I built a fully automated real-time data pipeline that streams cryptocurrency data from Binance and visualizes it like a Bloomberg Terminal - completely open source!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting up Change Data Capture (CDC) with Debezium&lt;/li&gt;
&lt;li&gt;Building event-driven architectures with Kafka&lt;/li&gt;
&lt;li&gt;Handling time-series data at scale with Cassandra&lt;/li&gt;
&lt;li&gt;Creating real-time dashboards with Grafana&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tech Stack:&lt;/strong&gt; Python | PostgreSQL | Debezium | Apache Kafka | Cassandra | Grafana | Docker&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;I wanted to understand how major trading platforms handle real-time data at scale. Instead of just reading about it, I decided to build a production-grade pipeline that could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handle thousands of price updates per minute&lt;/li&gt;
&lt;li&gt;Never lose data even if services crash&lt;/li&gt;
&lt;li&gt;Provide instant insights through dashboards&lt;/li&gt;
&lt;li&gt;Scale horizontally as data grows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project taught me more about distributed systems in one month than a year of tutorials.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Challenge 1: Database Polling Overhead&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Initially, I was polling PostgreSQL every second. CPU usage was 80%+!&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Implemented Debezium CDC using PostgreSQL's replication log. CPU dropped to 5%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge 2: Data Loss During Failures&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
When Cassandra went down, data disappeared.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Kafka acts as a durable buffer - it stores events until consumers catch up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge 3: Time-Series Query Performance&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
PostgreSQL struggled with millions of time-series records.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Moved analytics workload to Cassandra, optimized for time-series data.&lt;/p&gt;
&lt;h2&gt;
  
  
  What We've Achieved
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Real-Time Data Collection&lt;/strong&gt;: Automatically fetches live crypto market data from Binance every 3600 seconds&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Automated Data Pipeline&lt;/strong&gt;: Data flows seamlessly from Binance → PostgreSQL → Debezium CDC → Kafka → Cassandra without manual intervention&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Change Data Capture (CDC)&lt;/strong&gt;: Allows this system to detect new data in PostgreSQL in realtime without polling. Instead of repeatedly querying the database, Debezium listens to changes directly through PostgreSQL's replication log, ensuring near-zero latency and minimal load.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Scalable Architecture&lt;/strong&gt;: Built with enterprise-grade technologies (Debezium, Kafka, Cassandra) that can handle millions of records&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Beautiful Visualizations&lt;/strong&gt;: Ready-to-use Grafana dashboards for monitoring crypto markets&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Binance API → PostgreSQL → Debezium CDC → Kafka → Cassandra → Grafana
    ↓             ↓              ↓           ↓          ↓          ↓
  Prices      Primary      Change       Message    Fast      Beautiful
  Stats       Storage      Detection    Queue      Storage   Dashboards
                ↓                        ↓
          Every INSERT              Stream Changes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1wbcxa6ll3yzwwformy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1wbcxa6ll3yzwwformy.png" alt="Pipeline Architecture" width="800" height="532"&gt;&lt;/a&gt; &lt;br&gt;
&lt;em&gt;End-to-end pipeline from data ingestion to visualization.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Components Breakdown
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Binance Data Collector&lt;/strong&gt; (Python)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fetches 5 types of market data: prices, 24hr stats, order books, recent trades, and candlestick data&lt;/li&gt;
&lt;li&gt;Writes data to PostgreSQL every 3600 seconds&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;PostgreSQL Database&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Primary storage for all crypto market data&lt;/li&gt;
&lt;li&gt;Stores historical data with timestamps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change Data Capture (CDC) enabled&lt;/strong&gt; via logical replication&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Debezium Change Data Capture&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatically detects and captures database changes in real-time&lt;/li&gt;
&lt;li&gt;Monitors PostgreSQL for INSERT, UPDATE, DELETE operations&lt;/li&gt;
&lt;li&gt;Converts database changes into Kafka messages&lt;/li&gt;
&lt;li&gt;No impact on database performance&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Kafka&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kafka acts as a real-time buffer between Debezium and Cassandra, ensuring data reliability. If Cassandra goes down, no data is lost. Kafka stores all change events until Cassandra comes back online.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cassandra Sink Connector&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Cassandra Sink Connector (Datastax) continuously listens to Kafka topics and mirrors every change into Cassandra table that match the PostgreSQL schema.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Cassandra&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast, distributed database optimized for time-series data&lt;/li&gt;
&lt;li&gt;Powers our real-time dashboards&lt;/li&gt;
&lt;li&gt;Stores denormalized data for quick reads&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Grafana Dashboards&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Visual interface for exploring crypto market data&lt;/li&gt;
&lt;li&gt;Live charts and analytics&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Data We Collect
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Update Frequency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prices&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Latest price for all trading pairs&lt;/td&gt;
&lt;td&gt;Every 3600 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;24hr Stats&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Price changes, volumes, and market movements&lt;/td&gt;
&lt;td&gt;Every 3600 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Order Books&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Current buy/sell orders&lt;/td&gt;
&lt;td&gt;Every 3600 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Recent Trades&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Latest market transactions&lt;/td&gt;
&lt;td&gt;Every 3600 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Candlesticks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Historical price patterns (OHLCV)&lt;/td&gt;
&lt;td&gt;Every 6000 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd6tufsvyf55cm8ih1k0r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd6tufsvyf55cm8ih1k0r.png" alt="Live crypto prices" width="800" height="403"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Live dashboard displaying top-performing cryptocurrencies by 24h change.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Docker and Docker Compose installed on your computer&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Quick Start (3 Steps)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create environment file&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="c"&gt;# Create a .env file with these contents:&lt;/span&gt;
   &lt;span class="nv"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;crypto_user
   &lt;span class="nv"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;crypto_pass
   &lt;span class="nv"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;crypto_db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start everything&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fasbz67bjbyk2suhpwpmy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fasbz67bjbyk2suhpwpmy.png" alt="Docker ps" width="800" height="804"&gt;&lt;/a&gt; &lt;br&gt;
   &lt;em&gt;Shows container orchestration success&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;View your dashboards&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Open &lt;code&gt;http://localhost:3000&lt;/code&gt; in your browser&lt;/li&gt;
&lt;li&gt;Login: admin / admin&lt;/li&gt;
&lt;li&gt;Explore the crypto market data!&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkw0o8bmk7185or2bh0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkw0o8bmk7185or2bh0d.png" alt="Home page after login" width="800" height="590"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Project Screenshots
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Kafka UI - Monitoring Topics &amp;amp; Connectors
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdf3mv6d1s7vqbwsuu4q6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdf3mv6d1s7vqbwsuu4q6.png" alt="Kafka UI Overview" width="800" height="743"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcrxwqfewzokz2c7ycml.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcrxwqfewzokz2c7ycml.png" alt="Kafka UI crypto_prices topic" width="800" height="743"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Kafka UI providing a real-time view of all Kafka topics, internal connector states, and message traffic.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Kafka UI interface offers an intuitive dashboard for monitoring the Kafka ecosystem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Topics View&lt;/strong&gt; – Displays all internal and user-created topics such as &lt;code&gt;crypto_prices&lt;/code&gt;, &lt;code&gt;crypto_order_book&lt;/code&gt;, and more.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumers View&lt;/strong&gt; – Shows active sink connectors and other consumers reading from Kafka topics (e.g., &lt;code&gt;cassandra-sink&lt;/code&gt;).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Health&lt;/strong&gt; – Visualizes broker status, topic replication, and partition metrics.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This provides a richer, more interactive way to inspect data flow across Kafka.&lt;/p&gt;
&lt;h3&gt;
  
  
  Architecture &amp;amp; Data Flow
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6upahm8flsswrrlts48h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6upahm8flsswrrlts48h.png" alt="Active Topics" width="800" height="514"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Current Active Topics&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyz1rki27edc6m1x4se6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyz1rki27edc6m1x4se6.png" alt="PostgreSQL Sample Data" width="800" height="602"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Image showing sample query in our PostgreSQL database&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp94fyj9f979uinevxbxw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp94fyj9f979uinevxbxw.png" alt="Cassandra Sample Query" width="800" height="602"&gt;&lt;/a&gt; &lt;br&gt;
&lt;em&gt;A snap showing a query in our analytics database Cassandra&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuration Files
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;docker-compose.yml&lt;/code&gt; - Orchestrates all services&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;connectors/cassandra-sink.json&lt;/code&gt; - Cassandra data sink configuration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;connectors/postgres-source-temp.json&lt;/code&gt; - PostgreSQL change data capture configuration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scripts/binance_ingestor.py&lt;/code&gt; - Main data collection script&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Current Data Statistics
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Total Records Collected&lt;/strong&gt;: Over 1.8 million rows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Active Tables&lt;/strong&gt;: 5 (prices, stats, order books, trades, candlesticks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update Frequency&lt;/strong&gt;: Every 60 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Sources&lt;/strong&gt;: Binance REST API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: PostgreSQL (primary) + Cassandra (analytics)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Grafana Dashboards
&lt;/h2&gt;

&lt;p&gt;Our dashboards provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time price monitoring&lt;/strong&gt; across all trading pairs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;24-hour market analysis&lt;/strong&gt; with price changes and volumes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order book depth&lt;/strong&gt; visualization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade history&lt;/strong&gt; with buy/sell indicators&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Candlestick charts&lt;/strong&gt; for technical analysis&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Check if services are running
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  View data in PostgreSQL
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec &lt;/span&gt;postgres psql &lt;span class="nt"&gt;-U&lt;/span&gt; crypto_user &lt;span class="nt"&gt;-d&lt;/span&gt; crypto_db &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"SELECT * FROM crypto_prices LIMIT 10;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  View data in Cassandra
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec &lt;/span&gt;cassandra cqlsh &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"SELECT * FROM crypto_keyspace.crypto_prices LIMIT 10;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Check connector status
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-sS&lt;/span&gt; http://localhost:8083/connectors | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F74n6do9iidentpjwl0x7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F74n6do9iidentpjwl0x7.png" alt="CDC Status" width="800" height="804"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg47dzy78912rrwttfc2u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg47dzy78912rrwttfc2u.png" alt="CDC Config" width="800" height="378"&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;REST response of CDC pipeline configuration&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fla3vtncvyevj9dfce8ig.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fla3vtncvyevj9dfce8ig.png" alt="crypto_prices topic streams" width="800" height="770"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Crypto_prices topic streams&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fully Automated&lt;/strong&gt; - Set it and forget it, data collects automatically&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Real-Time&lt;/strong&gt; - New data every 3600 seconds&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Rich Visualizations&lt;/strong&gt; - Beautiful Grafana dashboards out of the box&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Reliable&lt;/strong&gt; - Built on proven enterprise technologies&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Scalable&lt;/strong&gt; - Can handle millions of records effortlessly&lt;/p&gt;
&lt;h2&gt;
  
  
  Learn More
&lt;/h2&gt;

&lt;p&gt;This project demonstrates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Change Data Capture (CDC)&lt;/strong&gt; with Debezium - automatically captures database changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time data streaming&lt;/strong&gt; with Apache Kafka - reliable message queuing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time-series data storage&lt;/strong&gt; with Cassandra - optimized for analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data visualization&lt;/strong&gt; with Grafana - beautiful dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microservices architecture&lt;/strong&gt; with Docker - containerized services&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  How Change Data Capture Works
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Python script inserts data into PostgreSQL every 3600 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debezium connector&lt;/strong&gt; watches PostgreSQL for changes using logical replication&lt;/li&gt;
&lt;li&gt;When new rows are inserted, Debezium captures them automatically&lt;/li&gt;
&lt;li&gt;Changes are converted to JSON messages and sent to Kafka topics&lt;/li&gt;
&lt;li&gt;Cassandra sink connector consumes these messages and writes to Cassandra&lt;/li&gt;
&lt;li&gt;Result: &lt;strong&gt;Zero manual intervention&lt;/strong&gt; - data flows automatically!&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Support
&lt;/h2&gt;

&lt;p&gt;For questions or issues, please check the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker logs binance_ingestor
docker logs debezium-connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Explore the Full Project
&lt;/h3&gt;

&lt;p&gt;You can find the complete source code, Docker setup, and connector configurations on GitHub:  &lt;/p&gt;

&lt;h2&gt;
  
  
  👉 &lt;a href="https://github.com/25thOliver/Crypto-Data-Pipeline" rel="noopener noreferrer"&gt;https://github.com/25thOliver/Crypto-Data-Pipeline&lt;/a&gt;
&lt;/h2&gt;

</description>
      <category>systemdesign</category>
      <category>cryptocurrency</category>
      <category>dataengineering</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>From smog to streams: how data engineering helps us breathe easier.</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Mon, 20 Oct 2025 07:51:46 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/from-smog-to-streams-how-data-engineering-helps-us-breathe-easier-4190</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/from-smog-to-streams-how-data-engineering-helps-us-breathe-easier-4190</guid>
      <description>&lt;h2&gt;
  
  
  Building a Real-Time Air Quality Data Pipeline for Mombasa &amp;amp; Nairobi
&lt;/h2&gt;




&lt;h2&gt;
  
  
  The Invisible Problem We Breathe
&lt;/h2&gt;

&lt;p&gt;If you’ve ever driven through &lt;strong&gt;Nairobi&lt;/strong&gt; at rush hour or felt the coastal haze in &lt;strong&gt;Mombasa&lt;/strong&gt;, you’ve likely wondered:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What exactly am I breathing right now?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Air pollution often hides in plain sight, invisible but deadly. As a data engineer passionate about real-world impact, I decided to build a system that could &lt;strong&gt;listen to the air&lt;/strong&gt; and &lt;strong&gt;tell us the truth in real time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That journey became the &lt;strong&gt;Real-Time Air Quality Pipeline&lt;/strong&gt;:&lt;br&gt;
a streaming data architecture that fetches hourly pollutant readings, processes them instantly, and makes them queryable within seconds — all built with open-source tools.&lt;/p&gt;


&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;This pipeline fetches air quality data (PM2.5, PM10, CO, NO₂, SO₂, Ozone, UV Index) from the &lt;strong&gt;Open-Meteo API&lt;/strong&gt; for Nairobi and Mombasa, then streams it through a real-time pipeline using &lt;strong&gt;Kafka&lt;/strong&gt;, &lt;strong&gt;MongoDB&lt;/strong&gt;, &lt;strong&gt;Debezium&lt;/strong&gt;, and &lt;strong&gt;Cassandra&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  What It Does
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Fetches data hourly&lt;/li&gt;
&lt;li&gt;Streams via &lt;strong&gt;Kafka&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Stores raw data in &lt;strong&gt;MongoDB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Uses &lt;strong&gt;Debezium&lt;/strong&gt; for CDC (Change Data Capture)&lt;/li&gt;
&lt;li&gt;Writes processed data to &lt;strong&gt;Cassandra&lt;/strong&gt; for analytics&lt;/li&gt;
&lt;li&gt;Fully containerized using &lt;strong&gt;Docker Compose&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;End Result:&lt;/strong&gt; Live, queryable data on Kenya’s air quality — updated every hour.&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;Open-Meteo API&lt;span class="o"&gt;)&lt;/span&gt;
       ↓
  &lt;span class="o"&gt;[&lt;/span&gt;Producer - Python]
       ↓
  &lt;span class="o"&gt;[&lt;/span&gt;Kafka Topic - air_quality_data]
       ↓
  &lt;span class="o"&gt;[&lt;/span&gt;Consumer - MongoDB Writer]
       ↓
  &lt;span class="o"&gt;[&lt;/span&gt;MongoDB - Raw Data Storage]
       ↓
  &lt;span class="o"&gt;[&lt;/span&gt;Debezium CDC Connector]
       ↓
  &lt;span class="o"&gt;[&lt;/span&gt;Kafka CDC Topic]
       ↓
  &lt;span class="o"&gt;[&lt;/span&gt;Cassandra Consumer]
       ↓
  &lt;span class="o"&gt;[&lt;/span&gt;Cassandra - Analytics Storage]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Each block is a service — communicating in real-time via Kafka topics.&lt;br&gt;
Together, they form a &lt;strong&gt;streaming ecosystem&lt;/strong&gt; that can handle continuous data without breaking a sweat.&lt;/p&gt;


&lt;h2&gt;
  
  
  Setting the Pipeline in Motion
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Start the System
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffh2bq6v5hu7fuha3umgg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffh2bq6v5hu7fuha3umgg.png" alt="Docker Compose starting all services" width="800" height="479"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After a few minutes, you’ll see all &lt;strong&gt;9 containers running&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mongo&lt;/code&gt;, &lt;code&gt;zookeeper&lt;/code&gt;, &lt;code&gt;kafka&lt;/code&gt;, &lt;code&gt;kafka-ui&lt;/code&gt;, &lt;code&gt;mongo-connector&lt;/code&gt;,
&lt;code&gt;producer&lt;/code&gt;, &lt;code&gt;consumer&lt;/code&gt;, &lt;code&gt;cassandra&lt;/code&gt;, and &lt;code&gt;cassandra-consumer&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglf3kdfjv80ube0c7ai3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglf3kdfjv80ube0c7ai3.png" alt="All services live in Docker" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  2. Initialize Databases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;MongoDB Replica Set Setup&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash storage/init-replica-set.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79vhwb5p5tp5gf1yhbr4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79vhwb5p5tp5gf1yhbr4.png" alt="MongoDB replica set initialized successfully" width="800" height="479"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cassandra Schema Initialization&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; cassandra cqlsh &amp;lt; storage/cassandra_setup.cql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Register Debezium CDC Connector
&lt;/h3&gt;

&lt;p&gt;Debezium monitors MongoDB for new data, captures changes, and streams them out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash streaming/register-connector.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs52m5ch6ix9z0fkjy2c6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs52m5ch6ix9z0fkjy2c6.png" alt="Debezium connector registered and active" width="800" height="891"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once registered, every new air quality record inserted in MongoDB automatically triggers a CDC event.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. System Health Check
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash health-check.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ppwa7rqlr1uxezniqat.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ppwa7rqlr1uxezniqat.png" alt="All services healthy and connected" width="800" height="867"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When all checks pass — the real-time pipeline is alive!&lt;/p&gt;




&lt;h2&gt;
  
  
  Deep Dive — The Data Flow in Action
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: The Producer (Python)
&lt;/h3&gt;

&lt;p&gt;Fetches data from Open-Meteo every hour, ensuring we only publish &lt;em&gt;complete&lt;/em&gt; readings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtwl4qsz083pwwyjtqvp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtwl4qsz083pwwyjtqvp.png" alt="Producer fetching and publishing new air quality data" width="800" height="479"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 2: The Consumer (MongoDB Writer)
&lt;/h3&gt;

&lt;p&gt;Consumes messages from Kafka and writes them as raw JSON into MongoDB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fklskj0dirgi1msa9iox5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fklskj0dirgi1msa9iox5.png" alt="MongoDB consumer writing data" width="800" height="479"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each entry contains pollutant levels, timestamps, and metadata for each city.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 3: Debezium CDC Connector
&lt;/h3&gt;

&lt;p&gt;Debezium detects new inserts in MongoDB and publishes “change events” to a Kafka CDC topic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqm1ktutlscwf2eqex85g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqm1ktutlscwf2eqex85g.png" alt="Debezium CDC connector running" width="800" height="891"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 4: Cassandra Consumer
&lt;/h3&gt;

&lt;p&gt;Reads CDC events, cleans the data, skips incomplete values, and inserts time-series records into Cassandra.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnjidcmq0krjmtpc8sxwy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnjidcmq0krjmtpc8sxwy.png" alt="Cassandra consumer logs showing inserted readings" width="800" height="867"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Monitoring &amp;amp; Dashboards
&lt;/h2&gt;

&lt;p&gt;With &lt;strong&gt;Kafka UI&lt;/strong&gt;, you can see your streaming data live.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5rwmhev5v9zj2ekwkrv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5rwmhev5v9zj2ekwkrv.png" alt="Kafka UI topics overview" width="800" height="819"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Kafka UI displaying all active topics.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqce4eadq34lvl5k22yfd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqce4eadq34lvl5k22yfd.png" alt="Live messages in air\_quality\_data topic" width="800" height="819"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Real-time message flow for each city.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxs4b67tk0b9ub1hrm9p9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxs4b67tk0b9ub1hrm9p9.png" alt="Kafka consumer group statuses" width="800" height="819"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Consumers processing messages without lag.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Querying the Data
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Raw Data in MongoDB
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;db.air_quality_raw.find&lt;span class="o"&gt;()&lt;/span&gt;.sort&lt;span class="o"&gt;({&lt;/span&gt;_id: &lt;span class="nt"&gt;-1&lt;/span&gt;&lt;span class="o"&gt;})&lt;/span&gt;.limit&lt;span class="o"&gt;(&lt;/span&gt;5&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8me9aufgnlaghisctay.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh8me9aufgnlaghisctay.png" alt="MongoDB query showing pollutant values" width="800" height="836"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Raw data including PM2.5, ozone, and NO₂ readings.&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Analytics Data in Cassandra
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;SELECT city, timestamp, pm2_5, pm10, ozone 
FROM air_quality_analytics.air_quality_readings 
WHERE &lt;span class="nv"&gt;city&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Nairobi'&lt;/span&gt; LIMIT 5&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymeti8tb9qje8q3l45nn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymeti8tb9qje8q3l45nn.png" alt="Cassandra analytics results" width="800" height="836"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Structured air quality readings optimized for analysis.&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Timestamp Insight
&lt;/h3&gt;

&lt;p&gt;Each record has two timestamps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;timestamp&lt;/code&gt;: when the reading was captured&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;inserted_at&lt;/code&gt;: when it entered the pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This lets you track latency and data freshness — crucial for real-time systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You’ll Learn
&lt;/h2&gt;

&lt;p&gt;Building this pipeline teaches core &lt;strong&gt;data engineering concepts&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;What You’ll Learn&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Streaming Systems&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How to build and manage real-time Kafka pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CDC (Change Data Capture)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tracking database changes with Debezium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-Database Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Choosing MongoDB for raw data, Cassandra for analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Distributed Systems&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managing replication and eventual consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Containerization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deploying complex pipelines with Docker Compose&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Future Enhancements
&lt;/h2&gt;

&lt;p&gt;This is just the beginning — imagine expanding this into a &lt;strong&gt;nationwide environmental dashboard&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Next Steps:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Add more cities (Kisumu, Eldoret, Nakuru)&lt;/li&gt;
&lt;li&gt;Create Grafana dashboards for AQI visualization&lt;/li&gt;
&lt;li&gt;Add SMS or Slack alerts for dangerous readings&lt;/li&gt;
&lt;li&gt;Integrate ML for forecasting and anomaly detection&lt;/li&gt;
&lt;li&gt;Build an API or GraphQL endpoint for app developers&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;p&gt;Data shouldn’t live in spreadsheets — it should live in &lt;em&gt;motion&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;By streaming real-time air quality data, we can give cities, developers, and citizens &lt;strong&gt;live awareness of environmental health&lt;/strong&gt;.&lt;br&gt;
Projects like this can inform policy, support research, and raise awareness about what’s really in our air.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Data is the new air — you can’t see it, but everything depends on it.”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Learn More &amp;amp; Contribute
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Author:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/samwel-oliver/" rel="noopener noreferrer"&gt;Samwel Oliver&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/25thOliver" rel="noopener noreferrer"&gt;@25thOliver&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Email:&lt;/strong&gt; &lt;a href="//mailto:oliversamwel33@gmail.com"&gt;oliversamwel33@gmail.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/25thOliver/Air-Quality-Pipeline" rel="noopener noreferrer"&gt;Explore the Full Project on GitHub&lt;/a&gt;&lt;/p&gt;




</description>
      <category>architecture</category>
      <category>dataengineering</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Containerization for Data Engineering: A practical Guide with Docker and Docker Compose</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Fri, 10 Oct 2025 06:14:00 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/containerization-for-data-engineering-a-practical-guide-with-docker-and-docker-compose-1bon</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/containerization-for-data-engineering-a-practical-guide-with-docker-and-docker-compose-1bon</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;For many aspiring data engineers, Docker sounds intimidating—complex containers, YAML files, and endless &lt;code&gt;docker&lt;/code&gt; commands. But here's the truth: Docker isn't just for backend developers. It's your best friend when managing complex data pipelines with multiple moving parts: databases, schedulers, dashboards, and storage systems.&lt;/p&gt;

&lt;p&gt;In this guide, I'll demonstrate how I containerized a full YouTube analytics pipeline using &lt;strong&gt;Docker&lt;/strong&gt; and &lt;strong&gt;Docker Compose&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The goal? To automate data extraction, transformation, storage, and visualization—all running seamlessly across containers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Containerize Data Pipelines?
&lt;/h2&gt;

&lt;p&gt;Without containers, setting up tools like &lt;strong&gt;Airflow&lt;/strong&gt;, &lt;strong&gt;Spark&lt;/strong&gt;, &lt;strong&gt;PostgreSQL&lt;/strong&gt;, &lt;strong&gt;Grafana&lt;/strong&gt;, and &lt;strong&gt;MinIO&lt;/strong&gt; locally would take hours, each requiring its own dependencies and configurations.&lt;/p&gt;

&lt;p&gt;With Docker Compose, all these services run together with a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Docker creates isolated environments for each service, ensuring &lt;strong&gt;portability&lt;/strong&gt;, &lt;strong&gt;consistency&lt;/strong&gt;, and &lt;strong&gt;easy scaling&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Engine-Cartridge Architecture
&lt;/h2&gt;

&lt;p&gt;A key design pattern I used in this project was splitting the setup into two distinct layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;code&gt;airflow-docker/&lt;/code&gt; → The Engine
&lt;/h3&gt;

&lt;p&gt;This is the core infrastructure. It defines all containers, networks, environment variables, and Airflow services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Responsibilities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Defines the Docker Compose stack (Airflow + PostgreSQL + Grafana + MinIO + Spark)&lt;/li&gt;
&lt;li&gt;Acts as the "orchestration engine"&lt;/li&gt;
&lt;li&gt;Mounts DAGs and pipeline code dynamically&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. &lt;code&gt;airflow-youtube-analytics/&lt;/code&gt; → The Cartridge
&lt;/h3&gt;

&lt;p&gt;This is the plug-and-play ETL project, which lives &lt;em&gt;outside&lt;/em&gt; the engine but connects seamlessly to it.&lt;/p&gt;

&lt;p&gt;Think of it like a "cartridge" you can load into the Airflow engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Responsibilities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Contains all DAGs and ETL scripts (&lt;code&gt;extract.py&lt;/code&gt;, &lt;code&gt;transform.py&lt;/code&gt;, &lt;code&gt;load.py&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Handles API calls, data transformations, and loading logic&lt;/li&gt;
&lt;li&gt;Can be swapped or extended without touching the engine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Relationship Diagram:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------+
|  airflow-docker/      |   ---&amp;gt; Engine (Airflow + Services)
|  ├── docker-compose.yml|
|  ├── .env             |
|  └── dags/ &amp;lt;mount&amp;gt; ---┼──&amp;gt; Mounts DAGs from cartridge
+-----------------------+

        ⬇

+-----------------------------+
| airflow-youtube-analytics/  |  ---&amp;gt; Cartridge (ETL logic)
| ├── pipelines/youtube/      |
| │    ├── extract.py         |
| │    ├── transform.py       |
| │    └── load.py            |
| └── dags/youtube_pipeline.py|
+-----------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this modular setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I can add new "cartridges" (projects) like &lt;code&gt;airflow-nasa-apod/&lt;/code&gt; or &lt;code&gt;airflow-weather-analytics/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;airflow-docker/&lt;/code&gt; engine never changes—it simply mounts the new DAGs and runs them&lt;/li&gt;
&lt;li&gt;This makes the system scalable and reusable across multiple ETL projects&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Setup Overview
&lt;/h2&gt;

&lt;p&gt;Our pipeline components:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache Airflow&lt;/td&gt;
&lt;td&gt;Automates ETL workflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MinIO&lt;/td&gt;
&lt;td&gt;Acts as local S3 data lake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transformation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PySpark / Pandas&lt;/td&gt;
&lt;td&gt;Cleans and processes raw data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Warehouse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;Stores transformed metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Visualization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Grafana&lt;/td&gt;
&lt;td&gt;Visualizes channel performance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Architecture Diagram:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ty8nsleromrikiart58.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ty8nsleromrikiart58.png" alt="Architecture Diagram" width="710" height="901"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Containerized pipeline architecture.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Each service runs as a Docker container defined in the &lt;code&gt;docker-compose.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This approach allowed me to test and run everything from extraction to Grafana visualization on my local machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Container Orchestration in Action
&lt;/h2&gt;

&lt;p&gt;Here's a sample of how services are spun together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/airflow-docker
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To sync environment variables between project and containers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./sync_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Airflow runs the DAG (Extract &amp;gt;&amp;gt; Transform &amp;gt;&amp;gt; Load)&lt;/li&gt;
&lt;li&gt;Spark handles transformations&lt;/li&gt;
&lt;li&gt;Data is stored in PostgreSQL and visualized in Grafana&lt;/li&gt;
&lt;li&gt;All communication happens inside containers through a shared Docker network&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Running Containers:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9fnnnt54zvdm8nop06i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9fnnnt54zvdm8nop06i.png" alt="Running Containers" width="800" height="566"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All containers running simultaneously via Docker Compose.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Docker simplifies multi-service setup for data engineering projects&lt;/li&gt;
&lt;li&gt;Containerized Airflow pipelines are reproducible and portable&lt;/li&gt;
&lt;li&gt;Local MinIO + PostgreSQL simulates a full-scale cloud environment&lt;/li&gt;
&lt;li&gt;With Docker Compose, you can spin up a production-grade analytics stack in minutes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Containerization removes the friction between development and deployment. Instead of juggling tool installations, Docker lets you focus on what matters: &lt;strong&gt;data flow&lt;/strong&gt;, &lt;strong&gt;not setup&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you've ever been scared to touch Docker, this is your sign:&lt;/p&gt;

&lt;p&gt;Start with one project, one &lt;code&gt;docker-compose.yaml&lt;/code&gt;, and build from there.&lt;/p&gt;

&lt;p&gt;By the end, you'll realize containers don't complicate data pipelines—they &lt;strong&gt;liberate&lt;/strong&gt; them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;You can explore the complete codebase and pipeline setup in my GitHub repository.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/25thOliver/Airflow-Youtube-Analytics" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>tutorial</category>
      <category>devops</category>
      <category>docker</category>
    </item>
    <item>
      <title>Building an Automated YouTube Analytics Dashboard with Airflow, PySpark, MinIO, PostgreSQL &amp; Grafana</title>
      <dc:creator>Oliver Samuel</dc:creator>
      <pubDate>Tue, 07 Oct 2025 06:00:45 +0000</pubDate>
      <link>https://open.forem.com/oliver_samuel_028c6f65ad6/building-an-automated-youtube-analytics-dashboard-with-airflow-pyspark-minio-postgresql-grafana-26ef</link>
      <guid>https://open.forem.com/oliver_samuel_028c6f65ad6/building-an-automated-youtube-analytics-dashboard-with-airflow-pyspark-minio-postgresql-grafana-26ef</guid>
      <description>&lt;p&gt;&lt;strong&gt;Author:&lt;/strong&gt; &lt;em&gt;Oliver Samuel&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Date:&lt;/strong&gt; &lt;em&gt;October 2025&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;This project explores the digital footprint of &lt;strong&gt;Raye&lt;/strong&gt;, the UK chart-topping artist known for her soulful pop sound and breakout hits like Escapism. Using a custom-built &lt;strong&gt;YouTube Analytics Pipeline&lt;/strong&gt; powered by &lt;strong&gt;Apache Airflow&lt;/strong&gt;, &lt;strong&gt;PySpark&lt;/strong&gt;, &lt;strong&gt;MinIO&lt;/strong&gt;, &lt;strong&gt;PostgreSQL&lt;/strong&gt; and &lt;strong&gt;Grafana&lt;/strong&gt;, we analyzed Raye's channel performance — from engagement trends to audience distribution.&lt;/p&gt;

&lt;p&gt;The goal was to design a scalable data workflow capable of extracting, transforming, and visualizing YouTube channel insights in real time. Beyond technical architecture, this analysis reveals how content release patterns, audience geography, and engagement rates evolve alongside Raye's career milestones.&lt;/p&gt;


&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how to design, containerize, and automate an &lt;strong&gt;end-to-end data engineering pipeline&lt;/strong&gt; for YouTube analytics using &lt;strong&gt;Apache Airflow&lt;/strong&gt;, &lt;strong&gt;PySpark&lt;/strong&gt;, &lt;strong&gt;MinIO&lt;/strong&gt;, &lt;strong&gt;PostgreSQL&lt;/strong&gt;, and &lt;strong&gt;Grafana&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It automatically fetches YouTube channel data, performs transformations in Spark, loads the results into a PostgreSQL warehouse, and visualizes insights in Grafana — all orchestrated by Airflow.&lt;/p&gt;

&lt;p&gt;By the end, you'll have a live dashboard showing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total videos, views, and subscribers&lt;/li&gt;
&lt;li&gt;Average engagement rates&lt;/li&gt;
&lt;li&gt;Country-level view distribution&lt;/li&gt;
&lt;li&gt;Growth trends and publishing cadence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4v1foc3mvf55anys8pn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4v1foc3mvf55anys8pn.png" alt="Final Grafana dashboard overview" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbuz12z26a4hdnvezaghj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbuz12z26a4hdnvezaghj.png" alt="Final Grafana dashboard overview" width="800" height="420"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Final Grafana dashboard overview&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;Here's the end-to-end data flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;YouTube API → Raw JSON → MinIO (Data Lake)
         ↓
     PySpark Transform
         ↓
 PostgreSQL Warehouse
         ↓
 Grafana Dashboard (Visualization)
         ↓
 Airflow DAG (Automation &amp;amp; Scheduling)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwscdf60n0ck00va1xdz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwscdf60n0ck00va1xdz.png" alt="Architecture Diagram" width="710" height="901"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Architecture Diagram&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 1: Automated Extraction with Airflow
&lt;/h2&gt;

&lt;p&gt;The first DAG task — &lt;code&gt;extract_youtube_data&lt;/code&gt; — uses the &lt;strong&gt;YouTube Data API v3&lt;/strong&gt; to fetch metadata and statistics for each target channel.&lt;/p&gt;

&lt;p&gt;The extracted JSON files are stored in &lt;strong&gt;MinIO&lt;/strong&gt;, a local S3-compatible data lake.&lt;/p&gt;

&lt;p&gt;Sample record:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"channel_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UC123456..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"channel_title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Raye"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"statistics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"viewCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"10402000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"subscriberCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"251000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"videoCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"159"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"likeCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"359000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"commentCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"50382"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"country"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"publishedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2014-06-22T10:05:00Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmesrqeqkdsfk65oeyjzz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmesrqeqkdsfk65oeyjzz.png" alt="Raw data in MinIO browser" width="800" height="762"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Raw data in MinIO&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 2: Data Transformation with PySpark
&lt;/h2&gt;

&lt;p&gt;Next, Airflow triggers the &lt;strong&gt;transform task&lt;/strong&gt;, which runs &lt;code&gt;transform_youtube_data()&lt;/code&gt; inside the same containerized environment.&lt;/p&gt;

&lt;p&gt;It loads the raw files from MinIO using the S3A connector, casts numeric types, fills missing values, and computes engagement metrics like &lt;code&gt;views_per_video&lt;/code&gt;, &lt;code&gt;like_ratio&lt;/code&gt;, and &lt;code&gt;engagement_rate&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Key Transformations
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;transformed_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;transformed_df&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;like_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_likes&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comment_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_comments&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;views_per_video&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_views&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subs_per_video&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscriber_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;video_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;like_ratio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_views&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_likes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_views&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comment_ratio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_views&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_comments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_views&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_views&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_likes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_comments&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_views&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Output Format
&lt;/h3&gt;

&lt;p&gt;The cleaned dataset is stored back to MinIO as Parquet for optimized reads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;transformed_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://rayes-youtube/transformed/channel_stats_transformed.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbqst0tw10eh5tsz3jydm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbqst0tw10eh5tsz3jydm.png" alt="Spark job logs showing transformation success in Airflow" width="800" height="400"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Spark job logs showing transformation success in Airflow&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 3: Load into PostgreSQL
&lt;/h2&gt;

&lt;p&gt;Airflow's final task — &lt;code&gt;load_to_postgres&lt;/code&gt; — transfers the transformed Parquet data into PostgreSQL using a JDBC connector or pandas-based loader.&lt;/p&gt;
&lt;h3&gt;
  
  
  Schema Alignment
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PySpark Column&lt;/th&gt;
&lt;th&gt;PostgreSQL Column&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;channel_id&lt;/td&gt;
&lt;td&gt;channel_id&lt;/td&gt;
&lt;td&gt;text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;channel_title&lt;/td&gt;
&lt;td&gt;channel_name&lt;/td&gt;
&lt;td&gt;text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;published_at&lt;/td&gt;
&lt;td&gt;published_at&lt;/td&gt;
&lt;td&gt;timestamp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;view_count&lt;/td&gt;
&lt;td&gt;view_count&lt;/td&gt;
&lt;td&gt;bigint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;subscriber_count&lt;/td&gt;
&lt;td&gt;subscriber_count&lt;/td&gt;
&lt;td&gt;bigint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;video_count&lt;/td&gt;
&lt;td&gt;video_count&lt;/td&gt;
&lt;td&gt;bigint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;like_count&lt;/td&gt;
&lt;td&gt;like_count&lt;/td&gt;
&lt;td&gt;bigint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;comment_count&lt;/td&gt;
&lt;td&gt;comment_count&lt;/td&gt;
&lt;td&gt;bigint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;like_ratio&lt;/td&gt;
&lt;td&gt;like_ratio&lt;/td&gt;
&lt;td&gt;double precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;comment_ratio&lt;/td&gt;
&lt;td&gt;comment_ratio&lt;/td&gt;
&lt;td&gt;double precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;engagement_rate&lt;/td&gt;
&lt;td&gt;engagement_rate&lt;/td&gt;
&lt;td&gt;double precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;views_per_video&lt;/td&gt;
&lt;td&gt;views_per_video&lt;/td&gt;
&lt;td&gt;double precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;channel_age_days&lt;/td&gt;
&lt;td&gt;channel_age_days&lt;/td&gt;
&lt;td&gt;bigint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;daily_view_growth&lt;/td&gt;
&lt;td&gt;daily_view_growth&lt;/td&gt;
&lt;td&gt;double precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;daily_sub_growth&lt;/td&gt;
&lt;td&gt;daily_sub_growth&lt;/td&gt;
&lt;td&gt;double precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;country&lt;/td&gt;
&lt;td&gt;country&lt;/td&gt;
&lt;td&gt;text&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; &lt;code&gt;channel_title&lt;/code&gt; in Spark maps to &lt;code&gt;channel_name&lt;/code&gt; in PostgreSQL — the only column renamed during loading.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fad4qdj8bhkvzn9vljko1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fad4qdj8bhkvzn9vljko1.png" alt="Sample query results from PostgreSQL" width="800" height="649"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Sample query results from PostgreSQL&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 4: Airflow Integration and Reusability
&lt;/h2&gt;

&lt;p&gt;This pipeline is designed as a &lt;strong&gt;modular Airflow project&lt;/strong&gt; that plugs into a reusable local engine (&lt;code&gt;~/airflow-docker&lt;/code&gt;).&lt;/p&gt;
&lt;h3&gt;
  
  
  Run Instructions
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Copy and configure environment variables&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x sync_env.sh
./sync_env.sh

&lt;span class="c"&gt;# Start the Airflow engine&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/airflow-docker
docker compose down
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1j4ax2o8eibcgybvyb7e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1j4ax2o8eibcgybvyb7e.png" alt="Running Docker Containers" width="800" height="566"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Running Docker Containers&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To manually test a DAG task inside Airflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; airflow-docker-airflow-scheduler-1 &lt;span class="se"&gt;\&lt;/span&gt;
  python /opt/airflow/dags/pipelines/youtube/extract.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Environment Variables
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variable&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;YOUTUBE_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;YouTube Data API v3 key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MINIO_ACCESS_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MinIO access key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MINIO_SECRET_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MinIO secret key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MINIO_ENDPOINT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MinIO endpoint (default: &lt;code&gt;http://172.17.0.1:9000&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu7zd7pbm8z8856s4pyy9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu7zd7pbm8z8856s4pyy9.png" alt="Airflow DAG graph showing extract → transform → load tasks" width="800" height="833"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Airflow DAG graph showing extract → transform → load tasks&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 5: Visualization with Grafana
&lt;/h2&gt;

&lt;p&gt;Grafana connects directly to PostgreSQL to visualize key metrics.&lt;/p&gt;
&lt;h3&gt;
  
  
  Example Queries
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Overview Metrics&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;view_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_views&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subscriber_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_subscribers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;channel_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_channels&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"raye_youtube_channel_stats"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Engagement Rate&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;like_count&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;comment_count&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;view_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_engagement_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;like_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_likes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comment_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_comments&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"raye_youtube_channel_stats"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Country Breakdown&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;published_at&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      
    &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;view_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_views&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"raye_youtube_channel_stats"&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;published_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_views&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmpb38yce8jgaeslt1r0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmpb38yce8jgaeslt1r0.png" alt="Country Metrics" width="800" height="332"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Grafana panels showing engagement and country metrics&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Insights
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engagement Peaks:&lt;/strong&gt; Engagement rates spike around high-visibility video releases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;View Concentration:&lt;/strong&gt; Most traffic originates from English-speaking regions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Rhythm:&lt;/strong&gt; Publishing trends show periodic releases tied to album cycles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcugsyiesuxpsbg1ssx68.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcugsyiesuxpsbg1ssx68.png" alt="More on engagements" width="800" height="584"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Chart highlighting peak engagement days&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;Apache Airflow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Storage&lt;/td&gt;
&lt;td&gt;MinIO (S3-compatible)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transformation&lt;/td&gt;
&lt;td&gt;PySpark&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Warehouse&lt;/td&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visualization&lt;/td&gt;
&lt;td&gt;Grafana&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Containerization&lt;/td&gt;
&lt;td&gt;Docker Compose&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Automation Summary
&lt;/h2&gt;

&lt;p&gt;Each Airflow DAG run performs the full cycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extract:&lt;/strong&gt; Fetch YouTube channel data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform:&lt;/strong&gt; Clean and compute new metrics via PySpark&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load:&lt;/strong&gt; Write clean results into PostgreSQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visualize:&lt;/strong&gt; Grafana auto-refreshes metrics in near real time&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;PySpark and MinIO enable scalable, cloud-like ETL locally&lt;/li&gt;
&lt;li&gt;Airflow provides robust scheduling and retry mechanisms&lt;/li&gt;
&lt;li&gt;Grafana and PostgreSQL make analytics exploration seamless&lt;/li&gt;
&lt;li&gt;Modular design allows reuse across multiple data sources or APIs&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project went beyond dashboards and data pipelines — it told a story about how an artist's digital rhythm mirrors their creative journey. By building a robust analytics workflow for Raye's YouTube channel, we connected raw engagement metrics to real-world momentum — from viral singles to album releases.&lt;/p&gt;

&lt;p&gt;The pipeline's architecture, powered by &lt;strong&gt;Apache Airflow&lt;/strong&gt;, &lt;strong&gt;PySpark&lt;/strong&gt;, &lt;strong&gt;MinIO&lt;/strong&gt;, &lt;strong&gt;PostgreSQL&lt;/strong&gt;, and &lt;strong&gt;Grafana&lt;/strong&gt;, proved not just scalable but insightful — offering a live pulse on fan interactions, audience geography, and engagement surges tied to content drops.&lt;/p&gt;

&lt;p&gt;As a next step, the same framework can be extended to analyze cross-platform trends (Spotify, Instagram, TikTok) and measure how each channel amplifies an artist's reach in the streaming era.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2fs3bx6sn8oz16mb4fni.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2fs3bx6sn8oz16mb4fni.png" alt="Dashboard shot" width="800" height="392"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Final dashboard hero shot&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Data meets artistry — and every like, view, and comment becomes a note in the bigger symphony of audience connection.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>analytics</category>
      <category>architecture</category>
      <category>python</category>
    </item>
  </channel>
</rss>
