Completing DataTalksClub’s Data Engineering Zoomcamp: My Module 1 Journey with Docker and Terraform

·

2 min read

I just wrapped up Module 1 of DataTalksClub’s Data Engineering Zoomcamp, and it’s been an exciting dive into the fundamentals of modern data engineering! This module focused on Docker, Postgres, and Terraform – tools every data engineer should know. Here’s a breakdown of my experience, key takeaways, and the code I built along the way.


What I Learned

  1. Docker Basics:

    • Containerizing applications (like Postgres) to avoid "it works on my machine" chaos.

    • Writing Dockerfile and docker-compose.yaml for multi-service setups.

  2. Data Ingestion with Python:

    • Using pgcli and pandas to load NYC Taxi data into Postgres.

    • Writing reusable scripts for dataset validation and pipeline automation.

  3. Infrastructure as Code (Terraform):

    • Provisioning Google Cloud Platform (GCP) resources (BigQuery, GCS) via Terraform.

    • Managing state files and modular configurations.


My Code & Setup

Check out my GitHub repo for Module 1 here:
👉 DTC_dataEngg / module1-hw

Highlights:

  • Docker Workflow:

    bash

    Copy

      # Spin up Postgres + pgAdmin
      docker-compose up -d
    
      # Run ingestion script
      docker run -it --network=hw_default taxi_ingest:v1 \
        --user=root --password=root --host=pgdatabase --port=5432 --db=ny_taxi
    
  • Terraform for GCP:
    Defined modules for BigQuery datasets and GCS buckets to ensure reproducibility.


Challenges & Wins

  • Docker Networking: Debugging container communication (e.g., Python script → Postgres) was tricky at first.

  • Terraform State: Learned to manage .tfstate files properly to avoid config drift.

  • BigQuery Schema Auto-Detection: Tweaked my Python script to handle datatype mismatches.

Win: Successfully orchestrated a local-to-cloud pipeline using free-tier tools!


Why This Matters

Module 1 taught me how containerization and IaC solve critical problems in data engineering:

  • Reproducibility: Docker ensures pipelines run identically across environments.

  • Scalability: Terraform automates cloud resource provisioning, saving hours of manual setup.

Join the Discussion!

Are you also doing the Zoomcamp? How did your Module 1 go? Let’s connect on:

Tags: #DataEngineering #Docker #Terraform #Zoomcamp #DataPipeline #GCP