Free Hands-On Databricks Labs

5 open-source projects you can run on Databricks Free Edition — from streaming pipelines to certification prep

Mar 08, 2026

Practice on real projects, not toy examples. These are open-source hands-on labs designed to run on Databricks Free Edition. Each one builds a realistic pipeline from scratch and teaches production patterns you’ll actually use at work.

I’m building these labs in the open. If you have feedback or want to see a new lab topic, drop a comment below!

What’s Inside:

Certification Prep Labs — aligned to Associate and Spark Developer exams
Production Pattern Labs — streaming, optimization, real-time monitoring

1. Certification Prep Labs

Prepare for Databricks certifications with exam-aligned, hands-on practice.

Data Engineer Associate Certification Prep ⭐

🛠️ GitHub: databricks_data_engineer_associate_cert_prep Build an end-to-end data pipeline covering all five exam domains: platform fundamentals, ingestion (Auto Loader, COPY INTO), medallion architecture transformations, job orchestration, and Unity Catalog governance. Includes numbered notebooks with guided exercises, TODO sections, and solutions for self-verification.

You’ll learn: Auto Loader, COPY INTO, medallion architecture, SCD Type 2, multi-task workflows, Unity Catalog access control, schema evolution, data quality validation.

PySpark Certification Zenith

🛠️ GitHub: databricks_pyspark_cert_zenith Build an e-commerce analytics pipeline for “Zenith Online” — processing user events, customer profiles, and product data into business-ready aggregations. Covers the full Associate Developer for Apache Spark exam scope through a realistic scenario with intentional data quality challenges.

You’ll learn: DataFrame API, Spark SQL, Structured Streaming, broadcast joins, salting for skewed data, watermark-based deduplication, Pandas UDFs, window functions, partitioning strategies.

2. Production Pattern Labs

Apply production-grade patterns: streaming architectures, performance optimization, and real-time monitoring.

DLT Apparel Streaming Pipeline

🛠️ GitHub: databricks_apparel_streaming Build a complete Delta Live Tables streaming pipeline for a simulated apparel company. Ingest synthetic retail transactions, progressively clean and transform through Bronze/Silver/Gold layers, and produce business-ready datasets with data quality expectations enforced at every stage.

You’ll learn: Delta Live Tables, streaming ingestion, medallion architecture, data quality expectations, SCD Type 2 for slowly changing dimensions, Unity Catalog integration.

Delta Lake Optimization Techniques

🛠️ GitHub: databricks_optimization_techniques Generate a 50-million-row synthetic dataset, then systematically apply six optimization techniques while measuring the impact on files scanned, bytes read, and query runtime. Observation over theory — you’ll build an empirical measurement log to justify each optimization decision.

You’ll learn: Partitioning, Z-Ordering, manual compaction (OPTIMIZE), Auto Optimize, Liquid Clustering, VACUUM, DESCRIBE DETAIL diagnostics, Spark UI metrics.

Real-Time Fintech Monitoring Pipeline

🛠️ GitHub: databricks_fintech_monitoring Build a real-time fraud detection system for a payment processor handling 500K+ daily transactions. Ingest streaming JSON with Auto Loader, deduplicate payment retries with watermarks, enrich events via stream-static joins, and detect suspicious patterns using a windowed rules engine.

You’ll learn: Auto Loader with rescued data column, watermarked deduplication, stream-static joins, tumbling and sliding window aggregations, Liquid Clustering, medallion architecture for dual SLAs.

📬 Get more Databricks career resources - certification guides, learning paths, and career tips delivered to your inbox.

Last updated: March 8, 2026

The Databricks Data Engineer

Discussion about this post

Ready for more?