Free Hands-On Databricks Labs
5 open-source projects you can run on Databricks Free Edition — from streaming pipelines to certification prep
Practice on real projects, not toy examples. These are open-source hands-on labs designed to run on Databricks Free Edition. Each one builds a realistic pipeline from scratch and teaches production patterns you’ll actually use at work.
I’m building these labs in the open. If you have feedback or want to see a new lab topic, drop a comment below!
What’s Inside:
Certification Prep Labs — aligned to Associate and Spark Developer exams
Production Pattern Labs — streaming, optimization, real-time monitoring
1. Certification Prep Labs
Prepare for Databricks certifications with exam-aligned, hands-on practice.
Data Engineer Associate Certification Prep ⭐
🛠️ GitHub: databricks_data_engineer_associate_cert_prep Build an end-to-end data pipeline covering all five exam domains: platform fundamentals, ingestion (Auto Loader, COPY INTO), medallion architecture transformations, job orchestration, and Unity Catalog governance. Includes numbered notebooks with guided exercises, TODO sections, and solutions for self-verification.
You’ll learn: Auto Loader, COPY INTO, medallion architecture, SCD Type 2, multi-task workflows, Unity Catalog access control, schema evolution, data quality validation.
PySpark Certification Zenith
🛠️ GitHub: databricks_pyspark_cert_zenith Build an e-commerce analytics pipeline for “Zenith Online” — processing user events, customer profiles, and product data into business-ready aggregations. Covers the full Associate Developer for Apache Spark exam scope through a realistic scenario with intentional data quality challenges.
You’ll learn: DataFrame API, Spark SQL, Structured Streaming, broadcast joins, salting for skewed data, watermark-based deduplication, Pandas UDFs, window functions, partitioning strategies.
2. Production Pattern Labs
Apply production-grade patterns: streaming architectures, performance optimization, and real-time monitoring.
DLT Apparel Streaming Pipeline
🛠️ GitHub: databricks_apparel_streaming Build a complete Delta Live Tables streaming pipeline for a simulated apparel company. Ingest synthetic retail transactions, progressively clean and transform through Bronze/Silver/Gold layers, and produce business-ready datasets with data quality expectations enforced at every stage.
You’ll learn: Delta Live Tables, streaming ingestion, medallion architecture, data quality expectations, SCD Type 2 for slowly changing dimensions, Unity Catalog integration.
Delta Lake Optimization Techniques
🛠️ GitHub: databricks_optimization_techniques Generate a 50-million-row synthetic dataset, then systematically apply six optimization techniques while measuring the impact on files scanned, bytes read, and query runtime. Observation over theory — you’ll build an empirical measurement log to justify each optimization decision.
You’ll learn: Partitioning, Z-Ordering, manual compaction (OPTIMIZE), Auto Optimize, Liquid Clustering, VACUUM, DESCRIBE DETAIL diagnostics, Spark UI metrics.
Real-Time Fintech Monitoring Pipeline
🛠️ GitHub: databricks_fintech_monitoring Build a real-time fraud detection system for a payment processor handling 500K+ daily transactions. Ingest streaming JSON with Auto Loader, deduplicate payment retries with watermarks, enrich events via stream-static joins, and detect suspicious patterns using a windowed rules engine.
You’ll learn: Auto Loader with rescued data column, watermarked deduplication, stream-static joins, tumbling and sliding window aggregations, Liquid Clustering, medallion architecture for dual SLAs.
📬 Get more Databricks career resources - certification guides, learning paths, and career tips delivered to your inbox.
Last updated: March 8, 2026

