Databricks Engineer Journey

Learning Journey Videos

Espanol

English

Accreditations

Peace, I hope you are well — thanks for coming to my page! I created this dedicated space to showcase all the concepts, ideas, definitions, acronyms, visuals, technologies, and programs I'm mastering as I dive deeper into Databricks Engineering.

This journey isn't just personal growth — it's strategic. My company needs significant help in this area, so I'm committed to learning as much as possible to support both their success and my career development as a Databricks Engineer.

Here's comprehensive documentation and definitions to accelerate your Databricks Engineer journey — with enough repetition to ensure these crucial concepts stick in your mind for the long haul.

Free Downloadable PDFs from Databricks

Core Technologies & Concepts

Python Core Language

Created by Guido van Rossum in 1991; widely used in Databricks for data analysis, ML, and scripting.

SQL Database

Developed in the 1970s at IBM for relational databases; core for querying structured data in Databricks.

R Analytics

Developed by Ross Ihaka and Robert Gentleman in the early 1990s for statistical computing; used in Databricks for analytics.

Scala JVM Language

Created by Martin Odersky in 2003; JVM language, used in Databricks for Spark jobs.

Apache Spark Big Data Engine

Developed at UC Berkeley in 2009; distributed computing engine for big data processing, core of Databricks.

Data Lakes Storage

Concept emerged in 2010s; centralized storage for raw structured/unstructured data, foundational for Databricks.

Delta Lake ACID Transactions

Open-source storage layer from Databricks (2019) adding ACID transactions to data lakes.

ACID Data Integrity

Atomicity, Consistency, Isolation, Durability; ensures reliable transactions in databases/Delta Lake.

ETL (Extract, Transform, Load) Traditional Pipeline

Traditional data pipeline: transform data before loading it into the warehouse.

ELT (Extract, Load, Transform) Modern Pipeline

Modern pipeline (common in Databricks): load raw data first, transform in-place (often using Delta Lake).

Medallion Architecture Data Architecture

Databricks ETL/ELT pattern: Bronze (raw), Silver (cleaned), Gold (aggregated insights).

Unity Catalog Data Governance

Databricks 2021+ unified governance layer for data and AI assets (tables, files, ML models).

Notebooks Interactive Environment

Interactive coding environment (like Jupyter) integrated in Databricks for multi-language workflows.

Linux "magic commands" CLI Tools

Shorthand in notebooks (e.g., %fs, %sh) to interact with files and shell.

Essential Reading Materials

Out of all the Databricks books I've sampled on Kindle, the Databricks Certified Engineer Associate (with the bird cover) and Databricks Lakehouse Platform (with the beaver cover) have been the most comprehensive and in-depth.

Databricks Certified Engineer Associate

Databricks Lakehouse Platform

Databricks for Not Dummies

Engineering with Apache Spark & Delta Cookbook

Databricks Certified Associate Developer

Databricks Data Intelligence Platform

Overview of Databricks platform capabilities and data intelligence strategies.

DataBricks Unofficial Beginners Guide

Runbook for Passing Databricks Certified Associate

I got some APIs and Data from CMS.
here are the links for reference.
Marketplace API Key Request.

🔗 CMS Public APIs Overview
Here are some of the key public APIs available:

Marketplace API
Powers HealthCare.gov with plan, provider, and coverage data.
Procedure Price Lookup (PPL) API
Provides cost data for ~3,900 medical procedures.

Blue Button 2.0 API
Lets Medicare beneficiaries share claims data with apps and services.
Beneficiary Claims Data API (BCDA)
For ACOs and other care organizations to access Medicare claims data.

AB2D API
Allows Medicare Part D providers to retrieve bulk claims data.
Data at the Point of Care API
Enables providers to access claims data for patients under active care.

Finder API
Helps users find private health plans outside the Marketplace.
Quality Payment Program (QPP) Submissions API
For submitting QPP data and receiving performance feedback.

Provider Directory API
Offers access to provider and facility data.
Coverage Inspector & JSON Validator Tools
Machine-readable tools for validating coverage and schema formats.

My Github