Databricks fused and seamlessly integrated all of these technologies into their Databricks Lakehouse Platform and made it run equally well on Amazon, Azure and Google cloud platforms. They received more than $3B (!) in funding. They have more than 3000 employees and more than 7000 customers (a couple more exclamation marks).
So, what gives? Why are they so popular in data analytics world? And more importantly: why should you trust them with your data?
We give you 7 awesome reasons.
1. Creators of Spark
Founders of Databricks are also the original creators of Apache Spark, a general-purpose, distributed, data processing engine. Spark has been a huge success. Today, most of the world’s companies working with data are probably also using Spark in some capacity. Spark offers Scala, Java, Python and R APIs for writing data manipulation programs and its API components comprise machine learning, graph processing, streaming, and SQL functions. It integrates great with many (all?) open-source and proprietary tools and systems and is even embedded in many of them, such as AWS Glue and H2O, for example.
Spark has been smoothly integrated into the Databricks platform. Almost everything you command the platform to do gets executed in Spark and the platform offers great insight into Spark’s intestines: the nitty-gritty details of your job’s or query’s performance and execution, available at your fingertips.
Spark cluster management is cleverly handled by the platform itself.
2. They keep inventing: The Lakehouse concept
Databricks seem to have a knack for creating wave-making inventions. They came up with another one that has already gone mainstream: the Lakehouse concept. In short, Lakehouse marries data warehouses and data lakes.
Data warehouses (DWH) have been successfully used since the ’80s for data analytics and data mining in many industries. They are built for efficient BI (business intelligence) and reporting. Although there have been advances on this front, they still have their set of shortcomings:
- they have poor support for unstructured and semi-structured data such as images, videos or JSON
- exploratory data analysis, so vital for data science, is not easily done with data warehouses because they optimize only specific types of access and aggregation patterns needed for specific reports and clients
- traditional data warehouses are not built for streaming data ingestion and processing
- they use closed formats: you cannot simply take a file used by a DWH and load it into another system
- it is not easy to scale DWH systems
On the other hand, you have data lakes which started appearing in 2010’s and which are using cheap, commodity hardware; are able to store any kind of data, structured and unstructured; and allow for streaming use cases and for exploratory data analysis. However, data lakes come with their own set of shortcomings:
- they are not optimized for BI workloads
- they are complex to set up – they consist of a large number of moving parts, as anybody who tried to set one up can testify
- they often devolve into data swamps (and sometimes even “data dumps”) because data governance wasn’t in the focus when they first appeared and is still treated as an optional “add-on”, often avoided in practice
Lakehouse retains the best of both worlds by employing a set of technologies that provide ACID transactions and schema enforcement while organizing data in a cloud data lake.
3. Great performance: Delta Lake and Photon
One pillar of Databricks Lakehouse’s technological stack is Apache Spark, as we already said. Another one is Delta Lake. Delta Lake is a data storage framework built on top of Parquet, a distributed file format. It works by maintaining a transaction log in a separate file that logs each change to a table. Changes to Parquet files (updates and deletes) are only possible by creating new versions of the files. Because Parquet files can contain large amounts of data, these changes can take a lot of time.
So, what happens with clients that are using a file while the new change is being written? With Delta Lake, that’s easy: until the change is written (committed) to the transaction log, the clients keep using the old version of the file. That is the (simple) mechanism that enables transaction atomicity, consistency, isolation and durability (ACID properties).
On top of ACID transactions, transaction log also enables time travel: going back to (using) previous versions of the file (as long as the physical files belonging to old versions are still available) with commands such as:
SELECT * FROM table VERSION AS OF 0
(by version), or:
SELECT * FROM table VERSION AS OF ’2020-03-22 06:24:15’
Furthermore, Delta Lake has (optional) schema enforcement so that a file with a new version of a file that has a different schema cannot be committed to the transaction log. And table and transaction log metadata are also handled with Spark which means that Delta Lake can handle many tables with lots of columns easily and in a scalable way.
Databricks provides the proprietary Photon execution engine, on top of open source Apache Spark, that significantly speeds up certain workloads by vectorizing processing operations.
4. Jupyter notebooks
Jupyter notebooks are open-source, web-based and interactive programming environments that are used for writing code in Python, Scala, R or SQL (these 4 are supported in Databricks). You can combine all of these languages in Jupyter notebooks, along with textual Markdown cells that can be used for writing documentation and comments, adding images, etc. Whole notebooks, with code, documentation and execution results can be exported as PDF or HTML files. Notebooks in Databricks can take in arguments through “widgets”, GUI elements whose values can also be set through the API.
Databricks can organize notebooks in workspaces and also in Repos: Git repositories linked to external Git providers (such as GitHub, Bitbucket and others) that also enable CI/CD workflows. You can run any notebook from other notebooks (if you have permissions to access and run it).
All notebooks in Databricks need a cluster in order to be executed. A cluster is a set of virtual machines running Apache Spark components (driver and workers). You will have to pay for resources used by your clusters, while Databricks provides resources needed for running web GUI, repos, task scheduling, orchestration, etc. While you can execute notebooks manually, you can also organize them in jobs and pipelines that can be managed and scheduled.
Figure 1 shows what insight Databricks Jupyter notebooks provide into the details about the underlying Spark execution engine, right from the GUI.
5. Delta Live Tables
Delta Live Tables is a framework for building data processing pipelines using special LIVE tables. You begin by writing a notebook defining your LIVE tables. SQL queries defining them can only effectively be executed from within a Delta Live Tables (DLT) Pipeline. DLT Pipeline will analyze your notebooks, find all LIVE table definitions and relationships between them, create and validate a DAG (directed acyclic graph) of those definitions and display the result in a clear and easy-to-understand view. You can then execute or schedule the pipeline and the platform will manage:
- Orchestration – DLT will make sure data processing tasks start in the most efficient order
- Cluster management – DLT will automatically start clusters and stop them once they’re no longer needed
- Monitoring – DLT will collect logs, send alerts, and so on
- Data quality – you can create “expectations” when defining a LIVE table and DLT will monitor and report on them
- Error handling – DLT will retry tasks when certain exceptions happen
While your DLT Pipeline is executing, you can watch it live as the data flows through it (look at an example in Figure 2). There are two main execution modes: triggered and continuous. In continuous mode pipelines process data as soon as data arrive. This is, of course, more expensive then executing them periodically because the clusters need to be up and active at all times.
One more cool thing that you can do with DLT pipelines is implement change data capture (CDC). When you execute a command such as this one:
APPLY CHANGES INTO table2 FROM table1
table2 will start receiving each updated, deleted or inserted row in table1 as a new row containing the original data, plus columns with timestamps and type f operation (update, delete, insert). This is also known as SCD2 and table2 is sometimes called “history table” or “audit table”.
Databricks platform is organized into three “personas”, or separate areas with specific functionalities: Data Science & Engineering, Machine Learning, and SQL. SQL is available on premium accounts only and allows for more efficient DWH-specialized workloads. Users of Databricks SQL work with Data Explorer, SQL Editor, Queries, Alerts and Dashboards, which all require SQL Warehouses (formerly known as “SQL Endpoints”): special clusters optimized for BI and analytic workloads.
Data Explorer and SQL Editor offer the standard things you’re probably used to when using similar SQL tools: listing tables, running SQL queries, setting permissions. But here you can also examine history for Delta tables and Spark execution details: amount of memory and time required in each step of the executed plan. Check out a query profile in Figure 3 for an example.
Right from the query editor you can create visualizations (and filtering or parameter widgets) that are automatically registered in your workspace and can be used in dashboards. You can see an example in Figure 4.
MLFlow is an open source framework for managing machine learning lifecycles, well integrated into Databricks platform. It consists of four main components:
- MLflow Tracking – tracks experiments, runs, parameters, and metrics
- MLflow models – storage format for describing models of different “flavors” (e.g. SKlearn, Keras, XGBoost etc.)
- MLflow Projects – package code in a format to reproduce runs on different platforms
- Model registry – manage models in a central repository
MLFlow really simplifies and brings order into training and using machine learning models.
In addition to MLFlow, Machine Learning Engineers in Databricks also have access to the Feature Store that catalogs special tables containing features used for training machine learning algorithms. In this way, you can employ a more structured and controlled approach to handling your training and testing data in your machine learning workflows.2
Databricks is a modern, mature, and complete data platform. We’ve covered a lot of ground here and we still haven’t mentioned Unity Catalog, data ingestion capabilities, integrations with other tools and systems, Delta Sharing, security, compliance, etc., etc. That will have to wait for a future edition of our blog.
Autor: Petar Zečević, Senior Principal Consultant at Poslovna inteligencija/Bird Consulting