ENGLISH

A standard for storing big data? Apache Spark creators release open-source Delta Lake

160

Using big data to create value for external customers and internal teams
Zhe Zhang, manager of Core Big Data at LinkedIn, discusses open source deep learning and artificial intelligence tools used at LinkedIn.

In theory, data lakes sound like a good idea: One big repository to store all data your organization needs to process, unifying myriads of data sources. In practice, most data lakes are a mess in one way or another, earning them the “data swamp” moniker. Databricks says part of the reason is lack of transactional support, and they have just open sourced Delta Lake, a solution to address this.

Historically, data lakes have been a euphemism for Hadoop. Historical Hadoop, that is: On-premises, using HDFS as the storage layer. The reason is simple. HDFS offers cost-efficient, reliable storage for data of all shapes and sizes, and Hadoop’s ecosystem offers an array of processing options for that data.

The data times are a changin’ though, and data lakes follow. The main idea of having one big data store for everything remains, but that’s not necessarily on premise anymore, and not necessarily Hadoop either.
Cloud storage is becoming the de facto data lake
, and Hadoop itself is evolving to utilize cloud storage and work in the cloud.

A layer on top of your storage system, wherever it may be

Databricks is the company founded by the creators of Apache Spark. Spark has complemented, or superseded, traditional Hadoop to a large extent. This is due to the higher abstraction of Spark’s APIs and its faster, in-memory processing. Databricks itself offers a managed version of open source Spark in the cloud, with a number of proprietary extensions, called Delta. Delta is cloud-only, and is used by a number of big clients worldwide.

In a conversation with Matei Zaharia, Apache Spark co-creator and Databricks CTO. Zaharia noted that sometimes Spark users migrate to the Databricks platform, while other times it’s line-of-business requirements that dictate a cloud-first approach. It seems that having to deal with data lakes that span on-premises and cloud storage prompted Databricks to do something to address one of their main issues: Reliability.

apache-spark.jpg

“Today nearly every company has a data lake they are trying to gain insights from, but data lakes have proven to lack data reliability. Delta Lake has eliminated these challenges for hundreds of enterprises. By making Delta Lake open source, developers will be able to easily build reliable data lakes and turn them into ‘Delta Lakes’,” said Ali Ghodsi, cofounder and CEO at Databricks.

Knowing where this is coming from, we had to wonder what exactly does it mean, and what kind of data storage does Delta Lake support?

“Delta Lake sits on top of your storage system[s], it does not replace them. Delta Lake is a transactional storage layer that works both on top of HDFS and cloud storage like S3, Azure blob storage. Users can download open-source Delta Lake and use it on-prem with HDFS. Users can read from any storage system that supports Apache Spark’s data sources and write to Delta Lake, which stores data in Parquet format,” Ghodsi told ZDNet.

Apache Parquet is the format of choice for Databricks. Parquet is an open-source columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework. So it seems Delta Lake acts as a layer on top of the supported data storage formats.

A layer on top of your storage system, wherever it may be

apache-spark.jpg

Related Topics: