ENGLISH

The future of the future: Spark, big data insights, streaming and deep learning in the cloud

195

Spark: The big data tool du jour is getting automation

You probably did not hear it here first. Spark has been making waves in big data for a while now, and 2017 has not disappointed anyone who has bet on its meteoric rise. That was a pretty safe bet actually, as interpreting market signals, speaking with pundits and monitoring data all pointed to the same direction.

Spark adoption is booming. Its community is growing, and all major big data platforms make a point of interoperating with Spark. If you look at its core contributors and project management committee (PMC) you will see Hadoop heavyweights Cloudera and Hortonworks, and all-round powerhouses such as IBM, Facebook and Microsoft.

You will also see a name you may not recognize, but dominates Spark’s current development and future direction: Databricks. Databricks is a startup founded by Spark’s inventors, Ali Ghodsi and Matei Zaharia. Ghodsi and Zaharia, who started out as fellow researchers and friends in their Berkeley days, are the CEO and CTO of Databricks.

Last week the Spark Summit Europe event attracted more than 1,000 attendees in Dublin. Ghodsi and Zaharia were both there to share news, get in touch with the community and discuss. ZDNet was also there, and the topics we discussed covered a wide spectrum ranging from strategic to hard-core technical.

Meet Delta, your smart cache layer in the cloud

Dublin set the stage for the latest addition to Databricks arsenal: Delta. In a way, Delta represents the direction and philosophy of Databricks and its founders perfectly. It can be summarized as being a smart cache layer on top of AWS S3 storage that lets you do all your data processing at scale and throughput in the cloud, with Azure and Google cloud soon following suit.

It sounds evolutionary rather than revolutionary, in the sense that this is something that has been going on for a while. Databricks has been moving in that direction too, and starting the conversation with Delta it was an obvious question for Ghodsi: great, but what’s new there exactly?

Databricks pitches Delta as a platform that combines streaming and batch processing, data warehouses, collaboration and machine learning (ML) all in one, while running in the cloud to offer scale and elasticity. Ghodsi explains that product development was customer-driven, not just in the sense of responding to needs but also making customers part of the development loop.

But why try to shape Spark to a data warehouse, and how would that work?

The reason is data warehouses do have advantages in terms of performance and governance, and hearing from customers how they kept moving data around between their data lakes and data warehouses inspired Databricks to take action. Data lakes complement data warehouses in terms of cheap storage and separation of compute and storage, so the idea was to get the best of both worlds.