Data 2022 outlook, part one: Will data clouds get easier? Will streaming get off its own island?

0
147

Tony Baer (dbInsight)

Written by

Tony Baer (dbInsight), Contributing Writer

Tony Baer (dbInsight)

Tony Baer (dbInsight)
Contributing Writer

Ovum

Full Bio

Posted in Big on Data

on January 3, 2022

| Topic: Big Data

With the pandemic nearing its two-year anniversary, the growth of cloud adoption has continued accelerating. Although dated last March, the most recent state of the cloud report from Flexera shows significant acceleration in cloud spending for large enterprises, with the proportion shelling out over $1 million/month — double over the previous year.

As reported by Larry Dignan last summer, a backlash to cloud migration may be starting to brew based on growing expenses. We’ve heard anecdotes from technology providers like Vertica that some of their largest clients were actually repatriating workloads from the cloud back to their own data center or colocation facilities. 

So what’s on tap for this year? We’re dividing our 2022 outlook over two posts. Here, we’ll focus on trends with cloud data platforms; tomorrow, we’ll share our thoughts on what will happen with data mesh in the coming year.

Looking back on 2021

Last year saw some of the last on-premises database holdouts, such as Vertica and Couchbase, unveil their own cloud managed services. This reflects the reality that, while not all customers are going to deploy in the public cloud, offering an as-a-service option is now a required addition to the portfolio.

Despite the growth in cloud adoption, the database and analytics world did not see dramatic product or cloud service introductions. Instead, it saw a rounding out of portfolios with the addition of serverless options for analytics, and it moved toward pushdown processing in the database or storage tier. Excluding HPE, which unveiled a significant expansion of its GreenLake hybrid cloud platform in midyear, the same was largely true on the hybrid cloud front.

With most providers having planted their stakes in the cloud, the past year was about cloud providers building bridges to make it easier to lift and shift or lift and transform on-premise database deployments. For lift and shift, Microsoft already offered Azure SQL Database Managed Instance to SQL server customers, and it added managed instance for Apache Cassandra in 2021.

Meanwhile, AWS introduced its answer to Managed Instance: a new RDS Custom option for SQL Server and Oracle customers requiring special configurations that wouldn’t otherwise be supported in RDS. This could be especially useful for instances that support, for example, legacy ERP applications. 

What if you want to continue using your existing SQL skills on a new target? Last year, AWS released Babelfish, an open source utility that can automatically convert most SQL Server T-SQL calls into PostgreSQL’s pg/PLSQL dialect. And then there’s Datometry to just virtualize your database.

Also in the spirit of lift and shift, last year saw each of the major clouds adding or expanding database migration services designed to make the process simpler. AWS and Azure already had services that provided guided approaches to migrating from Oracle or SQL Server to MySQL or PostgreSQL. Meanwhile, Google introduced a database migration service that makes the transfer of on-premises MySQL or PostgreSQL to Cloud SQL into an almost fully-automated process.

Also: Analytics and AI in 2022: Innovation in the era of COVID-19

The burden is currently on the customer

Streaming will start converging with analytics and operational databases

A long elusive goal for operational systems and analytics is unifying data in motion (streaming) with data at rest (data sitting in a database or data lake).

In the coming year, we expect to see streaming and operational systems come closer together. The benefit would be to improve operational decision support by embedding some lightweight analytics or predictive capability. There would be clear benefits for use cases as diverse as Customer 360 and Supply Chain Optimization; Maintenance, Repair, and Overhaul (MRO); capital markets trading; and smart grid balancing. It could also provide real-time feedback loops for ML models. In a world where business is getting digitized, having that predictive loop to support data-driven operational decisions is morphing from luxury to necessity.

The idea of bringing streaming and data at rest together is hardly new; it was spelled out years ago as the Kappa architecture, and there have been isolated implementations on big data platforms — the former MapR’s “converged platform” (now HPE Ezmeral Unified Analytics) comes to mind.

Streaming workloads traditionally run on their own dedicated platforms because of their extreme resource demands. The show stopper keeping streaming on its own island of infrastructure is resource contention.

Streaming applications — such as parsing real-time capital market feeds, detecting anomalies in the flow of data from physical machines, troubleshooting the operation of networks, or monitoring clinical data –have typically operated standalone. And because of the need to maintain a light footprint, analytics and queries tend to be simpler than what you could run in a data warehouse or data lake. Specifically, streaming analytics often involves filtering, parsing, and, increasingly, predictive trending.

When there is a handoff to data warehouses or data lakes, in most cases, the data is limited to result sets. For instance, you can run an SQL query on Amazon Kinesis Data Analytics that identifies outliers, persist the results to Redshift, and then perform a query on the combined data for more complex analytics. But it’s a multistep operation involving two services, and it’s not strictly real-time.

Admittedly, in-memory operational databases like Redis, you can support the near-instant persistence of streaming data with append-only log data formats, but that is not the same as adding a predictive feedback loop to operational applications.

Over the past couple years, we’ve seen some hints that streaming is about to become part of operational and analytic data clouds. Confluent kicked open the doors when it released ksqldb on Confluent Cloud back in 2020. Last year, DataStax introduced the beta for Astra Streaming, backed on Apache Pulsar (not Kafka); it’s currently a separate service, but we expect that it will be blended in with Astra DB over time. In the Spark universe, Delta Lake can act as a streaming source or sink for Spark Structured Streaming.

The game changer is cloud-native architecture. The elasticity of the cloud eliminates issues of resource contention, while microservices provide more resilient alternatives to classic design patterns involving a central orchestrator or state machine. In turn, Kubernetes (K8s) enables analytic platforms to support elasticity without having to reinvent the wheel for orchestrating compute resources. Converged streaming and operational or analytic systems can run on distributed clusters, which can be partitioned and orchestrated for performing real-time stream analytics, merging results, and correlating with complex operational models.

Such convergence won’t replace dedicated streaming services, but there are clear opportunities for cloud incumbents: Amazon Kinesis Data Analytics paired with Redshift or DynamoDB; Azure Stream Analytics with Cosmos DB or Synapse Analytics; Google Cloud Dataflow with BigQuery or Firestore all come to mind. 

But there are also opportunities for real-time in-memory data stores. We’re talking to you, Redis, not to mention any of the dozens of time series databases out there.

Also: What data management leaders forecast for the sector in 2022

Data share and share, alike

In hindsight, this looks like a no-brainer. With cloud storage being the de facto data lake, promoting wider access to data should be a win-win for everybody: data providers get more mileage (and potentially, monetization) out of their data; data customers gain access to more diverse data sets; cloud platform providers can sell more utilization (storage and compute); and cloud data warehouses can transform themselves into data destinations. 

From that perspective, it’s surprising that it’s taken each of the major cloud providers almost five years to catch on to an idea that Snowflake hatched.

Snowflake and AWS have been the most active in promoting data exchanges, although both approached it from opposite directions. Snowflake began with a data-sharing capability aimed across internal departments and later opened a data exchange for third parties. AWS went in reverse order, opening a data exchange on AWS Marketplace a couple years back, but it’s only been adding capabilities for internal sharing of data for Redshift customers (that required AWS to develop the RA3 instance that finally separated Redshift data into its own pool) for the past year. 

Snowflake has taken the added step of opening vertical industry sections of its marketplace, making it easier for customers to connect to the right data sets. On the other hand, AWS beat Snowflake to the punch in commercializing its data marketplace by utilizing the existing AWS Marketplace mechanism.

Google followed suit with Analytics Hub for sharing BigQuery data sets, a capability that they will subsequently extend to other assets such as Looker Blocks and Connected Sheets. Microsoft Azure has also gotten into the act.

Over the next year, we expect each of the cloud providers to flesh out their internal and external data exchanges and marketplaces, especially where it comes to commercialization.

Database platforms turn to ML to run themselves

This is the flip side of in-database ML, which we predicted would become a checkbox item in 2021 for cloud data warehouses and data lakes. What we’re talking about here is the use of ML under the covers to help run or optimize a database.

Oracle fired the first shot with the Autonomous Database; Oracle went full-bore with ML by designing a database that literally runs itself. That’s only possible with the breadth of database automation that is largely unique to Oracle database. But for Oracle’s rivals, we’re taking a more modest view: applying ML to assist, not replace, the DBA in optimizing specific database operations.

As any experienced DBA will testify, running a database involves lots of figurative “knobs.” Examples include physical data placement and storage tiering, the sequence of joins in a complex query, and identifying the right indexes. In the cloud, that could also encompass identifying the most optimal hardware instances. Typically, configurations are set by formal rules or based on the DBA’s informal knowledge.

Optimizing a database is well-suited for ML. The processes are data rich, as databases generate huge troves of log data. The problem is also well-bounded, as the features are well-defined. And there is significant potential for cost savings, especially when it comes to factoring how to best lay out data or design a query. Cloud DBaaS providers are well-situated to apply ML to optimize the running of their database services, as they control the infrastructure and have rich pools of anonymized operational data on which to build and continually improve models.

We’ve been surprised, however, that there have been few takers to Oracle’s challenge. Just about the only formally productized use of ML (aside from Oracle) is with Azure SQL Database and SQL Managed Instance; Microsoft offers autotuning of indexes and queries. That’s a classical problem of trade-offs: the faster speed of retrieval with an index vs. the cost and overhead of writes when you have too many indexes. Azure’s automated tuning can automatically create indexes when it senses query hot spots; drops indexes that go unused after 90 days; and reinstates previous versions of query plans if newer ones prove slower.

Over the coming year, we expect to see more cloud DBaaS services introduce options incorporating ML to optimize the database, promoting to enterprises how they can save money. 

Disclosure: AWS, DataStax, Google Cloud, HPE, IBM, and Oracle are dbInsight clients.

Featured

Why I replaced my iPhone 12 with the Pixel 6

Covid testing: The best at-home rapid test kits

American Airlines has a special way of dealing with angry customers

Low-code and no-code platforms move beyond the shiny-tools stage

Amazon

|
Digital Transformation

|
Robotics

|
Internet of Things

|
Innovation

|
Enterprise Software