ENGLISH

The state of AI in 2021: Machine learning in production, MLOps and data-centric AI

135

George Anadiotis

for Big on Data

| October 14, 2021

| Topic: Big Data Analytics

It’s that time of year again: Reports on the state of AI for 2021 are out. A few days back, it was the Machine learning, Artificial Intelligence and Data report by Matt Turck, that ZDNet Big on Data colleague Tony Baer covered. This week, it’s the State of AI 2021 report, by Nathan Benaich and Ian Hogarth.

After releasing what probably was the most comprehensive report on the State of AI in 2020, Air Street Capital and RAAIS founder Nathan Benaich and AI angel investor and UCL IIPP visiting professor Ian Hogarth are back for more.

In what is becoming a valued yearly tradition, we caught up with Benaich and Hogarth to discuss topics that stood out for us in the report.

MLOps, machine learning in production

First off, there is overlap with the topics that Turck covered and Baer reported on, and for good reason. As Baer pointed out, the wave of IPOs and proliferation of unicorns is turning this market into its own sector, and that is impossible to ignore. For an overview of market trends, we encourage readers to have a look at Baer’s coverage.

That said, our feeling is that the State of AI 2021 report covers more topics: the latest developments in AI research, industry, talent, and politics, while it also ventures on predictions. In fact, Benaich and Hogarth keep track of their predictions, and they are doing pretty well. For example, in 2020 they correctly predicted the obstacles in Arm’s acquisition by Nvidia, and AI and biotech-related IPOs.

As Benaich noted, by virtue of being investors at different mostly early stages machine learning companies, they have access to major AI labs, academic groups, up and coming startups, bigger companies, as well as people who work in government. So they try to synthesize all those different angles in a public good product that is open source and aims to holistically inform all stakeholders.

We picked some overarching themes that stood out for us in the report, as we have also identified them throughout the year. The first one is MLOps — the art and science of bringing machine learning to production. In operationalizing AI, the emphasis is shifting from shiny new models to perhaps more mundane, but practical aspects.

With the increasing power and availability of machine learning models, gains from model improvements have become marginal. In this context, the machine learning community is growing increasingly aware of the importance of better data practices, and more generally better MLOps, to build reliable machine learning products.

Hazy Research, Stanford

Benaich noted that they thought it important to highlight renewed attention in more industry minded academic work around data quality and various issues that can reside within data that ultimately propagate towards ML models, determining whether models predicts well or not:

“A lot of academia was focused on competing on static benchmarks, showing model performance offline on these benchmarks, and then moving into industry. So generation one was a lot about — let’s just get a model that works for a specific problem, and then deal with any issues or any changes whenever they happen.

Google researchers define data cascades as “compounding events causing negative, downstream effects from data issues”. Supported by a survey of 53 practitioners from the US, India, East and West African countries, they warn that current practices undervalue data quality and result in data cascades.

It’s a fairly intuitive idea — the domino effect. If you have a problem at the start, it’s going to likely come down by the time you get to the last domino. What’s notable is that the overwhelming majority of data scientists reports having experienced one of these issues.

When trying to attribute why these issues actually happened, it was mostly due to lack of recognition of the importance of data within the context of their work in AI, or lack of training in the domain, or not getting access to enough specialized data for the particular problem that they were solving.

What that points to is that in the world of machine learning there is more nuance than “good data” and “bad data”. As datasets are multi-faceted, with different subsets used in different contexts, and different versions evolving, context is key in defining data quality. The insights from machine learning in production incite a shift of focus from model-centric to data-centric AI.

Data-centric AI is a notion developed in Hazy Research, Chris Ré’s Research Group at Stanford. As noted, the importance of data is not new — there are well-established mathematical, algorithmic, and systems techniques for working with data, which have been developed over decades.

What is new is how to build on and re-examine these techniques in light of modern AI models and methods. Just a few years ago, we did not have long-lived AI systems or the current breed of powerful deep models.

Join us next week as we continue the conversation with Benaich and Hogarth, to cover topics such as language models, AI commercialization, and AI-powered biotechnology.:

Big Data

Vertica accelerates plunge into the cloud

Observe Inc and the adventure of being one of Snowflake’s best users

Digital transformation is changing. Here’s what comes next

The best careers you can start with a computer science degree

The state of AI in 2021: Machine learning in production, MLOps and data-centric AI

MLOps, machine learning in production

Big Data

Related Topics:

LEAVE A REPLY