ENGLISH

Cerebras CEO talks about the big implications for machine learning in company’s big chip

115

Quality Data: The first mile of machine learning
To launch an effective machine learning initiative, companies need to start with quality data — and maintain steady flow of data to keep models updated, explains Dale Brown, Head of Operations at Figure Eight.

You may have heard that, on Monday, Silicon Valley startup Cerebras Systems unveiled the world’s biggest chip, called the WSE, or “wafer-scale engine,” pronounced “wise.” It is going to be built into complete computing systems sold by Cerebras.

What you may not know is that the WSE and the systems it makes possible have some fascinating implications for deep learning forms of AI, beyond merely speeding up computations.

Cerebras co-founder and chief executive Andrew Feldman talked with ZDNet a bit about what changes become possible in deep learning.

There are three immediate implications that can be seen in what we know of the WSE so far. First, an important aspect of deep networks, known as “normalization,” may get an overhaul. Second, the concept of “sparsity,” of dealing with individual data points rather than a group or “batch,” may take a more central role in deep learning. And third, as people start to develop with the WSE system in mind, more interesting forms of parallel processing may become a focus than has been the case up until now.

All this represents what Feldman says is the hardware freeing up design choices and experimentation in deep learning.

Cerebras’s “wafer-scale engine,” left, compared to a top-of-the-line graphics processing unit from Nvidia, the “V100,” popular in deep learning training.

Cerebras Systems.

“We are proud that we can vastly accelerate the existing, pioneering models of Hinton and Bengio and LeCun,” says Feldman, referring to the three deep learning pioneers who won this year’s ACM Turing award for their work in deep learning, Geoffrey Hinton, Yoshua Bengio, and Yann LeCun.

“But what’s most interesting are the new models yet to be developed,” he adds.

“The size of the universe of models that can be trained is very large,” observes Feldman, “but the sub-set that work well on a GPU is very small, and that’s where things have been focused so far,” referring to the graphics processing chips of Nvidia that are the main compute device for deep learning training.

The first sign that something very interesting was happening with Cerebras came in a paper posted on the arXiv pre-print server in May by Vitaliy Chiley and colleagues at Cerebras, titled “Online Normalization for Training Neural Networks.” In that paper, the authors propose a change to the way machine learning networks are built, called normalization.

Also: AI is changing the entire nature of compute

“The ways in which problems have always been attacked have gathered around them a whole set of sealing wax and string and little ways to correct for weaknesses,” observes Feldman. “They seem practically to require that you do work the way a GPU makes you do work.”

Feldman points out batches are an artifact of GPUs’ form of parallel processing. “Think about why large batches came about in the first place,” he says. “The fundamental math in neural networking is a vector times a matrix.” However, “if you do that it leaves a GPU at very low utilization, like, a few percent utilized, and that’s really bad.”

Also: Google says ‘exponential’ growth of AI is changing nature of compute

So, batching was proposed to fill up the GPU’s pipeline of operations. “What they did is they stacked vectors on top of each other to make a matrix-by-matrix multiply, and the stacking of those vectors is what’s called a mini-batch.”

All this means that batching is “not driven by machine learning theory, they are driven by the need to achieve some utilization of a GPU; it is a case of us bending our neural net thinking to the needs of a very particular hardware architecture, but that’s backward.”

“One of the things we are most excited about is that WSE allows you to do deep learning the way deep learning wants to be done, not shoehorned into a particular architecture,” declares Feldman.

The WSE is intended for what’s called small batch size, or really, “a batch size of one.” Instead of jamming lots of samples through every available circuit, the WSE has hard-wired circuitry that only begins to compute when it detects a single sample that has non-zero values.

Cerebras Systems co-founder and CEO Andrew Feldman.

Tiernan Ray.

The focus on sparse signals is a rebuke to the “data parallelism” of running multiple samples, which, again, is an anachronism of the GPU, contends Feldman. “Data parallelism means your individual instructions will be applied to multiple pieces of data at the same time, including if they are zeros, which is perfect if they are never zeros, like in graphics.

“But when up to 80% is zero, as in a neural network, it’s not smart at all — it’s not wise.” He notes that in the average neural network, the “ReLU,” the most common kind of activation unit for an artificial neuron, has “80% zeros as an output.”

Being able to handle sparse signals looks to be an important direction for deep learning. In a speech to a chip conference in February, the International Solid State Circuits Conference, Facebook’s head of AI research, Yann LeCun, noted that “As the size of DL systems grows, the modules’ activations will likely become increasingly sparse, with only a subset of variables of a subset of modules being activated at any one time.”

That’s closer to how the brain works, contends LeCun. “Unfortunately, with current hardware, batching is what allows us to reduce most low-level neural network operations to matrix products, and thereby reduce the memory access-to-computation ratio,” he said, echoing Feldman.

“Thus, we will need new hardware architectures that can function efficiently with a batch size of one.”

If traditional data parallelism of GPUs is less than optimal, Feldman contends WSE makes possible a kind of renaissance of parallel processing. In particular, the other kind of parallelism can be explored, called “model parallelism,” where separate parts of the network graph of deep learning are apportioned to different areas of the chip and run in parallel.

Related Topics: