Tushar Krishna - DNN-Dataflow- Hardware Co-Design for Enabling Pervasive General-Purpose AI

views comments

The development of supervised learning based DL solutions today is mostly open loop. A typical DL model is

created by hand-tuning the neural network (NN) topology by a team of experts over multiple iterations, often by

trial and error, and then trained over gargantuan amounts of labeled data over weeks at a time to obtain a set of

weights. The trained model hence obtained is then deployed in the cloud or at the edge over inference accelerators

(such as GPUs, FPGAs, or ASICs). This form ofDL breaks in the absence of labelled data, and/or if the model for

the task at hand is unknown, and/or if the problem keeps changing. An AI system for continuous learning needs to

have the ability to constantly interact with the environment and add and remove connections within the NN

autonomously, just like our brains do.

In this talk, we will briefly present our research efforts towards enabling general-purpose AI.

First, we will present GeneSys, a HW-SW prototype of an Evolutionary Algorithm (EA)-based learning system,

that comprises of a closed loop learning engine called EvE and an inference engine called ADAM. EvE is a genetic

algorithm accelerator that can "evolve" the topology and weights of NNs completely in hardware for the task at

hand, without requiring hand-optimization or back propogation training. ADAM continuously interacts with the

environment and is optimized for efficiently running the irregular NNs generated by EvE, which today's suite of

DL accelerators and GPUs are not optimized to handle.

Next, we focus on the challenge of mapping a DNN model (developed via supervised or EA-based methods)

efficiently over an accelerator (ASIC/GPU/FPGA). DNNs are essentially multi-dimensionalloops, with millions of

parameters and billions of computations. They can be partitioned in myriad ways to map over the compute array.

Each unique mapping, or "dataflow" provides different trade-offs in terms of throughput and energy-efficiency, as

it determines overall utilization and data reuse. Moreover, the right dataflow for a DNN depends heavily on the

layer type, input activation to weight ratio, the accelerator microarchitecture, and its memory hierarchy. We will

present an analytical tool called MAESTRO that we have been developing in collaboration with NVJDIA for

formally characterizing the performance and energy-impact of dataflows in DNNs today. MAESTRO can be used

at design-time, for providing quick first-order metrics at design-time when hardware resources (buffers

and interconnects) are being allocated on-chip, and compile-time when different layers need to be optimally mapped

for high utilization and energy-efficiency.

Finally, we will present the micro-architecture of an open-source DNN accelerator called MAERI that is equipped

to adaptively change the dataflow depending on the DNN layer currently being mapped by levering a runtime

reconfigurable interconnection fabric.