Table of Contents

1 Efficent Processing of Neural Networks

  • Speaker: Vivienne Sze
  • processing at the edge instead of the cloud
  • ex. autonomous vehicles 6 gigs of data every three seconds
  • existing processors consume too much power
  • Given slowdown of moores law and denard scaling, we need specialized hardware.

1.1 Points of Talk

  • What are the ky metrics?
  • what are the callenges towards acheiving these metrics?
  • what are the design considerations and tradeoffs?

1.2 DNNs

  • Key operation is the multiply and accumulate (MAC) 90% of computation

1.3 Metrics

  • Accuracy
    • Consider quality of result
  • Throughout
    • important for real time performance
  • Latency
    • autonomous driving
  • Energy and Power consumption
    • embedded devices are limited battery capacity
  • Hardware Cost
  • Flexibility
    • Range of DNN models and abiltiy on tasks
    • ability to support future models
  • Scalability
    • performance should scale with more resources

1.4 Design objectives of a NN processor

1.4.1 Reduce the time per MAC

  • reduce instruction overhead
  • increase clock frequency

1.4.2 Avoid unecessary MACS

1.4.3 increase parallelism

  • Perform MACS in parallel

1.4.4 increase PE utilization

  • distribute workload
  • balanced workload, weakest link phenomena
  • memory bandwidth to get workload to the PE

1.4.5 evaluation: Eyexam

  • graph of MAC/cycle vs MAC/data
  • slope at beginning due to memory bound compute, once data is there problem becomes compute bound.

1.5 Power Consumption of a NN processor

  • MACS actually are not what are taking power, its actually reading the data.
  • DRAM is orders of magnitude more expensive than 16b FP multiply

1.5.1 To reduce power usage

  • Reduce data movement
  • Reduce energy per MAC
  • Reduce unnecessary MACS

1.6 Specifications to evaluate metrics

1.6.1 Accuracy

  • difficulty of dataset and task should be considered

1.6.2 Throughput

  • Numer of PEs with utilization stats

1.6.3 Latency

  • batch size used in evaluation

1.6.4 Energy and Power

  • no sufficent to just report chip power consumption but also off chip memory access power consumption
  • Without DRAM estimates one could claim low power consumption but fail drastically at evaluation time

1.6.5 Hardware Cost

  • on chip storage, # of PEs, chip area

1.6.6 Flexibility

  • number of models supported without customization

1.7 Reduce ops in Matrix Multiply

  • FFT: direct conv but increases storage
  • Strassen: slightly faster but can lead to numerical instability
  • Winograd

1.8 Reduce Instruction Overhead

  • Perform more macs per second
    • GPU: HMMA nivida instruction 64 macs
    • CPU: Specialized vector neural network instruction

1.9 Properties we can leverge

  • Throughput: DRAM accesses are the bottleneck, around 200x more than MACS for Alexnet
  • Input data reuse
    • filter resuse
    • convolutional reuse
  • Spatial Arch (efficent dataflow)
    • small on node memory, with inter node communication
    • allows weights, activations and partial sums to live closer to the chip

2 Co-Design

2.1 Quantization

  • reduce the precision to reduce latency in training and inference
  • methods
    • linear quantiation
    • log quantization
    • non-linear quantization
    • 8-bit training with stochasic rounding

2.2 Design considerations for reduced precision

  • impact on accuracy
  • does hardware cost exceed benefits?
  • evaluation
    • 8 bit for inference and 16 bit for training (standard benchmark)

2.3 Sparsity

  • when using activations like ReLu there are a number of weights to 0
  • Gate Operation (reduce power consumption)
  • Skip operations (increase throughput)
  • Compression to reduce data movement
  • Pruning (optimal brain damage, love that term)

2.4 Design considerations for Sparsity

  • similar to reduced precision
  • impact on accuracy
  • Do you need extra hardware to identify sparsity?

2.5 Neural Architecture Search (NAS)

  • complexity = numsmaples x timepersample

2.5.1 three main components

  • search space (what is the set of all samples)
  • optimization (where to sample)
  • performance evaluation (how to evaluate)

2.5.2 design considerations

  • optimization alogrithm may limit search space
  • probability of convergence to a good model is a commonly overlooked property

3 Tools

  • NetAdapt
    • Platform Aware DNN adaption
    • on github.

Author: Sam Partee

Created: 2019-12-09 Mon 13:00

Validate