Table of Contents
- 1. Efficent Processing of Neural Networks
- 2. Co-Design
- 3. Tools
1 Efficent Processing of Neural Networks
- Speaker: Vivienne Sze
- processing at the edge instead of the cloud
- ex. autonomous vehicles 6 gigs of data every three seconds
- existing processors consume too much power
- Given slowdown of moores law and denard scaling, we need specialized hardware.
1.1 Points of Talk
- What are the ky metrics?
- what are the callenges towards acheiving these metrics?
- what are the design considerations and tradeoffs?
1.2 DNNs
- Key operation is the multiply and accumulate (MAC) 90% of computation
1.3 Metrics
- Accuracy
- Consider quality of result
- Throughout
- important for real time performance
- Latency
- autonomous driving
- Energy and Power consumption
- embedded devices are limited battery capacity
- Hardware Cost
- Flexibility
- Range of DNN models and abiltiy on tasks
- ability to support future models
- Scalability
- performance should scale with more resources
1.4 Design objectives of a NN processor
1.4.1 Reduce the time per MAC
- reduce instruction overhead
- increase clock frequency
1.4.2 Avoid unecessary MACS
1.4.3 increase parallelism
- Perform MACS in parallel
1.4.4 increase PE utilization
- distribute workload
- balanced workload, weakest link phenomena
- memory bandwidth to get workload to the PE
1.4.5 evaluation: Eyexam
- graph of MAC/cycle vs MAC/data
- slope at beginning due to memory bound compute, once data is there problem becomes compute bound.
1.5 Power Consumption of a NN processor
- MACS actually are not what are taking power, its actually reading the data.
- DRAM is orders of magnitude more expensive than 16b FP multiply
1.5.1 To reduce power usage
- Reduce data movement
- Reduce energy per MAC
- Reduce unnecessary MACS
1.6 Specifications to evaluate metrics
1.6.1 Accuracy
- difficulty of dataset and task should be considered
1.6.2 Throughput
- Numer of PEs with utilization stats
1.6.3 Latency
- batch size used in evaluation
1.6.4 Energy and Power
- no sufficent to just report chip power consumption but also off chip memory access power consumption
- Without DRAM estimates one could claim low power consumption but fail drastically at evaluation time
1.6.5 Hardware Cost
- on chip storage, # of PEs, chip area
1.6.6 Flexibility
- number of models supported without customization
1.7 Reduce ops in Matrix Multiply
- FFT: direct conv but increases storage
- Strassen: slightly faster but can lead to numerical instability
- Winograd
1.8 Reduce Instruction Overhead
- Perform more macs per second
- GPU: HMMA nivida instruction 64 macs
- CPU: Specialized vector neural network instruction
1.9 Properties we can leverge
- Throughput: DRAM accesses are the bottleneck, around 200x more than MACS for Alexnet
- Input data reuse
- filter resuse
- convolutional reuse
- Spatial Arch (efficent dataflow)
- small on node memory, with inter node communication
- allows weights, activations and partial sums to live closer to the chip
2 Co-Design
2.1 Quantization
- reduce the precision to reduce latency in training and inference
- methods
- linear quantiation
- log quantization
- non-linear quantization
- 8-bit training with stochasic rounding
2.2 Design considerations for reduced precision
- impact on accuracy
- does hardware cost exceed benefits?
- evaluation
- 8 bit for inference and 16 bit for training (standard benchmark)
2.3 Sparsity
- when using activations like ReLu there are a number of weights to 0
- Gate Operation (reduce power consumption)
- Skip operations (increase throughput)
- Compression to reduce data movement
- Pruning (optimal brain damage, love that term)
2.4 Design considerations for Sparsity
- similar to reduced precision
- impact on accuracy
- Do you need extra hardware to identify sparsity?
2.5 Neural Architecture Search (NAS)
- complexity = numsmaples x timepersample
2.5.1 three main components
- search space (what is the set of all samples)
- optimization (where to sample)
- performance evaluation (how to evaluate)
2.5.2 design considerations
- optimization alogrithm may limit search space
- probability of convergence to a good model is a commonly overlooked property
3 Tools
- NetAdapt
- Platform Aware DNN adaption
- on github.