Making Neural Networks Performant

06 Sep, 2020
Mark Strefford
AI

I’ve been having conversations with a good friend of mine about taking the latest research papers and implementing them in real life situations.

What’s evident is many papers are written with accuracy in mind, and although this is important, waiting many seconds for a prediction is not suitable for real life scenarios.

This article aims to highlight some approaches that can be taken to improve real-time performance of deep neural networks. Take this as a work in progress, it’s not meant to be deeply technical as there are plenty of up-to-date articles on many of the subjects below already.

As I work extensively with computer vision algorithms, I’ll predominantly reference them here, although many of these approaches should translate well to NLP, etc.

If we take the scenario of wanting to predict an outcome from a real time video feed, and the straight-from-github algorithm isn’t performing, here are some possible approaches. They are not necessarily in an order of preference as each use-case can have different requirements in terms of

These first few options relate to scaling and parallelism:

1. Throw hardware at the problem

OK, so if you’re running on CPU can you use a GPU. If you have a GPU, can you use a bigger/faster one, can you run a number of GPUs in parallel so that you can predict in multiple frames simultaneously.

Obviously scaling hardware to this extent works if you have a data centre close at hand, it doesn’t for example work for self-driving cars or other edge-computing scenarios.

2. Split the process and scale each part individually

Typically, a machine learning pipeline has data gather, pre-processing, prediction (sometimes multiple parts of this too), and then likely some post-processing before you get a final prediction you can use.

Each of these parts of the process will have different compute requirements. Pre-processing and post-processing are very likely to require less compute power than the NN-based prediction code.

If you run a single threaded process, you have to wait for each one frame to be processed completely before the next one starts. By taking this approach, you can scale individual parts, run each part of the process separately, and therefore mean multiple frames are being processed in parallel.

These next options relate to optimising the neural network itself:

3. Reduce the image resolution

This is self-explanatory. Lower resolution images may mean less computations.

4. Reduce Network Parameters

With a research aim of getting optimum accuracy on benchmarks, research papers can often base their code on very deep NN backbones. This accuracy comes at a cost, for example using a ResNet backbone will introduce 10s of millions (and potentially over 100 million) parameters before you even look at custom layers for your particular use case.

Replacing even ResNet50 with MobileNet v2 will reduce that figure by more than 10x depending on the resolution and alpha settings.

5. Quantizing

Can you reduce computational overheads by reducing from Float32 to Float16?

6. Reduce the Size of Output Tensors

As an example, Microsoft’s Simple Baselines for Human Pose Estimation outputs a 64x64 tensor for each keypoint, and post-processing needs to perform and argmax to identify the location of each keypoint. Google’s BlazePose algorithm directly regresses the location directly, reducing post-processing.