Question about inference, latency, and performance in production environments

What tricks have you previously used to reduce inference latency without negatively impacting prediction performance?

Would the technique you use change if you were running in real time versus batch? How so?

Any tips here @kausthubk @OBaratz @richmond @lu.riera @chris @msminhas?


If we’re talking the king of all architectures: Transfomers -

Methods like quantization and graph optimization using tools like ONNX are usually ways you can get latency down without sacrificing too much performance.

  • Quantization (Including INT8 calibration)
  • QAT
  • Parallel Inference / Concurrent Inference
  • Finding the right batch size for you pipeline.
  • Find the sweet spot for your inference server / software:
    • data size (input and output bytes) <-> batch size <-> threads <-> np. processes <-> no. of GPUs.
  • Use shared memory
  • Not using HTTP and gRPC in localhost for inference - use Shared Memory instead.

I think a great thing to add to this discussion is profiling.

This can really give you an idea of where the bottlenecks are - and potentially where you can improve the efficiency of your system’s performance.

As mentioned by avi, shared memory can definitely be very helpful to reduce the time spent allocating memory into different components of your system.

Using a memory buffer to transfer data from the host’s memory to GPU memory has been one of the techniques I’ve seen used to improve inference time. This reduces the amount of calls from GPU to host memory, which can oftentimes be a bottleneck.



Yeah, definitely want to second profiling as a way to seek your bottlenecks to direct your optimizations!