If we’re talking the king of all architectures: Transfomers -
Methods like quantization and graph optimization using tools like ONNX are usually ways you can get latency down without sacrificing too much performance.
I think a great thing to add to this discussion is profiling.
This can really give you an idea of where the bottlenecks are - and potentially where you can improve the efficiency of your system’s performance.
As mentioned by avi, shared memory can definitely be very helpful to reduce the time spent allocating memory into different components of your system.
Using a memory buffer to transfer data from the host’s memory to GPU memory has been one of the techniques I’ve seen used to improve inference time. This reduces the amount of calls from GPU to host memory, which can oftentimes be a bottleneck.