Best way to improve inference throughput

I see multiple options on the internet to optimize inference, and i don’t know which would be the best fit for me. My goal is to maximize throughput on GPU, and preferably reduce GPU memory usage.

I have a reinforcement learning project, where i have multiple cpu processes generating input data in batches and sending them over to a single GPU for inference. Each process loads the same resnet model with two different weight configurations at a time. The weights used get updated about every 30 minutes and get distributed between the processes. I use Python and Tensorflow 2.7 on Windows(don’t judge) and the only optimization is use right now is the built-in XLA optimizations. My GPU does not support FP-16.

I have seen TensorRT being suggested to optimize inference, i have also seen TensorflowLite, Intel has an optimization tool too, and then there is Tensorflow Serve. What option do you think would fit my needs best?

submitted by /u/LikvidJozsi
[visit reddit] [comments]

Leave a Reply Cancel reply