Achieving 10x ML Inference performance

MLnative revolutionizes ML inference performance by achieving 4x higher throughput and 30x lower latencies through GPU sharing, as demonstrated in a recent load test using the FLAN-T5 model.

Running ML predictions is no small feat. The majority of ML applications in today’s world require a lot of computing power, a task for which the CPU is usually not good enough. That’s why the majority of complex tasks like computer vision, text-to-speech, or other generative tasks are usually running on GPUs. There are significant disadvantages to this, as running workloads on GPUs is much more complex than on CPUs. We’re here to help your company get rid of this problem entirely - see below for what’s possible with MLnative.

Sample LLM Load Test

We’d like to share the results of internal load testing we’ve performed over at MLnative. For test purposes, we’ve decided to run a small LLM from Google called Flan-T5. All tests were performed on the A100 GPU with a total of 80 GB of memory.

The test consists of the following phases:

a 3 min ramp-up to a total of 20 concurrent users
a 5 min load of constant 20 concurrent users
a 2 min scale down

Single replica, 20 GB

In the initial run, the FLAN-T5 model was allocated 20 GB of GPU memory.

We can see that after the initial spike in throughput, the value slightly dropped due to the increase in response time, finally stabilizing around 280 Requests per minute. Throughout the test, the response time remained roughly stable around 3220ms, with P90 and P95 not far off the median value.

Four replicas, 5 GB each

For comparison, we’ve deployed 4 replicas of the exact same model with 5 GB of memory allocated to each.

We can see on the charts that we’ve managed to achieve a throughput of over 1000 Requests per minute, while also maintaining a significantly lower median response time of 122ms.

Thanks to MLnative’s built-in GPU autoscaling, the number of model replicas was automatically adjusted to accommodate the increase in traffic. By utilizing multiple model replicas we’re able to achieve 4x higher throughput and 30x lower latencies with the same amount of GPU memory. This example clearly demonstrates the power of GPU sharing when it comes to enabling GPU workloads. All of this is provided out of the box when you run your models with MLnative. On top of that, we provide API quota controls, request prioritization, rate limiting, and much more to make your ML prod operations as smooth as possible.

‍

Wanna know more? Feel free to reach out at contact@mlnative.com and find out how we can help you scale your ML operations!