MLOps Community
+00:00 GMT
May 21, 2024

Exploring LLMs Speed Benchmarks

# LLMs,
# Speed Benchmarks
# Inferless

Independent Analysis - Part 2

May 21, 2024
Rajdeep Borgohain
Rajdeep Borgohain
Aishwarya Goel
Aishwarya Goel


In this blog post, we continue our series on benchmarking language models (LLMs) by examining five LLMs ranging from 10B to 34B parameters across six different inference libraries. Our analysis covers key performance indicators such as Time to First Token, Tokens per Second, and total inference time. These tests were independently conducted on an A100 GPU hosted on Azure, rather than Inferless, ensuring a fair and unbiased comparison of how each model performs.

This comprehensive benchmarking is intended to guide developers, researchers, and AI enthusiasts in choosing the most appropriate LLM for their specific tasks, enhancing decision-making for production environments and application strategies. This post follows up on our previous analysis where we evaluated three 7Bn models: LLama2, Mistral, and Gemma, across the same libraries. Check out part 1 of this series here.

Key Findings

In our benchmarking of five LLMs, here are the key results:

1. LLama-2-13b, using TensorRT-LLM, recorded the highest tokens per second at 52.60 with 20 input tokens and 500 output tokens, outperforming vLLM by about 6.92%. Despite its impressive performance, vLLM was incredibly user-friendly. However, if you're still interested in TensorRT-LLM, we have a tutorial available for you to read.


2. SOLAR-10.7B, paired with vLLM reached a peak of 57.86 tokens per second, surpassing Triton with vLLM backend by 3.89%.


3. Qwen1.5-14B, with vLLM achieved 46.84 tokens per second, slightly ahead of Triton with vLLM by 1.79%.


4. Mpt-30b, using Text Generation Inference, hit 35.43 tokens per second with 100 input tokens and 200 output tokens, showing a 36.23% increase over TensorRT-LLM.


5. Yi-34B, the largest model tested, reached 21.26 tokens per second with TensorRT-LLM, surpassing vLLM by 1.58%.


‍Overall, SOLAR-10.7B demonstrated the highest tokens per second at 57.86 when optimized with vLLM. Each model showed unique strengths across different conditions and libraries. For a detailed comparison of the different libraries in terms of simplicity, documentation, and setup time, refer to our previous blog post: Exploring LLMs' Speed Benchmarks – An Independent Analysis.

Impact on TTFT

Time to First Token (TTFT) is a performance metric that measures the responsiveness of the model. A faster TTFT indicates a more interactive experience.

To assess the TTFT, we've deployed the model within a Docker container and streamed the tokens generated by the LLM. Subsequently, we submit a prompt query to the LLM and measure the time taken for the first token to appear.

Key Findings from the TTFT Analysis Across Libraries and Input Sizes:

  1. Performance Variation by Library:
  2. Triton-vLLM and vLLM generally had lower TTFTs at smaller input sizes, but TTFT increased substantially as input size grew.
  3. CTranslate-2 and Deepspeed-mii also saw a rise in TTFT with larger token counts, indicating potential scalability issues under heavier loads.
  4. Libraries designed for efficiency, like vLLM, maintained lower TTFTs more consistently, which suggests better optimization for real-time use.
  5. Scalability across Libraries:
  6. Libraries such as Deepspeed-miivLLMTGI and Triton-vLLM managed scaling to higher token counts relatively well, though TTFT still increased notably at maximum token counts.
  7. Performance generally deteriorates with the increase in input tokens across all libraries, with a notable spike in TTFT for higher token counts, suggesting that managing larger inputs is a challenge.
  8. Average Performance:
  9. Average TTFTs across libraries tended to rise with an increase in input tokens. While performance was acceptable at 100 tokens, it declined sharply beyond 500 tokens.
  10. The shortest TTFTs were observed around 20 input tokens, with times varying from about 0.025 to 0.060 seconds depending on the library.


How did we test them?

To evaluate leading LLM libraries, we developed a precise and systematic testing strategy. Here's a brief outline of our approach, emphasizing our commitment to accuracy and consistency:

  1. Testing Platform: All benchmarks were conducted on A100 GPUs provided by Azure, ensuring an unbiased view for these tests.
  2. Environment Setup: We utilized Docker containers for vLLM, CTranslate2, and DeepSpeed Mii, alongside official containers for other libraries. This setup guaranteed a uniform testing environment across all libraries.
  3. Configuration: Each test was standardized with temperature set to 0.5 and top_p to 1, allowing us to focus on the libraries' performance without external variables.
  4. Prompts and Token Ranges: Our test suite included six unique prompts with input lengths from 20 to 5,000 tokens. We explored three generation lengths (100, 200, and 500 tokens) to assess each library's adaptability to varied task complexities.
  5. Models and Libraries Tested: The evaluation featured Llama-2-13b-chat-hf, SOLAR-10.7B-Instruct-v1.0, Qwen1.5-14B-Chat, Mpt-30b-instruct and Yi-34B-Chat, using libraries such as Text Generation Inference (TGI), vLLM, DeepSpeed Mii, CTranslate2, Triton with vLLM Backend, and TensorRT-LLM.

Note on Missing Data: Some data points were absent due to constraints such as only Qwen1.5-14B-Chat and Mpt-30b-instruct supporting a context length of 5000 tokens. Additionally, only vLLM, TGI, and Triton with vLLM backend were compatible with all model architectures tested. Large models like Yi-34B-Chat also experienced out-of-memory issues at higher token counts, a common challenge with substantial models.

Detailed Benchmarks

Now let’s deep dive in the benchmarks.The header meanings are provided below:

Model Name: The designated model used in the benchmarking.

Library: The inference library used for the benchmarking.

TTFT: Time takes to generate and deliver the first token after you provide a input tokens.

Token_count: The total number of tokens generated by the model.

Latency (second): The duration taken to receive a response.

Tokens/second: The rate at which tokens are generated by the model per second.

Output_tokens: The maximum anticipated number of tokens in the response.

Input_tokens: The overall count of input tokens in the prompt.

Question: The specific question asked to the model

Answer: The response generated for the given question.

1. LLama-2-13b

2. SOLAR-10.7B




Note: We appreciate and look forward to your thoughts or insights to help refine our benchmarks better. Our objective is to empower decisions with data, not to discount any service.

Originally posted at:

Dive in
Streamlining Model Deployment // Daniel Lenton // AI in Production Talk
By Rochana Dissanayake • Mar 8th, 2024 Views 224
Streamlining Model Deployment // Daniel Lenton // AI in Production Talk
By Rochana Dissanayake • Mar 8th, 2024 Views 224
A Survey of Production RAG Pain Points and Solutions
By Joselito Balleta • Feb 28th, 2024 Views 1.7K
FedML Nexus AI: Your Generative AI Platform at Scale
By Joselito Balleta • May 7th, 2024 Views 386