Exploring LLMs Speed Benchmarks
Independent Analysis
June 24, 2024Introduction
This blog takes a closer look at how three advanced 7 billion parameter language models (LLMs)—LLama2 7Bn, Mistral 7Bn, and Gemma 7Bn—perform. We tested them using different setups to see how fast they can process text, which is important for anyone using these models to know. These tests were independently carried out on A100 GPU hosted on Azure not Inferless to provide a fair and unbiased look at how each model handles tasks with various amounts of text to process.
Our findings are aimed at developers, researchers, and AI enthusiasts who need to pick the right language model for their work by measuring the speed of these models. By sharing how LLama, Mistral, and Gemma did in our tests, we hope to help you make better choices for your production workloads and keep you at the cutting edge of model performances.
Key Findings
In our benchmarking of three LLMs, the results are as follows:
- Mistral 7Bn, in conjunction with TensorRT-LLM, achieved the highest performance, reaching a maximum of 93.63 tokens/sec with 20 Input tokens and 200 Output tokens. This surpassed vLLM by approximately 5.10% in tokens per second. Despite its impressive performance, vLLM was incredibly user-friendly. On the other hand, the CTranslate2 Library produced the lowest results.
Mistral 7B
- Llama2 7Bn, using TensorRT-LLM, outperformed vLLM by reaching a maximum of 92.18 tokens/second with 20 Input tokens and 200 Output tokens, which is a slight improvement of 2.80%. Considering the minimal difference and the time-consuming process of using TensorRT-LLM without proper documentation, vLLM is a recommended choice. However, if you're still interested in TensorRT-LLM, we have a tutorial available for you to read.
Llama 2 7bn
- Gemma 7Bn, using Text Generation Inference, showed impressive performance of approximately 65.86 tokens/sec with 20 input tokens and 100 output tokens. This represents a slight improvement of approximately 3.28% in tokens per second compared to vLLM. However, when comparing to Llama 2 and Mistral, regardless of the libraries used, Gemma 7Bn model exhibits the lowest token/sec performance.
Gemma 7Bn
Overall, Mistral achieved the highest tokens per second at 93.63 when optimized with TensorRT-LLM, highlighting its efficiency. However, each model displayed unique strengths depending on the conditions or libraries used, emphasizing the absence of a universal solution.
More details about these different libraries are mentioned below:
LibraryEase of UseTime RequiredDocumentationGitHub URLGitHub Stars
vLLM | Easy | < 30 minutes | Good | vLLM Project | 16.2k |
TGI | Easy | < 30 minutes | Good | Text Generation Inference by Hugging Face | 7.3k |
DeepSpeed Mii | Easy | < 30 minutes | Good | DeepSpeed-MII by Microsoft | 1.6k |
CTranslate2 | Easy | < 1 Hour | Good | CTranslate2 by OpenNMT | 2.6k |
Triton+vLLM Backend | Moderate | < 1 Hour | - | Triton+vLLM Backend by Inferless | Triton: 7k, vLLM Backend: 95 |
TensorRT-LLM | Moderate | < 3 Hours | - | TensorRT-LLM by Inferless | 5.9k |
How did we test them
To evaluate leading LLM libraries, we developed a precise and systematic testing strategy. Here's a brief outline of our approach, emphasizing our commitment to accuracy and consistency:
- Testing Platform: All benchmarks were conducted on A100 GPUs provided by Azure, ensuring unbiased and independent view for our tests.
- Environment Setup: We utilized Docker containers for vLLM, CTranslate2, and DeepSpeed Mii, alongside official containers for other libraries. This setup guaranteed a uniform testing environment across all libraries.
- Configuration: Each test was standardized with temperature set to 0.5 and top_p to 1, allowing us to focus on the libraries' performance without external variables.
- Prompts and Token Ranges: Our test suite included six unique prompts with input lengths from 20 to 5,000 tokens. We explored three generation lengths (100, 200, and 500 tokens) to assess each library's adaptability to varied task complexities.
- Models and Libraries Tested: The evaluation featured Gemma 7B, Llama-2 7B, and Mistral 7B models, using libraries such as Text Generation Inference, vLLM, DeepSpeed Mii, CTranslate2, Triton with vLLM Backend, and TensorRT-LLM.
Our approach allowed for an in-depth analysis of each library's handling of diverse text generation tasks, emphasizing efficiency and adaptability.
Detailed Benchmarks
Now let’s deep dive in the benchmarks.The header meanings are provided below:
Model Name: The designated model used in the benchmarking.
Library: The inference library used for the benchmarking.
Tokens_second: The rate at which tokens are generated by the model per second.
Input_tokens: The overall count of input tokens in the prompt.
Output_tokens: The maximum anticipated number of tokens in the response.
Time (second): The duration taken to receive a response.
Token_count: The total number of tokens generated by the model.
Question: The specific question asked to the model
Answer: The response generated for the given question.
1. Mistral 7Bn
2. Llama 2 7Bn
3. Gemma 7Bn
Note: We appreciate and look forward to your thoughts or insights to help refine our benchmarks better. Our objective is to empower decisions with data, not to discount any service.
Originally posted at: