In this blog post, we continue our series on benchmarking language models (LLMs) by examining five LLMs ranging from 10B to 34B parameters across six different inference libraries. Our analysis covers key performance indicators such as Time to First Token, Tokens per Second, and total inference time. These tests were independently conducted on an A100 GPU hosted on Azure, rather than Inferless, ensuring a fair and unbiased comparison of how each model performs.
This comprehensive benchmarking is intended to guide developers, researchers, and AI enthusiasts in choosing the most appropriate LLM for their specific tasks, enhancing decision-making for production environments and application strategies. This post follows up on our previous analysis where we evaluated three 7Bn models: LLama2, Mistral, and Gemma, across the same libraries. Check out part 1 of this series here.
In our benchmarking of five LLMs, here are the key results:
1. LLama-2-13b, using TensorRT-LLM, recorded the highest tokens per second at 52.60 with 20 input tokens and 500 output tokens, outperforming vLLM by about 6.92%. Despite its impressive performance, vLLM was incredibly user-friendly. However, if you're still interested in TensorRT-LLM, we have a tutorial available for you to read.
LLama-2-13b
2. SOLAR-10.7B, paired with vLLM reached a peak of 57.86 tokens per second, surpassing Triton with vLLM backend by 3.89%.
SOLAR-10.7B
3. Qwen1.5-14B, with vLLM achieved 46.84 tokens per second, slightly ahead of Triton with vLLM by 1.79%.
Qwen1.5-14B
4. Mpt-30b, using Text Generation Inference, hit 35.43 tokens per second with 100 input tokens and 200 output tokens, showing a 36.23% increase over TensorRT-LLM.
Mpt-30b
5. Yi-34B, the largest model tested, reached 21.26 tokens per second with TensorRT-LLM, surpassing vLLM by 1.58%.
Yi-34B
Overall, SOLAR-10.7B demonstrated the highest tokens per second at 57.86 when optimized with vLLM. Each model showed unique strengths across different conditions and libraries. For a detailed comparison of the different libraries in terms of simplicity, documentation, and setup time, refer to our previous blog post: Exploring LLMs' Speed Benchmarks – An Independent Analysis.
Time to First Token (TTFT) is a performance metric that measures the responsiveness of the model. A faster TTFT indicates a more interactive experience.
To assess the TTFT, we've deployed the model within a Docker container and streamed the tokens generated by the LLM. Subsequently, we submit a prompt query to the LLM and measure the time taken for the first token to appear.
Key Findings from the TTFT Analysis Across Libraries and Input Sizes:
TTFT
To evaluate leading LLM libraries, we developed a precise and systematic testing strategy. Here's a brief outline of our approach, emphasizing our commitment to accuracy and consistency:
Note on Missing Data: Some data points were absent due to constraints such as only Qwen1.5-14B-Chat and Mpt-30b-instruct supporting a context length of 5000 tokens. Additionally, only vLLM, TGI, and Triton with vLLM backend were compatible with all model architectures tested. Large models like Yi-34B-Chat also experienced out-of-memory issues at higher token counts, a common challenge with substantial models.
Now let’s deep dive in the benchmarks.The header meanings are provided below:
Model Name: The designated model used in the benchmarking.
Library: The inference library used for the benchmarking.
TTFT: Time takes to generate and deliver the first token after you provide a input tokens.
Token_count: The total number of tokens generated by the model.
Latency (second): The duration taken to receive a response.
Tokens/second: The rate at which tokens are generated by the model per second.
Output_tokens: The maximum anticipated number of tokens in the response.
Input_tokens: The overall count of input tokens in the prompt.
Question: The specific question asked to the model
Answer: The response generated for the given question.
1. LLama-2-13b
2. SOLAR-10.7B
3.Qwen1.5-14B
4.Mpt-30b
5.Yi-34B
Note: We appreciate and look forward to your thoughts or insights to help refine our benchmarks better. Our objective is to empower decisions with data, not to discount any service.
Originally posted at: