Skip to content
Snippets Groups Projects
README.md 6.96 KiB

Speed Benchmark

This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the Qwen2.5 Speed Benchmark.

1. Model Collections

For models hosted on HuggingFace, refer to Qwen2.5 Collections-HuggingFace.

For models hosted on ModelScope, refer to Qwen2.5 Collections-ModelScope.

2. Environment Setup

For inference using HuggingFace transformers:

conda create -n qwen_perf_transformers python=3.10
conda activate qwen_perf_transformers

pip install torch==2.3.1
pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@v0.7.1
pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.8
pip install -r requirements-perf-transformers.txt

[!Important]

  • For flash-attention, you can use the prebulit wheels from GitHub Releases or installing from source, which requires a compatible CUDA compiler.
    • You don't actually need to install flash-attention. It has been intergrated into torch as a backend of sdpa.
  • For auto_gptq to use efficent kernels, you need to install from source, because the prebuilt wheels require incompatible torch versions. Installing from source also requires a compatible CUDA compiler.
  • For autoawq to use efficent kenerls, you need autoawq-kernels, which should be automatically installed. If not, run pip install autoawq-kernels.

For inference using vLLM:

conda create -n qwen_perf_vllm python=3.10
conda activate qwen_perf_vllm

pip install -r requirements-perf-vllm.txt

3. Execute Tests

Below are two methods for executing tests: using a script or the Speed Benchmark tool.

Method 1: Testing with Speed Benchmark Tool

Use the Speed Benchmark tool developed by EvalScope, which supports automatic model downloads from ModelScope and outputs test results. It also allows testing by specifying the model service URL. For details, please refer to the 📖 User Guide.

Install Dependencies

pip install 'evalscope[perf]' -U

HuggingFace Transformers Inference

Execute the command as follows:

CUDA_VISIBLE_DEVICES=0 evalscope perf \
 --parallel 1 \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --attn-implementation flash_attention_2 \
 --log-every-n-query 5 \
 --connect-timeout 6000 \
 --read-timeout 6000 \
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api local \
 --dataset speed_benchmark 

vLLM Inference

CUDA_VISIBLE_DEVICES=0 evalscope perf \
 --parallel 1 \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --log-every-n-query 1 \
 --connect-timeout 60000 \
 --read-timeout 60000\
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api local_vllm \
 --dataset speed_benchmark

Parameter Explanation

  • --parallel sets the number of worker threads for concurrent requests, should be fixed at 1.
  • --model specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct.
  • --attn-implementation sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa.
  • --log-every-n-query: sets how often to log every n requests.
  • --connect-timeout: sets the connection timeout in seconds.
  • --read-timeout: sets the read timeout in seconds.
  • --max-tokens: sets the maximum output length in tokens.
  • --min-tokens: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048.
  • --api: sets the inference interface; local inference options are local|local_vllm.
  • --dataset: sets the test dataset; options are speed_benchmark|speed_benchmark_long.

Test Results

Test results can be found in the outputs/{model_name}/{timestamp}/speed_benchmark.json file, which contains all request results and test parameters.

Method 2: Testing with Scripts

HuggingFace Transformers Inference

  • Using HuggingFace Hub
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
  • Using ModelScope Hub
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers

Parameter Explanation:

`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
`--generate_length`: Number of tokens to generate; default is 2048
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`  
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
`--outputs_dir`: Output directory, default is `outputs/transformers`  

vLLM Inference

  • Using HuggingFace Hub
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
  • Using ModelScope Hub
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm

Parameter Explanation: