-
Yunnglin authoredb736067e
Speed Benchmark
This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the Qwen2.5 Speed Benchmark.
1. Model Collections
For models hosted on HuggingFace, refer to Qwen2.5 Collections-HuggingFace.
For models hosted on ModelScope, refer to Qwen2.5 Collections-ModelScope.
2. Environment Setup
For inference using HuggingFace transformers:
conda create -n qwen_perf_transformers python=3.10
conda activate qwen_perf_transformers
pip install torch==2.3.1
pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@v0.7.1
pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.8
pip install -r requirements-perf-transformers.txt
[!Important]
- For
flash-attention
, you can use the prebulit wheels from GitHub Releases or installing from source, which requires a compatible CUDA compiler.
- You don't actually need to install
flash-attention
. It has been intergrated intotorch
as a backend ofsdpa
.- For
auto_gptq
to use efficent kernels, you need to install from source, because the prebuilt wheels require incompatibletorch
versions. Installing from source also requires a compatible CUDA compiler.- For
autoawq
to use efficent kenerls, you needautoawq-kernels
, which should be automatically installed. If not, runpip install autoawq-kernels
.
For inference using vLLM:
conda create -n qwen_perf_vllm python=3.10
conda activate qwen_perf_vllm
pip install -r requirements-perf-vllm.txt
3. Execute Tests
Below are two methods for executing tests: using a script or the Speed Benchmark tool.
Method 1: Testing with Speed Benchmark Tool
Use the Speed Benchmark tool developed by EvalScope, which supports automatic model downloads from ModelScope and outputs test results. It also allows testing by specifying the model service URL. For details, please refer to the
Install Dependencies
pip install 'evalscope[perf]' -U
HuggingFace Transformers Inference
Execute the command as follows:
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--parallel 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--attn-implementation flash_attention_2 \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--max-tokens 2048 \
--min-tokens 2048 \
--api local \
--dataset speed_benchmark
vLLM Inference
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--parallel 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--log-every-n-query 1 \
--connect-timeout 60000 \
--read-timeout 60000\
--max-tokens 2048 \
--min-tokens 2048 \
--api local_vllm \
--dataset speed_benchmark
Parameter Explanation
-
--parallel
sets the number of worker threads for concurrent requests, should be fixed at 1. -
--model
specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct. -
--attn-implementation
sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa. -
--log-every-n-query
: sets how often to log every n requests. -
--connect-timeout
: sets the connection timeout in seconds. -
--read-timeout
: sets the read timeout in seconds. -
--max-tokens
: sets the maximum output length in tokens. -
--min-tokens
: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048. -
--api
: sets the inference interface; local inference options are local|local_vllm. -
--dataset
: sets the test dataset; options are speed_benchmark|speed_benchmark_long.
Test Results
Test results can be found in the outputs/{model_name}/{timestamp}/speed_benchmark.json
file, which contains all request results and test parameters.
Method 2: Testing with Scripts
HuggingFace Transformers Inference
- Using HuggingFace Hub
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
- Using ModelScope Hub
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
Parameter Explanation:
`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics
`--generate_length`: Number of tokens to generate; default is 2048
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace
`--outputs_dir`: Output directory, default is `outputs/transformers`
vLLM Inference
- Using HuggingFace Hub
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
- Using ModelScope Hub
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
Parameter Explanation: