Skip to content
Snippets Groups Projects
Commit 7812c405 authored by Yunnglin's avatar Yunnglin
Browse files

update speed benchmark doc

parent 2f122d7a
No related branches found
No related tags found
No related merge requests found
## Speed Benchmark # Speed Benchmark
This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html). This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
### 1. Model Collections ## 1. Model Collections
For models hosted on HuggingFace, please refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e). For models hosted on HuggingFace, refer to [Qwen2.5 Model - HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
For models hosted on ModelScope, please refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768). For models hosted on ModelScope, refer to [Qwen2.5 Model - ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
### 2. Environment Installation ## 2. Environment Setup
For inference using HuggingFace transformers: For inference using HuggingFace transformers:
...@@ -38,69 +38,126 @@ conda activate qwen_perf_vllm ...@@ -38,69 +38,126 @@ conda activate qwen_perf_vllm
pip install -r requirements-perf-vllm.txt pip install -r requirements-perf-vllm.txt
``` ```
## 3. Execute Tests
### 3. Run Experiments Below are two methods for executing tests: using a script or the Speed Benchmark tool.
#### 3.1 Inference using HuggingFace Transformers ### Method 1: Testing with Speed Benchmark Tool
- Use HuggingFace hub Use the Speed Benchmark tool developed by [EvalScope](https://github.com/modelscope/evalscope) for testing. It supports automatic downloading of models from ModelScope and outputs test results. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/speed_benchmark.html).
**Install Dependencies**
```shell ```shell
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers pip install 'evalscope[perf]' -U
``` ```
- Use ModelScope hub #### HuggingFace Transformers Inference
Execute the command as follows:
```shell ```shell
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers CUDA_VISIBLE_DEVICES=0 evalscope perf \
--parallel 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--attn-implementation flash_attention_2 \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--max-tokens 2048 \
--min-tokens 2048 \
--api local \
--dataset speed_benchmark
```
#### vLLM Inference
```shell
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--parallel 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--log-every-n-query 1 \
--connect-timeout 60000 \
--read-timeout 60000\
--max-tokens 2048 \
--min-tokens 2048 \
--api local_vllm \
--dataset speed_benchmark
``` ```
Parameters: #### Parameter Explanation
- `--parallel` sets the number of worker threads for concurrent requests, should be fixed at 1.
- `--model` specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct.
- `--attn-implementation` sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa.
- `--log-every-n-query`: sets how often to log every n requests.
- `--connect-timeout`: sets the connection timeout in seconds.
- `--read-timeout`: sets the read timeout in seconds.
- `--max-tokens`: sets the maximum output length in tokens.
- `--min-tokens`: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048.
- `--api`: sets the inference interface; local inference options are local|local_vllm.
- `--dataset`: sets the test dataset; options are speed_benchmark|speed_benchmark_long.
`--model_id_or_path`: The model path or id on ModelScope or HuggingFace hub #### Test Results
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.
`--generate_length`: Output length in tokens; default is 2048.
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES. e.g. `0,1,2,3`, `4,5`
`--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.
`--outputs_dir`: Output directory; default is outputs/transformers.
Test results can be found in the `outputs/{model_name}/{timestamp}/speed_benchmark.json` file, which contains all request results and test parameters.
#### 3.2 Inference using vLLM ### Method 2: Testing with Scripts
- Use HuggingFace hub #### HuggingFace Transformers Inference
- Using HuggingFace Hub
```shell ```shell
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
```
# Specify HF_ENDPOINT
HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
```
- Use ModelScope hub - Using ModelScope Hub
```shell ```shell
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
``` ```
Parameter Explanation:
`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics
`--generate_length`: Number of tokens to generate; default is 2048
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace
`--outputs_dir`: Output directory, default is `outputs/transformers`
Parameters: #### vLLM Inference
`--model_id_or_path`: The model id on ModelScope or HuggingFace hub. - Using HuggingFace Hub
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`.
`--generate_length`: Output length in tokens; default is 2048.
`--max_model_len`: Maximum model length in tokens; default is 32768. Optional values are 4096, 8192, 32768, 65536, 131072.
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES. e.g. `0,1,2,3`, `4,5`
`--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace.
`--gpu_memory_utilization`: GPU memory utilization; range is (0, 1]; default is 0.9.
`--outputs_dir`: Output directory; default is outputs/vllm.
`--enforce_eager`: Whether to enforce eager mode; default is False.
```shell
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
```
- Using ModelScope Hub
```shell
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
```
Parameter Explanation:
#### 3.3 Tips `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics
`--generate_length`: Number of tokens to generate; default is 2048
`--max_model_len`: Maximum model length in tokens; default is 32768
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace
`--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9
`--outputs_dir`: Output directory, default is `outputs/vllm`
`--enforce_eager`: Whether to enforce eager mode; default is False
- Run multiple experiments and compute the average result; a typical number is 3 times. #### Test Results
- Make sure the GPU is idle before running experiments.
Test results can be found in the `outputs` directory, which by default includes two folders for `transformers` and `vllm`, storing test results for HuggingFace transformers and vLLM respectively.
### 4. Results ## Notes
Please check the `outputs` directory, which includes two directories by default: `transformers` and `vllm`, containing the experiments results for HuggingFace transformers and vLLM, respectively. 1. Conduct multiple tests and take the average, with a typical value of 3 tests.
2. Ensure the GPU is idle before testing to avoid interference from other tasks.
\ No newline at end of file
## 效率评估 # 效率评估
本文介绍Qwen2.5系列模型(原始模型和量化模型)的效率测试流程,详细报告可参考 [Qwen2.5模型效率评估报告](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html) 本文介绍Qwen2.5系列模型(原始模型和量化模型)的效率测试流程,详细报告可参考 [Qwen2.5模型效率评估报告](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html)
### 1. 模型资源 ## 1. 模型资源
对于托管在HuggingFace上的模型,可参考 [Qwen2.5模型-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) 对于托管在HuggingFace上的模型,可参考 [Qwen2.5模型-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e)
对于托管在ModelScope上的模型,可参考 [Qwen2.5模型-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768) 对于托管在ModelScope上的模型,可参考 [Qwen2.5模型-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768)
### 2. 环境安装 ## 2. 环境安装
使用HuggingFace transformers推理,安装环境如下: 使用HuggingFace transformers推理,安装环境如下:
...@@ -41,9 +40,70 @@ pip install -r requirements-perf-vllm.txt ...@@ -41,9 +40,70 @@ pip install -r requirements-perf-vllm.txt
``` ```
### 3. 执行测试 ## 3. 执行测试
下面介绍两种执行测试的方法,分别是使用脚本测试和使用Speed Benchmark工具进行测试。
### 方法1:使用Speed Benchmark工具测试
使用[EvalScope](https://github.com/modelscope/evalscope)开发的Speed Benchmark工具进行测试,支持自动从modelscope下载模型并输出测试结果,参考[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/speed_benchmark.html).
**安装依赖**
```shell
pip install 'evalscope[perf]' -U
```
#### HuggingFace transformers推理
执行命令如下:
```shell
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--parallel 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--attn-implementation flash_attention_2 \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--max-tokens 2048 \
--min-tokens 2048 \
--api local \
--dataset speed_benchmark
```
#### vLLM推理
```shell
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--parallel 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--log-every-n-query 1 \
--connect-timeout 60000 \
--read-timeout 60000\
--max-tokens 2048 \
--min-tokens 2048 \
--api local_vllm \
--dataset speed_benchmark
```
#### 参数说明
- `--parallel` 设置并发请求的worker数量,需固定为1。
- `--model` 测试模型文件路径,也可为模型ID,支持自动从modelscope下载模型,例如Qwen/Qwen2.5-0.5B-Instruct。
- `--attn-implementation` 设置attention实现方式,可选值为flash_attention_2|eager|sdpa。
- `--log-every-n-query`: 设置每n个请求打印一次日志。
- `--connect-timeout`: 设置连接超时时间,单位为秒。
- `--read-timeout`: 设置读取超时时间,单位为秒。
- `--max-tokens`: 设置最大输出长度,单位为token。
- `--min-tokens`: 设置最小输出长度,单位为token;两个参数同时设置为2048则模型固定输出长度为2048。
- `--api`: 设置推理接口,本地推理可选值为local|local_vllm。
- `--dataset`: 设置测试数据集,可选值为speed_benchmark|speed_benchmark_long。
#### 测试结果
#### 3.1 使用HuggingFace transformers推理 测试结果详见`outputs/{model_name}/{timestamp}/speed_benchmark.json`文件,其中包含所有请求结果和测试参数。
### 方法2:使用脚本测试
#### HuggingFace transformers推理
- 使用HuggingFace hub - 使用HuggingFace hub
...@@ -70,7 +130,7 @@ python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Inst ...@@ -70,7 +130,7 @@ python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Inst
`--outputs_dir`: 输出目录, 默认为`outputs/transformers` `--outputs_dir`: 输出目录, 默认为`outputs/transformers`
#### 3.2 使用vLLM推理 #### vLLM推理
- 使用HuggingFace hub - 使用HuggingFace hub
...@@ -99,12 +159,13 @@ python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --c ...@@ -99,12 +159,13 @@ python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --c
`--outputs_dir`: 输出目录, 默认为`outputs/vllm` `--outputs_dir`: 输出目录, 默认为`outputs/vllm`
`--enforce_eager`: 是否强制使用eager模式;默认为False `--enforce_eager`: 是否强制使用eager模式;默认为False
#### 测试结果
#### 3.3 注意事项 测试结果详见`outputs`目录下的文件,默认包括`transformers``vllm`两个目录,分别存放HuggingFace transformers和vLLM的测试结果。
## 注意事项
1. 多次测试,取平均值,典型值为3次 1. 多次测试,取平均值,典型值为3次
2. 测试前请确保GPU处于空闲状态,避免其他任务影响测试结果 2. 测试前请确保GPU处于空闲状态,避免其他任务影响测试结果
### 4. 测试结果
测试结果详见`outputs`目录下的文件,默认包括`transformers``vllm`两个目录,分别存放HuggingFace transformers和vLLM的测试结果。
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment