diff --git a/examples/speed-benchmark/README.md b/examples/speed-benchmark/README.md index 84987b7b2a85988622394fcf0179b84034008c9d..3c521ffc04033ef431675d70c0d67d4a57996386 100644 --- a/examples/speed-benchmark/README.md +++ b/examples/speed-benchmark/README.md @@ -1,14 +1,14 @@ -## Speed Benchmark +# Speed Benchmark This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html). -### 1. Model Collections +## 1. Model Collections -For models hosted on HuggingFace, please refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e). +For models hosted on HuggingFace, refer to [Qwen2.5 Model - HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e). -For models hosted on ModelScope, please refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768). +For models hosted on ModelScope, refer to [Qwen2.5 Model - ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768). -### 2. Environment Installation +## 2. Environment Setup For inference using HuggingFace transformers: @@ -38,69 +38,126 @@ conda activate qwen_perf_vllm pip install -r requirements-perf-vllm.txt ``` +## 3. Execute Tests -### 3. Run Experiments +Below are two methods for executing tests: using a script or the Speed Benchmark tool. -#### 3.1 Inference using HuggingFace Transformers +### Method 1: Testing with Speed Benchmark Tool -- Use HuggingFace hub +Use the Speed Benchmark tool developed by [EvalScope](https://github.com/modelscope/evalscope) for testing. It supports automatic downloading of models from ModelScope and outputs test results. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/speed_benchmark.html). +**Install Dependencies** ```shell -python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers +pip install 'evalscope[perf]' -U ``` -- Use ModelScope hub +#### HuggingFace Transformers Inference +Execute the command as follows: ```shell -python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers +CUDA_VISIBLE_DEVICES=0 evalscope perf \ + --parallel 1 \ + --model Qwen/Qwen2.5-0.5B-Instruct \ + --attn-implementation flash_attention_2 \ + --log-every-n-query 5 \ + --connect-timeout 6000 \ + --read-timeout 6000 \ + --max-tokens 2048 \ + --min-tokens 2048 \ + --api local \ + --dataset speed_benchmark +``` + +#### vLLM Inference + +```shell +CUDA_VISIBLE_DEVICES=0 evalscope perf \ + --parallel 1 \ + --model Qwen/Qwen2.5-0.5B-Instruct \ + --log-every-n-query 1 \ + --connect-timeout 60000 \ + --read-timeout 60000\ + --max-tokens 2048 \ + --min-tokens 2048 \ + --api local_vllm \ + --dataset speed_benchmark ``` -Parameters: +#### Parameter Explanation +- `--parallel` sets the number of worker threads for concurrent requests, should be fixed at 1. +- `--model` specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct. +- `--attn-implementation` sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa. +- `--log-every-n-query`: sets how often to log every n requests. +- `--connect-timeout`: sets the connection timeout in seconds. +- `--read-timeout`: sets the read timeout in seconds. +- `--max-tokens`: sets the maximum output length in tokens. +- `--min-tokens`: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048. +- `--api`: sets the inference interface; local inference options are local|local_vllm. +- `--dataset`: sets the test dataset; options are speed_benchmark|speed_benchmark_long. - `--model_id_or_path`: The model path or id on ModelScope or HuggingFace hub - `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`. - `--generate_length`: Output length in tokens; default is 2048. - `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES. e.g. `0,1,2,3`, `4,5` - `--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace. - `--outputs_dir`: Output directory; default is outputs/transformers. +#### Test Results +Test results can be found in the `outputs/{model_name}/{timestamp}/speed_benchmark.json` file, which contains all request results and test parameters. -#### 3.2 Inference using vLLM +### Method 2: Testing with Scripts -- Use HuggingFace hub +#### HuggingFace Transformers Inference + +- Using HuggingFace Hub ```shell -python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm -``` +python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers +# Specify HF_ENDPOINT +HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers +``` -- Use ModelScope hub +- Using ModelScope Hub ```shell -python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm +python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers ``` +Parameter Explanation: + + `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section + `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics + `--generate_length`: Number of tokens to generate; default is 2048 + `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5` + `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace + `--outputs_dir`: Output directory, default is `outputs/transformers` -Parameters: +#### vLLM Inference - `--model_id_or_path`: The model id on ModelScope or HuggingFace hub. - `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; Refer to the `Qwen2.5 SpeedBenchmark`. - `--generate_length`: Output length in tokens; default is 2048. - `--max_model_len`: Maximum model length in tokens; default is 32768. Optional values are 4096, 8192, 32768, 65536, 131072. - `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES. e.g. `0,1,2,3`, `4,5` - `--use_modelscope`: Use ModelScope when set this flag. Otherwise, use HuggingFace. - `--gpu_memory_utilization`: GPU memory utilization; range is (0, 1]; default is 0.9. - `--outputs_dir`: Output directory; default is outputs/vllm. - `--enforce_eager`: Whether to enforce eager mode; default is False. +- Using HuggingFace Hub +```shell +python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm +``` + +- Using ModelScope Hub + +```shell +python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm +``` +Parameter Explanation: -#### 3.3 Tips + `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section + `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics + `--generate_length`: Number of tokens to generate; default is 2048 + `--max_model_len`: Maximum model length in tokens; default is 32768 + `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5` + `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace + `--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9 + `--outputs_dir`: Output directory, default is `outputs/vllm` + `--enforce_eager`: Whether to enforce eager mode; default is False -- Run multiple experiments and compute the average result; a typical number is 3 times. -- Make sure the GPU is idle before running experiments. +#### Test Results +Test results can be found in the `outputs` directory, which by default includes two folders for `transformers` and `vllm`, storing test results for HuggingFace transformers and vLLM respectively. -### 4. Results +## Notes -Please check the `outputs` directory, which includes two directories by default: `transformers` and `vllm`, containing the experiments results for HuggingFace transformers and vLLM, respectively. +1. Conduct multiple tests and take the average, with a typical value of 3 tests. +2. Ensure the GPU is idle before testing to avoid interference from other tasks. \ No newline at end of file diff --git a/examples/speed-benchmark/README_zh.md b/examples/speed-benchmark/README_zh.md index 4e966876c1dee7ff2c2d3db1967241bfdaf3e05e..da43266e7e7a864c912a5f91841101aae889f6f2 100644 --- a/examples/speed-benchmark/README_zh.md +++ b/examples/speed-benchmark/README_zh.md @@ -1,16 +1,15 @@ -## 效率评估 +# 效率评估 本文介ç»Qwen2.5系列模型(原始模型和é‡åŒ–模型)的效率测试æµç¨‹ï¼Œè¯¦ç»†æŠ¥å‘Šå¯å‚考 [Qwen2.5模型效率评估报告](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html)。 -### 1. æ¨¡åž‹èµ„æº +## 1. æ¨¡åž‹èµ„æº å¯¹äºŽæ‰˜ç®¡åœ¨HuggingFace上的模型,å¯å‚考 [Qwen2.5模型-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e)。 对于托管在ModelScope上的模型,å¯å‚考 [Qwen2.5模型-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768)。 -### 2. 环境安装 - +## 2. 环境安装 使用HuggingFace transformers推ç†ï¼Œå®‰è£…环境如下: @@ -41,9 +40,70 @@ pip install -r requirements-perf-vllm.txt ``` -### 3. 执行测试 +## 3. 执行测试 + +下é¢ä»‹ç»ä¸¤ç§æ‰§è¡Œæµ‹è¯•的方法,分别是使用脚本测试和使用Speed Benchmark工具进行测试。 + +### 方法1:使用Speed Benchmark工具测试 + +使用[EvalScope](https://github.com/modelscope/evalscope)å¼€å‘çš„Speed Benchmark工具进行测试,支æŒè‡ªåŠ¨ä»Žmodelscope下载模型并输出测试结果,å‚考[📖使用指å—](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/speed_benchmark.html). + +**安装ä¾èµ–** +```shell +pip install 'evalscope[perf]' -U +``` + +#### HuggingFace transformersæŽ¨ç† + +执行命令如下: +```shell +CUDA_VISIBLE_DEVICES=0 evalscope perf \ + --parallel 1 \ + --model Qwen/Qwen2.5-0.5B-Instruct \ + --attn-implementation flash_attention_2 \ + --log-every-n-query 5 \ + --connect-timeout 6000 \ + --read-timeout 6000 \ + --max-tokens 2048 \ + --min-tokens 2048 \ + --api local \ + --dataset speed_benchmark +``` + +#### vLLMæŽ¨ç† + +```shell +CUDA_VISIBLE_DEVICES=0 evalscope perf \ + --parallel 1 \ + --model Qwen/Qwen2.5-0.5B-Instruct \ + --log-every-n-query 1 \ + --connect-timeout 60000 \ + --read-timeout 60000\ + --max-tokens 2048 \ + --min-tokens 2048 \ + --api local_vllm \ + --dataset speed_benchmark +``` + +#### 傿•°è¯´æ˜Ž +- `--parallel` 设置并å‘请求的workeræ•°é‡ï¼Œéœ€å›ºå®šä¸º1。 +- `--model` 测试模型文件路径,也å¯ä¸ºæ¨¡åž‹ID,支æŒè‡ªåŠ¨ä»Žmodelscope下载模型,例如Qwen/Qwen2.5-0.5B-Instruct。 +- `--attn-implementation` 设置attention实现方å¼ï¼Œå¯é€‰å€¼ä¸ºflash_attention_2|eager|sdpa。 +- `--log-every-n-query`: 设置æ¯n个请求打å°ä¸€æ¬¡æ—¥å¿—。 +- `--connect-timeout`: 设置连接超时时间,å•ä½ä¸ºç§’。 +- `--read-timeout`: 设置读å–超时时间,å•ä½ä¸ºç§’。 +- `--max-tokens`: 设置最大输出长度,å•ä½ä¸ºtoken。 +- `--min-tokens`: 设置最å°è¾“出长度,å•ä½ä¸ºtokenï¼›ä¸¤ä¸ªå‚æ•°åŒæ—¶è®¾ç½®ä¸º2048则模型固定输出长度为2048。 +- `--api`: è®¾ç½®æŽ¨ç†æŽ¥å£ï¼Œæœ¬åœ°æŽ¨ç†å¯é€‰å€¼ä¸ºlocal|local_vllm。 +- `--dataset`: 设置测试数æ®é›†ï¼Œå¯é€‰å€¼ä¸ºspeed_benchmark|speed_benchmark_long。 + +#### 测试结果 -#### 3.1 使用HuggingFace transformersæŽ¨ç† +测试结果详è§`outputs/{model_name}/{timestamp}/speed_benchmark.json`文件,其ä¸åŒ…嫿‰€æœ‰è¯·æ±‚ç»“æžœå’Œæµ‹è¯•å‚æ•°ã€‚ + +### 方法2:使用脚本测试 + +#### HuggingFace transformersæŽ¨ç† - 使用HuggingFace hub @@ -70,7 +130,7 @@ python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Inst `--outputs_dir`: 输出目录, 默认为`outputs/transformers` -#### 3.2 使用vLLMæŽ¨ç† +#### vLLMæŽ¨ç† - 使用HuggingFace hub @@ -99,12 +159,13 @@ python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --c `--outputs_dir`: 输出目录, 默认为`outputs/vllm` `--enforce_eager`: 是å¦å¼ºåˆ¶ä½¿ç”¨eager模å¼ï¼›é»˜è®¤ä¸ºFalse +#### 测试结果 -#### 3.3 注æ„事项 +测试结果详è§`outputs`目录下的文件,默认包括`transformers`å’Œ`vllm`ä¸¤ä¸ªç›®å½•ï¼Œåˆ†åˆ«å˜æ”¾HuggingFace transformerså’ŒvLLM的测试结果。 + +## 注æ„事项 1. 多次测试,å–å¹³å‡å€¼ï¼Œå…¸åž‹å€¼ä¸º3次 2. 测试å‰è¯·ç¡®ä¿GPU处于空闲状æ€ï¼Œé¿å…å…¶ä»–ä»»åŠ¡å½±å“æµ‹è¯•结果 -### 4. 测试结果 -测试结果详è§`outputs`目录下的文件,默认包括`transformers`å’Œ`vllm`ä¸¤ä¸ªç›®å½•ï¼Œåˆ†åˆ«å˜æ”¾HuggingFace transformerså’ŒvLLM的测试结果。