From c7871179500a0ea688c97b0f6656dd6e8a28b555 Mon Sep 17 00:00:00 2001
From: Ren Xuancheng <jklj077@users.noreply.github.com>
Date: Wed, 25 Sep 2024 20:24:28 +0800
Subject: [PATCH] update doc (#981)

Co-authored-by: Ren Xuancheng <17811943+jklj077@users.noreply.github.com>
---
 docs/source/quantization/gptq.md | 33 ++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/docs/source/quantization/gptq.md b/docs/source/quantization/gptq.md
index 4faa3a9..a3b050d 100644
--- a/docs/source/quantization/gptq.md
+++ b/docs/source/quantization/gptq.md
@@ -5,6 +5,16 @@ In this document, we show you how to use the quantized model with Hugging Face `
 
 ## Usage of GPTQ Models with Hugging Face transformers
 
+:::{note}
+
+To use the official Qwen2.5 GPTQ models with `transformers`, please ensure that `optimum>=1.20.0` and compatible versions of `transformers` and `auto_gptq` are installed.
+
+You can do that by 
+```bash
+pip install -U "optimum>=1.20.0"
+```
+:::
+
 Now, `transformers` has officially supported AutoGPTQ, which means that you can directly use the quantized model with `transformers`. 
 For each size of Qwen2.5, we provide both Int4 and Int8 GPTQ quantized models.
 The following is a very simple code snippet showing how to run `Qwen2.5-7B-Instruct-GPTQ-Int4`:
@@ -204,6 +214,29 @@ For sharding, you need to load the model and use `save_pretrained` from transfor
 Except for this, everything is so simple. 
 Enjoy!
 
+
+## Known Issues
+
+### Qwen2.5-72B-Instruct-GPTQ-Int4 cannot stop generation properly
+
+:Model: Qwen2.5-72B-Instruct-GPTQ-Int4
+:Framework: vLLM, AutoGPTQ (including Hugging Face transformers)
+:Description: Generation cannot stop properly. Continual generation after where it should stop, then repeated texts, either single character, a phrase, or paragraphs, are generated.
+:Workaround: The following workaround could be considered
+    1. Using the original model in 16-bit floating point
+    2. Using the AWQ variants or llama.cpp-based models for reduced chances of abnormal generation
+
+### Qwen2.5-32B-Instruct-GPTQ-Int4 broken with vLLM on multiple GPUs
+
+:Model: Qwen2.5-32B-Instruct-GPTQ-Int4
+:Framework: vLLM
+:Description: Deployment on multiple GPUs and only garbled text like `!!!!!!!!!!!!!!!!!!` could be generated.
+:Workaround: Each of the following workaround could be considered
+    1. Using the AWQ or GPTQ-Int8 variants
+    2. Using a single GPU
+    3. Using Hugging Face `transformers` if latency and throughput are not major concerns
+
+
 ## Troubleshooting
 
 :::{dropdown} With `transformers` and `auto_gptq`, the logs suggest `CUDA extension not installed.` and the inference is slow.
-- 
GitLab