Qwen-72B-Chat
๐ค Hugging Face | ๐ค ModelScope | ๐ Paper ๏ฝ ๐ฅ๏ธ Demo
WeChat (ๅพฎไฟก) | Discord ๏ฝ API
ไป็ป๏ผIntroduction๏ผ
้ไนๅ้ฎ-72B๏ผQwen-72B๏ผๆฏ้ฟ้ไบ็ ๅ็้ไนๅ้ฎๅคงๆจกๅ็ณปๅ็720ไบฟๅๆฐ่งๆจก็ๆจกๅใQwen-72BๆฏๅบไบTransformer็ๅคง่ฏญ่จๆจกๅ, ๅจ่ถ ๅคง่งๆจก็้ข่ฎญ็ปๆฐๆฎไธ่ฟ่ก่ฎญ็ปๅพๅฐใ้ข่ฎญ็ปๆฐๆฎ็ฑปๅๅคๆ ท๏ผ่ฆ็ๅนฟๆณ๏ผๅ ๆฌๅคง้็ฝ็ปๆๆฌใไธไธไนฆ็ฑใไปฃ็ ็ญใๅๆถ๏ผๅจQwen-72B็ๅบ็กไธ๏ผๆไปฌไฝฟ็จๅฏน้ฝๆบๅถๆ้ ไบๅบไบๅคง่ฏญ่จๆจกๅ็AIๅฉๆQwen-72B-ChatใๆฌไปๅบไธบQwen-72B-Chat็ไปๅบใ
้ไนๅ้ฎ-72B๏ผQwen-72B๏ผไธป่ฆๆไปฅไธ็น็น๏ผ
- ๅคง่งๆจก้ซ่ดจ้่ฎญ็ป่ฏญๆ๏ผไฝฟ็จ่ถ ่ฟ3ไธไบฟtokens็ๆฐๆฎ่ฟ่ก้ข่ฎญ็ป๏ผๅ ๅซ้ซ่ดจ้ไธญใ่ฑใๅค่ฏญ่จใไปฃ็ ใๆฐๅญฆ็ญๆฐๆฎ๏ผๆถต็้็จๅไธไธ้ขๅ็่ฎญ็ป่ฏญๆใ้่ฟๅคง้ๅฏนๆฏๅฎ้ชๅฏน้ข่ฎญ็ป่ฏญๆๅๅธ่ฟ่กไบไผๅใ
- ๅผบๅคง็ๆง่ฝ๏ผQwen-72Bๅจๅคไธชไธญ่ฑๆไธๆธธ่ฏๆตไปปๅกไธ๏ผๆถต็ๅธธ่ฏๆจ็ใไปฃ็ ใๆฐๅญฆใ็ฟป่ฏ็ญ๏ผ๏ผๆๆๆพ่่ถ ่ถ็ฐๆ็ๅผๆบๆจกๅใๅ ทไฝ่ฏๆต็ปๆ่ฏท่ฏฆ่งไธๆใ
- ่ฆ็ๆดๅ จ้ข็่ฏ่กจ๏ผ็ธๆฏ็ฎๅไปฅไธญ่ฑ่ฏ่กจไธบไธป็ๅผๆบๆจกๅ๏ผQwen-72Bไฝฟ็จไบ็บฆ15ไธๅคงๅฐ็่ฏ่กจใ่ฏฅ่ฏ่กจๅฏนๅค่ฏญ่จๆดๅ ๅๅฅฝ๏ผๆนไพฟ็จๆทๅจไธๆฉๅฑ่ฏ่กจ็ๆ ๅตไธๅฏน้จๅ่ฏญ็ง่ฟ่ก่ฝๅๅขๅผบๅๆฉๅฑใ
- ๆด้ฟ็ไธไธๆๆฏๆ๏ผQwen-72Bๆฏๆ32k็ไธไธๆ้ฟๅบฆใ
- ็ณป็ปๆไปค่ท้๏ผQwen-72B-Chatๅฏไปฅ้่ฟ่ฐๆด็ณป็ปๆไปค๏ผๅฎ็ฐ่ง่ฒๆฎๆผ๏ผ่ฏญ่จ้ฃๆ ผ่ฟ็งป๏ผไปปๅก่ฎพๅฎ๏ผๅ่กไธบ่ฎพๅฎ็ญ่ฝๅใ
ๅฆๆๆจๆณไบ่งฃๆดๅคๅ ณไบ้ไนๅ้ฎ72Bๅผๆบๆจกๅ็็ป่๏ผๆไปฌๅปบ่ฎฎๆจๅ้ GitHubไปฃ็ ๅบใ
Qwen-72B is the 72B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-72B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-72B, we release Qwen-72B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-72B-Chat.
The features of Qwen-72B include:
- Large-scale high-quality training corpora: It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments.
- Competitive performance: It significantly surpasses existing open-source models on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.). See below for specific evaluation results.
- More comprehensive vocabulary coverage: Compared with other open-source models based on Chinese and English vocabularies, Qwen-72B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
- Longer context support: Qwen-72B supports 32k context length.
- System prompt: Qwen-72B can realize roly playing, language style transfer, task setting, and behavior setting by using system prompt.
For more details about the open-source model of Qwen-72B, please refer to the GitHub code repository.
่ฆๆฑ๏ผRequirements๏ผ
- python 3.8ๅไปฅไธ็ๆฌ
- pytorch 1.12ๅไปฅไธ็ๆฌ๏ผๆจ่2.0ๅไปฅไธ็ๆฌ
- ๅปบ่ฎฎไฝฟ็จCUDA 11.4ๅไปฅไธ๏ผGPU็จๆทใflash-attention็จๆท็ญ้่่ๆญค้้กน๏ผ
- ่ฟ่กBF16ๆFP16ๆจกๅ้่ฆๅคๅก่ณๅฐ144GBๆพๅญ๏ผไพๅฆ2xA100-80Gๆ5xV100-32G๏ผ๏ผ่ฟ่กInt4ๆจกๅ่ณๅฐ้่ฆ48GBๆพๅญ๏ผไพๅฆ1xA100-80Gๆ2xV100-32G๏ผ
- python 3.8 and above
- pytorch 1.12 and above, 2.0 and above are recommended
- CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
- To run Qwen-72B-Chat in bf16/fp16, at least 144GB GPU memory is required (e.g., 2xA100-80G or 5xV100-32G). To run it in int4, at least 48GB GPU memory is required (e.g., 1xA100-80G or 2xV100-32G)
ไพ่ต้กน๏ผDependency๏ผ
ไฝฟ็จHuggingFace่ฟ่กๆจ็
่ฟ่กQwen-72B-Chat๏ผ่ฏท็กฎไฟๆปก่ถณไธ่ฟฐ่ฆๆฑ๏ผๅๆง่กไปฅไธpipๅฝไปคๅฎ่ฃ ไพ่ตๅบ
To run Qwen-72B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.
pip install "transformers>=4.32.0" accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed
ๅฆๅค๏ผๆจ่ๅฎ่ฃ
flash-attentionๅบ๏ผๅฝๅๅทฒๆฏๆflash attention 2๏ผ๏ผไปฅๅฎ็ฐๆด้ซ็ๆ็ๅๆดไฝ็ๆพๅญๅ ็จใ
In addition, it is recommended to install the flash-attention library (we support flash attention 2 now.) for higher efficiency and lower memory usage.
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# ไธๆนๅฎ่ฃ
ๅฏ้๏ผๅฎ่ฃ
ๅฏ่ฝๆฏ่พ็ผๆ
ขใ
# Below are optional. Installing them might be slow.
# pip install csrc/layer_norm
# ๅฆๆไฝ ็flash-attn็ๆฌ้ซไบ2.1.1๏ผไธๆนไธ้่ฆๅฎ่ฃ
ใ
# If the version of flash-attn is higher than 2.1.1, the following is not needed.
# pip install csrc/rotary
ไฝฟ็จvLLM่ฟ่กๆจ็
ไฝฟ็จvLLM่ฟ่กๆจ็ๅฏไปฅๆฏๆๆด้ฟ็ไธไธๆ้ฟๅบฆๅนถ่ทๅพ่ณๅฐไธคๅ็็ๆๅ ้ใไฝ ้่ฆๆปก่ถณไปฅไธ่ฆๆฑ๏ผ
Using vLLM for inference can support longer context lengths and obtain at least twice the generation speedup. You need to meet the following requirements:
- pytorch >= 2.0
- cuda 11.8 or 12.1
ๅฆๆไฝ ไฝฟ็จcuda12.1ๅpytorch2.1๏ผๅฏไปฅ็ดๆฅไฝฟ็จไปฅไธๅฝไปคๅฎ่ฃ vLLMใ
If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.
# pip install vllm # This line is faster but it does not support quantization models.
# The below lines support int4 quantization (int8 will be supported soon). The installation are slower (~10 minutes).
git clone https://github.com/QwenLM/vllm-gptq
cd vllm-gptq
pip install -e .
ๅฆๅ่ฏทๅ่vLLMๅฎๆน็ๅฎ่ฃ ่ฏดๆ๏ผๆ่ ๆไปฌvLLMๅๆฏไปๅบ๏ผๆฏๆ้ๅๆจกๅ๏ผใ
Otherwise, please refer to the official vLLM Installation Instructions, or our vLLM repo for GPTQ quantization.
ๅฟซ้ไฝฟ็จ๏ผQuickstart๏ผ
ไฝฟ็จHuggingFace Transformers่ฟ่กๆจ็๏ผInference with Huggingface Transformers๏ผ
ไธ้ขๆไปฌๅฑ็คบไบไธไธชไฝฟ็จQwen-72B-Chatๆจกๅ๏ผ่ฟ่กๅค่ฝฎๅฏน่ฏไบคไบ็ๆ ทไพ๏ผ
We show an example of multi-turn interaction with Qwen-72B-Chat in the following code:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-72B-Chat", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="auto", trust_remote_code=True).eval()
# NOTE: The above line would require at least 144GB memory in total
# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-72B-Chat", trust_remote_code=True) # ๅฏๆๅฎไธๅ็็ๆ้ฟๅบฆใtop_p็ญ็ธๅ
ณ่ถ
ๅ
# ็ฌฌไธ่ฝฎๅฏน่ฏ 1st dialogue turn
response, history = model.chat(tokenizer, "ไฝ ๅฅฝ", history=None)
print(response)
# ไฝ ๅฅฝ๏ผๅพ้ซๅ
ดไธบไฝ ๆไพๅธฎๅฉใ
# ็ฌฌไบ่ฝฎๅฏน่ฏ 2nd dialogue turn
response, history = model.chat(tokenizer, "็ปๆ่ฎฒไธไธชๅนด่ฝปไบบๅฅๆๅไธๆ็ปๅๅพๆๅ็ๆ
ไบใ", history=history)
print(response)
# ่ฟๆฏไธไธชๅ
ณไบไธไธชๅนด่ฝปไบบๅฅๆๅไธๆ็ปๅๅพๆๅ็ๆ
ไบใ
# ๆ
ไบ็ไธปไบบๅ
ฌๅซๆๆ๏ผไปๆฅ่ชไธไธชๆฎ้็ๅฎถๅบญ๏ผ็ถๆฏ้ฝๆฏๆฎ้็ๅทฅไบบใไปๅฐ๏ผๆๆๅฐฑ็ซไธไบไธไธช็ฎๆ ๏ผ่ฆๆไธบไธๅๆๅ็ไผไธๅฎถใ
# ไธบไบๅฎ็ฐ่ฟไธช็ฎๆ ๏ผๆๆๅคๅฅๅญฆไน ๏ผ่ไธไบๅคงๅญฆใๅจๅคงๅญฆๆ้ด๏ผไป็งฏๆๅๅ ๅ็งๅไธๆฏ่ต๏ผ่ทๅพไบไธๅฐๅฅ้กนใไป่ฟๅฉ็จ่ฏพไฝๆถ้ดๅปๅฎไน ๏ผ็งฏ็ดฏไบๅฎ่ดต็็ป้ชใ
# ๆฏไธๅ๏ผๆๆๅณๅฎๅผๅง่ชๅทฑ็ๅไธไน่ทฏใไปๅผๅงๅฏปๆพๆ่ตๆบไผ๏ผไฝๅคๆฌก้ฝ่ขซๆ็ปไบใ็ถ่๏ผไปๅนถๆฒกๆๆพๅผใไป็ปง็ปญๅชๅ๏ผไธๆญๆน่ฟ่ชๅทฑ็ๅไธ่ฎกๅ๏ผๅนถๅฏปๆพๆฐ็ๆ่ตๆบไผใ
# ๆ็ป๏ผๆๆๆๅๅฐ่ทๅพไบไธ็ฌๆ่ต๏ผๅผๅงไบ่ชๅทฑ็ๅไธไน่ทฏใไปๆ็ซไบไธๅฎถ็งๆๅ
ฌๅธ๏ผไธๆณจไบๅผๅๆฐๅ่ฝฏไปถใๅจไป็้ขๅฏผไธ๏ผๅ
ฌๅธ่ฟ
้ๅๅฑ่ตทๆฅ๏ผๆไธบไบไธๅฎถๆๅ็็งๆไผไธใ
# ๆๆ็ๆๅๅนถไธๆฏๅถ็ถ็ใไปๅคๅฅใๅ้งใๅไบๅ้ฉ๏ผไธๆญๅญฆไน ๅๆน่ฟ่ชๅทฑใไป็ๆๅไน่ฏๆไบ๏ผๅช่ฆๅชๅๅฅๆ๏ผไปปไฝไบบ้ฝๆๅฏ่ฝๅๅพๆๅใ
# ็ฌฌไธ่ฝฎๅฏน่ฏ 3rd dialogue turn
response, history = model.chat(tokenizer, "็ป่ฟไธชๆ
ไบ่ตทไธไธชๆ ้ข", history=history)
print(response)
# ใๅฅๆๅไธ๏ผไธไธชๅนด่ฝปไบบ็ๆๅไน่ทฏใ
# Qwen-72B-Chat็ฐๅจๅฏไปฅ้่ฟ่ฐๆด็ณป็ปๆไปค๏ผSystem Prompt๏ผ๏ผๅฎ็ฐ่ง่ฒๆฎๆผ๏ผ่ฏญ่จ้ฃๆ ผ่ฟ็งป๏ผไปปๅก่ฎพๅฎ๏ผ่กไธบ่ฎพๅฎ็ญ่ฝๅใ
# Qwen-72B-Chat can realize roly playing, language style transfer, task setting, and behavior setting by system prompt.
response, _ = model.chat(tokenizer, "ไฝ ๅฅฝๅ", history=None, system="่ฏท็จไบๆฌกๅ
ๅฏ็ฑ่ฏญๆฐๅๆ่ฏด่ฏ")
print(response)
# ๅๅ๏ผไฝ ๅฅฝๅ๏ผๆฏๆไนๆพๅฐไบบๅฎถ็ๅข๏ผๆฏไธๆฏ่ขซไบบๅฎถ็้ญ
ๅๅธๅผ่ฟๆฅ็ๅ~(โงโฝโฆ)/~
response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs")
print(response)
# Your colleague is a shining example of dedication and hard work. Their commitment to their job is truly commendable, and it shows in the quality of their work.
# They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!
ไฝฟ็จvLLMๅ็ฑปTransformersๆฅๅฃ่ฟ่กๆจ็๏ผInference with vLLM and Transformers-like APIs๏ผ
ๅจๆ นๆฎไธๆนไพ่ตๆง้จๅ็่ฏดๆๅฎ่ฃ
vLLMๅ๏ผๅฏไปฅไธ่ฝฝๆฅๅฃๅฐ่ฃ
ไปฃ็ ๅฐๅฝๅๆไปถๅคน๏ผๅนถๆง่กไปฅไธๅฝไปค่ฟ่กๅค่ฝฎๅฏน่ฏไบคไบใ๏ผๆณจๆ๏ผ่ฏฅๆนๆณๅฝๅๅชๆฏๆmodel.chat()ๆฅๅฃใ๏ผ
After installing vLLM according to the dependency section above, you can download the wrapper codes and execute the following commands for multiple rounds of dialogue interaction. (Note: It currently only supports the model.chat() method.)
from vllm_wrapper import vLLMWrapper
model = vLLMWrapper('Qwen/Qwen-72B-Chat', tensor_parallel_size=2)
# model = vLLMWrapper('Qwen/Qwen-72B-Chat-Int4', tensor_parallel_size=1, dtype="float16") # ่ฟ่กint4ๆจกๅใ run int4 model.
response, history = model.chat(query="ไฝ ๅฅฝ", history=None)
print(response)
response, history = model.chat(query="็ปๆ่ฎฒไธไธชๅนด่ฝปไบบๅฅๆๅไธๆ็ปๅๅพๆๅ็ๆ
ไบใ", history=history)
print(response)
response, history = model.chat(query="็ป่ฟไธชๆ
ไบ่ตทไธไธชๆ ้ข", history=history)
print(response)
ไฝฟ็จvLLMๅ็ฑปOpenAIๆฅๅฃ่ฟ่กๆจ็๏ผInference with vLLM and OpenAI-like API๏ผ
่ฏทๅ่ๆไปฌGitHub repoไธญvLLM้จ็ฝฒๅOpenAIๆฅๅฃไฝฟ็จไธคไธช้จๅ็ไป็ปใ
Please refer to the introduction of vLLM deployment and OpenAI interface usage in our GitHub repo.
ๅฆๆไฝฟ็จ2xA100-80G่ฟ่ก้จ็ฝฒ๏ผๅฏไปฅ่ฟ่กไปฅไธไปฃ็ ๏ผ
If deploying with 2xA100-80G, you can run the following code:
python -m fastchat.serve.controller
python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat --trust-remote-code --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --dtype bfloat16
# python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat-Int4 --trust-remote-code --dtype float16 # ่ฟ่กint4ๆจกๅใ run int4 model.
python -m fastchat.serve.openai_api_server --host localhost --port 8000
ๆณจๆ้่ฆ--gpu-memory-utilization 0.98ๅๆฐ้ฟๅ
OOM้ฎ้ขใ
Note that the --gpu-memory-utilization 0.98 parameter is required to avoid OOM problems.
ๅ ณไบๆดๅค็ไฝฟ็จ่ฏดๆ๏ผ่ฏทๅ่ๆไปฌ็GitHub repo่ทๅๆดๅคไฟกๆฏใ
For more information, please refer to our GitHub repo for more information.
้ๅ (Quantization)
็จๆณ (Usage)
ไปฅไธๆไปฌๆไพ็คบไพ่ฏดๆๅฆไฝไฝฟ็จInt4/Int8้ๅๆจกๅใๅจๅผๅงไฝฟ็จๅ๏ผ่ฏทๅ ไฟ่ฏๆปก่ถณ่ฆๆฑ๏ผๅฆtorch 2.0ๅไปฅไธ๏ผtransformers็ๆฌไธบ4.32.0ๅไปฅไธ๏ผ็ญ็ญ๏ผ๏ผๅนถๅฎ่ฃ ๆ้ๅฎ่ฃ ๅ ๏ผ
Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:
pip install auto-gptq optimum
ๅฆๅฎ่ฃ
auto-gptq้ๅฐ้ฎ้ข๏ผๆไปฌๅปบ่ฎฎๆจๅฐๅฎๆนrepoๆ็ดขๅ้็้ข็ผ่ฏwheelใ
If you meet problems installing auto-gptq, we advise you to check out the official repo to find a pre-build wheel.
ๆณจๆ๏ผ้ข็ผ่ฏ็
auto-gptq็ๆฌๅฏนtorch็ๆฌๅๅ ถCUDA็ๆฌ่ฆๆฑไธฅๆ ผใๅๆถ๏ผ็ฑไบ ๅ ถ่ฟๆๆดๆฐ๏ผไฝ ๅฏ่ฝไผ้ๅฐtransformersใoptimumๆpeftๆๅบ็็ๆฌ้่ฏฏใ ๆไปฌๅปบ่ฎฎไฝฟ็จ็ฌฆๅไปฅไธ่ฆๆฑ็ๆๆฐ็ๆฌ๏ผ
- torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
- torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0 Note: The pre-compiled
auto-gptqpackages strongly depend on the version oftorchand its CUDA version. Moreover, due to recent update, you may also encounter unsupported version errors fromtransformers,optimum, orpeft. We recommend using the latest versions meeting the following requirements :- torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
- torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0
้ๅๅณๅฏไฝฟ็จๅไธ่ฟฐไธ่ด็็จๆณ่ฐ็จ้ๅๆจกๅ๏ผ
Then you can load the quantized model easily and run inference as same as usual:
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-72B-Chat-Int4",
device_map="auto",
trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "ไฝ ๅฅฝ", history=None)
ๆณจๆ๏ผไฝฟ็จvLLM่ฟ่ก้ๅๆจกๅ้ๅฎ่ฃ ๆไปฌvLLMๅๆฏไปๅบใๆไธๆฏๆint8ๆจกๅ๏ผ่ฟๆๅฐๆดๆฐใ
Note: You need to install our [vLLM repo] (https://github.com/qwenlm/vllm-gptq) for AutoGPTQ. The int8 model is not supported for the time being, and we will add the support soon.
ๆๆ่ฏๆต
ๆไปฌๅฏนBF16๏ผInt8ๅInt4ๆจกๅๅจๅบๅ่ฏๆตไธๅไบๆต่ฏ๏ผไฝฟ็จzero-shot่ฎพ็ฝฎ๏ผ๏ผ็ปๆๅฆไธๆ็คบ๏ผ
We illustrate the zero-shot performance of both BF16, Int8 and Int4 models on the benchmark. Results are shown below:
| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
|---|---|---|---|---|
| BF16 | 74.4 | 80.1 | 76.4 | 64.6 |
| Int8 | 73.5 | 80.1 | 73.5 | 62.2 |
| Int4 | 73.4 | 80.1 | 75.3 | 61.6 |
ๆจ็้ๅบฆๅๆพๅญไฝฟ็จ (Inference Speed & GPU Memory Usage)
ๆไปฌๆต็ฎไบไธๅ็ฒพๅบฆๆจกๅใไธๅFlashAttnๅบ็ๆฌใไปฅๅๆฏๅฆไฝฟ็จvLLM็ๆ ๅตไธ๏ผๆจกๅๅจไธๅ่พๅ ฅ้ฟๅบฆไธ็ๆ2048่ฏ็ๅนณๅๆจ็้ๅบฆไปฅๅๆพๅญไฝฟ็จใ
We measured the average inference speed and GPU memory usage of generating 2048 tokens across several settings, including input lengths, quantization levels, versions of flash-attention, and whether vLLM is used.
| Quantization | Setting | # of A100-80G GPUs | Context Length | Generation Length | Speed (Tokens/s) | Total GPU Memory Usage |
|---|---|---|---|---|---|---|
| BF16 | HF + FlashAttn-v2 | 2 | 1 | 2048 | 8.48 | 144.69GB |
| BF16 | HF + FlashAttn-v1 | 2 | 1 | 2048 | 8.31 | 144.69GB |
| BF16 | HF + No FlashAttn | 2 | 1 | 2048 | 7.89 | 144.69GB |
| BF16 | vLLM | 2 | 1 | 2048 | 17.60 | Pre-Allocated* |
| BF16 | vLLM | 4 | 1 | 2048 | 26.16 | Pre-Allocated* |
| BF16 | HF + FlashAttn-v2 | 4 | 6144 | 2048 | 5.37 | 181.47GB |
| BF16 | HF + FlashAttn-v1 | 4 | 6144 | 2048 | 4.96 | 181.47GB |
| BF16 | HF + No FlashAttn | 4 | 6144 | 2048 | 4.72 | 202.74GB |
| BF16 | vLLM | 4 | 6144 | 2048 | 24.41 | Pre-Allocated* |
| BF16 | vLLM | 4 | 14336 | 2048 | 21.24 | Pre-Allocated* |
| BF16 | vLLM | 4 | 30720 | 2048 | 17.55 | Pre-Allocated* |
| Int8 | HF + FlashAttn-v2 | 2 | 1 | 2048 | 9.05 | 81.27GB |
| Int8 | HF + FlashAttn-v1 | 2 | 1 | 2048 | 8.97 | 81.27GB |
| Int8 | HF + No FlashAttn | 2 | 1 | 2048 | 8.32 | 81.27GB |
| Int8 | HF + FlashAttn-v2 | 3 | 6144 | 2048 | 5.76 | 118.06GB |
| Int8 | HF + FlashAttn-v1 | 3 | 6144 | 2048 | 5.72 | 118.06GB |
| Int8 | HF + No FlashAttn | 2 | 6144 | 2048 | 4.50 | 129.83GB |
| Int8 | HF + FlashAttn-v2 | 4 | 14336 | 2048 | 3.44 | 180.44GB |
| Int8 | HF + FlashAttn-v1 | 4 | 14336 | 2048 | 3.19 | 180.44GB |
| Int8 | HF + No FlashAttn | 4 | 14336 | 2048 | OOM | OOM |
| Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
| Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
| Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
| Int4 | vLLM | 1 | 1 | 2048 | 14.63 | Pre-Allocated* |
| Int4 | vLLM | 2 | 1 | 2048 | 20.76 | Pre-Allocated* |
| Int4 | vLLM | 4 | 1 | 2048 | 27.19 | Pre-Allocated* |
| Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
| Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
| Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
| Int4 | vLLM | 2 | 6144 | 2048 | 18.07 | Pre-Allocated* |
| Int4 | vLLM | 4 | 6144 | 2048 | 24.56 | Pre-Allocated* |
| Int4 | HF + FlashAttn-v2 | 3 | 14336 | 2048 | 4.18 | 148.73GB |
| Int4 | HF + FlashAttn-v1 | 3 | 14336 | 2048 | 3.72 | 148.73GB |
| Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
| Int4 | vLLM | 2 | 14336 | 2048 | 14.51 | Pre-Allocated* |
| Int4 | vLLM | 4 | 14336 | 2048 | 19.28 | Pre-Allocated* |
| Int4 | vLLM | 4 | 30720 | 2048 | 16.93 | Pre-Allocated* |
* vLLMไผๆๅ้ขๅ้ ๆพๅญ๏ผๅ ๆญคๆ ๆณๆขๆตๆๅคงๆพๅญไฝฟ็จๆ ๅตใHFๆฏๆไฝฟ็จHuggingface Transformersๅบ่ฟ่กๆจ็ใ
* vLLM pre-allocates GPU memory, so we cannot detect the maximum usage. HF refers to using the Huggingface Transformers library for inference.
HuggingFace Transformers็ๆง่ฝๆต็ฎไฝฟ็จๆญค่ๆฌๅฎๆใ่ฏๆตไฝฟ็จA100-SXM4-80G GPU๏ผไฝฟ็จPyTorch 2.0.1 (Huggingface Transformers) / PyTorch 2.1.0 (vLLM)ๅCUDA 11.8ใ
The speed and memory profiling of HuggingFace Transformers are conducted using this script. The profiling runs on A100-SXM4-80G GPUs with PyTorch 2.0.1 (for Huggingface Transformers) / PyTorch 2.1.0 (for vLLM) and CUDA 11.8.
ๆจกๅ็ป่๏ผModel๏ผ
ไธQwen-72B้ข่ฎญ็ปๆจกๅ็ธๅ๏ผQwen-72B-Chatๆจกๅ่งๆจกๅบๆฌๆ ๅตๅฆไธๆ็คบ
The details of the model architecture of Qwen-72B-Chat are listed as follows
| Hyperparameter | Value |
|---|---|
| n_layers | 80 |
| n_heads | 64 |
| d_model | 8192 |
| vocab size | 151851 |
| sequence length | 32768 |
ๅจไฝ็ฝฎ็ผ็ ใFFNๆฟๆดปๅฝๆฐๅnormalization็ๅฎ็ฐๆนๅผไธ๏ผๆไปฌไน้็จไบ็ฎๅๆๆต่ก็ๅๆณ๏ผ ๅณRoPE็ธๅฏนไฝ็ฝฎ็ผ็ ใSwiGLUๆฟๆดปๅฝๆฐใRMSNorm๏ผๅฏ้ๅฎ่ฃ flash-attentionๅ ้๏ผใ
ๅจๅ่ฏๅจๆน้ข๏ผ็ธๆฏ็ฎๅไธปๆตๅผๆบๆจกๅไปฅไธญ่ฑ่ฏ่กจไธบไธป๏ผQwen-72B-Chatไฝฟ็จไบ็บฆ15ไธtokenๅคงๅฐ็่ฏ่กจใ
่ฏฅ่ฏ่กจๅจGPT-4ไฝฟ็จ็BPE่ฏ่กจcl100k_baseๅบ็กไธ๏ผๅฏนไธญๆใๅค่ฏญ่จ่ฟ่กไบไผๅ๏ผๅจๅฏนไธญใ่ฑใไปฃ็ ๆฐๆฎ็้ซๆ็ผ่งฃ็ ็ๅบ็กไธ๏ผๅฏน้จๅๅค่ฏญ่จๆดๅ ๅๅฅฝ๏ผๆนไพฟ็จๆทๅจไธๆฉๅฑ่ฏ่กจ็ๆ
ๅตไธๅฏน้จๅ่ฏญ็ง่ฟ่ก่ฝๅๅขๅผบใ
่ฏ่กจๅฏนๆฐๅญๆๅไธชๆฐๅญไฝๅๅใ่ฐ็จ่พไธบ้ซๆ็tiktokenๅ่ฏๅบ่ฟ่กๅ่ฏใ
For position encoding, FFN activation function, and normalization calculation methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).
For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-72B-Chat uses a vocabulary of over 150K tokens.
It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary.
It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization.
่ฏๆตๆๆ๏ผEvaluation๏ผ
ๅฏนไบQwen-72B-Chatๆจกๅ๏ผๆไปฌๅๆ ท่ฏๆตไบๅธธ่ง็ไธญๆ็่งฃ๏ผC-Eval๏ผใ่ฑๆ็่งฃ๏ผMMLU๏ผใไปฃ็ ๏ผHumanEval๏ผๅๆฐๅญฆ๏ผGSM8K๏ผ็ญๆๅจไปปๅก๏ผๅๆถๅ ๅซไบ้ฟๅบๅไปปๅก็่ฏๆต็ปๆใ็ฑไบQwen-72B-Chatๆจกๅ็ป่ฟๅฏน้ฝๅ๏ผๆฟๅไบ่พๅผบ็ๅค้จ็ณป็ป่ฐ็จ่ฝๅ๏ผๆไปฌ่ฟ่ฟ่กไบๅทฅๅ ทไฝฟ็จ่ฝๅๆน้ข็่ฏๆตใ
ๆ็คบ๏ผ็ฑไบ็กฌไปถๅๆกๆถ้ ๆ็่ๅ ฅ่ฏฏๅทฎ๏ผๅค็ฐ็ปๆๅฆๆๆณขๅจๅฑไบๆญฃๅธธ็ฐ่ฑกใ
For Qwen-72B-Chat, we also evaluate the model on C-Eval, MMLU, HumanEval, GSM8K, etc., as well as the benchmark evaluation for long-context understanding, and tool usage.
Note: Due to rounding errors caused by hardware and framework, differences in reproduced results are possible.
ไธญๆ่ฏๆต๏ผChinese Evaluation๏ผ
C-Eval
ๅจC-Eval้ช่ฏ้ไธ๏ผๆไปฌ่ฏไปทไบQwen-72B-Chatๆจกๅ็0-shot & 5-shotๅ็กฎ็
We demonstrate the 0-shot & 5-shot accuracy of Qwen-72B-Chat on C-Eval validation set
| Model | Avg. Acc. |
|---|---|
| LLaMA2-7B-Chat | 31.9 |
| LLaMA2-13B-Chat | 36.2 |
| LLaMA2-70B-Chat | 44.3 |
| ChatGPT3.5 | 52.5 |
| ChatGPT4 | 69.9 |
| Yi-34B-Chat (0-shot) | 77.0 |
| Yi-34B-Chat (5-shot) | 78.5 |
| Qwen-7B-Chat (original) (0-shot) | 54.2 |
| Qwen-7B-Chat (0-shot) | 59.7 |
| Qwen-7B-Chat (5-shot) | 59.3 |
| Qwen-14B-Chat (0-shot) | 69.8 |
| Qwen-14B-Chat (5-shot) | 71.7 |
| Qwen-72B-Chat (0-shot) | 80.1 |
| Qwen-72B-Chat (5-shot) | 82.9 |
C-Evalๆต่ฏ้ไธ๏ผQwen-72B-Chatๆจกๅ็zero-shotๅ็กฎ็็ปๆๅฆไธ๏ผ
The zero-shot accuracy of Qwen-72B-Chat on C-Eval testing set is provided below:
| Model | Avg. | STEM | Social Sciences | Humanities | Others |
|---|---|---|---|---|---|
| Qwen-7B-Chat (original) | 54.6 | 47.8 | 67.6 | 59.3 | 50.6 |
| Qwen-7B-Chat | 58.6 | 53.3 | 72.1 | 62.8 | 52.0 |
| Qwen-14B-Chat | 69.1 | 65.1 | 80.9 | 71.2 | 63.4 |
| Qwen-72B-Chat | 79.5 | 74.5 | 89.1 | 81.2 | 78.1 |
่ฑๆ่ฏๆต๏ผEnglish Evaluation๏ผ
MMLU
MMLU่ฏๆต้ไธ๏ผQwen-7B-Chatๆจกๅ็ 0-shot & 5-shot ๅ็กฎ็ๅฆไธ๏ผๆๆๅๆ ทๅจๅ็ฑปๅฏน้ฝๆจกๅไธญๅๆ ท่กจ็ฐ่พไผใ
The 0-shot & 5-shot accuracy of Qwen-72B-Chat on MMLU is provided below. The performance of Qwen-72B-Chat still on the top between other human-aligned models with comparable size.
| Model | Avg. Acc. |
|---|---|
| LLaMA2-7B-Chat | 46.2 |
| LLaMA2-13B-Chat | 54.6 |
| LLaMA2-70B-Chat | 63.8 |
| Yi-34B-Chat (0-shot) | 67.6 |
| Yi-34B-Chat (5-shot) | 73.4 |
| ChatGPT3.5 | 69.1 |
| ChatGPT4 | 83.0 |
| Qwen-7B-Chat (original) (0-shot) | 53.9 |
| Qwen-7B-Chat (0-shot) | 55.8 |
| Qwen-7B-Chat (5-shot) | 57.0 |
| Qwen-14B-Chat (0-shot) | 64.6 |
| Qwen-14B-Chat (5-shot) | 66.5 |
| Qwen-72B-Chat (0-shot) | 74.3 |
| Qwen-72B-Chat (5-shot) | 75.0 |
ไปฃ็ ่ฏๆต๏ผCoding Evaluation๏ผ
Qwen-72B-ChatๅจHumanEval็zero-shot Pass@1ๆๆๅฆไธ
The zero-shot Pass@1 of Qwen-72B-Chat on HumanEval is demonstrated below
| Model | Pass@1 |
|---|---|
| LLaMA2-7B-Chat | 12.2 |
| LLaMA2-13B-Chat | 18.9 |
| LLaMA2-70B-Chat | 32.3 |
| Yi-34B-Chat | 33.5 |
| ChatGPT3.5 | 73.2 |
| ChatGPT4 | 86.6 |
| Qwen-7B-Chat (original) | 24.4 |
| Qwen-7B-Chat | 37.2 |
| Qwen-14B-Chat | 43.9 |
| Qwen-72B-Chat | 64.6 |
ๆฐๅญฆ่ฏๆต๏ผMathematics Evaluation๏ผ
ๅจ่ฏๆตๆฐๅญฆ่ฝๅ็GSM8Kไธ๏ผQwen-72B-Chat็ๅ็กฎ็็ปๆๅฆไธ
The accuracy of Qwen-72B-Chat on GSM8K is shown below
| Model | Acc. |
|---|---|
| LLaMA2-7B-Chat | 26.3 |
| LLaMA2-13B-Chat | 37.1 |
| LLaMA2-70B-Chat | 59.3 |
| Yi-34B-Chat | 71.6 |
| ChatGPT3.5 | 73.2 |
| ChatGPT4 | 91.4 |
| Qwen-7B-Chat (original) (0-shot) | 41.1 |
| Qwen-7B-Chat (0-shot) | 50.3 |
| Qwen-7B-Chat (8-shot) | 54.1 |
| Qwen-14B-Chat (0-shot) | 60.1 |
| Qwen-14B-Chat (8-shot) | 59.3 |
| Qwen-72B-Chat (0-shot) | 76.4 |
| Qwen-72B-Chat (8-shot) | 75.7 |
้ฟๅบๅ่ฏๆต๏ผLong-Context Understanding๏ผ
Qwen-72B-Chatๆฏๆๆ้ฟ32k็ไธไธๆ้ฟๅบฆ๏ผๅจL-Evalๅฎข่ง้ข็่ฏๅ็ปๆๅฆไธ๏ผ
Qwen-72B-Chat supports context lengths of up to 32k. The scores of L-Eval (closed-ended tasks) are as follows:
| Model | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition |
|---|---|---|---|---|---|---|---|
| ChatGPT-3.5-16k | 60.73 | 63.51 | 84.00 | 61.38 | 78.43 | 12.22 | 64.84 |
| Qwen-72B-Chat | 62.30 | 58.13 | 76.00 | 77.22 | 86.24 | 6.66 | 69.53 |
ๆไปฌ่ฟไธๆญฅ่ฟ่กไบโๅคงๆตทๆ้โๅฎ้ช๏ผๆณๆณๆฅ่ชไบ@Greg Kamradt๏ผ๏ผๆต่ฏๆจกๅๅจไธๅ้ฟๅบฆ็่พๅ ฅไธ๏ผๆฏๅฆ่ฝๆฃ็ดขๅฐๆ็ซ ไธๅไฝ็ฝฎ็ไฟกๆฏ๏ผ็ปๆๅฆไธ๏ผ
We conducted the "needle in a haystack" experiment (the idea came from @Greg Kamradt) to test whether the model can retrieve information at different positions in the inputs of different lengths, the result is as follows:
ไปฅไธ็ปๆ่ฏดๆ๏ผQwen-72B-Chatๅฏไปฅ่ฝๅ็กฎๆฃ็ดขๅฐ32kไปฅๅ ็่พๅ ฅ้ฟๅบฆไธญๆพๅจๅ็งไฝ็ฝฎ็ไฟกๆฏ๏ผ่ฏๆไบๅ ถๅ ทๆไผ็ง็้ฟๆๆฌๅค็่ฝๅใ
The above results show that Qwen-72B-Chat can accurately retrieve information placed in various positions within an input length of 32k, proving its excellent long text understanding capabilities.
FAQ
ๅฆ้ๅฐ้ฎ้ข๏ผๆฌ่ฏทๆฅ้ FAQไปฅๅissueๅบ๏ผๅฆไปๆ ๆณ่งฃๅณๅๆไบคissueใ
If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue.
ๅผ็จ (Citation)
ๅฆๆไฝ ่งๅพๆไปฌ็ๅทฅไฝๅฏนไฝ ๆๅธฎๅฉ๏ผๆฌข่ฟๅผ็จ๏ผ
If you find our work helpful, feel free to give us a cite.
@article{qwen,
title={Qwen Technical Report},
author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
journal={arXiv preprint arXiv:2309.16609},
year={2023}
}
ไฝฟ็จๅ่ฎฎ๏ผLicense Agreement๏ผ
ๆไปฌ็ไปฃ็ ๅๆจกๅๆ้ๅฏนๅญฆๆฏ็ ็ฉถๅฎๅ จๅผๆพ๏ผๅนถๆฏๆๅ็จใ่ฏทๆฅ็LICENSEไบ่งฃๅ ทไฝ็ๅผๆบๅ่ฎฎ็ป่ใๅฆ้ๅ็จ๏ผๆฌข่ฟๅกซๅ้ฎๅท็ณ่ฏทใ
Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply.
่็ณปๆไปฌ๏ผContact Us๏ผ
ๅฆๆไฝ ๆณ็ปๆไปฌ็็ ๅๅข้ๅไบงๅๅข้็่จ๏ผๆฌข่ฟๅ ๅ ฅๆไปฌ็ๅพฎไฟก็พคใ้้็พคไปฅๅDiscord๏ผๅๆถ๏ผไนๆฌข่ฟ้่ฟ้ฎไปถ๏ผqianwen_opensource@alibabacloud.com๏ผ่็ณปๆไปฌใ
If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to qianwen_opensource@alibabacloud.com.
- Downloads last month
- 245
