A newer version of this model is available: Qwen/Qwen2.5-72B-Instruct

Qwen-72B-Chat


๐Ÿค— Hugging Face   |   ๐Ÿค– ModelScope   |    ๐Ÿ“‘ Paper    ๏ฝœ   ๐Ÿ–ฅ๏ธ Demo
WeChat (ๅพฎไฟก)   |   Discord   ๏ฝœ   API


ไป‹็ป๏ผˆIntroduction๏ผ‰

้€šไน‰ๅƒ้—ฎ-72B๏ผˆQwen-72B๏ผ‰ๆ˜ฏ้˜ฟ้‡Œไบ‘็ ”ๅ‘็š„้€šไน‰ๅƒ้—ฎๅคงๆจกๅž‹็ณปๅˆ—็š„720ไบฟๅ‚ๆ•ฐ่ง„ๆจก็š„ๆจกๅž‹ใ€‚Qwen-72Bๆ˜ฏๅŸบไบŽTransformer็š„ๅคง่ฏญ่จ€ๆจกๅž‹, ๅœจ่ถ…ๅคง่ง„ๆจก็š„้ข„่ฎญ็ปƒๆ•ฐๆฎไธŠ่ฟ›่กŒ่ฎญ็ปƒๅพ—ๅˆฐใ€‚้ข„่ฎญ็ปƒๆ•ฐๆฎ็ฑปๅž‹ๅคšๆ ท๏ผŒ่ฆ†็›–ๅนฟๆณ›๏ผŒๅŒ…ๆ‹ฌๅคง้‡็ฝ‘็ปœๆ–‡ๆœฌใ€ไธ“ไธšไนฆ็ฑใ€ไปฃ็ ็ญ‰ใ€‚ๅŒๆ—ถ๏ผŒๅœจQwen-72B็š„ๅŸบ็ก€ไธŠ๏ผŒๆˆ‘ไปฌไฝฟ็”จๅฏน้ฝๆœบๅˆถๆ‰“้€ ไบ†ๅŸบไบŽๅคง่ฏญ่จ€ๆจกๅž‹็š„AIๅŠฉๆ‰‹Qwen-72B-Chatใ€‚ๆœฌไป“ๅบ“ไธบQwen-72B-Chat็š„ไป“ๅบ“ใ€‚

้€šไน‰ๅƒ้—ฎ-72B๏ผˆQwen-72B๏ผ‰ไธป่ฆๆœ‰ไปฅไธ‹็‰น็‚น๏ผš

  1. ๅคง่ง„ๆจก้ซ˜่ดจ้‡่ฎญ็ปƒ่ฏญๆ–™๏ผšไฝฟ็”จ่ถ…่ฟ‡3ไธ‡ไบฟtokens็š„ๆ•ฐๆฎ่ฟ›่กŒ้ข„่ฎญ็ปƒ๏ผŒๅŒ…ๅซ้ซ˜่ดจ้‡ไธญใ€่‹ฑใ€ๅคš่ฏญ่จ€ใ€ไปฃ็ ใ€ๆ•ฐๅญฆ็ญ‰ๆ•ฐๆฎ๏ผŒๆถต็›–้€š็”จๅŠไธ“ไธš้ข†ๅŸŸ็š„่ฎญ็ปƒ่ฏญๆ–™ใ€‚้€š่ฟ‡ๅคง้‡ๅฏนๆฏ”ๅฎž้ชŒๅฏน้ข„่ฎญ็ปƒ่ฏญๆ–™ๅˆ†ๅธƒ่ฟ›่กŒไบ†ไผ˜ๅŒ–ใ€‚
  2. ๅผบๅคง็š„ๆ€ง่ƒฝ๏ผšQwen-72Bๅœจๅคšไธชไธญ่‹ฑๆ–‡ไธ‹ๆธธ่ฏ„ๆต‹ไปปๅŠกไธŠ๏ผˆๆถต็›–ๅธธ่ฏ†ๆŽจ็†ใ€ไปฃ็ ใ€ๆ•ฐๅญฆใ€็ฟป่ฏ‘็ญ‰๏ผ‰๏ผŒๆ•ˆๆžœๆ˜พ่‘—่ถ…่ถŠ็Žฐๆœ‰็š„ๅผ€ๆบๆจกๅž‹ใ€‚ๅ…ทไฝ“่ฏ„ๆต‹็ป“ๆžœ่ฏท่ฏฆ่งไธ‹ๆ–‡ใ€‚
  3. ่ฆ†็›–ๆ›ดๅ…จ้ข็š„่ฏ่กจ๏ผš็›ธๆฏ”็›ฎๅ‰ไปฅไธญ่‹ฑ่ฏ่กจไธบไธป็š„ๅผ€ๆบๆจกๅž‹๏ผŒQwen-72Bไฝฟ็”จไบ†็บฆ15ไธ‡ๅคงๅฐ็š„่ฏ่กจใ€‚่ฏฅ่ฏ่กจๅฏนๅคš่ฏญ่จ€ๆ›ดๅŠ ๅ‹ๅฅฝ๏ผŒๆ–นไพฟ็”จๆˆทๅœจไธๆ‰ฉๅฑ•่ฏ่กจ็š„ๆƒ…ๅ†ตไธ‹ๅฏน้ƒจๅˆ†่ฏญ็ง่ฟ›่กŒ่ƒฝๅŠ›ๅขžๅผบๅ’Œๆ‰ฉๅฑ•ใ€‚
  4. ๆ›ด้•ฟ็š„ไธŠไธ‹ๆ–‡ๆ”ฏๆŒ๏ผšQwen-72Bๆ”ฏๆŒ32k็š„ไธŠไธ‹ๆ–‡้•ฟๅบฆใ€‚
  5. ็ณป็ปŸๆŒ‡ไปค่ทŸ้š๏ผšQwen-72B-Chatๅฏไปฅ้€š่ฟ‡่ฐƒๆ•ด็ณป็ปŸๆŒ‡ไปค๏ผŒๅฎž็Žฐ่ง’่‰ฒๆ‰ฎๆผ”๏ผŒ่ฏญ่จ€้ฃŽๆ ผ่ฟ็งป๏ผŒไปปๅŠก่ฎพๅฎš๏ผŒๅ’Œ่กŒไธบ่ฎพๅฎš็ญ‰่ƒฝๅŠ›ใ€‚

ๅฆ‚ๆžœๆ‚จๆƒณไบ†่งฃๆ›ดๅคšๅ…ณไบŽ้€šไน‰ๅƒ้—ฎ72Bๅผ€ๆบๆจกๅž‹็š„็ป†่Š‚๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จๅ‚้˜…GitHubไปฃ็ ๅบ“ใ€‚

Qwen-72B is the 72B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-72B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-72B, we release Qwen-72B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-72B-Chat.

The features of Qwen-72B include:

  1. Large-scale high-quality training corpora: It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments.
  2. Competitive performance: It significantly surpasses existing open-source models on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.). See below for specific evaluation results.
  3. More comprehensive vocabulary coverage: Compared with other open-source models based on Chinese and English vocabularies, Qwen-72B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
  4. Longer context support: Qwen-72B supports 32k context length.
  5. System prompt: Qwen-72B can realize roly playing, language style transfer, task setting, and behavior setting by using system prompt.

For more details about the open-source model of Qwen-72B, please refer to the GitHub code repository.

่ฆๆฑ‚๏ผˆRequirements๏ผ‰

  • python 3.8ๅŠไปฅไธŠ็‰ˆๆœฌ
  • pytorch 1.12ๅŠไปฅไธŠ็‰ˆๆœฌ๏ผŒๆŽจ่2.0ๅŠไปฅไธŠ็‰ˆๆœฌ
  • ๅปบ่ฎฎไฝฟ็”จCUDA 11.4ๅŠไปฅไธŠ๏ผˆGPU็”จๆˆทใ€flash-attention็”จๆˆท็ญ‰้œ€่€ƒ่™‘ๆญค้€‰้กน๏ผ‰
  • ่ฟ่กŒBF16ๆˆ–FP16ๆจกๅž‹้œ€่ฆๅคšๅก่‡ณๅฐ‘144GBๆ˜พๅญ˜๏ผˆไพ‹ๅฆ‚2xA100-80Gๆˆ–5xV100-32G๏ผ‰๏ผ›่ฟ่กŒInt4ๆจกๅž‹่‡ณๅฐ‘้œ€่ฆ48GBๆ˜พๅญ˜๏ผˆไพ‹ๅฆ‚1xA100-80Gๆˆ–2xV100-32G๏ผ‰
  • python 3.8 and above
  • pytorch 1.12 and above, 2.0 and above are recommended
  • CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
  • To run Qwen-72B-Chat in bf16/fp16, at least 144GB GPU memory is required (e.g., 2xA100-80G or 5xV100-32G). To run it in int4, at least 48GB GPU memory is required (e.g., 1xA100-80G or 2xV100-32G)

ไพ่ต–้กน๏ผˆDependency๏ผ‰

ไฝฟ็”จHuggingFace่ฟ›่กŒๆŽจ็†

่ฟ่กŒQwen-72B-Chat๏ผŒ่ฏท็กฎไฟๆปก่ถณไธŠ่ฟฐ่ฆๆฑ‚๏ผŒๅ†ๆ‰ง่กŒไปฅไธ‹pipๅ‘ฝไปคๅฎ‰่ฃ…ไพ่ต–ๅบ“

To run Qwen-72B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.

pip install "transformers>=4.32.0" accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed

ๅฆๅค–๏ผŒๆŽจ่ๅฎ‰่ฃ…flash-attentionๅบ“๏ผˆๅฝ“ๅ‰ๅทฒๆ”ฏๆŒflash attention 2๏ผ‰๏ผŒไปฅๅฎž็Žฐๆ›ด้ซ˜็š„ๆ•ˆ็އๅ’Œๆ›ดไฝŽ็š„ๆ˜พๅญ˜ๅ ็”จใ€‚

In addition, it is recommended to install the flash-attention library (we support flash attention 2 now.) for higher efficiency and lower memory usage.

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# ไธ‹ๆ–นๅฎ‰่ฃ…ๅฏ้€‰๏ผŒๅฎ‰่ฃ…ๅฏ่ƒฝๆฏ”่พƒ็ผ“ๆ…ขใ€‚
# Below are optional. Installing them might be slow.
# pip install csrc/layer_norm
# ๅฆ‚ๆžœไฝ ็š„flash-attn็‰ˆๆœฌ้ซ˜ไบŽ2.1.1๏ผŒไธ‹ๆ–นไธ้œ€่ฆๅฎ‰่ฃ…ใ€‚
# If the version of flash-attn is higher than 2.1.1, the following is not needed.
# pip install csrc/rotary

ไฝฟ็”จvLLM่ฟ›่กŒๆŽจ็†

ไฝฟ็”จvLLM่ฟ›่กŒๆŽจ็†ๅฏไปฅๆ”ฏๆŒๆ›ด้•ฟ็š„ไธŠไธ‹ๆ–‡้•ฟๅบฆๅนถ่Žทๅพ—่‡ณๅฐ‘ไธคๅ€็š„็”ŸๆˆๅŠ ้€Ÿใ€‚ไฝ ้œ€่ฆๆปก่ถณไปฅไธ‹่ฆๆฑ‚๏ผš

Using vLLM for inference can support longer context lengths and obtain at least twice the generation speedup. You need to meet the following requirements:

  • pytorch >= 2.0
  • cuda 11.8 or 12.1

ๅฆ‚ๆžœไฝ ไฝฟ็”จcuda12.1ๅ’Œpytorch2.1๏ผŒๅฏไปฅ็›ดๆŽฅไฝฟ็”จไปฅไธ‹ๅ‘ฝไปคๅฎ‰่ฃ…vLLMใ€‚

If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.

# pip install vllm  # This line is faster but it does not support quantization models.

# The below lines support int4 quantization (int8 will be supported soon). The installation are slower (~10 minutes).
git clone https://github.com/QwenLM/vllm-gptq
cd vllm-gptq
pip install -e .

ๅฆๅˆ™่ฏทๅ‚่€ƒvLLMๅฎ˜ๆ–น็š„ๅฎ‰่ฃ…่ฏดๆ˜Ž๏ผŒๆˆ–่€…ๆˆ‘ไปฌvLLMๅˆ†ๆ”ฏไป“ๅบ“๏ผˆๆ”ฏๆŒ้‡ๅŒ–ๆจกๅž‹๏ผ‰ใ€‚

Otherwise, please refer to the official vLLM Installation Instructions, or our vLLM repo for GPTQ quantization.

ๅฟซ้€Ÿไฝฟ็”จ๏ผˆQuickstart๏ผ‰

ไฝฟ็”จHuggingFace Transformers่ฟ›่กŒๆŽจ็†๏ผˆInference with Huggingface Transformers๏ผ‰

ไธ‹้ขๆˆ‘ไปฌๅฑ•็คบไบ†ไธ€ไธชไฝฟ็”จQwen-72B-Chatๆจกๅž‹๏ผŒ่ฟ›่กŒๅคš่ฝฎๅฏน่ฏไบคไบ’็š„ๆ ทไพ‹๏ผš

We show an example of multi-turn interaction with Qwen-72B-Chat in the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-72B-Chat", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="auto", trust_remote_code=True).eval()
# NOTE: The above line would require at least 144GB memory in total

# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-72B-Chat", trust_remote_code=True) # ๅฏๆŒ‡ๅฎšไธๅŒ็š„็”Ÿๆˆ้•ฟๅบฆใ€top_p็ญ‰็›ธๅ…ณ่ถ…ๅ‚

# ็ฌฌไธ€่ฝฎๅฏน่ฏ 1st dialogue turn
response, history = model.chat(tokenizer, "ไฝ ๅฅฝ", history=None)
print(response)
# ไฝ ๅฅฝ๏ผๅพˆ้ซ˜ๅ…ดไธบไฝ ๆไพ›ๅธฎๅŠฉใ€‚

# ็ฌฌไบŒ่ฝฎๅฏน่ฏ 2nd dialogue turn
response, history = model.chat(tokenizer, "็ป™ๆˆ‘่ฎฒไธ€ไธชๅนด่ฝปไบบๅฅ‹ๆ–—ๅˆ›ไธšๆœ€็ปˆๅ–ๅพ—ๆˆๅŠŸ็š„ๆ•…ไบ‹ใ€‚", history=history)
print(response)
# ่ฟ™ๆ˜ฏไธ€ไธชๅ…ณไบŽไธ€ไธชๅนด่ฝปไบบๅฅ‹ๆ–—ๅˆ›ไธšๆœ€็ปˆๅ–ๅพ—ๆˆๅŠŸ็š„ๆ•…ไบ‹ใ€‚
# ๆ•…ไบ‹็š„ไธปไบบๅ…ฌๅซๆŽๆ˜Ž๏ผŒไป–ๆฅ่‡ชไธ€ไธชๆ™ฎ้€š็š„ๅฎถๅบญ๏ผŒ็ˆถๆฏ้ƒฝๆ˜ฏๆ™ฎ้€š็š„ๅทฅไบบใ€‚ไปŽๅฐ๏ผŒๆŽๆ˜Žๅฐฑ็ซ‹ไธ‹ไบ†ไธ€ไธช็›ฎๆ ‡๏ผš่ฆๆˆไธบไธ€ๅๆˆๅŠŸ็š„ไผไธšๅฎถใ€‚
# ไธบไบ†ๅฎž็Žฐ่ฟ™ไธช็›ฎๆ ‡๏ผŒๆŽๆ˜Žๅ‹คๅฅ‹ๅญฆไน ๏ผŒ่€ƒไธŠไบ†ๅคงๅญฆใ€‚ๅœจๅคงๅญฆๆœŸ้—ด๏ผŒไป–็งฏๆžๅ‚ๅŠ ๅ„็งๅˆ›ไธšๆฏ”่ต›๏ผŒ่Žทๅพ—ไบ†ไธๅฐ‘ๅฅ–้กนใ€‚ไป–่ฟ˜ๅˆฉ็”จ่ฏพไฝ™ๆ—ถ้—ดๅŽปๅฎžไน ๏ผŒ็งฏ็ดฏไบ†ๅฎ่ดต็š„็ป้ชŒใ€‚
# ๆฏ•ไธšๅŽ๏ผŒๆŽๆ˜Žๅ†ณๅฎšๅผ€ๅง‹่‡ชๅทฑ็š„ๅˆ›ไธšไน‹่ทฏใ€‚ไป–ๅผ€ๅง‹ๅฏปๆ‰พๆŠ•่ต„ๆœบไผš๏ผŒไฝ†ๅคšๆฌก้ƒฝ่ขซๆ‹’็ปไบ†ใ€‚็„ถ่€Œ๏ผŒไป–ๅนถๆฒกๆœ‰ๆ”พๅผƒใ€‚ไป–็ปง็ปญๅŠชๅŠ›๏ผŒไธๆ–ญๆ”น่ฟ›่‡ชๅทฑ็š„ๅˆ›ไธš่ฎกๅˆ’๏ผŒๅนถๅฏปๆ‰พๆ–ฐ็š„ๆŠ•่ต„ๆœบไผšใ€‚
# ๆœ€็ปˆ๏ผŒๆŽๆ˜ŽๆˆๅŠŸๅœฐ่Žทๅพ—ไบ†ไธ€็ฌ”ๆŠ•่ต„๏ผŒๅผ€ๅง‹ไบ†่‡ชๅทฑ็š„ๅˆ›ไธšไน‹่ทฏใ€‚ไป–ๆˆ็ซ‹ไบ†ไธ€ๅฎถ็ง‘ๆŠ€ๅ…ฌๅธ๏ผŒไธ“ๆณจไบŽๅผ€ๅ‘ๆ–ฐๅž‹่ฝฏไปถใ€‚ๅœจไป–็š„้ข†ๅฏผไธ‹๏ผŒๅ…ฌๅธ่ฟ…้€Ÿๅ‘ๅฑ•่ตทๆฅ๏ผŒๆˆไธบไบ†ไธ€ๅฎถๆˆๅŠŸ็š„็ง‘ๆŠ€ไผไธšใ€‚
# ๆŽๆ˜Ž็š„ๆˆๅŠŸๅนถไธๆ˜ฏๅถ็„ถ็š„ใ€‚ไป–ๅ‹คๅฅ‹ใ€ๅš้Ÿงใ€ๅ‹‡ไบŽๅ†’้™ฉ๏ผŒไธๆ–ญๅญฆไน ๅ’Œๆ”น่ฟ›่‡ชๅทฑใ€‚ไป–็š„ๆˆๅŠŸไนŸ่ฏๆ˜Žไบ†๏ผŒๅช่ฆๅŠชๅŠ›ๅฅ‹ๆ–—๏ผŒไปปไฝ•ไบบ้ƒฝๆœ‰ๅฏ่ƒฝๅ–ๅพ—ๆˆๅŠŸใ€‚

# ็ฌฌไธ‰่ฝฎๅฏน่ฏ 3rd dialogue turn
response, history = model.chat(tokenizer, "็ป™่ฟ™ไธชๆ•…ไบ‹่ตทไธ€ไธชๆ ‡้ข˜", history=history)
print(response)
# ใ€Šๅฅ‹ๆ–—ๅˆ›ไธš๏ผšไธ€ไธชๅนด่ฝปไบบ็š„ๆˆๅŠŸไน‹่ทฏใ€‹

# Qwen-72B-Chat็Žฐๅœจๅฏไปฅ้€š่ฟ‡่ฐƒๆ•ด็ณป็ปŸๆŒ‡ไปค๏ผˆSystem Prompt๏ผ‰๏ผŒๅฎž็Žฐ่ง’่‰ฒๆ‰ฎๆผ”๏ผŒ่ฏญ่จ€้ฃŽๆ ผ่ฟ็งป๏ผŒไปปๅŠก่ฎพๅฎš๏ผŒ่กŒไธบ่ฎพๅฎš็ญ‰่ƒฝๅŠ›ใ€‚
# Qwen-72B-Chat can realize roly playing, language style transfer, task setting, and behavior setting by system prompt.
response, _ = model.chat(tokenizer, "ไฝ ๅฅฝๅ‘€", history=None, system="่ฏท็”จไบŒๆฌกๅ…ƒๅฏ็ˆฑ่ฏญๆฐ”ๅ’Œๆˆ‘่ฏด่ฏ")
print(response)
# ๅ“Žๅ‘€๏ผŒไฝ ๅฅฝๅ“‡๏ผๆ˜ฏๆ€Žไนˆๆ‰พๅˆฐไบบๅฎถ็š„ๅ‘ข๏ผŸๆ˜ฏไธๆ˜ฏ่ขซไบบๅฎถ็š„้ญ…ๅŠ›ๅธๅผ•่ฟ‡ๆฅ็š„ๅ‘€~(โ‰งโ–ฝโ‰ฆ)/~

response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs")
print(response)
# Your colleague is a shining example of dedication and hard work. Their commitment to their job is truly commendable, and it shows in the quality of their work. 
# They are an asset to the team, and their efforts do not go unnoticed. Keep up the great work!

ไฝฟ็”จvLLMๅ’Œ็ฑปTransformersๆŽฅๅฃ่ฟ›่กŒๆŽจ็†๏ผˆInference with vLLM and Transformers-like APIs๏ผ‰

ๅœจๆ นๆฎไธŠๆ–นไพ่ต–ๆ€ง้ƒจๅˆ†็š„่ฏดๆ˜Žๅฎ‰่ฃ…vLLMๅŽ๏ผŒๅฏไปฅไธ‹่ฝฝๆŽฅๅฃๅฐ่ฃ…ไปฃ็ ๅˆฐๅฝ“ๅ‰ๆ–‡ไปถๅคน๏ผŒๅนถๆ‰ง่กŒไปฅไธ‹ๅ‘ฝไปค่ฟ›่กŒๅคš่ฝฎๅฏน่ฏไบคไบ’ใ€‚๏ผˆๆณจๆ„๏ผš่ฏฅๆ–นๆณ•ๅฝ“ๅ‰ๅชๆ”ฏๆŒmodel.chat()ๆŽฅๅฃใ€‚๏ผ‰

After installing vLLM according to the dependency section above, you can download the wrapper codes and execute the following commands for multiple rounds of dialogue interaction. (Note: It currently only supports the model.chat() method.)

from vllm_wrapper import vLLMWrapper

model = vLLMWrapper('Qwen/Qwen-72B-Chat', tensor_parallel_size=2)
# model = vLLMWrapper('Qwen/Qwen-72B-Chat-Int4', tensor_parallel_size=1, dtype="float16")  # ่ฟ่กŒint4ๆจกๅž‹ใ€‚ run int4 model.

response, history = model.chat(query="ไฝ ๅฅฝ", history=None)
print(response)
response, history = model.chat(query="็ป™ๆˆ‘่ฎฒไธ€ไธชๅนด่ฝปไบบๅฅ‹ๆ–—ๅˆ›ไธšๆœ€็ปˆๅ–ๅพ—ๆˆๅŠŸ็š„ๆ•…ไบ‹ใ€‚", history=history)
print(response)
response, history = model.chat(query="็ป™่ฟ™ไธชๆ•…ไบ‹่ตทไธ€ไธชๆ ‡้ข˜", history=history)
print(response)

ไฝฟ็”จvLLMๅ’Œ็ฑปOpenAIๆŽฅๅฃ่ฟ›่กŒๆŽจ็†๏ผˆInference with vLLM and OpenAI-like API๏ผ‰

่ฏทๅ‚่€ƒๆˆ‘ไปฌGitHub repoไธญvLLM้ƒจ็ฝฒๅ’ŒOpenAIๆŽฅๅฃไฝฟ็”จไธคไธช้ƒจๅˆ†็š„ไป‹็ปใ€‚

Please refer to the introduction of vLLM deployment and OpenAI interface usage in our GitHub repo.

ๅฆ‚ๆžœไฝฟ็”จ2xA100-80G่ฟ›่กŒ้ƒจ็ฝฒ๏ผŒๅฏไปฅ่ฟ่กŒไปฅไธ‹ไปฃ็ ๏ผš

If deploying with 2xA100-80G, you can run the following code:

python -m fastchat.serve.controller
python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat --trust-remote-code --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --dtype bfloat16
# python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat-Int4 --trust-remote-code --dtype float16  # ่ฟ่กŒint4ๆจกๅž‹ใ€‚ run int4 model.
python -m fastchat.serve.openai_api_server --host localhost --port 8000

ๆณจๆ„้œ€่ฆ--gpu-memory-utilization 0.98ๅ‚ๆ•ฐ้ฟๅ…OOM้—ฎ้ข˜ใ€‚

Note that the --gpu-memory-utilization 0.98 parameter is required to avoid OOM problems.


ๅ…ณไบŽๆ›ดๅคš็š„ไฝฟ็”จ่ฏดๆ˜Ž๏ผŒ่ฏทๅ‚่€ƒๆˆ‘ไปฌ็š„GitHub repo่Žทๅ–ๆ›ดๅคšไฟกๆฏใ€‚

For more information, please refer to our GitHub repo for more information.

้‡ๅŒ– (Quantization)

็”จๆณ• (Usage)

ไปฅไธ‹ๆˆ‘ไปฌๆไพ›็คบไพ‹่ฏดๆ˜Žๅฆ‚ไฝ•ไฝฟ็”จInt4/Int8้‡ๅŒ–ๆจกๅž‹ใ€‚ๅœจๅผ€ๅง‹ไฝฟ็”จๅ‰๏ผŒ่ฏทๅ…ˆไฟ่ฏๆปก่ถณ่ฆๆฑ‚๏ผˆๅฆ‚torch 2.0ๅŠไปฅไธŠ๏ผŒtransformers็‰ˆๆœฌไธบ4.32.0ๅŠไปฅไธŠ๏ผŒ็ญ‰็ญ‰๏ผ‰๏ผŒๅนถๅฎ‰่ฃ…ๆ‰€้œ€ๅฎ‰่ฃ…ๅŒ…๏ผš

Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:

pip install auto-gptq optimum

ๅฆ‚ๅฎ‰่ฃ…auto-gptq้‡ๅˆฐ้—ฎ้ข˜๏ผŒๆˆ‘ไปฌๅปบ่ฎฎๆ‚จๅˆฐๅฎ˜ๆ–นrepoๆœ็ดขๅˆ้€‚็š„้ข„็ผ–่ฏ‘wheelใ€‚

If you meet problems installing auto-gptq, we advise you to check out the official repo to find a pre-build wheel.

ๆณจๆ„๏ผš้ข„็ผ–่ฏ‘็š„auto-gptq็‰ˆๆœฌๅฏนtorch็‰ˆๆœฌๅŠๅ…ถCUDA็‰ˆๆœฌ่ฆๆฑ‚ไธฅๆ ผใ€‚ๅŒๆ—ถ๏ผŒ็”ฑไบŽ ๅ…ถ่ฟ‘ๆœŸๆ›ดๆ–ฐ๏ผŒไฝ ๅฏ่ƒฝไผš้‡ๅˆฐtransformersใ€optimumๆˆ–peftๆŠ›ๅ‡บ็š„็‰ˆๆœฌ้”™่ฏฏใ€‚ ๆˆ‘ไปฌๅปบ่ฎฎไฝฟ็”จ็ฌฆๅˆไปฅไธ‹่ฆๆฑ‚็š„ๆœ€ๆ–ฐ็‰ˆๆœฌ๏ผš

  • torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
  • torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0 Note: The pre-compiled auto-gptq packages strongly depend on the version of torch and its CUDA version. Moreover, due to recent update, you may also encounter unsupported version errors from transformers, optimum, or peft. We recommend using the latest versions meeting the following requirements :
  • torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
  • torch>=2.0,<2.1 auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0

้šๅŽๅณๅฏไฝฟ็”จๅ’ŒไธŠ่ฟฐไธ€่‡ด็š„็”จๆณ•่ฐƒ็”จ้‡ๅŒ–ๆจกๅž‹๏ผš

Then you can load the quantized model easily and run inference as same as usual:

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-72B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "ไฝ ๅฅฝ", history=None)

ๆณจๆ„๏ผšไฝฟ็”จvLLM่ฟ่กŒ้‡ๅŒ–ๆจกๅž‹้œ€ๅฎ‰่ฃ…ๆˆ‘ไปฌvLLMๅˆ†ๆ”ฏไป“ๅบ“ใ€‚ๆš‚ไธๆ”ฏๆŒint8ๆจกๅž‹๏ผŒ่ฟ‘ๆœŸๅฐ†ๆ›ดๆ–ฐใ€‚

Note: You need to install our [vLLM repo] (https://github.com/qwenlm/vllm-gptq) for AutoGPTQ. The int8 model is not supported for the time being, and we will add the support soon.

ๆ•ˆๆžœ่ฏ„ๆต‹

ๆˆ‘ไปฌๅฏนBF16๏ผŒInt8ๅ’ŒInt4ๆจกๅž‹ๅœจๅŸบๅ‡†่ฏ„ๆต‹ไธŠๅšไบ†ๆต‹่ฏ•๏ผˆไฝฟ็”จzero-shot่ฎพ็ฝฎ๏ผ‰๏ผŒ็ป“ๆžœๅฆ‚ไธ‹ๆ‰€็คบ๏ผš

We illustrate the zero-shot performance of both BF16, Int8 and Int4 models on the benchmark. Results are shown below:

Quantization MMLU CEval (val) GSM8K Humaneval
BF16 74.4 80.1 76.4 64.6
Int8 73.5 80.1 73.5 62.2
Int4 73.4 80.1 75.3 61.6

ๆŽจ็†้€ŸๅบฆๅŠๆ˜พๅญ˜ไฝฟ็”จ (Inference Speed & GPU Memory Usage)

ๆˆ‘ไปฌๆต‹็ฎ—ไบ†ไธๅŒ็ฒพๅบฆๆจกๅž‹ใ€ไธๅŒFlashAttnๅบ“็‰ˆๆœฌใ€ไปฅๅŠๆ˜ฏๅฆไฝฟ็”จvLLM็š„ๆƒ…ๅ†ตไธ‹๏ผŒๆจกๅž‹ๅœจไธๅŒ่พ“ๅ…ฅ้•ฟๅบฆไธ‹็”Ÿๆˆ2048่ฏ็š„ๅนณๅ‡ๆŽจ็†้€ŸๅบฆไปฅๅŠๆ˜พๅญ˜ไฝฟ็”จใ€‚

We measured the average inference speed and GPU memory usage of generating 2048 tokens across several settings, including input lengths, quantization levels, versions of flash-attention, and whether vLLM is used.

Quantization Setting # of A100-80G GPUs Context Length Generation Length Speed (Tokens/s) Total GPU Memory Usage
BF16 HF + FlashAttn-v2 2 1 2048 8.48 144.69GB
BF16 HF + FlashAttn-v1 2 1 2048 8.31 144.69GB
BF16 HF + No FlashAttn 2 1 2048 7.89 144.69GB
BF16 vLLM 2 1 2048 17.60 Pre-Allocated*
BF16 vLLM 4 1 2048 26.16 Pre-Allocated*
BF16 HF + FlashAttn-v2 4 6144 2048 5.37 181.47GB
BF16 HF + FlashAttn-v1 4 6144 2048 4.96 181.47GB
BF16 HF + No FlashAttn 4 6144 2048 4.72 202.74GB
BF16 vLLM 4 6144 2048 24.41 Pre-Allocated*
BF16 vLLM 4 14336 2048 21.24 Pre-Allocated*
BF16 vLLM 4 30720 2048 17.55 Pre-Allocated*
Int8 HF + FlashAttn-v2 2 1 2048 9.05 81.27GB
Int8 HF + FlashAttn-v1 2 1 2048 8.97 81.27GB
Int8 HF + No FlashAttn 2 1 2048 8.32 81.27GB
Int8 HF + FlashAttn-v2 3 6144 2048 5.76 118.06GB
Int8 HF + FlashAttn-v1 3 6144 2048 5.72 118.06GB
Int8 HF + No FlashAttn 2 6144 2048 4.50 129.83GB
Int8 HF + FlashAttn-v2 4 14336 2048 3.44 180.44GB
Int8 HF + FlashAttn-v1 4 14336 2048 3.19 180.44GB
Int8 HF + No FlashAttn 4 14336 2048 OOM OOM
Int4 HF + FlashAttn-v2 1 1 2048 11.67 48.86GB
Int4 HF + FlashAttn-v1 1 1 2048 11.27 48.86GB
Int4 HF + No FlashAttn 1 1 2048 11.32 48.86GB
Int4 vLLM 1 1 2048 14.63 Pre-Allocated*
Int4 vLLM 2 1 2048 20.76 Pre-Allocated*
Int4 vLLM 4 1 2048 27.19 Pre-Allocated*
Int4 HF + FlashAttn-v2 2 6144 2048 6.75 85.99GB
Int4 HF + FlashAttn-v1 2 6144 2048 6.32 85.99GB
Int4 HF + No FlashAttn 2 6144 2048 5.97 88.30GB
Int4 vLLM 2 6144 2048 18.07 Pre-Allocated*
Int4 vLLM 4 6144 2048 24.56 Pre-Allocated*
Int4 HF + FlashAttn-v2 3 14336 2048 4.18 148.73GB
Int4 HF + FlashAttn-v1 3 14336 2048 3.72 148.73GB
Int4 HF + No FlashAttn 3 14336 2048 OOM OOM
Int4 vLLM 2 14336 2048 14.51 Pre-Allocated*
Int4 vLLM 4 14336 2048 19.28 Pre-Allocated*
Int4 vLLM 4 30720 2048 16.93 Pre-Allocated*

* vLLMไผšๆๅ‰้ข„ๅˆ†้…ๆ˜พๅญ˜๏ผŒๅ› ๆญคๆ— ๆณ•ๆŽขๆต‹ๆœ€ๅคงๆ˜พๅญ˜ไฝฟ็”จๆƒ…ๅ†ตใ€‚HFๆ˜ฏๆŒ‡ไฝฟ็”จHuggingface Transformersๅบ“่ฟ›่กŒๆŽจ็†ใ€‚

* vLLM pre-allocates GPU memory, so we cannot detect the maximum usage. HF refers to using the Huggingface Transformers library for inference.

HuggingFace Transformers็š„ๆ€ง่ƒฝๆต‹็ฎ—ไฝฟ็”จๆญค่„šๆœฌๅฎŒๆˆใ€‚่ฏ„ๆต‹ไฝฟ็”จA100-SXM4-80G GPU๏ผŒไฝฟ็”จPyTorch 2.0.1 (Huggingface Transformers) / PyTorch 2.1.0 (vLLM)ๅ’ŒCUDA 11.8ใ€‚

The speed and memory profiling of HuggingFace Transformers are conducted using this script. The profiling runs on A100-SXM4-80G GPUs with PyTorch 2.0.1 (for Huggingface Transformers) / PyTorch 2.1.0 (for vLLM) and CUDA 11.8.

ๆจกๅž‹็ป†่Š‚๏ผˆModel๏ผ‰

ไธŽQwen-72B้ข„่ฎญ็ปƒๆจกๅž‹็›ธๅŒ๏ผŒQwen-72B-Chatๆจกๅž‹่ง„ๆจกๅŸบๆœฌๆƒ…ๅ†ตๅฆ‚ไธ‹ๆ‰€็คบ

The details of the model architecture of Qwen-72B-Chat are listed as follows

Hyperparameter Value
n_layers 80
n_heads 64
d_model 8192
vocab size 151851
sequence length 32768

ๅœจไฝ็ฝฎ็ผ–็ ใ€FFNๆฟ€ๆดปๅ‡ฝๆ•ฐๅ’Œnormalization็š„ๅฎž็Žฐๆ–นๅผไธŠ๏ผŒๆˆ‘ไปฌไนŸ้‡‡็”จไบ†็›ฎๅ‰ๆœ€ๆต่กŒ็š„ๅšๆณ•๏ผŒ ๅณRoPE็›ธๅฏนไฝ็ฝฎ็ผ–็ ใ€SwiGLUๆฟ€ๆดปๅ‡ฝๆ•ฐใ€RMSNorm๏ผˆๅฏ้€‰ๅฎ‰่ฃ…flash-attentionๅŠ ้€Ÿ๏ผ‰ใ€‚

ๅœจๅˆ†่ฏๅ™จๆ–น้ข๏ผŒ็›ธๆฏ”็›ฎๅ‰ไธปๆตๅผ€ๆบๆจกๅž‹ไปฅไธญ่‹ฑ่ฏ่กจไธบไธป๏ผŒQwen-72B-Chatไฝฟ็”จไบ†็บฆ15ไธ‡tokenๅคงๅฐ็š„่ฏ่กจใ€‚ ่ฏฅ่ฏ่กจๅœจGPT-4ไฝฟ็”จ็š„BPE่ฏ่กจcl100k_baseๅŸบ็ก€ไธŠ๏ผŒๅฏนไธญๆ–‡ใ€ๅคš่ฏญ่จ€่ฟ›่กŒไบ†ไผ˜ๅŒ–๏ผŒๅœจๅฏนไธญใ€่‹ฑใ€ไปฃ็ ๆ•ฐๆฎ็š„้ซ˜ๆ•ˆ็ผ–่งฃ็ ็š„ๅŸบ็ก€ไธŠ๏ผŒๅฏน้ƒจๅˆ†ๅคš่ฏญ่จ€ๆ›ดๅŠ ๅ‹ๅฅฝ๏ผŒๆ–นไพฟ็”จๆˆทๅœจไธๆ‰ฉๅฑ•่ฏ่กจ็š„ๆƒ…ๅ†ตไธ‹ๅฏน้ƒจๅˆ†่ฏญ็ง่ฟ›่กŒ่ƒฝๅŠ›ๅขžๅผบใ€‚ ่ฏ่กจๅฏนๆ•ฐๅญ—ๆŒ‰ๅ•ไธชๆ•ฐๅญ—ไฝๅˆ‡ๅˆ†ใ€‚่ฐƒ็”จ่พƒไธบ้ซ˜ๆ•ˆ็š„tiktokenๅˆ†่ฏๅบ“่ฟ›่กŒๅˆ†่ฏใ€‚

For position encoding, FFN activation function, and normalization calculation methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).

For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-72B-Chat uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization.

่ฏ„ๆต‹ๆ•ˆๆžœ๏ผˆEvaluation๏ผ‰

ๅฏนไบŽQwen-72B-Chatๆจกๅž‹๏ผŒๆˆ‘ไปฌๅŒๆ ท่ฏ„ๆต‹ไบ†ๅธธ่ง„็š„ไธญๆ–‡็†่งฃ๏ผˆC-Eval๏ผ‰ใ€่‹ฑๆ–‡็†่งฃ๏ผˆMMLU๏ผ‰ใ€ไปฃ็ ๏ผˆHumanEval๏ผ‰ๅ’Œๆ•ฐๅญฆ๏ผˆGSM8K๏ผ‰็ญ‰ๆƒๅจไปปๅŠก๏ผŒๅŒๆ—ถๅŒ…ๅซไบ†้•ฟๅบๅˆ—ไปปๅŠก็š„่ฏ„ๆต‹็ป“ๆžœใ€‚็”ฑไบŽQwen-72B-Chatๆจกๅž‹็ป่ฟ‡ๅฏน้ฝๅŽ๏ผŒๆฟ€ๅ‘ไบ†่พƒๅผบ็š„ๅค–้ƒจ็ณป็ปŸ่ฐƒ็”จ่ƒฝๅŠ›๏ผŒๆˆ‘ไปฌ่ฟ˜่ฟ›่กŒไบ†ๅทฅๅ…ทไฝฟ็”จ่ƒฝๅŠ›ๆ–น้ข็š„่ฏ„ๆต‹ใ€‚

ๆ็คบ๏ผš็”ฑไบŽ็กฌไปถๅ’Œๆก†ๆžถ้€ ๆˆ็š„่ˆๅ…ฅ่ฏฏๅทฎ๏ผŒๅค็Žฐ็ป“ๆžœๅฆ‚ๆœ‰ๆณขๅŠจๅฑžไบŽๆญฃๅธธ็Žฐ่ฑกใ€‚

For Qwen-72B-Chat, we also evaluate the model on C-Eval, MMLU, HumanEval, GSM8K, etc., as well as the benchmark evaluation for long-context understanding, and tool usage.

Note: Due to rounding errors caused by hardware and framework, differences in reproduced results are possible.

ไธญๆ–‡่ฏ„ๆต‹๏ผˆChinese Evaluation๏ผ‰

C-Eval

ๅœจC-Eval้ชŒ่ฏ้›†ไธŠ๏ผŒๆˆ‘ไปฌ่ฏ„ไปทไบ†Qwen-72B-Chatๆจกๅž‹็š„0-shot & 5-shotๅ‡†็กฎ็އ

We demonstrate the 0-shot & 5-shot accuracy of Qwen-72B-Chat on C-Eval validation set

Model Avg. Acc.
LLaMA2-7B-Chat 31.9
LLaMA2-13B-Chat 36.2
LLaMA2-70B-Chat 44.3
ChatGPT3.5 52.5
ChatGPT4 69.9
Yi-34B-Chat (0-shot) 77.0
Yi-34B-Chat (5-shot) 78.5
Qwen-7B-Chat (original) (0-shot) 54.2
Qwen-7B-Chat (0-shot) 59.7
Qwen-7B-Chat (5-shot) 59.3
Qwen-14B-Chat (0-shot) 69.8
Qwen-14B-Chat (5-shot) 71.7
Qwen-72B-Chat (0-shot) 80.1
Qwen-72B-Chat (5-shot) 82.9

C-Evalๆต‹่ฏ•้›†ไธŠ๏ผŒQwen-72B-Chatๆจกๅž‹็š„zero-shotๅ‡†็กฎ็އ็ป“ๆžœๅฆ‚ไธ‹๏ผš

The zero-shot accuracy of Qwen-72B-Chat on C-Eval testing set is provided below:

Model Avg. STEM Social Sciences Humanities Others
Qwen-7B-Chat (original) 54.6 47.8 67.6 59.3 50.6
Qwen-7B-Chat 58.6 53.3 72.1 62.8 52.0
Qwen-14B-Chat 69.1 65.1 80.9 71.2 63.4
Qwen-72B-Chat 79.5 74.5 89.1 81.2 78.1

่‹ฑๆ–‡่ฏ„ๆต‹๏ผˆEnglish Evaluation๏ผ‰

MMLU

MMLU่ฏ„ๆต‹้›†ไธŠ๏ผŒQwen-7B-Chatๆจกๅž‹็š„ 0-shot & 5-shot ๅ‡†็กฎ็އๅฆ‚ไธ‹๏ผŒๆ•ˆๆžœๅŒๆ ทๅœจๅŒ็ฑปๅฏน้ฝๆจกๅž‹ไธญๅŒๆ ท่กจ็Žฐ่พƒไผ˜ใ€‚

The 0-shot & 5-shot accuracy of Qwen-72B-Chat on MMLU is provided below. The performance of Qwen-72B-Chat still on the top between other human-aligned models with comparable size.

Model Avg. Acc.
LLaMA2-7B-Chat 46.2
LLaMA2-13B-Chat 54.6
LLaMA2-70B-Chat 63.8
Yi-34B-Chat (0-shot) 67.6
Yi-34B-Chat (5-shot) 73.4
ChatGPT3.5 69.1
ChatGPT4 83.0
Qwen-7B-Chat (original) (0-shot) 53.9
Qwen-7B-Chat (0-shot) 55.8
Qwen-7B-Chat (5-shot) 57.0
Qwen-14B-Chat (0-shot) 64.6
Qwen-14B-Chat (5-shot) 66.5
Qwen-72B-Chat (0-shot) 74.3
Qwen-72B-Chat (5-shot) 75.0

ไปฃ็ ่ฏ„ๆต‹๏ผˆCoding Evaluation๏ผ‰

Qwen-72B-ChatๅœจHumanEval็š„zero-shot Pass@1ๆ•ˆๆžœๅฆ‚ไธ‹

The zero-shot Pass@1 of Qwen-72B-Chat on HumanEval is demonstrated below

Model Pass@1
LLaMA2-7B-Chat 12.2
LLaMA2-13B-Chat 18.9
LLaMA2-70B-Chat 32.3
Yi-34B-Chat 33.5
ChatGPT3.5 73.2
ChatGPT4 86.6
Qwen-7B-Chat (original) 24.4
Qwen-7B-Chat 37.2
Qwen-14B-Chat 43.9
Qwen-72B-Chat 64.6

ๆ•ฐๅญฆ่ฏ„ๆต‹๏ผˆMathematics Evaluation๏ผ‰

ๅœจ่ฏ„ๆต‹ๆ•ฐๅญฆ่ƒฝๅŠ›็š„GSM8KไธŠ๏ผŒQwen-72B-Chat็š„ๅ‡†็กฎ็އ็ป“ๆžœๅฆ‚ไธ‹

The accuracy of Qwen-72B-Chat on GSM8K is shown below

Model Acc.
LLaMA2-7B-Chat 26.3
LLaMA2-13B-Chat 37.1
LLaMA2-70B-Chat 59.3
Yi-34B-Chat 71.6
ChatGPT3.5 73.2
ChatGPT4 91.4
Qwen-7B-Chat (original) (0-shot) 41.1
Qwen-7B-Chat (0-shot) 50.3
Qwen-7B-Chat (8-shot) 54.1
Qwen-14B-Chat (0-shot) 60.1
Qwen-14B-Chat (8-shot) 59.3
Qwen-72B-Chat (0-shot) 76.4
Qwen-72B-Chat (8-shot) 75.7

้•ฟๅบๅˆ—่ฏ„ๆต‹๏ผˆLong-Context Understanding๏ผ‰

Qwen-72B-Chatๆ”ฏๆŒๆœ€้•ฟ32k็š„ไธŠไธ‹ๆ–‡้•ฟๅบฆ๏ผŒๅœจL-Evalๅฎข่ง‚้ข˜็š„่ฏ„ๅˆ†็ป“ๆžœๅฆ‚ไธ‹๏ผš

Qwen-72B-Chat supports context lengths of up to 32k. The scores of L-Eval (closed-ended tasks) are as follows:

Model Average Coursera GSM QuALITY TOEFL CodeU SFcition
ChatGPT-3.5-16k 60.73 63.51 84.00 61.38 78.43 12.22 64.84
Qwen-72B-Chat 62.30 58.13 76.00 77.22 86.24 6.66 69.53

ๆˆ‘ไปฌ่ฟ›ไธ€ๆญฅ่ฟ›่กŒไบ†โ€œๅคงๆตทๆž้’ˆโ€ๅฎž้ชŒ๏ผˆๆƒณๆณ•ๆฅ่‡ชไบŽ@Greg Kamradt๏ผ‰๏ผŒๆต‹่ฏ•ๆจกๅž‹ๅœจไธๅŒ้•ฟๅบฆ็š„่พ“ๅ…ฅไธ‹๏ผŒๆ˜ฏๅฆ่ƒฝๆฃ€็ดขๅˆฐๆ–‡็ซ ไธๅŒไฝ็ฝฎ็š„ไฟกๆฏ๏ผŒ็ป“ๆžœๅฆ‚ไธ‹๏ผš

We conducted the "needle in a haystack" experiment (the idea came from @Greg Kamradt) to test whether the model can retrieve information at different positions in the inputs of different lengths, the result is as follows:

ไปฅไธŠ็ป“ๆžœ่ฏดๆ˜Ž๏ผŒQwen-72B-Chatๅฏไปฅ่ƒฝๅ‡†็กฎๆฃ€็ดขๅˆฐ32kไปฅๅ†…็š„่พ“ๅ…ฅ้•ฟๅบฆไธญๆ”พๅœจๅ„็งไฝ็ฝฎ็š„ไฟกๆฏ๏ผŒ่ฏๆ˜Žไบ†ๅ…ถๅ…ทๆœ‰ไผ˜็ง€็š„้•ฟๆ–‡ๆœฌๅค„็†่ƒฝๅŠ›ใ€‚

The above results show that Qwen-72B-Chat can accurately retrieve information placed in various positions within an input length of 32k, proving its excellent long text understanding capabilities.

FAQ

ๅฆ‚้‡ๅˆฐ้—ฎ้ข˜๏ผŒๆ•ฌ่ฏทๆŸฅ้˜…FAQไปฅๅŠissueๅŒบ๏ผŒๅฆ‚ไปๆ— ๆณ•่งฃๅ†ณๅ†ๆไบคissueใ€‚

If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue.

ๅผ•็”จ (Citation)

ๅฆ‚ๆžœไฝ ่ง‰ๅพ—ๆˆ‘ไปฌ็š„ๅทฅไฝœๅฏนไฝ ๆœ‰ๅธฎๅŠฉ๏ผŒๆฌข่ฟŽๅผ•็”จ๏ผ

If you find our work helpful, feel free to give us a cite.

@article{qwen,
  title={Qwen Technical Report},
  author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
  journal={arXiv preprint arXiv:2309.16609},
  year={2023}
}

ไฝฟ็”จๅ่ฎฎ๏ผˆLicense Agreement๏ผ‰

ๆˆ‘ไปฌ็š„ไปฃ็ ๅ’Œๆจกๅž‹ๆƒ้‡ๅฏนๅญฆๆœฏ็ ”็ฉถๅฎŒๅ…จๅผ€ๆ”พ๏ผŒๅนถๆ”ฏๆŒๅ•†็”จใ€‚่ฏทๆŸฅ็œ‹LICENSEไบ†่งฃๅ…ทไฝ“็š„ๅผ€ๆบๅ่ฎฎ็ป†่Š‚ใ€‚ๅฆ‚้œ€ๅ•†็”จ๏ผŒๆฌข่ฟŽๅกซๅ†™้—ฎๅท็”ณ่ฏทใ€‚

Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply.

่”็ณปๆˆ‘ไปฌ๏ผˆContact Us๏ผ‰

ๅฆ‚ๆžœไฝ ๆƒณ็ป™ๆˆ‘ไปฌ็š„็ ”ๅ‘ๅ›ข้˜Ÿๅ’Œไบงๅ“ๅ›ข้˜Ÿ็•™่จ€๏ผŒๆฌข่ฟŽๅŠ ๅ…ฅๆˆ‘ไปฌ็š„ๅพฎไฟก็พคใ€้’‰้’‰็พคไปฅๅŠDiscord๏ผๅŒๆ—ถ๏ผŒไนŸๆฌข่ฟŽ้€š่ฟ‡้‚ฎไปถ๏ผˆqianwen_opensource@alibabacloud.com๏ผ‰่”็ณปๆˆ‘ไปฌใ€‚

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to qianwen_opensource@alibabacloud.com.

Downloads last month
245
Safetensors
Model size
72B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Qwen/Qwen-72B-Chat

Base model

Qwen/Qwen-72B
Finetuned
(1)
this model
Quantizations
2 models

Spaces using Qwen/Qwen-72B-Chat 30

Collection including Qwen/Qwen-72B-Chat

Papers for Qwen/Qwen-72B-Chat