InternVL3_5-1B_GPTQ_INT4
This version of InternVL3_5-1B_GPTQ_INT4 has been converted to run on the Axera NPU using w4a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 5.1-patch1.
Please note that the context of the model is 2k and the maximum prefill length is 1k.
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo:
https://huggingface.co/OpenGVLab/InternVL3_5-1B
How to Convert LLM from Huggingface to axmodel
Support Platform
AX650
- AX650N DEMO Board
- M4N-Dock(็ฑ่ฏๆดพPro)
- M.2 Accelerator card
AX620E
- AX620E DEMO Board
| Chips | image encoder 448 | ttft | w4a16 |
|---|---|---|---|
| AX650 | 364.412 ms | 883.458 ms | 28.09 tokens/sec |
| AX620E | 2358.956 ms | 3136.54 | 7.33 tokens/sec |
How to use
Download all files from this repository to the device
$ tree -L 1
.
โโโ assets
โโโ config.json
โโโ examples
โโโ gradio_demo.py
โโโ infer_axmodel.py
โโโ infer_torch.py
โโโ internvl3-5_axmodel
โโโ internvl3-5_tokenizer
โโโ README.md
โโโ utils
โโโ vit-models
6 directories, 5 files
Install transformer
pip install transformers==4.57.1
Inference with AX650 Host, such as M4N-Dock(็ฑ่ฏๆดพPro) or AX650 DEMO Board
Interactive conversations using the C++ Gradio Demo (Updated: 2026.01.26):
Start the backend service:
./run_internvl_3-5_1b_448_ax650_api.sh
Reference log output:
root@ax650 ~/yongqiang/push_hugging_face/InternVL3_5-1B_GPTQ_INT4 # ./run_internvl_3-5_1b_448_ax650_api.sh
[I][ Init][ 135]: LLM init start
[I][ Init][ 137]: Total CMM:7915 MB
tokenizer_type = 3
3% | โโ | 1 / 31 [0.75s<23.19s, 1.34 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap
6% | โโโ | 2 / 31 [0.75s<11.69s, 2.65 count/s] embed_selector init ok[I][ Init][ 182]: attr.axmodel_num:28
41% | โโโโโโโโโโโโโโ | 13 / 31 [3.42s<8.15s, 3.80 count/s] init 10 axmodel ok,remain_cmm(7596 MB 45% | โโโโโโโโโโโโโโโ | 14 / 31 [3.68s<8.15s, 3.80 count/s] init 11 axmodel ok,remain_cmm(7567 MB 48% | โโโโโโโโโโโโโโโโ | 15 / 31 [3.95s<8.16s, 3.80 count/s] init 12 axmodel ok,remain_cmm(7538 MB 51% | โโโโโโโโโโโโโโโโโ | 16 / 31 [4.19s<8.13s, 3.81 count/s] init 13 axmodel ok,remain_cmm(7509 MB 54% | โโโโโโโโโโโโโโโโโโ | 17 / 31 [4.45s<8.12s, 3.82 count/s] init 14 axmodel ok,remain_cmm(7480 MB 58% | โโโโโโโโโโโโโโโโโโโ | 18 / 31 [4.70s<8.10s, 3.83 count/s] init 15 axmodel ok,remain_cmm(7451 MB 61% | โโโโโโโโโโโโโโโโโโโโ | 19 / 31 [5.05s<8.25s, 3.76 count/s] init 16 axmodel ok,remain_cmm(7422 MB 64% | โโโโโโโโโโโโโโโโโโโโโ | 20 / 31 [5.30s<8.22s, 3.77 count/s] init 17 axmodel ok,remain_cmm(7393 MB 67% | โโโโโโโโโโโโโโโโโโโโโโ | 21 / 31 [5.56s<8.21s, 3.78 count/s] init 18 axmodel ok,remain_cmm(7364 MB 70% | โโโโโโโโโโโโโโโโโโโโโโโ | 22 / 31 [5.81s<8.19s, 3.79 count/s] init 19 axmodel ok,remain_cmm(7335 MB 74% | โโโโโโโโโโโโโโโโโโโโโโโโ | 23 / 31 [6.06s<8.17s, 3.79 count/s] init 20 axmodel ok,remain_cmm(7306 MB 77% | โโโโโโโโโโโโโโโโโโโโโโโโโ | 24 / 31 [6.32s<8.16s, 3.80 count/s] init 21 axmodel ok,remain_cmm(7277 MB 80% | โโโโโโโโโโโโโโโโโโโโโโโโโโ | 25 / 31 [6.59s<8.17s, 3.79 count/s] init 22 axmodel ok,remain_cmm(7248 MB 83% | โโโโโโโโโโโโโโโโโโโโโโโโโโโ | 26 / 31 [6.86s<8.18s, 3.79 count/s] init 23 axmodel ok,remain_cmm(7219 MB 87% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 27 / 31 [7.13s<8.18s, 3.79 count/s] init 24 axmodel ok,remain_cmm(7190 MB 90% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 28 / 31 [7.39s<8.18s, 3.79 count/s] init 25 axmodel ok,remain_cmm(7161 MB 93% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 29 / 31 [7.67s<8.19s, 3.78 count/s] init 26 axmodel ok,remain_cmm(7132 MB 96% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 30 / 31 [7.93s<8.19s, 3.78 count/s] init 27 axmodel ok,remain_cmm(7103 MB100% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 31 / 31 [9.86s<9.86s, 3.14 count/s] init post axmodel ok,remain_cmm(6940 MB)[I][ Init][ 240]: image encoder feature outputs:0
103% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 32 / 31 [13.95s<13.52s, 2.29 count/s] init vpm axmodel ok,remain_cmm(6588 MB)[I][ Init][ 280]: image encoder input nhwc@uint8
[I][ Init][ 305]: image encoder output float32
[I][ Init][ 335]: max_token_len : 2047
[I][ Init][ 340]: kv_cache_size : 1024, kv_cache_num: 2047
[I][ Init][ 348]: prefill_token_num : 128
[I][ Init][ 352]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 352]: grp: 2, prefill_max_token_num : 128
[I][ Init][ 352]: grp: 3, prefill_max_token_num : 256
[I][ Init][ 352]: grp: 4, prefill_max_token_num : 384
[I][ Init][ 352]: grp: 5, prefill_max_token_num : 512
[I][ Init][ 352]: grp: 6, prefill_max_token_num : 640
[I][ Init][ 352]: grp: 7, prefill_max_token_num : 768
[I][ Init][ 352]: grp: 8, prefill_max_token_num : 896
[I][ Init][ 352]: grp: 9, prefill_max_token_num : 1024
[I][ Init][ 356]: prefill_max_token_num : 1024
[I][ load_config][ 281]: load config:
{
"enable_repetition_penalty": true,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 30,
"repetition_penalty": 1.2,
"temperature": 0.7,
"top_k": 10,
"top_p": 0.9
}
[I][ Init][ 373]: LLM init ok
[I][ Init][ 375]: Left CMM:6588 MB
Server running on port 8000...
Run the Gradio frontend:
python3 gradio_demo_cpp_backend.py
Interactive conversations using the C++ Demo:
./run_internvl_3-5_1b_448_ax650.sh
The log information is as follows:
root@ax650 ~/yongqiang/push_hugging_face/InternVL3_5-1B_GPTQ_INT4 # ./run_internvl_3-5_1b_448_ax650.sh
[I][ Init][ 135]: LLM init start
[I][ Init][ 137]: Total CMM:7915 MB
tokenizer_type = 3
3% | โโ | 1 / 31 [0.71s<21.92s, 1.41 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap
6% | โโโ | 2 / 31 [0.71s<11.05s, 2.81 count/s] embed_selector init ok[I][ Init][ 182]: attr.axmodel_num:28
100% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 31 / 31 [2.06s<2.06s, 15.03 count/s] init post axmodel ok,remain_cmm(6940 MB)[I][ Init][ 240]: image encoder feature outputs:0
103% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 32 / 31 [2.32s<2.25s, 13.79 count/s] init vpm axmodel ok,remain_cmm(6588 MB)[I][ Init][ 280]: image encoder input nhwc@uint8
[I][ Init][ 305]: image encoder output float32
[I][ Init][ 335]: max_token_len : 2047
[I][ Init][ 340]: kv_cache_size : 1024, kv_cache_num: 2047
[I][ Init][ 348]: prefill_token_num : 128
[I][ Init][ 352]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 352]: grp: 2, prefill_max_token_num : 128
[I][ Init][ 352]: grp: 3, prefill_max_token_num : 256
[I][ Init][ 352]: grp: 4, prefill_max_token_num : 384
[I][ Init][ 352]: grp: 5, prefill_max_token_num : 512
[I][ Init][ 352]: grp: 6, prefill_max_token_num : 640
[I][ Init][ 352]: grp: 7, prefill_max_token_num : 768
[I][ Init][ 352]: grp: 8, prefill_max_token_num : 896
[I][ Init][ 352]: grp: 9, prefill_max_token_num : 1024
[I][ Init][ 356]: prefill_max_token_num : 1024
[I][ load_config][ 281]: load config:
{
"enable_repetition_penalty": true,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 30,
"repetition_penalty": 1.2,
"temperature": 0.7,
"top_k": 10,
"top_p": 0.9
}
[I][ Init][ 373]: LLM init ok
[I][ Init][ 375]: Left CMM:6588 MB
Type "q" to exit, Ctrl+c to stop current running
prompt(่พๅ
ฅq้ๅบ) >> ไป็ปไธไธไฝ ่ชๅทฑ
image(ๅ่ฝฆ้ฎ่ทณ่ฟ) >>
[I][ Run][ 713]: input token num : 21, prefill_split_num : 1
[I][ Run][ 747]: input_num_token:21
[I][ Run][ 976]: ttft: 83.79 ms
ๆ่ขซ็งฐไธบ"่ฏญ่จๆจกๅ-1.0"๏ผๆฅ่ชไธๆตทไบบๅทฅๆบ่ฝๅฎ้ชๅฎคใๆ็ๅผๅๅข้่ดๅไบไธบ็จๆทๆไพ้ซๆใๅ็กฎๅไธชๆงๅ็AIๆๅกใไฝไธบไธๆฌพๅ
่ฟ็่ช็ถ่ฏญ่จๅค็๏ผNLP๏ผๆจกๅ๏ผๆๆจๅจๅธฎๅฉ็จๆท่งฃๅณๅ็ง่ฏญ่จ็ธๅ
ณ้ฎ้ข๏ผๅนถๆไพๆ็จ็ไฟกๆฏๅๅปบ่ฎฎใๆ็่ฎพ่ฎก็ฎๆ ๆฏ่ฝๅคไปฅ่ช็ถๆต็
็ๆนๅผไธไบบ็ฑป่ฟ่กไบคไบ๏ผๆ ่ฎบๆฏๅ็ญ้ฎ้ขใๆไพๅปบ่ฎฎ่ฟๆฏๆง่กไปปๅกใ
[N][ Run][1102]: hit eos,avg 19.79 token/s
prompt(่พๅ
ฅq้ๅบ) >> ่ฏทไฝ ่ฏฆ็ปๆ่ฟฐไธ้ข่ฟๅน
ๅพ
image(ๅ่ฝฆ้ฎ่ทณ่ฟ) >> assets/image_1.jpg
[I][ EncodeImage][ 481]: image encode time : 408.467987 ms, size : 1
[I][ Encode][ 636]: input_ids size:284
[I][ Encode][ 644]: offset 15
[I][ Encode][ 673]: img_embed.size:1, 262144
[I][ Encode][ 689]: out_embed size:290816
[I][ Encode][ 690]: input_ids size 284
[I][ Encode][ 692]: position_ids size:284
[I][ Run][ 713]: input token num : 284, prefill_split_num : 3
[I][ Run][ 747]: input_num_token:128
[I][ Run][ 747]: input_num_token:128
[I][ Run][ 747]: input_num_token:28
[I][ Run][ 976]: ttft: 270.76 ms
่ฟๆฏไธๅน
็ๅจ็ๅพ็๏ผๅฑ็คบไบไธๅชๅคง็็ซๆญฃๅจ่ช็ถ็ฏๅขไธญ่ง
้ฃ็ๆ
ๆฏใ็ป้ขไธญ๏ผๅคง็็ซๆญฃไฝๅคดๅจๆค็ฉไธไธญๅฏปๆพ้ฃ็ฉใๅฎ็ๆฏๅๅ็ฝ่ฒ๏ผ่้จๅ่
น้จๆ้ป่ฒๆ็นใๅจๅด็ปฟๆ็็ถ๏ผๅ็ง็ๆจๅๆค็ฉ็ฏ็ป็ๅฎ๏ผๆพๅพ็ๆบๅๅใ่ๆฏ็ๆจ่ดจ็ปๆๅฏ่ฝๆฏไธๆ็ซน็ซฟๆ้ฟๆค
๏ผ่ฟไธๆญฅๆ็คบ่ฟๅฏ่ฝๆฏๅจ็ฉๅญๆ้็ๅจ็ฉไฟๆคๅบใๆดไธชๅบๆฏๅ
ๆปกไบ่ช็ถ็ๆฐๆฏ๏ผ่ฎฉไบบๆๅๅฐๅคง่ช็ถ็ๅฏ็ฑไธ็ๆบใ
[N][ Run][1102]: hit eos,avg 19.86 token/s
prompt(่พๅ
ฅq้ๅบ) >>
Interactive conversations using the Gradio Python API:
$ python3 gradio_demo.py --hf_model internvl3-5_tokenizer/ --axmodel_path internvl3-5_axmodel/ --vit_model vit-models/internvl_vit_model_1x3x448x448.axmodel
Plain text dialogue:
Image understanding:
Run the following command on the Axera board to start a chat conversation:
$ python3 infer_axmodel.py --hf_model internvl3-5_tokenizer/ --axmodel_path internvl3-5_axmodel/ --question "่ฏท่ฎก็ฎๅฝๆฐ[y=2x^2+2]็ๅฏผๆฐ, ๅนถๆไพ markdown ๆ ผๅผ็ๆจ็่ฟ็จ"
output:
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-dirty 0fdbfe15-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> ๅฝๆฐ \( y = 2x^2 + 2 \) ็ๅฏผๆฐๅฏไปฅ้่ฟๆฑๅฏผๆณๅๆฅ่ฎก็ฎใ้ฆๅ
๏ผๆไปฌๅฏนๅฝๆฐไธญ็ๆฏไธ้กนๅๅซๆฑๅฏผ๏ผ
1. ๅฏนไบ \( 2x^2 \)๏ผไฝฟ็จๅนๆณๅๆฑๅฏผ๏ผ
\[
\frac{d}{dx}(2x^2) = 2 \cdot 2x = 4x
\]
2. ๅฏนไบๅธธๆฐ้กน \( 2 \)๏ผๅ
ถๅฏผๆฐไธบ 0๏ผๅ ไธบๅธธๆฐ็ๅฏผๆฐไธบ 0ใ
ๅฐ่ฟไธค้จๅ็็ปๆ็ธๅ ๏ผๅพๅฐๅฝๆฐ \( y \) ็ๅฏผๆฐ๏ผ
\[
y' = 4x
\]
ๅ ๆญค๏ผๅฝๆฐ \( y = 2x^2 + 2 \) ็ๅฏผๆฐไธบ \( y' = 4x \)ใ
Enter the following command to perform the single-image understanding task:
$ python3 infer_axmodel.py --hf_model internvl3-5_tokenizer/ --axmodel_path internvl3-5_axmodel/ --question "่ฏทๆ่ฟฐ่ฟๅน
ๅพ" -i examples/image_0.jpg --vit_model vit-models/internvl_vit_model_1x3x448x448.axmodel
output:
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-dirty 0fdbfe15-dirty
Model loaded successfully!
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> ่ฟๆฏไธๅผ ็บข็็ซ็็
ง็ใ็บข็็ซๆฏไธ็ง็บขๆฃ่ฒ็ๅบไนณๅจ็ฉ๏ผ้ๅธธ็ๆดปๅจไบๆดฒ็ๆฃฎๆไธญใๅฎไปฌไปฅๆ้ฃๆ่ซๅๅฐๅๆ ่ๆคๅจ็ฉไธบ็ใๅพ็ไธญ๏ผ็บข็็ซๆญฃๅๅจไธไธชๆจๅถ็ๅนณๅฐไธ๏ผ่ๆฏๆฏ็ปฟ่ฒ็ๆ ๆจๅๆค่ขซ๏ผๆพๅพ้ๅธธ่ช็ถๅ็ๅจใ็บข็็ซ็่กจๆ
็่ตทๆฅๅพๅๅฅฝ๏ผไผผไนๅจ่งๅฏๆ็ญๅพ
ไปไนใ
- Downloads last month
- 47
Model tree for AXERA-TECH/InternVL3_5-1B_GPTQ_INT4
Base model
OpenGVLab/InternVL3_5-1B-Pretrained

