Configuration Parsing Warning: Invalid JSON for config file config.json

InternVL3_5-1B_GPTQ_INT4

This version of InternVL3_5-1B_GPTQ_INT4 has been converted to run on the Axera NPU using w4a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 5.1-patch1.

Please note that the context of the model is 2k and the maximum prefill length is 1k.

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo:

https://huggingface.co/OpenGVLab/InternVL3_5-1B

How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

AXera NPU AXCL LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
AX620E
- AX620E DEMO Board

Chips	image encoder 448	ttft	w4a16
AX650	364.412 ms	883.458 ms	28.09 tokens/sec
AX620E	2358.956 ms	3136.54	7.33 tokens/sec

How to use

Download all files from this repository to the device

$ tree -L 1
.
├── assets
├── config.json
├── examples
├── gradio_demo.py
├── infer_axmodel.py
├── infer_torch.py
├── internvl3-5_axmodel
├── internvl3-5_tokenizer
├── README.md
├── utils
└── vit-models

6 directories, 5 files

Install transformer

pip install transformers==4.57.1

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650 DEMO Board

Interactive conversations using the C++ Gradio Demo (Updated: 2026.01.26):

Start the backend service:

./run_internvl_3-5_1b_448_ax650_api.sh

Reference log output:

root@ax650 ~/yongqiang/push_hugging_face/InternVL3_5-1B_GPTQ_INT4 # ./run_internvl_3-5_1b_448_ax650_api.sh
[I][                            Init][ 135]: LLM init start
[I][                            Init][ 137]: Total CMM:7915 MB
tokenizer_type = 3
  3% | ██                                |   1 /  31 [0.75s<23.19s, 1.34 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  6% | ███                               |   2 /  31 [0.75s<11.69s, 2.65 count/s] embed_selector init ok[I][                            Init][ 182]: attr.axmodel_num:28
 41% | ██████████████                    |  13 /  31 [3.42s<8.15s, 3.80 count/s] init 10 axmodel ok,remain_cmm(7596 MB 45% | ███████████████                   |  14 /  31 [3.68s<8.15s, 3.80 count/s] init 11 axmodel ok,remain_cmm(7567 MB 48% | ████████████████                  |  15 /  31 [3.95s<8.16s, 3.80 count/s] init 12 axmodel ok,remain_cmm(7538 MB 51% | █████████████████                 |  16 /  31 [4.19s<8.13s, 3.81 count/s] init 13 axmodel ok,remain_cmm(7509 MB 54% | ██████████████████                |  17 /  31 [4.45s<8.12s, 3.82 count/s] init 14 axmodel ok,remain_cmm(7480 MB 58% | ███████████████████               |  18 /  31 [4.70s<8.10s, 3.83 count/s] init 15 axmodel ok,remain_cmm(7451 MB 61% | ████████████████████              |  19 /  31 [5.05s<8.25s, 3.76 count/s] init 16 axmodel ok,remain_cmm(7422 MB 64% | █████████████████████             |  20 /  31 [5.30s<8.22s, 3.77 count/s] init 17 axmodel ok,remain_cmm(7393 MB 67% | ██████████████████████            |  21 /  31 [5.56s<8.21s, 3.78 count/s] init 18 axmodel ok,remain_cmm(7364 MB 70% | ███████████████████████           |  22 /  31 [5.81s<8.19s, 3.79 count/s] init 19 axmodel ok,remain_cmm(7335 MB 74% | ████████████████████████          |  23 /  31 [6.06s<8.17s, 3.79 count/s] init 20 axmodel ok,remain_cmm(7306 MB 77% | █████████████████████████         |  24 /  31 [6.32s<8.16s, 3.80 count/s] init 21 axmodel ok,remain_cmm(7277 MB 80% | ██████████████████████████        |  25 /  31 [6.59s<8.17s, 3.79 count/s] init 22 axmodel ok,remain_cmm(7248 MB 83% | ███████████████████████████       |  26 /  31 [6.86s<8.18s, 3.79 count/s] init 23 axmodel ok,remain_cmm(7219 MB 87% | ████████████████████████████      |  27 /  31 [7.13s<8.18s, 3.79 count/s] init 24 axmodel ok,remain_cmm(7190 MB 90% | █████████████████████████████     |  28 /  31 [7.39s<8.18s, 3.79 count/s] init 25 axmodel ok,remain_cmm(7161 MB 93% | ██████████████████████████████    |  29 /  31 [7.67s<8.19s, 3.78 count/s] init 26 axmodel ok,remain_cmm(7132 MB 96% | ███████████████████████████████   |  30 /  31 [7.93s<8.19s, 3.78 count/s] init 27 axmodel ok,remain_cmm(7103 MB100% | ████████████████████████████████ |  31 /  31 [9.86s<9.86s, 3.14 count/s] init post axmodel ok,remain_cmm(6940 MB)[I][                            Init][ 240]: image encoder feature outputs:0
103% | ██████████████████████████████████ |  32 /  31 [13.95s<13.52s, 2.29 count/s] init vpm axmodel ok,remain_cmm(6588 MB)[I][                            Init][ 280]: image encoder input nhwc@uint8
[I][                            Init][ 305]: image encoder output float32

[I][                            Init][ 335]: max_token_len : 2047
[I][                            Init][ 340]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 348]: prefill_token_num : 128
[I][                            Init][ 352]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 352]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 352]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 352]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 352]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 352]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 352]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 352]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 352]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 356]: prefill_max_token_num : 1024
[I][                     load_config][ 281]: load config:
{
    "enable_repetition_penalty": true,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 30,
    "repetition_penalty": 1.2,
    "temperature": 0.7,
    "top_k": 10,
    "top_p": 0.9
}

[I][                            Init][ 373]: LLM init ok
[I][                            Init][ 375]: Left CMM:6588 MB
Server running on port 8000...

Run the Gradio frontend:

python3 gradio_demo_cpp_backend.py

Interactive conversations using the C++ Demo:

./run_internvl_3-5_1b_448_ax650.sh

The log information is as follows:

root@ax650 ~/yongqiang/push_hugging_face/InternVL3_5-1B_GPTQ_INT4 # ./run_internvl_3-5_1b_448_ax650.sh
[I][                            Init][ 135]: LLM init start
[I][                            Init][ 137]: Total CMM:7915 MB
tokenizer_type = 3
  3% | ██                                |   1 /  31 [0.71s<21.92s, 1.41 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  6% | ███                               |   2 /  31 [0.71s<11.05s, 2.81 count/s] embed_selector init ok[I][                            Init][ 182]: attr.axmodel_num:28
100% | ████████████████████████████████ |  31 /  31 [2.06s<2.06s, 15.03 count/s] init post axmodel ok,remain_cmm(6940 MB)[I][                            Init][ 240]: image encoder feature outputs:0
103% | ██████████████████████████████████ |  32 /  31 [2.32s<2.25s, 13.79 count/s] init vpm axmodel ok,remain_cmm(6588 MB)[I][                            Init][ 280]: image encoder input nhwc@uint8
[I][                            Init][ 305]: image encoder output float32

[I][                            Init][ 335]: max_token_len : 2047
[I][                            Init][ 340]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 348]: prefill_token_num : 128
[I][                            Init][ 352]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 352]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 352]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 352]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 352]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 352]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 352]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 352]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 352]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 356]: prefill_max_token_num : 1024
[I][                     load_config][ 281]: load config:
{
    "enable_repetition_penalty": true,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 30,
    "repetition_penalty": 1.2,
    "temperature": 0.7,
    "top_k": 10,
    "top_p": 0.9
}

[I][                            Init][ 373]: LLM init ok
[I][                            Init][ 375]: Left CMM:6588 MB
Type "q" to exit, Ctrl+c to stop current running
prompt(输入q退出) >> 介绍一下你自己
image(回车键跳过) >>
[I][                             Run][ 713]: input token num : 21, prefill_split_num : 1
[I][                             Run][ 747]: input_num_token:21
[I][                             Run][ 976]: ttft: 83.79 ms
我被称为"语言模型-1.0"，来自上海人工智能实验室。我的开发团队致力于为用户提供高效、准确和个性化的AI服务。作为一款先进的自然语言处理（NLP）模型，我旨在帮助用户解决各种语言相关问题，并提供有用的信息和建议。我的设计目标是能够以自然流畅的方式与人类进行交互，无论是回答问题、提供建议还是执行任务。

[N][                             Run][1102]: hit eos,avg 19.79 token/s

prompt(输入q退出) >> 请你详细描述下面这幅图
image(回车键跳过) >> assets/image_1.jpg
[I][                     EncodeImage][ 481]: image encode time : 408.467987 ms, size : 1
[I][                          Encode][ 636]: input_ids size:284
[I][                          Encode][ 644]: offset 15
[I][                          Encode][ 673]: img_embed.size:1, 262144
[I][                          Encode][ 689]: out_embed size:290816
[I][                          Encode][ 690]: input_ids size 284
[I][                          Encode][ 692]: position_ids size:284
[I][                             Run][ 713]: input token num : 284, prefill_split_num : 3
[I][                             Run][ 747]: input_num_token:128
[I][                             Run][ 747]: input_num_token:128
[I][                             Run][ 747]: input_num_token:28
[I][                             Run][ 976]: ttft: 270.76 ms
这是一幅生动的图片，展示了一只大熊猫正在自然环境中觅食的情景。画面中，大熊猫正低头在植物丛中寻找食物。它的毛发呈白色，背部和腹部有黑色斑点。周围绿意盎然，各种灌木和植物环绕着它，显得生机勃勃。背景的木质结构可能是一把竹竿或长椅，进一步暗示这可能是动物园或野生动物保护区。整个场景充满了自然的气息，让人感受到大自然的可爱与生机。

[N][                             Run][1102]: hit eos,avg 19.86 token/s

prompt(输入q退出) >>

Interactive conversations using the Gradio Python API:

$ python3 gradio_demo.py --hf_model internvl3-5_tokenizer/ --axmodel_path internvl3-5_axmodel/ --vit_model vit-models/internvl_vit_model_1x3x448x448.axmodel

Plain text dialogue:

Image understanding:

Run the following command on the Axera board to start a chat conversation:

$ python3 infer_axmodel.py --hf_model internvl3-5_tokenizer/ --axmodel_path internvl3-5_axmodel/ --question "请计算函数[y=2x^2+2]的导数, 并提供 markdown 格式的推理过程"

output:

[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-dirty 0fdbfe15-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> 函数 \( y = 2x^2 + 2 \) 的导数可以通过求导法则来计算。首先，我们对函数中的每一项分别求导：

1. 对于 \( 2x^2 \)，使用幂法则求导：
   \[
   \frac{d}{dx}(2x^2) = 2 \cdot 2x = 4x
   \]

2. 对于常数项 \( 2 \)，其导数为 0，因为常数的导数为 0。

将这两部分的结果相加，得到函数 \( y \) 的导数：
\[
y' = 4x
\]

因此，函数 \( y = 2x^2 + 2 \) 的导数为 \( y' = 4x \)。

Enter the following command to perform the single-image understanding task:

$ python3 infer_axmodel.py --hf_model internvl3-5_tokenizer/ --axmodel_path internvl3-5_axmodel/ --question "请描述这幅图" -i examples/image_0.jpg --vit_model vit-models/internvl_vit_model_1x3x448x448.axmodel

output:

[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-dirty 0fdbfe15-dirty
Model loaded successfully!
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> 这是一张红熊猫的照片。红熊猫是一种红棕色的哺乳动物，通常生活在亚洲的森林中。它们以捕食昆虫和小型无脊椎动物为生。图片中，红熊猫正坐在一个木制的平台上，背景是绿色的树木和植被，显得非常自然和生动。红熊猫的表情看起来很友好，似乎在观察或等待什么。

Downloads last month: 47

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/InternVL3_5-1B_GPTQ_INT4

Base model

OpenGVLab/InternVL3_5-1B-Pretrained

Finetuned

OpenGVLab/InternVL3_5-1B-Instruct

Finetuned

OpenGVLab/InternVL3_5-1B-MPO

Finetuned

OpenGVLab/InternVL3_5-1B

Finetuned

(3)

this model