GGUF Update (17 Feb)

#17
by allkhor - opened

I noticed that the GGUF files have been recently updated, and I’d like to sincerely thank the team for the excellent work and ongoing efforts!
Could you kindly clarify the reason for this update? Specifically, is it recommended to re-download the files, or are the changes minor?

Thanks again for your dedication and support!

i am just downloading will test it myself recently llama.cpp merged couple of update i am not sure what they changed in gguf will let you know if i seen better results

update

just tested the response some what improved but seeing slightly reduced tps

Unsloth AI org

There was an issue where LM Studio wasn't detecting the Q6 or Q8 quants so we re-uploaded them. Unfortunately the issue still wasn't solved

The uploads are slightly more larger and more accurate

I guess now I've learned the downside of loading the models with -hf argument rather than --model argument... re-downloading the model every day or two because a new version is posted :\

Hey! 👋
Thanks for the update — really appreciate all the work you’re doing.

Could you please add a super short changelog with future GGUF updates? Even just a line like “fixed LM Studio detection” or “refined quant weights” would be super helpful — helps users know if they really need to redownload, or just why things changed.
Totally understand if it’s extra work — just thought it might save a few back-and-forth threads like this one. 🙏
Thanks again!

ExecStart=/home/wer/src/llama.cpp/build/bin/llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:Q6_K --temp 1.0 --repeat-penalty 1.0 --top-p .95 -ngl 99 --ctx-size 200000 --host 0.0.0.0 --port 39281 --threads 16 --alias qwen-30b -v --jinja --chat-template-kwargs '{"enable_thinking": false}'

this runs so slow at larger contexts. I saw error 500's at one point. compared to what I had previously? it feels really bad. before templating I'm like 65k context, and I know I was comfortably much larger the day before this update downloaded. my download was accidental :/ . I have done this to myself twice now. I don't know if my week old compile of llama is the issue. just dropping an opinion here. I don't have metrics. I have learned my lesson and will never let llama download stuff automatically again(is the lie I tell myself)

Feb 19 22:04:17 blender llama-server[6647]: srv log_server_r: response: {"error":{"code":500,"message":"Context size has been exceeded.","type":"server_>
Feb 19 22:04:17 blender llama-server[6647]: srv send_error: task id = 15786, error: Context size has been exceeded.
Feb 19 22:04:17 blender llama-server[6647]: res send: sending result for task id = 15786
Feb 19 22:04:17 blender llama-server[6647]: res send: task id = 15786 pushed to result queue
Feb 19 22:04:17 blender llama-server[6647]: slot release: id 3 | task 15786 | stop processing: n_tokens = 100861, truncated = 0

did I do it wrong? or am I lying? or is my hardware melting :)

You can enable context quantization by adding these flags: -ctk q8_0 -ctv q8_0
This will reduce vram and ram usage. Also you can remove -ngl flag and replace it by --fit-ctx 200000

You can enable context quantization by adding these flags: -ctk q8_0 -ctv q8_0
This will reduce vram and ram usage. Also you can remove -ngl flag and replace it by --fit-ctx 200000
Thanks WinPooh32. I'll give that a try. I never actually ran out of vram fwiw. But I was somehow having my kv invalidated more often than expected, so perhaps this will help.

Cache invalidation can be a problem of llama.cpp, try to update to the latest version. I have no issues with the cache at version 8123 (f75c4e8bf), even when last checkpoint is reached no cache invalidation happen.

example log after fixes:

...
slot update_slots: id  0 | task 45523 | erasing old context checkpoint (pos_min = 125959, pos_max = 125959, size = 75.376 MiB)
slot update_slots: id  0 | task 45523 | created context checkpoint 4 of 4 (pos_min = 127409, pos_max = 127409, size = 75.376 MiB)
...
slot update_slots: id  0 | task 45580 | erasing old context checkpoint (pos_min = 126550, pos_max = 126550, size = 75.376 MiB)
slot update_slots: id  0 | task 45580 | created context checkpoint 4 of 4 (pos_min = 129009, pos_max = 129009, size = 75.376 MiB)
...

yeah I pulled last night. And I always chase my own code thinking I did something weird to my context and it's my fault.

I don't really get it. it's blowing my key/cache often. once I get around 100k context. So responses are slow. But not always. This started happening after I accidentally downloaded this latest model, after a restart.

Anyway thanks for the suggestions. recompile of latest and your suggestions haven't changed performance for me as compared to my perceived performance of the old model. I may grab it and see if I'm stable again.

I'm adding --ctx-checkpoints 32 --cache-ram -1 --cache-reuse 256

I didn't see any checkpoints like you had. Maybe that's my issue. IE I don't know what I'm doing :)

Cache only works when prefix part of a prompt is not changed. If a client changes something in the middle of the prompt between requests, it will cause cache miss.

yeah I manage that prefix/context, which is why I reached out noticing the cache invalidation. it IS still doing it for me with latest llama.cpp on this model. I don't go incredibly deep into the guts of llama.cpp, and or gguff though. but pretty certain my context is stable up until I start trimming or getting more fancy. I may revert to the last push of this model for science. I've like it.

https://github.com/ggml-org/llama.cpp/issues/19901

this cleared it up. I had an even worse issue where cache was blown after ~16k tokens on that new qwen3.5 122b model? And manually applying this unconfirmed fix totally cleared it up. super fast now.
also of note (on that 3.5 model), I could no longer do prompt reinforcement? but that's another thing and I just removed it I guess :)

anyway, that fix seems to have fixed this model too so. mentioning it here.

Sign up or log in to comment