grimjim (Jim Lai)

replied to their post 12 days ago

Thanks! Fixed.

posted an update 13 days ago

Post

2845

I wanted to call attention to Arli Ai's success in applying my recent modifications to refusal ablation to a MoE model successfully. Nice work, @OwenArli !
ArliAI/GLM-4.5-Air-Derestricted
Ablation on a MoE model is no small thing; I expect preserving norms/magnitudes during intervention better respects routing compared to naive refusal ablation.

(I would have tagged their org earlier, but that feature seemed to be broken via "@")

ArliAI

4 replies

·

posted an update 19 days ago

Post

3261

Going forward, I will be adopting the term Magnitude-Preserving Orthogonal Ablation (MPOA) for my recent work in mitigating model damage from abliteration. The technique potentially unlocks reasoning capacity previously occupied with safety refusal processing.

For details, start here: https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration

Showcase results: grimjim/gemma-3-12b-it-norm-preserved-biprojected-abliterated (outperforms base instruct on UGI Leaderboard NatInt)

(The existing name, while technically accurate, was a bit of a mouthful.)

2 replies

·

replied to their post 19 days ago

The hope is that nudging probabilities breaks up longer spans of literal repetition.

posted an update 21 days ago

Post

5011

Implemented a proof of concept sampler in pure PyTorch and transformers.

Max P consists of a dynamic token filter which applies Winsorization to cap the probabilties of top tokens. Specifically, a base probability in the range of [0,1] is used to cap individual token probability; the sampler then redistributes excess proportionally.

https://github.com/jim-plus/maxp-sampler-poc

Combined with Temperature and Min P, this could represent a more intuitive way of reducing repetition in text generation.

2 replies

·

replied to their post 25 days ago

Code always keeps changing, but it appears in August transformers will try to autoconvert parts to 64-bit floating point, which can blow past a colab VRAM budget. My GPU doesn't support mxfp4, so I can't provide a definitive answer from experience.

replied to their post 25 days ago

That I inherited from the codebase I forked from for my experiments.

posted an update 2 months ago

Post

785

I've uploaded abliteration code with support for sparsification of the refusal vector. It's poorly documented, but the code should be straightforward.
https://github.com/jim-plus/llm-abliteration
The code is built atop a fork that enabled abliteration to be performed on models loaded in 4-bit or 8-bit bitsandbytes quantization. TransformerLens is not required, just plain Transformers. For those previously unaware, this opens up abliteration experimentation to more people with local VRAM limitations.

Since performing abliteration on a quant involves precision and perplexity loss, it stands to reason that a small amount of magnitude sparsification could filter out some noise and possibly even reduce the damage inflicted on latent space via ablation of the refusal vector.

There's a small but real acceleration of ablation of the refusal vector by reducing outer product operations from O(d²×n) to O(d×n), and then by pushing said computation layerwise to GPU. The code is hardcoded for CUDA acceleration currently. Normalization of the refusal vector was deferred in order to allow sparsification. In principle other behavior vector interventions could also be added and explored.

4 replies

·

posted an update 8 months ago

Post

2353

I recently have been looking at a paper titled "Why Warmup the Learning Rate? Underlying Mechanisms and Improvements", by Dayal Singh Kalra and Maissam Barkeshli, and was struck by "warmup" being analogous to simulated annealing.
https://arxiv.org/abs/2406.09405
Taking the physical analogy further, the "warmup" is a stochastic process to knock the system out of current local minima, allowing easier transition toward newer minima. It works because it reduces "fit" and therefore "friction".

posted an update 10 months ago

Post

2439

This recent paper points to an explanation for the unreasonable effectiveness of Frankenmerges: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (2502.05171)

Specifically, the duplication of layers in Frankenmerges serves a purpose similar to what occurs in their recurrent-depth architecture. Successful frankenmerges that operate without additional fine-tuning are able to recover or "heal" from any damage due to abrupt transitions between layer blocks. Operational replicated layer blocks can provide functional benefits grounded in latent reasoning. Frankenmerges can also result in hybrid reasoning, by splicing together the latent reasoning of different models.

Back in April 2024, I was able to duplicate a few layers in the Llama 3 8B model, turning it into a 9B model, without harming benchmarks significantly, despite any transition damage.
grimjim/llama-3-experiment-v1-9B
My informal experimentation suggested that latent reasoning circuits could occupy continguous stacks of 2-4 layers, though the result was highly sensitive to the choice of transition location between layers.

1 reply

·

posted an update 10 months ago

Post

2630

I've made yet another merge of reasoning models with incremental gains on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard

Merging in DeepSeek R1 distillation to Llama 3.1 8B (at 10% task arithmetic weight, using the Llama 3.1 8B base model as the case rather than the instruct model) with a prior best merge resulted in a slightly lower IFEval, but a higher result in every other benchmark save for MMLU-PRO, which went down only marginally. MATH Lvl5 and GPQA went up palpably.
grimjim/DeepSauerHuatuoSkywork-R1-o1-Llama-3.1-8B

This result is currently my best Llama 3.1 8B merge result to date. The actual R1 distillation itself scored quite badly, so this would seem to be another case of unexpected formatting (reflected in IFEval) hurting the evaluation results, obscuring the strength of a model.

It is also possible to use the text generation feature of this model to generate roleplay completions. Based on informal testing, this model's bias toward problem-solving will subtly impact narration.

posted an update 10 months ago

Post

1967

A recent merge has provided another interesting result on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard

Combining an o1 reasoning merge with VAGOsolutions's Llama-3.1 SauerkrautLM 8B Instruct model resulted in a lower IFEval, but a higher result in every other benchmark. This result is currently my best Llama 3.1 8B merge result to date.
grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B
The results suggest that defects in output format and/or output parsing may be limiting benchmark performance of various o1 models.

posted an update 11 months ago

Post

1740

I've arrived at an interesting result on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard
After I narrowed down the filter of models to be between 8-9B parameters, my recent merge of o1 reasoning models achieved the highest MATH eval result of any Llama 3.x 8B model currently on the board, hitting 33.99%, placing 973/2795.
grimjim/HuatuoSkywork-o1-Llama-3.1-8B

Unfortunately, I need more information to evaluate the parent models used in the merge.
The Skywork/Skywork-o1-Open-Llama-3.1-8B model scored 0% on the MATH eval, which I suspect was due to output formatting that was baked too hard into the model, and placed 2168/2795; the merge achieved a significant uplift in every benchmark across the board.
Unfortunately, FreedomIntelligence/HuatuoGPT-o1-8B was not currently benched as of this post, so I am unable to assess relative benchmarks. Nevertheless, it is intriguing that an ostensibly medical o1 model appears to have resulted in a sizable MATH boost.

replied to their post 11 months ago

Example of a fixed model.
https://huggingface.co/grimjim/lemon07r_Gemma-2-Ataraxy-v4c-9B_fixed

posted an update 11 months ago

Post

2856

I'm (finally) releasing a Python script that trims excess weights in Gemma2 full-weight models that bloated by ~1B parameters due to an early mergekit bug.
https://github.com/jim-plus/Gemma2-mergekit-remediation

I'd noticed something was off when merges of Gemma2 9B models ended up having ~10B parameters. The current mergekit package is fine, but there are still bloated models on HF that could stand to be fixed.

The script assumes that it will be run from the same directory as the model weights, and will trim the unnecessary lm_head.weight tensor and corresponding index entry.

2 replies

·

posted an update 12 months ago

Post

1471

A reminder that literal base models are valid choices for base model in task arithmetic mergers. Each Instruct or fine-tuned model then becomes a vector against the base model. Example merge formula used can be found via this model page.
grimjim/Magnolia-v3-12B

posted an update about 1 year ago

Post

1169

Speculative decoding only requires that the tokenizers for the two LLMs used line up; the model architectures do not have to be otherwise compatible. As proof of concept, I used exllamav2 to run Llama 3.2 1B Instruct (at 6bpw, for speed) as the draft model to accelerate the target model of a Llama 3 8B merge of Instruct models (at 8bpw, for accuracy). The difference between tokenizers was minor enough to allow this. With 8k context length allocated for each model, both fit in under 13GB VRAM.
https://github.com/turboderp/exllamav2
meta-llama/Llama-3.2-1B-Instruct
grimjim/llama-3-Nephilim-v3-8B

The proof-of-concept Python script compared a zero-shot creative task of writing a story limited to 500 tokens. Speculative decoding improved performance by approximately one third (e.g., increasing from 31 tokens/sec to 46 tokens/sec) over conventional decoding, and was consistent over a few runs. While not statistically significant, this implies that smaller models aimed at edge computing can serve effectively as draft models in the general case.

It is straightforward to consult literature to affirm that fine-tuning draft models can be a way of inducing behavioral change in target models, in a manner not unlike how samplers can be used to induce changes. I speculate that the impact of a fine-tuned draft model would be on part with a LoRA (Low-Rank Adaptation), as the target model retains veto power. The small size of draft model candidates means that more people can perform local full fine-tuning.

It is intuitively obvious that a distilled model can be used as a draft model for the larger teacher model so long as tokenizers line up; e.g., a distilled 8B model can draft for a 70B teacher model. Perhaps Llama-3.1-SuperNova-Lite 8B could effectively draft for the original Llama-3.1-405B-Instruct model.
arcee-ai/Llama-3.1-SuperNova-Lite
meta-llama/Llama-3.1-405B-Instruct

posted an update about 1 year ago

Post

2189

To demonstrate that it was possible, I performed a "trapezoid" gradient merge of a Llama 3 8B model onto Llama 3.1 8B Instruct, favoring the L3.1 model at the ends in order to preserve coherence and limiting the influence of the L3 model to at most 0.1 weight. Tested to 16k context length.
grimjim/Llama-Nephilim-Metamorphosis-v2-8B

replied to their post about 1 year ago

Those look like prefills. Unless you want to train for prefill-specific outputs, it makes sense to remove them.

replied to their post about 1 year ago

For DPO, I'd stick with what HF recommends, which in their example does not have prompt repetition.
https://huggingface.co/docs/trl/main/en/dpo_trainer

Offhand, for multi-turn data, I'd go with what the LLM "sees" in practice, so prior turns are probably part of the prompt, and "chosen" and "rejected" guide what text generation occurs.

Jim Lai

AI & ML interests

Recent Activity

Organizations

Jim Lai

AI & ML interests

Recent Activity

Organizations

grimjim's activity