xllsd16 v1

A BF16-only trained combination of SDXL VAE, LongCLIP-L (248 tokens!) and the SD15 base.

Note that this is "v1". Ideally, there will be more finetuning on this model, for improved image quality, and also improved prompt following, since it uses LongCLIP

WARNING: Untested model

I am not sure of the state of this model. It was just uploaded for inter-project testing. This version was created May 19, 2025, by OneTrainer. LION, b24a10, 1e-5. Unsure about how many epochs.

Scope

This model was trained specifically with the goal of realism in mind. Ideally you should be able to use it to finetune decent quality real-world models, without needing negative prompts.

Additionally, in theory, if your program supports it, you should be able to use up to 248-token-length prompts. However, that is primarily just an architecture-capable thing at this point. More training is required to better take advantage of this.

Usage

Use it like any other SD1.5 model. You should be able to specify "opendiffusionai/xllsd16-v1" as the modelid, for programs or pipelines that support "diffusers" format model loading from huggingface.

Creation history

This was trained on a single rtx 4090, using the OneTrainer program.

As noted in the metadata here, the starting point was a raw merge of the 3 components of SDXL vae, LongCLIP, and SD1.5 base.

Loss Function used was "Debiased Estimation"

Phase 1 (25 hours)

LION, LR 4.5e-05, Linear scheduler with 10 epochs, and bottom limit of 4.5e-06
Batch size 32, accum 8

Dataset size 200k (CC12M, 2mp, cleaned, with WD tagging)

Phase 2 (11 hours)

LION, LR 3e-06, CONST scheduler 60 epochs, but I picked off epoch 49
Batch size 64, accum 1

Dataset size 22k (CC12M, 2mp, "woman" subset of above, WD tagging)

Phase 3 (?? hours)

LION, LR 4e-06, Const scheduler with 10 epochs (stopped at 16,000 steps)
Batch size 32, accum 8

Dataset size 140k

How were LR values picked?

I did many, many, many runs at other values. A key strategy I used, was to set up "validation" set graphs.

(I used 144 images from a completely seperate dataset; specifically, some 1mp images from CC12M)

For phase 1:

It would always reach a general floor after a few epochs. The specific floor value got lower and lower as I raised the LR, until it hit a particular magic number. After that point, raising LR, would only increase the rate at which it reached the floor. So I experimented with setting initial LR to the lowest value reaching the floor, then used linear scheduler to back off slowly. Initially, I was up in the Xe-05 range, and modified the value by one. Then I started playing around with adjusting by 0.1, aka ±1e-06, until I decided I liked 4.5e-05 best.

One interesting thing of note is that for the LR ranges that converged on the "floor" value, the ones with smaller LR got there slower, but for my short test runs, seemed to achieve a very slightly better value in validation curve.

For phase 2:

I wanted to do a cleanup run with the same dataset at low LR. So, I just experimented with a few "lowish" LR values to find the one I liked the most. (I also checked validation graphs.)

For phase 3:

This was a mess.

Initially I went back to comparing validation graphs, this time with a new dataset. Interesting thing here is that I initially started too low, and the validation curve got WORSE the longer it ran. But also, starting at 4e-05 was too high, and also got initially worse over time. I found it worked at 1e-05 and started tweaking from there. I eventually found 9e-05 to work best (with linear decay?)

However, there was another factor at play. I found that the "step 0" sample was actually BETTER to my eye than the samples for the next epoch. So I really wanted to preserve the good factors of that, through the next round. I tried using ema, and that was kind of an improvement. But not enough. I tried using warmup, even though I dont usually have to do that with LION, and that was also an improvement.

But then I decided to do a bit more examination of the dataset, and discovered that this one had not been trimmed nearly as much as the first dataset has been. So after I pruned it some, I could then use LION, CONST with it and get better results than my prior stuff. One interesting thing is that now I took it down to 4e06, and was getting nice facial details that way.

Comparison with FP32 training

Generally speaking, training in FP32 precision takes 2-4 times as long as bf16. Additionally, it is possible to fit in some things in bf16, that will not fit in fp32. You cannot do batchsize=64 in fp32 on a 4090.

It is just barely possible, however, to do b32a8 with LION in bf16. Not with fp32. (Closest clean combination I have found so far is b24a10)

Generally speaking, my FP32 experiments started looking human a lot faster than bf16. Perhaps only half the epochs. However, I found it harder to get to where I really wanted to go, and lost patience.

After perhaps some more finetuning with bf16, I may return once more to fp32 for attempted maximum quality.

Downloads last month: 1

Model tree for opendiffusionai/xllsd16-v1

Base model

stabilityai/sdxl-vae

Finetuned

opendiffusionai/xllsd-alpha0