Same dataset as v1 but removed about 25% of the worst images.
Training notes:
Kohya's trainer
optimizer ADOPT
optimizer_args
"betas=(0.9, 0.9999)" "eps=1e-7"
weight_decay was actively harming DoRA so I trained without it
DoRa LoCon
frozen text encoder (increased training time, but I prefer to not touch the TE if possible)
lr 5e-4
dim 16
alpha 8
conv_dim 8
conv_alpha 4
batch 8
GA 16
3k images
flip
bucket
resolution 1024
tag dropout 0.1
dropout 0.2
caption_dropout 0.1
scale_weight_norms 6
ip_noise_gamma 0.02
min_snr_gamma 2
zsnr
v_pred
Extras:
soft_min_snr instead of the default formula
learned timestep loss weights, a small network to learn the loss scale for each timestep. similar goal to min_snr
sort important tags to the front and sort separately from others
tag implication dropout for all common implied tags ~40% drop

