Awesome

SimpleTuner 💹

⚠️ Warning: The scripts in this repository have the potential to damage your training data. Always maintain backups before proceeding.

SimpleTuner is geared towards simplicity, with a focus on making the code easily understood. This codebase serves as a shared academic exercise, and contributions are welcome.

Design Philosophy
Tutorial
Features
Hardware Requirements
Scripts
Toolkit
Setup
Troubleshooting

Design Philosophy

Simplicity: Aiming to have good default settings for most use cases, so less tinkering is required.
Versatility: Designed to handle a wide range of image quantities - from small datasets to extensive collections.
Cutting-Edge Features: Only incorporates features that have proven efficacy, avoiding the addition of untested options.

Tutorial

Please fully explore this README before embarking on the tutorial, as it contains vital information that you might need to know first.

For a quick start without reading the full documentation, you can use the Quick Start guide.

For memory-constrained systems, see the DeepSpeed document which explains how to use 🤗Accelerate to configure Microsoft's DeepSpeed for optimiser state offload.

For multi-node distributed training, this guide will help tweak the configurations from the INSTALL and Quickstart guides to be suitable for multi-node training, and optimising for image datasets numbering in the billions of samples.

Features

Multi-GPU training
Image and caption features (embeds) are cached to the hard drive in advance, so that training runs faster and with less memory consumption
Aspect bucketing: support for a variety of image sizes and aspect ratios, enabling widescreen and portrait training.
Refiner LoRA or full u-net training for SDXL
Most models are trainable on a 24G GPU, or even down to 16G at lower base resolutions.
- LoRA/LyCORIS training for PixArt, SDXL, SD3, and SD 2.x that uses less than 16G VRAM
DeepSpeed integration allowing for training SDXL's full u-net on 12G of VRAM, albeit very slowly.
Quantised NF4/INT8/FP8 LoRA training, using low-precision base model to reduce VRAM consumption.
Optional EMA (Exponential moving average) weight network to counteract model overfitting and improve training stability.
Train directly from an S3-compatible storage provider, eliminating the requirement for expensive local storage. (Tested with Cloudflare R2 and Wasabi S3)
For only SDXL and SD 1.x/2.x, full ControlNet model training (not ControlLoRA or ControlLite)
Training Mixture of Experts for lightweight, high-quality diffusion models
Masked loss training for superior convergence and reduced overfitting on any model
Strong prior regularisation training support for LyCORIS models
Webhook support for updating eg. Discord channels with your training progress, validations, and errors
Integration with the Hugging Face Hub for seamless model upload and nice automatically-generated model cards.

Flux.1

Full training support for Flux.1 is included:

Classifier-free guidance training
- Leave it disabled and preserve the dev model's distillation qualities
- Or, reintroduce CFG to the model and improve its creativity at the cost of inference speed and training time.
(optional) T5 attention masked training for superior fine details and generalisation capabilities
LoRA or full tuning via DeepSpeed ZeRO on a single GPU
Quantise the base model using --base_model_precision to int8-quanto or fp8-quanto for major memory savings

See hardware requirements or the quickstart guide.

PixArt Sigma

SimpleTuner has extensive training integration with PixArt Sigma - both the 600M & 900M models load without modification.

Text encoder training is not supported.
LyCORIS and full tuning both work as expected
ControlNet training is not yet supported
Two-stage PixArt training support (see: MIXTURE_OF_EXPERTS)

See the PixArt Quickstart guide to start training.

NVLabs Sana

SimpleTuner has preliminary training integration with NVLabs Sana.

This is a lightweight, fun, and fast model that makes getting into model training highly accessible to a wider audience.

LyCORIS and full tuning both work as expected.
Text encoder training is not supported.
PEFT Standard LoRA is not supported.
ControlNet training is not yet supported

See the NVLabs Sana Quickstart guide to start training.

Stable Diffusion 3

LoRA and full finetuning are supported as usual.
ControlNet is not yet implemented.
Certain features such as segmented timestep selection and Compel long prompt weighting are not yet supported.
Parameters have been optimised to get the best results, validated through from-scratch training of SD3 models

See the Stable Diffusion 3 Quickstart to get going.

Kwai Kolors

An SDXL-based model with ChatGLM (General Language Model) 6B as its text encoder, doubling the hidden dimension size and substantially increasing the level of local detail included in the prompt embeds.

Kolors support is almost as deep as SDXL, minus ControlNet training support.

Legacy Stable Diffusion models

RunwayML's SD 1.5 and StabilityAI's SD 2.x are both trainable under the legacy designation.

Hardware Requirements

NVIDIA

Pretty much anything 3080 and up is a safe bet. YMMV.

AMD

LoRA and full-rank tuning are verified working on a 7900 XTX 24GB and MI300X.

Lacking xformers, it will use more memory than Nvidia equivalent hardware.

Apple

LoRA and full-rank tuning are tested to work on an M3 Max with 128G memory, taking about 12G of "Wired" memory and 4G of system memory for SDXL.

You likely need a 24G or greater machine for machine learning with M-series hardware due to the lack of memory-efficient attention.
Subscribing to Pytorch issues for MPS is probably a good idea, as random bugs will make training stop working.

Flux.1 [dev, schnell]

A100-80G (Full tune with DeepSpeed)
A100-40G (LoRA, LoKr)
3090 24G (LoRA, LoKr)
4060 Ti 16G, 4070 Ti 16G, 3080 16G (int8, LoRA, LoKr)
4070 Super 12G, 3080 10G, 3060 12GB (nf4, LoRA, LoKr)

Flux prefers being trained with multiple large GPUs but a single 16G card should be able to do it with quantisation of the transformer and text encoders.

SDXL, 1024px

A100-80G (EMA, large batches, LoRA @ insane batch sizes)
A6000-48G (EMA@768px, no EMA@1024px, LoRA @ high batch sizes)
A100-40G (EMA@1024px, EMA@768px, EMA@512px, LoRA @ high batch sizes)
4090-24G (EMA@1024px, batch size 1-4, LoRA @ medium-high batch sizes)
4080-12G (LoRA @ low-medium batch sizes)

Stable Diffusion 2.x, 768px

16G or better

Toolkit

For more information about the associated toolkit distributed with SimpleTuner, refer to the toolkit documentation.

Setup

Detailed setup information is available in the installation documentation.

Troubleshooting

Enable debug logs for a more detailed insight by adding export SIMPLETUNER_LOG_LEVEL=DEBUG to your environment (config/config.env) file.

For performance analysis of the training loop, setting SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL=DEBUG will have timestamps that highlight any issues in your configuration.

For a comprehensive list of options available, consult this documentation.

Discord

For more help or to discuss training with like-minded folks, join our Discord server