Stable Diffusion has been the cornerstone of the open-source diffusion model since the day it was released. Its incredible performance coupled with the cheap resources required to run locally has inspired thousands of fine-tunes of the model to specialize in generating anything you can imagine. But recently, a new model has risen to the public’s attention, looking to dethrone the king of open source diffusion models.
Flux.1 was released on august 1st, 2024 from Black Forest Labs as the diffusion model behind Elon Musk’s Grok’s image generator. The model was trained with few safeguards in mind, allowing users to generate almost anything that came to their mind. Black Forest labs has $31 million in seed funding with their co-founders having worked for Stability.AI to develop the Stable Diffusion model.
The research lab has big plans for the future with their what’s next page simply stating “State-of-the-Art Text to Video for all”, clearly committed to the idea of open source models. In fact, Elon Musk directly stated that “The danger of training AI to be woke – in other words, lie – is deadly”.
The Flux Model Family
The Flux.1 model has three variants of varying abilities, the Flux 1 Schnell, Flux 1 Dev, and the Flux 1 Pro. All models have 12 billion parameters. For reference, the previous state-of-the-art open source diffusion model, the SDXL series of models, only has 3.5 billion parameters.
The Flux 1 Schnell model produces the worst image quality out of all the models in the Flux family but is optimized for speed and efficiency. The model was released under the Apache 2.0 License, meaning the model can be used for personal, scientific, and most importantly, commercial use cases. Flux 1 Schnell was trained using latent adversarial diffusion distillation, allowing it to generate images in a fraction of the time compared to Flux 1 Dev.
The Flux 1 Dev model can produce higher quality and accurate images compared to Schnell, but is released under a more restrictive license without any possibility for commercial use.
Finally, the Flux 1 Pro model is the best out of the three with superior visual quality, prompt adherence, and output diversity. However, the model is not open source and only accessible through Black Forest Lab’s API.
How to Run Flux Locally
The best way to run both Flux Dev and Schnell locally is using ComfyUI. ComfyUI offers a balance between ease of use and versatility for those that want to customize the model and workflows specific to their needs. There are similar platforms to ComfyUI such as automatic1111 or Forge, but for this article, we are going to focus on running Flux with ComfyUI.
Running Flux on ComfyUI is as easy as following their installation instructions and performing the steps detailed in their Flux Example page. It requires the user to download a couple files and put them in their respective folders, start the server with python main.py, then simply drag the image provided in the example page into the workflow space.
The original model will require a GPU with more than 32 GB of VRAM to run smoothly, whether that be an apple silicon machine or a NVIDIA graphics card. An alternative is running the fp8 version which halves the memory constraints.
However, even the fp8 version requires more than 16GB of VRAM to run, which still isn’t feasible for many. Additionally, due to the lack of support on Apple Silicon machines, the fp8 version cannot be run on the GPU, significantly slowing down the generation times.
A better solution is to run the GGUF quantized version of flux, with the smallest quantization being barely 4 gigabytes in size. GGUF is a binary file format specifically designed for fast loading and saving of machine learning models based on GGML. It was first adopted by Large Language Models and subsequently by diffusion models.
Additionally, the largest quant of Flux, Q8_0, has almost no quality degradation compared to the original fp16 model while taking less than half the memory required by the fp16 model. All generated images in this article are produced from the Q8_0 version of Flux Dev.
To run the GGUF models, there are a couple extra steps and installation involved.
To load the GGUF models, we need to install a custom node. simply following the instruction on the GitHub page to install it by cloning the repository into the ComfyUI/custom_nodes directory and installing the required dependencies.
Three additional models are needed other than the main GGUF model:
ae.safetensors, download from Black Forest Labs HF Repository and move it into ComfyUI/models/vae
Clip_l.safetensors, download from Comfyanonymous HF Repository and move it into ComfyUI/models/clip
t5xxl_fp8_e4m3fn.safetensors , download from Comfyanonymous HF Repository and move it into ComfyUI/models/clip
Then install the workflows here and drag the “model_Q8_CLIP_FP_16_FLUX_DEV.json” file onto the ComfyUI workspace. With the Q2 and Q3 models, GPUs with as little as 6GB of VRAM can run Flux.
Other than the basic text to image workflow, the installed folder also includes other workflows such as incorporating LoRAs as well as image to image generation, which you can simply drag the json file and try it out yourself.
Finally, install a version of Flux Dev or Flux Schnell that suits your needs. Remember, a smaller model is faster to run, but will come at the cost of lower quality. Click on the “queue prompt” button located on the sidebar to process the current prompt.
When running the models, Flux Dev requires 20 steps or more to generate an image of a decent quality (you can adjust the number of steps under the basic scheduler node) while Schnell can work with as little as 1 to 4 steps.
How to Get the Best Results: Tips and Tricks
Although Flux is designed to be used by anyone in the world, with or without technical knowledge about diffusion models, there are still tips and tricks that can improve the generation quality based on the specific use case.
Understanding The Flux Guidance Parameter
One of the most important parameters that determines the style and quality of generated image is the Flux Guidance parameter, a special form of guidance similar to the “Classifier-Free Guidance (CFG)” parameter used by other diffusion models.
Generally, the larger the guidance, the more likely the model will adhere to the prompt while a lower guidance value will allow the model to be more “creative” at the cost of a lower adherence. Although the general rule of thumb is good to know, there are actually a lot more intricacies to the guidance value.
As a rule of thumb, longer and more detailed prompts tend to work better with lower guidance values, while shorter, simpler prompts often require higher guidance values to produce satisfactory results.
Optimizing Guidance Values for Different Styles
Typically, for a photo or a realistic generation, a lower guidance will lead to a better quality image compared to a higher guidance.
Although the difference is subtle in this generation, we can see that the model with a higher guidance produced an image with more saturation and a “vibrant” color, which is somewhat typical of usual AI generated images. Additionally, the background appears to be more “smoothed out” on a higher guidance value, possibly due to the lack of description for the background.
A guidance value of around 2 will produce the best photography generations while a higher guidance value, such as 6 or above, is more suited for cartoon and anime styled images. For paintings with distinctive brush strokes, a guidance value of 1 to 1.5 is recommended.
The Impact of Capitalization in Prompts
On a side note, one reddit user pointed out that Flux’s text encoder also encodes capitalization, which means phrases like “vincent van gogh’s style” may be drastically different than “Vincent Van Gogh’s style”. This difference applies not only to artists but any well-known people.
Generating Accurate Text and Letters
To produce accurate letters and text, regardless of which medium, whether on a billboard, poster, or people’s clothes, a higher guidance value will more likely result in a correct generation than a lower guidance value.
We see that both attempts failed at generating all the texts included in the prompt correctly, but the image with the higher guidance was much, much closer to the desired result than the one with a lower guidance.
To achieve a better text accuracy while maintaining a low guidance, sometimes increasing the resolution may help. Both images above were generated with a 512x512 resolution. By bumping the resolution up to 1024x1024, we see that even an image with 2.0 guidance can produce text more accurately than the smaller image generated with a higher guidance at the cost of taking more than twice the time.
Leveraging Natural Language in Prompts
Flux’s text encoder can handle natural language with precision and the prompting should sound more like talking to ChatGPT than trying to describe every detail from a scene you imagined in your head.
For example, to generate an image with a certain mood, instead of describing the color gradients and the atmosphere in detail, simply tell the model the mood that you want the image to be and Flux will understand it with ease.
Specifying Text Placement in Prompts
Although equipped with excellent language understanding skills, when generating texts, especially on places such as a poster or a product, Flux seems to fail at interpreting “labels”, “captions”, or similar words describing where text belongs.
To ensure the best adherence and the presence of text that are described in the prompt, instead of using words such as “labels”, directly state “small words on the bottom of the product that writes: blah blah blah”.
Was Flux Overhyped?
The entire flux family of models boasts impressive performance metrics, with both the Dev and Pro model surpassing any diffusion model, open source or not. But this is not to say the model doesn’t have flaws. In fact, it fails at some of the most common pitfalls for text-to-image models.
According to most articles and posts, Flux seems to be perfect at generating hands and texts, it doesn’t take long to discover that Flux isn’t as reliable as it seems. Flux is decent at producing close-up shots of hands that aren’t holding up complex gestures but fails very frequently when prompted with a scene containing multiple people, or hands that are not close to the camera (just try asking Flux to generate two people holding hands! Even a close-up shot is rarely correct).
The same thing appears to be true with text, when describing a relatively simple scene, Flux is able to generate text, even multiple sentences, with astounding accuracy. But when faced with complex scenes and texts from different angles, such as vertically, the generation often contains unintelligible gibberish.
Regardless, the Flux family of models still marks an incredible leap forward for diffusion models in general, not just within the open source space. With more time, people will create better fine-tunes suited for specific use cases and we can expect the quality to improve by using these fine-tuned models as we have seen great variants from the Stable Diffusion family (such as Juggernaut XL).
Unlock language AI at scale with an API call.
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.