Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fooocus: OSS for image generation by ControlNet author (github.com/lllyasviel)
186 points by dvrp on Aug 12, 2023 | hide | past | favorite | 54 comments


"Native refiner swap inside one single k-sampler. The advantage is that now the refiner model can reuse the base model's momentum (or ODE's history parameters) collected from k-sampling to achieve more coherent sampling. In Automatic1111's high-res fix and ComfyUI's node system, the base model and refiner use two independent k-samplers, which means the momentum is largely wasted, and the sampling continuity is broken. Fooocus uses its own advanced k-diffusion sampling that ensures seamless, native, and continuous swap in a refiner setup."

This is so interesting and seems obvious in retrospect, but super impressive! The code is simple too, going to hack around with this over the weekend :)


As a frontend developer, this reads to me as technobabble you'd find in entertainment media. In general, I learn about things not directly related to my sphere of interests by osmosis, but this is on another level. Reminds me of the time when I started my computing journey. I wonder if I'll be able to understand this eventually just by reading a relevant comment or blog here and there.


The image is produced by sampling the model, which is the most computationally intensive process in the pipeline. Stable Diffusion XL consists of several parts which have to be sampled separately, losing the context in the process. This samples both parts in a way that no context is lost, improving the result.


What means "sampling" here? Is it like sampling a probability distribution? Or like sampling a continuous image by taking the value of specific pixels?


More the former.


Also a frontend developer, this feels like when I tried to grok React by osmosis and it made no sense. Then I sat down and read the docs, got the meaning of the twenty or so new terms down, and suddenly I could follow all the React stuff.

But anyway, it's the same here except that of course AI and machine learning is a far vaster topic than React so there's going to be more than twenty terms to learn.


Here's my attempt at an explanation without jargon, you can just read the last paragraph, the first 4 are just context.

These image models are trained on 1000 steps of noise, where at 0 no noise is added to the training image and at 1000 the image is pure noise. The model's goal it to denoise the image, and it does this knowing how much noise the image has, this makes the model learn how much it should change the image, for example at high noise it changes a lot of pixels and starts building the overall "structure" of the image, and a low noise it changes less pixels and focuses on adding details.

To use the model you start with pure noise, then the model iteratively denoises that noise until a clean image shows up. A naive approach would take 1000 steps, this means you run the model 1000 times, each time feeding the previous result and telling the model that the noise decreased by 1 until it reaches 0 noise. This takes a long time, up to 15 minutes to generate an image on a mid-range consumer GPU.

Turns out when you give the model pure noise and tell it there's 1000 steps of noise, the result is not an image that has 999 steps of noise, but an image that looks like it has much less, this means that you can probably skip 50-100 steps of denoising per iteration and still get a very good picture, the issue is: what steps to pick? You could again take a naive approach and just skip every 50 steps for a total of 20 steps, but turns out there's better ways.

This is where samplers come in, essentially a sampler takes the number of steps you want to take to denoise an image (usually ~20 steps) and it will--among other things--pick which steps to choose each iteration. The most popular samplers are the samplers in the k-diffusion repo[1] or k-samplers for short. Do note that samplers do much more than just pick the steps, they are actually responsible for doing the denoising process itself, some of them even add a small noise after a denoising step among other things.

The newest open source model, SDXL, is actually 2 models. A base model that can generate images as normal, and a refiner model that is specialized on adding details to images. A typical workflow is to ask the base model for 25 steps of denoise, but only run the first 20, then use the refiner model to do the rest. According to the OP, this was being done without keeping the state of the sampler, that is they were running 2 samplers separately, one for the base model and then start one over for the refiner model. Since the samplers use historical data for optimization, the end result was not ideal.

[1] https://github.com/crowsonkb/k-diffusion


Definitely interesting. It seems that I accidentally implemented that in Draw Things.


Nice! I love that app. It’s wild that I can generate images natively on my iPhone.

I’ve used some of your ideas for inspiration for ArtBot (mobile friendly web front-end for the AI Horde distributed Stable Diffusion project).

I love seeing updates to it. Keep it up! (SDXL is so much fun)


> Linux and Mac

> Coming soon ...

Ah well. Hopefully it is soon. Also, on behalf of all Apple Silicon Mac users, would be nice if the author looked into implementing Metal FlashAttention [1].

1. https://github.com/philipturner/metal-flash-attention


I can get this running on my M1 Mac Pro (CPU mode) with a few tweaks in the cv2win32.py file. I'll submit a change request.


For those who don't know, ControlNet is often used in conjunction with Stable Diffusion. It lets you add extra conditions to guide what is being generated. There are extensions for Automatic1111's stable diffusion webui that can make use of ControlNet. Some examples I've seen are copying the pose of a person/animal in an image and outputting a different person with the same pose (and extending to videos). Also taking line art drawings and filling it in with style.

https://stable-diffusion-art.com/controlnet/


Not to forget crazy creative QR codes that still scan!

https://stable-diffusion-art.com/qr-code/


I've been playing with that. It's really hard to get right. My end goal is to have a nice picture I can frame and put on the wall in my home, but then if anyone asks for the wifi password I tell them to take a photo of it with they phone.


What a great idea, I'm going to do that too!


I feel like if I can get it working reliably I might have fun setting up a website to make them for $1 each or something.

Right now though the process I have is to make 16 at a time with various settings and seeds and then run my phone over them all and see which ones the camera can read as QR codes. Finding one that is both nice to look at and reads reliably is rare.


> Learned from Midjourney, the manual tweaking is not needed, and users only need to focus on the prompts and images

Except prompt-based tweaking doesn’t work very well in MJ; certainly not as well as manually-directed in-painting and out-painting. It’s virtually impossible in MJ to hold one part of the image constant while adding to/modifying the remainder.


Those commits are something else.


Oh wow.

I am not sure what I would have expected upon reading this comment, but I was not prepared.


i


Interesting, and I look forward to using it, but I wish the distribution had kept the folder-name conventions of AUTOMATIC1111, so that we could more easily have used symbolic links for folders of LoRAs and checkpoints etc. that we'd rather not duplicate.


Apparently it uses the folder structure of ComfyUI - I just symlinked the models folder from that and it worked with no issues. (I also reused my ComfyUI venv, just had to do a pip install pygit2 to make it work)


Can't you symlink individual files? More effort but only a quick bit of scripting away.

(I've occasionally used a duplicate file eliminator that finds dups over a certain size and replaces them with symlinks. You can run it on an entire subtree or drive)


I suppose so, but it's a bit elaborate for the purpose. I was hoping to discover that the system allows nested folders, but couldn't get that far. If it did, you could just put the folder SymLink inside.


i’m sure there are ways around it no?


Still installing. If it can read nested folders, yes.


In the end, it wouldn't run. May revisit is a few releases down the road.


The names given to the commits are... peculiar.


SCM as Save Button.


i


Tbh I’m (loosely) following commit message best practices in all of my projects out of irrational fear of being viewed as unprofessional. But never needed that effing prose in my workflow. Maybe a keyword from time to time. I’m using code, not messages to navigate history, and in a rare occasion. If all my messages turned into “i” I’d lose nothing, because all rationales and essentials are in code comments. I’d better seen dates (and related grouping) in a log by default and looked for a commit by some grepping patch contents rather than messages.


On several occasions I've found that act of composing the commit message brings something to the front of my mind that I really ought to get into the patch but haven't done yet.

Otherwise the main use case I have for a commit message is that it fills out the GitHub PR for me with something useful.


Commit comments are how you find code diffs.

Focus on what/why in your commit messages.

If you can’t articulate that in your commit messages, I can almost guarantee you’re thinking deeply enough about your code changes.


I appreciate advices, but I believe this one is not on point I expressed, as it only reiterates on a premise that I follow already but find mostly useless irl.


I’ve been writing software for 30 years, and have often worked on OS code bases with revision history that extends back to the 1970s.

Your style of commit message is a crime against the future. It makes it impossible for a future developer to understand why you did something or what you were thinking.


The commit contents can tell you what changed, but it can never tell you why it was changed, what the authors thoughts were, what corner cases they took into account, whether it was a temporary workaround for something that's no longer a problem...

(unless of course the author is a stickler for inline comments, which I also approve of)


I mean… I try and include some intent in my commits, and online comments.


(Just an anecdote about my personal projects)

I often switch between different ideas/features/projects in the same repo, and I like rebasing between different chains of thought. For me, the messages are more like a memorable phrase. I also prepend them with a keyword for the overall project so I don't need a long dev branch, which has surprisingly been useful once or twice.

Otherwise, not really useful for me. I mostly just `git add -u && git commit --amend --no-edit`.


definitely the smoothest install process and relatively snappy on my local windows machine that I've come across. I do hope to see some ControlNet integrations as that's become a key part of my workflow for exploring new images.


Would you be willing to share a bit how you use controlnet in your exploration workflow?

My biggest discovery so far is using shuffle to guide the output style (and curating a folder of great style guide images).


Are there ways to run such apps with a remote GPU over network? I want to run the UI on my laptop, but use my homeserver GPU from the local network.

Anything better than X forwarding?


This project returns results over HTTP. You get what you want by running the server binary on the server, then access it with your web browser.


To flesh out this response with a common way to do this, the free tier of ngrok will likely help someone accomplish this, by running it on the server to get a routable endpoint to your ssh port (wouldn't hurt to further lock it down with any firewalling, if that's available ngrok-side.. haven't used ngrok in a while).

For example, to run the Stable Diffusion webui (which defaults to port 7860), it would be something not too far off from:

$ ssh -p1234 -L7860:127.0.0.1:7860 assigned.dns.at.startup.ngrok.io

(IIRC ngrok free-tier will allocate you a random port on the public side every time you start the service)

Then, can just browse localhost:7860 from the remote field machine.

I got bit by the AI bug a few days ago, and I already rent some lightsail instances, one of which has a static IP reserved and one of my domain names pointing to it. So, I set up something a little more convoluted and perhaps unnecessarily complex, turning that lightsail VM into a jumpbox/bastion host. Anywho:

AutoSSH from server to my lightsail instance, with two remote port forwards (not local port forwards): SSH and SD webui. Then, connect from the field machine anywhere in the world to the jumpbox with matching local port forwards. (I set up both ports, so I can shell back into the original machine, but this isn't strictly required.)

Then, fire up localhost:7860 in the web browser. Make sure this isn't being served on 0.0.0.0 or un-firewalled.

(e: I re-read original question after posting and realized GP was merely asking for over the LAN, but hey, now they know how to do this over the 'net :)


Even easier is to use Tailscale. Run tailscale on your laptop and server, and then access the web UI by using <tailnet_server_ip>:<port>


Ooh, perfect. Now just have to wait for a Linux/docker release.


It was easy for me to set up on Ubuntu with CUDA 11.7. I cloned the repo, created a new python venv, and installed the requirements in requirements_versions.txt, but looking at launch.py, that wasn't even necessary. Just create a new venv for the project, activate it, and run `python launch.py`.


Just like I expected, I get this error when trying to run it on my AMD GPU...

"RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx"

Maybe it can get modified to use DirectML? Although it looks like it's using PyTorch 2.0, and I think torch-directml only supports 1.13. Why is ML and GPGPU such a dependency mess?


The sample image on the github page doesn't look great. Major problems with the eyes, something both SD and MJ have solved, for the most part.


Without something like Adetailer or the ComfyUI equivalent it’s kind of useless to do anything with relatively small faces.

For those that don’t know, the Adetailer extension for Auto1111 does a second pass on faces at a higher resolution and then inpaints them back in.


Great steps. I would still like to see something offline that can blend two disparate images into one generated scene, like artbreeder has.


i know! it’s hard to make offline because of gpu requirements



Made a Discord bot with this. Check it out here http://fooocus.ai


I wonder if some of this can be ported to HF diffusers.

Lots of the changes just... make sense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: