hckrnws
Transfusion: Predict the next token and diffuse images with one multimodal model
by fzliu
This is such a natural extension to LLMs. I’m shocked it hasn’t been tried before.
When I ask a diffusion model to generate a chessboard, I’d expect the pieces to be placed randomly. We are getting closer to image generators that not only know what chess pieces look like but also where to place them.
You can talk to the authors directly on alphaXiv! https://www.alphaxiv.org/abs/2408.11039v1
It doesn't look like they are active there.
Stupid question: is their 7B model available? Is there public inference code that we could run? Or do they not usually release them along with these kinds of papers?
Doesn't appear to be any weights uploaded anywhere that I can find.
There are the starts of two (non-original-author) public implementations available on Github, but again -- doesn't appear to be any pretrained weights in either.
I’d also like to know this.
Hmm. I wonder if this is similar to Diffusion Transformers?
this is somewhat similar, but diffusion transformers typically use a pre-trained text model as the text conditioning whereas, in this case it's integrated and trained together multimodally.
Would such a model be able to give more accurate description of images as well?
I think so, specially with finetuning
Crafted by Rajat
Source Code