Transfusion: Predict the next token and diffuse images with one multimodal model

Transfusion: Predict the next token and diffuse images with one multimodal model

122

10m

by fzliu

valine

10m

This is such a natural extension to LLMs. I’m shocked it hasn’t been tried before.

When I ask a diffusion model to generate a chessboard, I’d expect the pieces to be placed randomly. We are getting closer to image generators that not only know what chess pieces look like but also where to place them.

cosmicjedi

10m

You can talk to the authors directly on alphaXiv! https://www.alphaxiv.org/abs/2408.11039v1

10m

It doesn't look like they are active there.

BaculumMeumEst

10m

Stupid question: is their 7B model available? Is there public inference code that we could run? Or do they not usually release them along with these kinds of papers?

HanClinto

10m

Doesn't appear to be any weights uploaded anywhere that I can find.

There are the starts of two (non-original-author) public implementations available on Github, but again -- doesn't appear to be any pretrained weights in either.

* https://github.com/lucidrains/transfusion-pytorch

* https://github.com/VachanVY/Transfusion.torch

K0balt

10m

I’d also like to know this.

ilaksh

10m

Hmm. I wonder if this is similar to Diffusion Transformers?

darknoon

10m

this is somewhat similar, but diffusion transformers typically use a pre-trained text model as the text conditioning whereas, in this case it's integrated and trained together multimodally.

littlestymaar

10m

Would such a model be able to give more accurate description of images as well?

swfsql

10m

I think so, specially with finetuning

Crafted by Rajat

Source Code

hckrnws

Transfusion: Predict the next token and diffuse images with one multimodal model