Curious about the multimodal model's architecture. But alas, when I try to reque...

ankit219 · on Sept 25, 2024

If you are still curious about the architecture, from the blog:

> To add image input support, we trained a set of adapter weights that integrate the pre-trained image encoder into the pre-trained language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the language model. We trained the adapter on text-image pairs to align the image representations with the language representations. During adapter training, we also updated the parameters of the image encoder, but intentionally did not update the language-model parameters. By doing that, we keep all the text-only capabilities intact, providing developers a drop-in replacement for Llama 3.1 models.

What this crudely means is that they extended the base Llama 3.1, to include image based weights and inference. You can do that if you freeze the existing weights. add new ones which are then updated during training runs (adapter training). Then they did SFT and RLHF runs on the composite model (for lack of a better word). This is a little known technique, and very effective. I just had a paper accepted about a similar technique, will share a blog once that is published if you are interested (though it's not on this scale, and probably not as effective). Side note: That is also why you see param size of 11B and 90B as addition from the text only models.

sva_ · on Sept 26, 2024

Thanks for the info, I now also found the model card. So it seems like they went the way of grafting models together, which I find less interesting tbh.

In the Transfusion paper, they use both discrete (text tokens) and continuous (images) signals to train a single transformer. To do this, they use a VAE to create a latent representation of the images (split into patches) which are fed into the transformer within one linear sequence along the text tokens - they trained the whole model from scratch (the largest being a 7B model trained on 2T token with a 1:1 split text:images.) The loss they trained the model on was a combination of the normal language modeling LM loss (cross entropy on tokens) and diffusion DDPM on the images.

There was some prior art on this, but models like Chameleon discretized the images into a token codebook of a certain size - so there were special tokens representing the images. However, this incurred a severe information loss which Transfusion claims to have alleviated using the continuous latent vectors of images.

Training a single set of weights (shared weights) on different modalities seems more interesting looking forward, in particular for emergent phenomena imo.

Some of the authors of the transfusion paper work at meta so I was hoping they trained a larger-scale model. Or released any transfusion-based weights at all.

Anyways, exciting stuff either ways.

Y_Y · on Sept 25, 2024

I hereby grant license to anyone in the EU to do whatever they want with this.

moffkalast · on Sept 25, 2024

Well you said hereby so it must be law.

littlestymaar · on Sept 26, 2024

That's exactly the reasoning behind meta's license (or any other gen AI model, BTW) though.

lawlessone · on Sept 25, 2024

Cheers :)

btdmaster · on Sept 25, 2024

Full text:

https://github.com/meta-llama/llama-models/blob/main/models/...

> With respect to any multimodal models included in Llama 3.2, the rights granted under Section 1(a) of the Llama 3.2 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models.

_ink_ · on Sept 25, 2024

Oh. That's sad indeed. What might be the reason for excluding Europe?

Arubis · on Sept 25, 2024

Glibly, Europe has the gall to even consider writing regulations without asking the regulated parties for permission.

pocketarc · on Sept 25, 2024

Between this and Apple's policies, big tech corporations really seem to be putting the screws to the EU as much as they can.

"See, consumers? Look at how bad your regulation is, that you're missing out on all these cool things we're working on. Talk to your politicians!"

Regardless of your political opinion on the subject, you've got to admit, at the very least, it will be educational to see how this develops over the next 5-10 years of tech progress, as the EU gets excluded from more and more things.

DannyBee · on Sept 25, 2024

Or, again, they are just deciding the economy isn't worth the cost. (or not worth prioritizing upfront or ....)

When we had numerous discussions on HN as these rules were implemented, this is precisely what the europeans said should happen.

So why does it now have to be some concerted effort to "put the screws to EU"?

I otherwise agree it will be interesting, but mostly in the sense that i watched people swear up and down this was just about protecting EU citizens and they were fine with none of these companies doing anything in the EU or not prioritizing the EU if they decided it wasn't worth the cost.

We'll see if that's true or not, i guess, or if they really wanted it to be "you have to do it, but on our terms" or whatever.

imiric · on Sept 25, 2024

> Between this and Apple's policies, big tech corporations really seem to be putting the screws to the EU as much as they can.

Funny, I see that the other way around, actually. The EU is forcing Big Tech to be transparent and not exploit their users. It's the companies that must choose to comply, or take their business elsewhere. Let's not forget that Apple users in the EU can use 3rd-party stores, and it was EU regulations that forced Apple to switch to USB-C. All of these are a win for consumers.

The reason Meta is not making their models available in the EU is because they can't or won't comply with the recent AI regulations. This only means that the law is working as intended.

> it will be educational to see how this develops over the next 5-10 years of tech progress, as the EU gets excluded from more and more things.

I don't think we're missing much that Big Tech has to offer, and we'll probably be better off for it. I'm actually in favor of even stricter regulations, particularly around AI, but what was recently enacted is a good start.

littlestymaar · on Sept 26, 2024

> The reason Meta is not making their models available in the EU is because they can't or won't comply with the recent AI regulations. This only means that the law is working as intended.

It isn't clear at all, and in fact given how light handed the European Commission when dealing with infringement cases (no fine before lots of warning and even clarification meetings about how to comply with the law) Meta would take no risk at all releasing something now even if they needed to roll it back later.

They are definitely trying to put pressure on the European Commission, leveraging the fact that Thierry Breton was dismissed.

DannyBee · on Sept 25, 2024

Why is it that and not just cost/benefit for them?

They've decided it's not worth their time/energy to do it right now in a way that complies with regulation (or whatever)

Isn't that precisely the choice the EU wants them to make?

Either do it within the bounds of what we want, or leave us out of it?

aftbit · on Sept 25, 2024

This makes it sound like some kind of retaliation, instead of Meta attempting to comply with the very regulations you're talking about. Maybe llama3.2 would violate the existing face recognition database policies?

weberer · on Sept 26, 2024

According to the open letter they linked, it looks to be regarding some regulation about the training data used.

https://euneedsai.com/

paxys · on Sept 25, 2024

Punishment. "Your government passes laws we don't like, so we aren't going to let you have our latest toys".

GaggiX · on Sept 25, 2024

Fortunately, Qwen-2-VL exists, it is pretty good and under an actual open source license, Apache 2.0.

Edit: the larger 72B model is not under Apache 2.0 but https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct/blob/main/...

Qwen2-VL-72B seems to perform better than llama-3.2-90B on visual tasks.

mrfinn · on Sept 25, 2024

Pity, it's over. We'll never ever be able to download those ten gigabytes files, at the other side of the fence.