Curious about the multimodal model's architecture. But alas, when I try to request access
> Llama 3.2 Multimodal is not available in your region.
It sounds like they input the continuous output of an image encoder into a transformer, similar to transfusion[0]? Does someone know where to find more details?
Edit:
> Regarding the licensing terms, Llama 3.2 comes with a very similar license to Llama 3.1, with one key difference in the acceptable use policy: any individual domiciled in, or a company with a principal place of business in, the European Union is not being granted the license rights to use multimodal models included in Llama 3.2. [1]
If you are still curious about the architecture, from the blog:
> To add image input support, we trained a set of adapter weights that integrate the pre-trained image encoder into the pre-trained language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the language model. We trained the adapter on text-image pairs to align the image representations with the language representations. During adapter training, we also updated the parameters of the image encoder, but intentionally did not update the language-model parameters. By doing that, we keep all the text-only capabilities intact, providing developers a drop-in replacement for Llama 3.1 models.
What this crudely means is that they extended the base Llama 3.1, to include image based weights and inference. You can do that if you freeze the existing weights. add new ones which are then updated during training runs (adapter training). Then they did SFT and RLHF runs on the composite model (for lack of a better word). This is a little known technique, and very effective. I just had a paper accepted about a similar technique, will share a blog once that is published if you are interested (though it's not on this scale, and probably not as effective). Side note: That is also why you see param size of 11B and 90B as addition from the text only models.
Thanks for the info, I now also found the model card. So it seems like they went the way of grafting models together, which I find less interesting tbh.
In the Transfusion paper, they use both discrete (text tokens) and continuous (images) signals to train a single transformer. To do this, they use a VAE to create a latent representation of the images (split into patches) which are fed into the transformer within one linear sequence along the text tokens - they trained the whole model from scratch (the largest being a 7B model trained on 2T token with a 1:1 split text:images.) The loss they trained the model on was a combination of the normal language modeling LM loss (cross entropy on tokens) and diffusion DDPM on the images.
There was some prior art on this, but models like Chameleon discretized the images into a token codebook of a certain size - so there were special tokens representing the images. However, this incurred a severe information loss which Transfusion claims to have alleviated using the continuous latent vectors of images.
Training a single set of weights (shared weights) on different modalities seems more interesting looking forward, in particular for emergent phenomena imo.
Some of the authors of the transfusion paper work at meta so I was hoping they trained a larger-scale model. Or released any transfusion-based weights at all.
> With respect to any multimodal models included in Llama 3.2, the rights granted under Section 1(a) of the Llama 3.2 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models.
Between this and Apple's policies, big tech corporations really seem to be putting the screws to the EU as much as they can.
"See, consumers? Look at how bad your regulation is, that you're missing out on all these cool things we're working on. Talk to your politicians!"
Regardless of your political opinion on the subject, you've got to admit, at the very least, it will be educational to see how this develops over the next 5-10 years of tech progress, as the EU gets excluded from more and more things.
Or, again, they are just deciding the economy isn't worth the cost.
(or not worth prioritizing upfront or ....)
When we had numerous discussions on HN as these rules were implemented, this is precisely what the europeans said should happen.
So why does it now have to be some concerted effort to "put the screws to EU"?
I otherwise agree it will be interesting, but mostly in the sense that i watched people swear up and down this was just about protecting EU citizens and they were fine with none of these companies doing anything in the EU or not prioritizing the EU if they decided it wasn't worth the cost.
We'll see if that's true or not, i guess, or if they really wanted it to be "you have to do it, but on our terms" or whatever.
> Between this and Apple's policies, big tech corporations really seem to be putting the screws to the EU as much as they can.
Funny, I see that the other way around, actually. The EU is forcing Big Tech to be transparent and not exploit their users. It's the companies that must choose to comply, or take their business elsewhere. Let's not forget that Apple users in the EU can use 3rd-party stores, and it was EU regulations that forced Apple to switch to USB-C. All of these are a win for consumers.
The reason Meta is not making their models available in the EU is because they can't or won't comply with the recent AI regulations. This only means that the law is working as intended.
> it will be educational to see how this develops over the next 5-10 years of tech progress, as the EU gets excluded from more and more things.
I don't think we're missing much that Big Tech has to offer, and we'll probably be better off for it. I'm actually in favor of even stricter regulations, particularly around AI, but what was recently enacted is a good start.
> The reason Meta is not making their models available in the EU is because they can't or won't comply with the recent AI regulations. This only means that the law is working as intended.
It isn't clear at all, and in fact given how light handed the European Commission when dealing with infringement cases (no fine before lots of warning and even clarification meetings about how to comply with the law) Meta would take no risk at all releasing something now even if they needed to roll it back later.
They are definitely trying to put pressure on the European Commission, leveraging the fact that Thierry Breton was dismissed.
This makes it sound like some kind of retaliation, instead of Meta attempting to comply with the very regulations you're talking about. Maybe llama3.2 would violate the existing face recognition database policies?
> Llama 3.2 Multimodal is not available in your region.
It sounds like they input the continuous output of an image encoder into a transformer, similar to transfusion[0]? Does someone know where to find more details?
Edit:
> Regarding the licensing terms, Llama 3.2 comes with a very similar license to Llama 3.1, with one key difference in the acceptable use policy: any individual domiciled in, or a company with a principal place of business in, the European Union is not being granted the license rights to use multimodal models included in Llama 3.2. [1]
What a bummer.
0. https://www.arxiv.org/abs/2408.11039
1. https://huggingface.co/blog/llama32#llama-32-license-changes...