J&J Talk AI Season 2 Episode 05: Practical Applications of Multimodal Vision Models

October 10, 2023

Boy standing on a metal platform gazing into an endless futuristic city with a lot of colorful displays. The city spans across the whole horizon and sky.

Transcript

Please accept marketing-cookies to watch this video.

‍

JD (Johannes Dienst): Welcome to J^2 Talk AI where we explore the world of AI and machine learning. This season we are covering advanced techniques for machine learning and applications for computer vision and its challenges. This time we are talking about the new hot shit, multimodal vision models and their applications. With me again is Johannes.

JH (Johannes Haux): Hi.

JD: So, multimodal vision models, Johannes. That sounds really futuristic. What are multimodal vision models?

JH: I think first of all we should understand what multimodal means. Basically multimodal says I have different modalities, right, multiple modalities and a modality in our case if we pair it with a vision model means I have image data for example that's the vision part and I combine this somehow with textual data, with audio data with any kind of modality that I want. So multimodal vision models are a number of modalities combined with the vision modality.

JD: I read today about images for example, not images in the sense we know it but thermal images like infrared and that stuff.

JH: I think the distinction multimodal image models comes from the fact that I assume it's very super primed on the vision part, right? That's how we get information into our system through our eyes, right, through images and That's why we make the distinction.

But basically anything where I combine data like thermal data as you said or any kind of like stock data, I don't know, like anything you can think of and I combine those then I have a multimodal model.

JD: So when you think about multimodal, so basic assumption for me as an outsider would be okay, do we have like different models, one for text, one for audio, one for video or images and then they get translated into the model and these models talk to each other somehow or how is that built?

JH: I think you already got the very gist of it. You said talk to each other somehow and the somehow is the crucial part, right? If I have multiple or different modalities like images and texts, the question is how do I compare those, right?

If I want to combine these two types of data, how do I represent them in a way that I can actually understand them in context of each other and that's the important part. Like you could have a number of models like expert models for certain tasks come together, work together or you could have like this huge model that can do anything. It doesn't really matter. The important part is that I have this representation that works for any of those modalities.

JD: So we break down text and let's keep it simple for me. We break down text into a representation and we break down the image in the same representation and then we are able to reason about them.

JH: Exactly.

JD: Maybe you could make an example how you could do this. I think that makes it a little more easier to grasp.

JH: The most famous example of a multimodal model is the CLIP model. They were basically the first to do this really well where they aired textual descriptions with the corresponding images and what they did is they trained the model, which basically was two parts, a text encoder and an image encoder to map the information in the text and the information in the image into a shared space.

So like you can think of it as a vector space. So if I have, for example, the image of a dog and the description, this is a black dog running on the grass, for example, then what I would want is the encoder for the text part and the encoder for the image part that those produce points in vector space that are close to each other.

JD: This way, I can basically then compare the text part and the image part.

JH: It's a bit more tricky. So to achieve this, you need to make sure that only similar points are close to each other and not points that shouldn't be close to each other. So they have a few training tricks in there to make this happen. But basically that's the gist of it.

I simply project, simply in quotation marks, I simply project these two modalities into the same vector space and ensure that the information I want to be close to each other is close to each other.

JD: That sounds interesting. So when I think about it, this opens a lot of different use cases in the real world, but also in the general sense when I think about it. So what are the general use cases you could tackle with these models?

JH: Well first of all, image search, right? If I can basically compare concepts, so to say, from text to image space, that's an easy way to find matching images. Or I can compare descriptions of images to find the best matching one. That would be, I think, a very simple, straightforward example for this. But I also can make use of these representations in different tasks.

So if I say, hey, I've got this text encoder, it's fixed, I trained it, and I get those representations out of it, I can make use of this in other models and use this as a conditioning, for example. So another famous example of that would be the Stable Diffusion model, which is used for image generation, conditioned on textual prompts. I think Mid Journey works pretty much the same. So yeah, there's a lot you can do with that.

JD: Can I also ask my model something? So the famous visual question answering question will say, is the girl walking the bike or is there only a bike and a girl in the picture?

JH: I mean, you cannot use CLIP for that. If we want to stick to that example, you would need to have your visual question answering model that is trained to work in this kind of dialectic way. But I mean, there are models that do exactly that. So you can definitely do that.

JD: That's cool. Yeah, I can really like with chatGPT, I can really converse with my chatbot, more or less, and ask questions about stuff, even if it's an image.

JH: Yeah, I mean, I think GPT-4 now has this image feature activated, right? I think they advertised it when we started this podcast, and I think by now it's online, if I remember correctly.

And there are other models out there, like for Microsoft, there's a model called Cosmos 2, which I really like, which is basically that, I think, where you can say, here's an image, for example, if you have a person sitting next to a fireplace, I think that's one of their examples in the paper, you can draw a bounding box around the fire and then ask what is there, then the model will output that's a fire. And then you can ask, okay, is there something next to it? And then the model can draw a bounding box around the person sitting next to it and say, there's a person and it's sitting in this part of the image. So this way, they figured out how to actually ground a language model in visual space, which is super, super interesting, I think.

JD: Yeah, that's really cool. So now we have real-world applications of multimodal vision models. What are real-world applications?

JH: Well, I mean, image generation is, I think, the most famous current real-world application that is being used heavily. And I think there is also a lot of money in that market right now. I think it's a bit hyped, but it's a real thing. So that's real-world application.

I must say, I think what we're doing, for example, at AskUI also falls into a similar area where we're not yet exactly doing things like, for example, pairing a language model with the vision part. But what we are doing is basically saying, hey, do the following thing by clicking on a screen. So, for example, do the login process by first clicking the login button, entering your details and so on. So that's also a multimodal application, real-world application. I think also really, really handy, not only for testers, but if you think of it in broader terms, what are visual question answering tools used for? If you're blind or visually impaired, you can ask about what's the content of this image that's on my screen and does not have an alt text.

JD: Great example.

JH: The model will output this. I want to do a login process. I cannot really see what text is written. The webpage is not implemented with all the ARIA labels. Let's use a multimodal vision model for that that can do the clicks by following my instructions. So these are, I think, really, really interesting real-world applications that exist already right now. It's just currently the task we're solving is making those actually usable in an easy-to-use manner.

JD: Another real-world application I thought of was the translation of sign language automatically into audio, for example.

JH: Yeah, really, really nice example. Or the other way around, right? Having an audio stream and reverse it, put it into text or generate a sign language video out of that. Why not that, right? Maybe it's easier to follow somebody doing sign language than reading a text. I don't know.

JD: Yes, yes. Then I came across the term zero-shot learning. What is that?

JH: So the zero-shot implies that you have a shot at something, right? But you have zero shots. So the model basically has to solve the task without having seen any examples of how to solve that specific task. That does not mean that I have not trained the model on solving this class of tasks.

But for example, if we are thinking again about language models, an example of few-shot task solving, a few-shot learning would be I have a fixed, trained text model and I say, here's an example of what I want you to output. Here's the question, this is the answer. Here's another question, that's the answer. Now answer this question. And then you had two shots, basically. Now it's your third shot.

Zero-shot would be, here's my question, answer it without any previous example. And you can apply this kind of thinking to any kind of model where you basically say, okay, here you go, solve it without ever having explained explicitly how to solve that specific task.

JD: Can you think of another real-world application? And if not, then we stop.

JH: Oh man. I can dream up real-world applications, which I don't want to do because that's always a bit hand-wavy. But now I have to reiterate, one of my favorite real-world applications is what we're doing at AskUI because it's giving eyes and hands to a computer and allowing it to follow instructions by a user, which is just super interesting. Really nice multimodal vision model.

JD: Thank you, Johannes. That was our last episode for season two of our podcast. I hope we see you again in a few months because we're going on vacation now.

Podcast Homepage

Please accept marketing-cookies to watch this video.

Johannes Dienst

October 10, 2023

J&J Talk AI Season 2 Episode 05: Practical Applications of Multimodal Vision Models

What can be said can be solved.

Transcript

Podcast Homepage

What can be said can be solved.

More to explore

The AI Test Agent's Dilemma: Navigating the Ethical Challenges of Autonomous QA in 2025

AI Testing Trends 2025: From Hype to Reality for App Builders

How to Prove MVP Quality to Investors (Even After a Weekend Build)