J&J Talk AI Season 2 Episode 04: Generative AI - GANs and NeRFs

J&J Talk AI Season 2 Episode 04: Generative AI - GANs and NeRFs
Johannes Dienst
October 10, 2023
Share
linkedin iconmail icon

Transcript

Please accept marketing-cookies to watch this video.

JD (Johannes Dienst): Welcome to J&J Talk AI episode 4 for season 2 together with Johannes again.

JH (Johannes Haux): Hi.

JD: And we touched the topic in the episode before which is generative adversarial networks, in short GANS. Not guns, but GANS. And actually when I researched this topic it was a little bit like magic and I hope Johannes can dispel the magic because it said like, okay, we have noise and we have two networks, one generator network and one test network and they are pitched against each other and at the end we have image generation. That was the basic explanation what GANs are. Is that correct?

JH: That is correct, but I will never dispel the magic because it's pure magic.

Yeah, generative adversarial networks are like a crazy idea that just works. So as you said, right, it's about generating images from noise.

So I have some set of random numbers stored in a vector, for example, or also you could take a 2D array of vector noise vectors and feed them into some generative neural network and in the end you'll get an image depending on the noise you put in. And the reason why this works, we'll talk later about why you would even want to do that. The reason that works is that you have a so-called discriminator network, I think you called it a test network, which tries to tell you if an image is coming from the ground truth image distribution. So it's part of the training data set, for example, or if it's a generated image. So you'll show this discriminator.

JD: Just to ask, what is the ground truth data set? The real images or can they be cut to real images?

JH: Well, whatever you want, right? So you can take any kind of image set, but basically one of the most famous data sets, I think, in the early days of GANs was Celeb A data set, which was images taken from public domain of faces of celebrities, which is an interesting data set because faces is something that we as humans can really easily detect if they are real or not.

So it's a really hard task to generate an image of a face that you would believe is real. You have this set of hundreds of thousands of images of celebrities and you want to build a model that can generate more of those. That's the overall goal.

And what you do is you take some noise, you generate an image from it, and then you ask the discriminator network, this generated image, is this a real image or not? And then it will say yes or no. Maybe it will say no, but it's also learning. So in the beginning it might say yes. And then you show the discriminator network a real image and ask it, is that real? Is it fake? And then you train it to say yes to the correct image, so from the ground truth data distribution, and to say no to the generated image.

But the discriminator is not the only network trained. So you train your generative network to generate images such that the discriminator network says yes, this is a real image. So the generator tries to fool the discriminator that it's seeing a real image and the discriminator tries to figure out if it's being fooled or not. Which in the end leads to the generator generating images that are more and more closely similar to the images from the data set. The discriminator being ever smarter about what a real image looks like.

But in the end, the discriminator will be completely overwhelmed. So the training is successful of such a GAN if the discriminator in the end always outputs 50% chance that it's really real or fake image because it cannot distinguish real and fake images anymore. And that's how you get real images or realistically looking images from pure noise, which to me is magic.

JD: Sounds like science fiction from movies, theaters, actually, if I think about it, pitching AIs against each other.

JH: Yeah, and I think to be honest, that it's adversarial training, right? You are training things to fool each other, which on that scale is probably not that bad. But yeah, just keep it in mind when training more capable models.

JD: So we talked a lot about the theory here. So what are some real world applications aside from generating pictures of faces?

JH: To be honest, like faces is the thing. So if you want to, for example, build a sample web application and you have avatars, you fake users, right? It's such hassle to take images of real people because they have rights. But if you just generate them, which is a thing nowadays, you get realistically looking images for your fake web app users, which is actually a use case people pay money for.

And then of course, I mean, things you can do with this is you can try to understand how images compare to each other in terms of their content, right? So for example, for us, two images of the same person with just slight differences in posture and where the person looks is the same image basically, right? We know it's two different photos, but it's the same person, same point in time, lighting is the same. So it's basically the same. But if you want to compare those images in terms of pixel values, it's really hard. If you do the mapping back to latent space where the noise is coming from, which is not directly possible when you're working with GANs.

But for example, in variational autoencoders, you always have this bottleneck, which is also like a vector like thing or can be a vector like thing. And then you can still train the generative part of the autoencoder with a GAN loss. So if we take this vector and compare it between images, then this vector encodes the information that is the content of the image or it can be interpreted as such. And then if you compare those vectors, you get small distances for images with similar content and larger distances for images with different content. And then you can ask super interesting questions.

Those are usually high dimensional mid to high dimensional vectors. In which dimensions, in which directions are we similar? Are those similar? In which directions are those vectors more distant? And then you can assign categories in terms of content to directions and shift those around and for example, give people glasses or change the lighting and this kind of thing. So you can do a bunch of crazy stuff with that, which is super interesting.

JD: So are there any more applications in the real world before we shift to the next thing?

JH: I mean, so for pure GANs, I would say it's really only about generating more stuff of the same. So wherever you have a situation where you need more of the same, that would be a GAN thing.

But GAN losses, so adding a discriminator to your image generating setup, that is something that's part of far more applications. And then you're already in the realm of models like mid-journey or a stable diffusion, where you have a little more control. But to get the realism of images or to get the alignment between distributions, data distributions, generated images and the ground truth data set, that's where you would want a GAN loss in there.

JD: Okay, that's cool.

JD: Then I researched further what generative AI can do and I wondered if I find something that's not possible with it. So the next thing I found was 3D and 2D reconstruction. When we talked about computer vision by today or in the former episodes, then we always talked about 2D like headshots, for example, and that stuff. And I found 3D and 2D reconstruction and one specific technique that came up were neural radiance fields called NeRFS.

JH: Those are again, quite a crazy technique where the goal is I have an image and I want to generate a new view on what I'm seeing in this image.

So for example, I want to take the camera that took this photo and move it a little to the right and then see what the photo would have looked like from that position.

And the way this is done is with so-called radiance fields. So I take a train in neural network to take a coordinate and ask it what color would this point in space have and what opacity. So for example, if I have a sphere lying in the middle of my image and then I look from each pixel to that sphere, or no, sorry, look from my camera and in all the directions where my pixels point, then everywhere there is the sphere at a certain depth of the ray that I cast from my camera through that pixel through space, I can ask along that ray, is there something? And if yes, what color? And then I repeat this over and over and everywhere there is sphere in 3D space, you will get a, yes, there is something and the color is this and that and everywhere else there is nothing and then you can render this as white and in the end you end up with a sphere. You take this and do this for another location for your camera, you move it around the sphere or you shift it a little bit, you can ask the same questions and you will get a shifted version of that sphere.

So that's pretty cool for if you, for example, take an image of a flower, something really complex and beautiful and then you want to just shift it a little bit to give you an idea what the depth and the, how do you say that, like the shape of that flower would be. From this you can generate beautiful videos. So really, really nice approach. In the end you have a model that understands basically the underlying geometric structure of something that you've shown in only a single or a few images.

JD: Sounds a little bit like I could use that in augmented reality, like projecting something onto the wall or another item.

JH: I mean, yeah, I mean, it's again a great way to compress information, right? So if I only have an image, you can tell if you look at my webcam, right, where we're looking at each other right now, you can tell what's further behind and what's closer to the camera. Just from a single image you have experience that helps you understand, like even people with only a single eye can walk through the world without every time hitting something because their experience already is enough information to be quite good at navigating space and this is basically how I interpret neural radiance fields, can use them.

Given the information from a single or a few images, I can generate depth information from that and synthesize new views based on that. But it's more than just using depth information because depth information only will not help me generate correct geometry from novel viewpoints because everything that's occluded, I cannot estimate using only depth estimation. But with neural radiance fields, I learn to interpret what I'm seeing, given also my experience, like basically given also my knowledge about the data distribution.

JD: Could I also use this in video game programming, like open world stuff?

JH: I mean, it's not generative in the sense of that it would come up with entirely new landscapes or stuff like that. I think for this you would need something else. I think it's more interesting in terms of compression. So whenever you need to send around information with small bandwidth, there it's interesting.

JD: Can you think of any more appliances here in the real world for NeRFs?

JH: That's a good question. I mean, what they've been shown to do quite well is deducting geometric information from some given images. So I think one interesting thing could be that you could start 3D printing stuff you find on the internet by using Google Image Search. More the reason to not photograph your keys, guys. But yeah, I mean, there are a lot of applications. I think, yeah, anytime you want a new perspective on things, that's where NeRFs can be handy.

JD: Oh, that's the perfect last words for this episode.

Podcast Homepage

Please accept marketing-cookies to watch this video.

Get in touch

For media queries, drop us a message at info@askui.com