J&J Talk AI Episode 03: Deep Learning and Convolutional Neural Networks (CNNs)

J&J Talk AI Episode 03: Deep Learning and Convolutional Neural Networks (CNNs)
Johannes Dienst
June 13, 2023
Share
linkedin iconmail icon

Episode

Please accept marketing-cookies to watch this video.

Transcript

Welcome back to episode 3 of J&J Talk AI.

We talked about classical approaches to computer vision last time and we saw that they are quite powerful for single tasks basically, but sometimes we want to detect more than one element or one feature in an image. And this is where the classical approaches have their limits.

This time we talk about deep learning, convolutional neural network or CNNs and how they could solve the problem.

So Johannes, what are the state-of-the-art approaches? Because I think convolutional neural networks are the state of the art. Am I right?

JH: It really depends on who you ask, I guess. But I think at the time of recording CNNs convolutional neural networks still are the state-of-the-art for certain tasks.

JD: So how do they tackle the problem that comes with the limits of classical approaches?

JH: Yeah, very good question. So maybe let's define the problem a little more specifically. What we talked about in the last episode is that classical computer vision approaches are filter-based or kernel approaches. They require you to define a set of filters that will produce features that we're interested in. So this process of figuring out how I do extract stuff or information from the image. That's not only tedious, but also really biased and limited by what I can come up with. And sometimes I don't know what kind of features are the most interesting for the task I want to solve.

And that's where a machine learning system comes in really handy because I can then ask the question: System, what filters do you need to answer the following question?

Maybe to put this into historical context, the most prominent time this was done was, I cannot tell you exactly the year, but was done by Jan LeCun. He's currently at Facebook, the principal researcher there, when he wanted to extract the handwritten digits from checks. So if you want to transfer money.

JD: So you use paper checks.

JH: You use paper checks. You have to write down the numbers. You have to extract this information that's usually done by a person, has to put it into an Excel sheet. That's quite tedious and sometimes error-prone.

So how can we automate this process using images, photographs of those paper checks? And he solved this problem using confnets, really simple ones, only a few layers where then the features are generated by learned filters or by learned weights. You could also call them.

Yeah, then decide, for example, which number you're seeing currently in the image. So that was the first CNN approach that became really, really popular and that sparked also the interest in convolutional neural networks.

And then we quickly saw ever larger models that were able to not only do classification tasks at some point, but also more complex tasks like visual question answering. And some point then also generative tasks where I take an image and change the style of the image so that it looks like an artwork, etc, etc. So that's what you can do with convolutional neural networks and learned filters.

In the end, we solved the problem of asking the question, what kind of filters I need to extract the relevant information by only giving a task and not caring about the exact filters we're using anymore, which is quite handy because that's a lot of work.

JD: So let me sum it up. So you ask the question, I need the digits basically, and then you say to the network, solve this question, please.

And is this some kind of layer? So you'd have a filter and then you get a feature map and the feature map is an input to another map. I remember deep learning from Google. They did it like this.

JH: Yeah, that's where the term deep comes from in this scenario. So what you want to do is you want to aggregate information. So I have now an image or an array of pixels. I turn those into a number of feature maps using my learned filters. But that information itself is not really helpful. It's just a bunch of features.

So I take those, this high dimensional collection of features, do the same thing again. I learn again, a bunch of filters that then use this information and turn it into another set of feature maps. And because of the nature of the process, each filter always collects information from a fixed set of pixels, for example, seven by seven or 11 by 11 or even smaller, and then aggregate the information in that area into a single value.
So over time with each layer, I aggregate more and more information into single values. And at some point I can then say, okay, now the number of features, I scale that down in my final layers, only for a certain set of features.

And then finally, I take those features and put it into a prediction layer, a classification layer and say, now I want 10 values for digit classification. I want 10 values, value one. If that's high, then I know it's a zero. And if value 10 is high and all the other values are compared to that very low, then I know it's a nine, right? This way I turn feature layer after feature layer after feature layer into really condensed information that I can then use to do, for example, classification.

So basically you do the same thing as in the classical approach. You learn filters and you stack them.

JD: And you can still reason about the learned filters, right?

JH: Depends on what you mean with reason.

JD: So I can look at those and I can understand their function. Okay. Yeah, that's what I meant.

JH: And that's actually super, super interesting too, because a human would, for example, define a bunch of filters for edges, for blobs, for these kinds of things. And the neural network does the same, just a little more unconstrained, so to say. So we don't have the zero degree edge and the 90 degree edge, but we have a two degree and 92 degree edge, for example.

And the combination of those two can then probably tell me any number of edges. So that's really interesting to see what the network comes up with. So you mentioned in your explanation before that you can also use the neural network not to just say, okay, this is a number or this is the number on the paper check, but you can also use this to generate things.

JD: Why would I want to generate things in the first place?

First of all, because it's cool. I mean, look at stable diffusion and all the other works. And maybe for just the sake of argument, leave out all the questions about ownership of art and so on.

But if you just enable a person to generate a set of pixels that look funny, just given a prompt, that I think in itself is a really cool thing to do. So that's why you would, for example, want to generate things. But there are more reasons to do that. So for example, you would like to generate images when you don't have labeled data. Sounds probably a little bit strange.

JD: Maybe you have to explain what a label is first.

JH: Okay, yeah, sorry, I'm getting ahead of myself. So a label would be, for example, in the scenario where I want to detect digits or classify digits would be that given this image, the label for it is the digit that's contained in it.

So if I have an image of A2, the label would be A2. Now, what would happen if I have like a million images or so many images that it would be really hard to label them all? How would I still be able to gather information out of that?

JD: Yeah, so basically you're talking about manual tagging.

JH: Exactly. If I don't want to do this, is there still a way I can extract information from this image by applying some kind of algorithm?

And one way to do this is to say, to build a network that needs to condense the information that's in the image into, for example, a low-dimensional vector. So in a dimensionality that's way lower than the inherent dimensionality of the space we're coming from. So image space and all the possible combinations of pixels that I could come up with, condense this information into something low dimensional and then take this information again to produce an image that must look exactly the same as the image I put in. And then I can compare the generated image with the given image and calculate a loss, which I can then use to train this system to come up with low-dimensional representations of the images I feed it.

And then, for example, use this for retrieval tasks. And I can encode all the million images I have and group them by similarity in their vectors. And then I can find structures in my data, which is super interesting. And I can say, okay, this group of images, those are all digits with only one in it. And those are sevens. And some of the sevens are so poorly written that they are somewhere in between the two. And some of the ones are so squiggly that they are interpreted as a seven. And I cannot really decide. And that's how you can do really interesting stuff and use higher dimensional data in low-dimensional space and look at it.

JD: That's cool. So basically, the generative AI stuff is a byproduct of the verification process.

JH: I would say it maybe started out that way, but it's definitely not a byproduct anymore. But then they were married there and then they used it.

JD: Cool.

So that's already it for the convolutional neural networks. Thanks, Johannes.

And we will talk in the next episode about deep learning architectures because that
was a very high-level explanation of what deep learning is.
But there are several architectures there that are specific for a specific domain task.

Podcast Homepage

Please accept marketing-cookies to watch this video.

Get in touch

For media queries, drop us a message at info@askui.com