## Episode

## Transcript

Hello and welcome to episode four of J&J Talk AI.

This time we are talking about deep learning architectures and deep learning architectures from convolutional neural networks.

So let's get into designing a real deep learning architecture. Is there a common architecture everyone is using, Johannes?

**JD:** That's not the case, I would say. There are prevalent ones, but there's a high variety of things you can do. And I think it would be a good way to start talking about the various building blocks we have at hand.

Maybe we could start with why is there so much variety there? Well the beautiful thing about deep neural networks is that they are basically constrained by what we can do with mathematics or to be more precise with everything differentiable in mathematics. So as long as you're writing a mathematical function where you can build the derivative of you can build in your network that you can train with backpropagation.

**JD:** So what are the building blocks?

**JH:** So the most basic building block is the matrix multiplication. That's what makes neural networks work. Now why matrix multiplication? Why not plus and minus? Those are differentiable mathematical operations too, right?

The answer is because matrix multiplication is plus and minus and multiplication in itself too. So the matrix multiplication allows me to take the output of a number of neurons and multiply and add them using matrix multiplication and learned weights to produce a feature map of a different dimensionality than the input. So if you remember like how we used to draw neural networks, it's usually a bunch of circles that are stacked on top of each other.

**JD:** Yes, yes.

**JH:** Next to it is another column of circles and then we draw a bunch of lines between those and it's getting really crowded. What this says is each of the dots, each of the circles on the right side gets as input the weighted sum of all the circles on the left side, right? That's why all the circles have a line drawn from the left side to one of the circles or to each of the circles on the right side. And the way to do this mathematically is to use matrix multiplication between the number of circles on the left side and a weight matrix.

That's the lines basically, which then produces the number of circles on the right side.

So that's the basic building block of deep neural networks of any sort.

You also said we're talking about convolutional neural networks. And the problem with matrix multiplication is this gets quite compute heavy, right? If I would take a whole image and do a matrix multiplication, even if it's only a small matrix, I would still have to combine every pixel. And those are three values, right? Every RGB pixel with some kind of weight matrix and that's unfeasible. So there we have more efficient algorithms to do that. Still does matrix multiplication under the hood, but only for constrained areas. And that process is then solved using so-called convolution. That's where the convolutional comes from in convolution neural networks, which convolves a filter over an image and this way produces features.

**JD:** So you basically slide the filter over the image?

**JH:** And do the matrix multiplication there, right?

**JD:** Ah, okay.

**JH:** We solve the same thing only on the subsection of the image. And thus you can parallelize it and that's why it's faster. That's an added bonus, definitely. Yeah.

**JD:** Oh, okay. Are there other building blocks?

**JH:** Yeah, of course. So that's the core building blocks. Now, one more building block is the so-called activation function. That means I take my feature map and I get basically from this convolutional process of this matrix multiplication, I get unconstrained values out of that. So it can be insanely large or really, really small, close to zero. And I want to make sure that I can work with those features. And I also want to give the network the ability to say this is a non-activation.

So the feature I'm looking for here is not present or it's strongly present or slightly present. And that's where the activation function comes in. In its basic, most basic format, for example, map a value that is smaller than zero to zero and a value that is higher than zero would just leave as is. So I would have a high pass which filters out all the low values and leaves all the high values. And there are variations on this that are more beneficial for the entire training process. So give stronger signals or propagate gradients better through the system. But in general, that's the overall goal to map things to different values.

**JD:** Was that it? Or do we have to do another stuff?

**JH:** Okay, now we're getting to the point where we should combine those blocks. Because now I can start building things. So basically what we already mentioned are network layers. So every time I have a set of neurons, that's what we call a layer. The features that are present in this layer, we activate those. And then we stack those layers on top of each other to aggregate information more and more.

And now the way I can plug these together, that's where the variety comes in. Maybe sometimes I don't want to have an activation function. Maybe sometimes I want to switch those activation functions out. There are also other normalization functions I can use certain times during the execution of my network. And that's where we then see the high variety of architectures.

**JD:** So what would I typically do in computer vision? How would I typically combine those building blocks? Maybe you have an example there.

**JH:** Yeah, actually, the question like one or two years back would have been fairly easy to answer. I would have said just, you know, the starting point would be an AlexNet or a VGG. Just a bunch of convolutional layers stacked on top of each other. And in the end, you get a classification vector out of it.

Or you want to build in a generative model which produces images, for example. Then you build an autoencoder or a generative adversarial network. So those are a few of the names I would have thrown out there.

Nowadays, there are also so-called vision transformers that are different to the classical convolutional neural networks in the way they aggregate information. So I think it's actually interesting to take a close look at this. The convolutional neural network aggregates information by condensing information present in image patches and then feature patches down and down again. So I always take a small window, condense the information into a feature map, and then take a window in this feature map, condense it down again and again. And in the end, I have really dense information that combines the information present basically from all over the image and from certain parts of feature maps.

**JD:** You're way getting ahead of episode five already.

So stay tuned for the next episode.

And then we talk about the combination of vision transformer architectures with convolutional neural networks.

## Podcast Homepage