Welcome to episode two of J&J Talk AI. This time we're talking about computer vision and again I have Johannes Haux with me.
JH: Hey there.
JD: And when we're talking about computer vision we also need an origin story, right? So what is computer vision and where does it come from Johannes?
JH: So good question. Computer vision as the name implies is the research field or the the process of giving computers eyes, giving computers vision. In computer vision we try to answer the question how can I give computers the ability to understand concepts that are present in images. So for example finding a person or describing the content of an image, these kinds of things.
And interestingly in the 60s, I think 1966 around that time, people thought okay now let's just solve this problem. We have computers now, now let's teach them how to see and started a summer school with the goal to solve computer vision in the next few weeks. Like seriously that was the idea. And they quickly figured out that this is not possible, right? It's a little more complicated than that. And yeah that's why we're still doing computer vision and that's why we need neural networks to do certain computer vision tasks for us because some things are a little more easier to solve in computer vision but others are really complex and not necessarily intuitive for us as humans to implement.
JD: So when they started out in 1966, how did they start out? What are the classical approaches from that time and until now?
JH: Okay so this notion of: Explain to me what you're seeing. That was there right from the start. Now the question is what am I seeing, right? So for us as humans that's obviously an intuitive thing to answer, I'm seeing the world around me in color and partly also in 3D. But for a machine the usual input that I would give it to show it something is pixels in the vision context. Like a series of numbers that are loosely correlated or strongly correlated in a two-dimensional sense.
Now how do I get information out of this, right? If I just take the pure image it's really just a bunch of numbers. Now I need to figure out how I can put those numbers into context and one of the earliest and also most successful ways of doing that is by applying filters or using kernel methods to simply say okay I have for example a small patch that I say: this looks like an edge for example a border of something. And then I take this small sample and I slide it over the given image and ask at each stop I make while sliding it over the image does this section of the image look like this small patch I have here and then I get a yes or no or something in between and by doing this for the whole image I can then say okay these parts in the image they look similar to this little patch and then I can define a number of patches that have different characteristics this looks like a blob this looks like an edge that's oriented 90 degrees to the other edge.
These kinds of things and then generate a bunch of feature maps that then can be used to explain the content of the image for example if I know where edges are I can then say okay probably if I have connected edges in an image then the convex hull they build right will say that the inside of this hull is then one object or if I know how an eye looks I can then say okay in this image at those locations probably our eyes. And then I can use this for face tracking and use for example the distance of the eyes to say okay this is a person that's close to the camera etc so I can use these kinds of features to already answer not only basic but interesting questions about the content of images.
JD: Let's remain at the eye example you're not looking straight at the camera all the time so we have some things like tilt in one left or right or up and down.
JH: Yeah so that's a really good point and that's also one of the limitations of these kinds of approaches. I need to cover all possible scenarios with the corresponding filters to be able to yeah say for example I'm looking from the side I'm looking above.
These kinds of things though you have to say like if you restrict it to only cover cases where I'm looking at the camera which is you know 80% of the time for example in a video call then I'm totally fine with only a small number of filters that I can use to detect eyes at the moment those filters don't activate anymore I don't know where the eyes are and then I need for example tracking methods to make predictions or to patch together data once the signal is coming back again with data from or from signals from previously in the time so
yeah that's a problem.
JD: It also sounds like if you want to detect not only eyes but a few things in an image then you would need a lot of filters.
JH: That's true okay and there are also ways to do this for example what I didn't mention here right now is scale right if I get closer to the camera my eye is larger so in terms of pixel space the area my eye covers is larger. I need to compare it now to a differently sized patch etc. and scaling patches is fairly easy, right?
I can do this so I can cover multiple scales like this. If I want to detect edges of various orientations I can just you know generate parametrically a number of filters for certain angles and so on and get more fine-grained so the question though is always what kind of task do I want to solve and what filters do I need for this task right and that's the interesting question and that's why for example still today face tracking is something that can be done very efficiently because I know very well how to do this with classical approaches and those are then of course highly optimized so you can do this in real time at literally no cost.
JD: It's basically what my camera is doing right now. Tracking my face to not lose its focus and it's blazingly fast yeah. Okay you have filters but basically do you have weights there? I know the concept of weights so you're not using every filter in the same weight or weight the same is that a concept here?
JH: I mean yeah that's basically a post-processing step I would guess. What you're referring to is now that I have information about the image, which information do I consider more valuable for the given task at hand but this concept of weights or this name is also used primarily in deep learning. Where I have learnable parameters and that I would say is again very similar but in that sense those weights are learned filters so depending on what you're getting at like the term weight might define something different.
JD: So thanks for the explanation. That's already the closing words for episode two. We might want to get into other methods like deep learning as the next evolutionary technology because as we discussed here the classical approaches are very good at solving a single task but they might get complicated when you have to have a lot of filters.
So stay tuned for the next episode!
In season two, we are back to explore practical applications of computer vision and their unique challenges. From identifying humans through facial analysis software to assisting doctors in diagnosing patients from medical images, computer vision is making a big impact across various domains. Join us as we unravel the mysteries of object detection, classification, and segmentation, and discover how these techniques are powering self-driving cars like Tesla's autopilot.
In the fifth episode we talk about Vision Transformer architectures. An approach for language translation and only recently applied to computer vision. This approach is more suitable for specific tasks as it takes context into account.