JD (Johannes Dienst): Hello and welcome to season two of J&J Talk AI, this time again with Johannes. This season we are talking about practical applications of computer vision and their respective tasks that are involved.
What are practical applications of computer vision, Johannes?
JH (Johannes Haux): Oh man, that's kind of a very broad question.
I mean there are so many, but maybe we can take a step back and reassess the word computer vision. In general, we're talking about everything that involves somehow two-dimensional or image data, right?
So everything where as a human we would look at things to deduct actions or extract information and we would like to have support there from a tool, computer-based tool. That's where computer vision applications come in. And they've been around for quite a while.
I think the most notable application that's also been talked about in the public a lot for a lot of years is for example identification of humans through for example facial analysis software.
So for that we don't even need deep learning so to say.
Like the classical computer vision has been very good at detecting faces or at least at extracting facial configurations for using that to identify persons. And that would be applicable for example when you want to do surveillance. So it's kind of also directly a very controversial topic.
But again, the application, I want a tool, I want some assistance, right? Because sitting in front of six screens to find this one person is really a hard task, especially if you want to do it in real time and as quickly as possible. And right away you can see that having a computer assistant there that can sift through all those images potentially in parallel at a way greater speed than you could ever do. That's a really helpful thing to do if you want to identify persons through video data. But of course there are other applications too.
One very interesting one of course is for medical purposes, right? I want to understand if like for example if I have an x-ray, what could be a diagnosis? Again having there some kind of pre-diagnosis through a tool can reduce the time a doctor has to spend on looking through x-rays or CT scans. There it's more applicable because I have 3D data and I have to look through each slice of a body. So having a tool there that assists you again super super helpful. Not to solve the task of diagnosing the patient but to just reduce the number of images I have to look at to come to a conclusion. Already a huge speed gain there.
Understanding for example where forest fires could occur also something where computer vision is really helpful.
Then again like in the arts sector what we've seen in the very recent times, you know image generation as a computer vision application that's becoming more and more popular. If you've ever been on YouTube in the past months there are a bunch of auto-generated YouTube shorts that talk about for example I don't know historical figures and then you have auto-generated images of those people. So like again it's a very real application that probably is a bit controversial but is being used.
So we could, I mean the list is longer, we could talk about so many real world applications but maybe you could give a few more specific questions to narrow it down.
JD: I would add another example I like to use very practically in my daily job. So there's a service where I can take a YouTube video or video per se or even now a Twitch stream and it transcribes the text out of the YouTube stream and also recognizes where to take pictures and where's code and even transcribe the code. So this is a very practical example to get a tutorial for example or blog post out of your video. That is what I would think about when I think about computer vision, a really practical application.
But maybe you could take a step back. So you talked about maybe object detection in x-rays or is this a classification example like the patient is healthy or the patient is not healthy. So take a step back and okay what tasks do we actually have to solve?
JH: Well as I said anything where I as a human would need to look at something, need to do it at scale, that's where computer vision comes into play.
I mean the tasks that are of course always a little bit controversial are all the tasks where military is involved, right? Where do I need to place the bomb from my drone for example?
Having autonomous vehicles. I mean Tesla is one of those companies that does computer vision quite prominently, right? They are very well known for their autopilot that at least as they say only uses image data or video data to navigate the streets. So yeah that would be a real task to solve.
JD: So what is Tesla actually doing? So let's remain at that example.
JH: Oh man.
JD: Do they, so they're very famous and they said okay like we have self-driving cars in like five years and I think five years already gone past by now.
JH: Yeah I mean that's, I've heard that sentence repeated a number of times.
JD: So do they do object detection or how do they see the world? So do they classify? Can I step on the gas or do I have to brake or do they say oh there's a human on the street? Like I detected an object that should be human or how do they do it?
JH: So I don't know. I have to be blatantly honest here. I've not seen their code base. I've seen a few talks and from what I gather I would expect them to use a number of things.
So I mean of course object detection paired with identifying where the street is. So segmentation task, depth estimation and then combining all this into some kind of decision logic that could also be done by neural networks but could also be a hard-coded logic. I don't know but they will most definitely not only rely on a single model doing everything but probably have a number of expert models that identify critical situations or certain information that's important for navigating the roads like street signs etc. and interpreting them.
JD: Okay so you dropped another interesting thing there which is called segmentation. I'm not sure everyone is familiar with that and I've always struggled to explain it. Maybe you can iterate on that.
JH: Okay well so you have a picture right? It's a number of pixels and segmentation asks the question which pixel can be assigned which class or label. So if I for example have a picture of a street the question could be which pixels are pixels that belong to a car. And then I allow them red and then I know in the picture where there are cars.
And in the Tesla example I could then use this information paired with a depth estimation and then I could say okay there are cars in this picture but they are far away because those pixels my depth estimation is it's way down the road. But then I ask is there a person in the picture and I see that in the center of the picture there are pixels that get the label person and the depth estimation for those pixels is it's very close. So this way you can see like how you could combine segmentation with depth estimation for example to decide that it's time to break.
JD: Okay are there different types of segmentation? So when I skim articles it's not one type of segmentation. Are there different types?
JH: I mean segmentation in itself like the name just says I take a set of values and I put them into different segments right?
That's segmentation and in the case of computer vision that means my set of values are pixels and I segment those pixels into different classes. There are a number of approaches to solve this and you could also distinguish between the types of classes you assign.
So for example we just heard about how to that you could ask for cars but that would not differentiate between individual cars right? So there you would speak of if you want to know where there are individual cars in the image you would speak of instance segmentation which is in itself again basic segmentation and you would usually see combinations of assigning a segment class like car or person and then an instance label or instance ID so that you then have these two informations per pixel that you know this is a car and it's car number 25 or car number 7 and then you can be a bit more fine-grained about which classes are where and if the pixels belong together.
You can also put a relation label on it. You can do anything.
The question about relations of course is always what kind of relation do I want right?
Those two cars that have crashed and are thus bonded together are those two cars that are next to each other as one car in front of the other from my perspective? Those are relations you can of course extract again through probably combining a bunch of approaches.
JD: Thank you Johannes. This was our introductory episode and the next episode will be about geospatial analysis and practical applications.
In the fifth episode we talk about Vision Transformer architectures. An approach for language translation and only recently applied to computer vision. This approach is more suitable for specific tasks as it takes context into account.
In the fourth episode we talk about the prevalent Deep Learning Architectures and their Building blocks. We cover how they work mathematically and how they are stacked together to achieve specific tasks.