Common Terms in Computer Vision

June 12, 2024
Headline Common Terms in Computer Vision written over blue sky from someone standing in the desert on a road leading to mountains.
linkedin icontwitter iconfacebook iconmail icon

When you read about Artificial Intelligence, Machine Learning and Computer Vision you often come across terms that seem to be common.

But do you know all of them and what they actually mean?

In this blog we will list the most common terms that we use in our day-to-day work.

Bounding Box

A Bounding Box is a rectangle around a detected element or region in a picture. In code they are represented with their coordinates. There are different formats used. Some examples:

  • pascal_voc: [xmin, ymin, ymax, ymax]
  • coco: [xmin, ymin, width, height]
  • yolo: normalized [xcenter, ycenter, width, height]

Example in JSON:

-- CODE language-json line-numbers -- { "xmin": "1128.2720982142857", "ymin": "160.21332310267857", "xmax": "1178.8204241071428", "ymax": "180.83512834821428" }


An annotation can be the visual representation of a bounding box or just a keypoint. But also a textual description is possible.

Usually bounding boxes are used in grounding together with labels and to establish a ground-truth or reference for trainingdata and testdata for training computer vision models.

Annotations differ from labels in that they can include any additional metadata.

Annotation with labels on a website with a lot of text.


A label is given to a bounding box or a keypoint and identifies which class or category an object or area the model assigned to it. In the above example the person-text above the bounding box is a label.

Labels are also used to train models where the training set consists of correctly labeled images.

Image Classification

Image classification is used to put a predefined label on an image.

For an image of a cat would get the label cat, while the image of a sunset would get the label sunset.

This seems like a trivial thing to do but it actually has a lot of practical use cases:

  • Recognition of objects or entities depicted in images
  • Classifying medical image for detecting abnormalities
  • Image retrieval based on categories

Image Segmentation

The goal of image segmentation is to divide an image into areas that belong together. The areas are visually distinctive from each other.

The output of the process is a set of segmentation masks that can be used to process the image further.

Take this segmentation of an avatar image where the masks define where a person is and where the background is. A practical application would be to remove the background based on these masks.

Colored segmentation areas for an avatar image.

Object Detection

Object detection is used to identify objects in images and video. It is a crucial task in a lot of domains such as Surveillance and Security, Environmental Monitoring and Augmented Reality.

It takes visual input and determines the objects that are present. The objects are described with a bounding box and also get a class-label and sometimes a confidence score.

Detected objects are person and cell phone in a screenshot of a person sitting in a chair holding up a cell phone.


A distinctive feature in an area or keypoint of an image. Usually you attribute features to a specific class you want to detect.

For a mouse those could be the shape, surroundings or fur pattern. Those are represented as numerical vectors, called descriptors, that capture the visual properties of an area or keypoint.

Feature Vector

Feature vectors are not limited to Computer Vision, but used there for all kind of tasks.

A feature vector is generated from input data such as an image and contains one numerical entry for every feature. Feature vectors are typically high-dimensional.

In practice feature vectors get generated by a Backbone (see next heading) which are by themselves models trained to extract useful features.


Grounding is the process of connecting high-level concepts, for example Mouse, with visual features, for example tail.

Now that you have high-level concepts and their representation in the model you can also put them into context and relation with each-other.

Grounding can also be done by humans for training data. There you annotate an input with bounding boxes and labels, so the model has a Ground Truth to learn from.

Referring Expression

Referring Expression in Computer Vision are used to describe objects or areas in images. They can contain attributes like color or relations to other objects.

Take this example: "The bike next to the child." You do not point to the bike directly but give a reference point child. Or you could give a description: "The red building next to the blue car.".

Referring Expressions are typically used in tasks like object localization, object detection or image retrieval.


In general zero-shot is when models can do tasks they were not specifically trained on.

To some extend, and most of them rather poorly, every model can perform tasks without specific training as the goal of training is to get a higher-level understanding of concepts.

Zero-shot learning in Computer Vision also includes more information than just labeled examples. The focus is on data that describes relationships between categories, so the model can generalize better to new categories.

This also requires large scale training on a lot of data, so the model is exposed to billions of examples and can generalize without specific training to new tasks.


That are all, but sure there are more. Do you know of any we should include?

Let us know on Social Media.

Johannes Dienst
June 12, 2024
On this page