Back to Blog
    Academy4 min readNovember 11, 2024

    Common Terms in Computer Vision

    When you read about Artificial Intelligence, Machine Learning and Computer Vision you often come across terms that seem to be common. But do you k....

    Johannes Dienst
    Common Terms in Computer Vision

    TLDR

    This blog post defines and clarifies common terminology within the fields of Artificial Intelligence (AI), Machine Learning (ML), and Computer Vision. Terms like bounding boxes, annotations, labels, image classification, image segmentation, object detection, features, feature vectors, grounding, referring expressions, and zero-shot learning are explained, providing a foundational understanding of their meanings and applications.

    Introduction

    Venturing into the domains of Artificial Intelligence, Machine Learning, and Computer Vision often means grappling with a specialized vocabulary. This blog post serves as a glossary, demystifying some of the most frequently encountered terms. By providing clear explanations, we aim to enhance comprehension and empower you to navigate these exciting fields with greater confidence.

    The Visual Language of AI: Bounding Boxes and Annotations

    Bounding Boxes: Defining Visual Boundaries

    A bounding box is a rectangular frame that visually outlines a detected object or region within an image. Think of it as a digital lasso, encircling a specific element of interest. These boxes are defined by their coordinates, with different formats used to represent them. For example:

    • pascal_voc: [xmin, ymin, xmax, ymax]
    • coco: [xmin, ymin, width, height]
    • yolo: normalized [xcenter, ycenter, width, height]

    [STAT: Bounding boxes are a fundamental component in object detection, with studies showing their use contributes to a significant improvement in detection accuracy compared to methods without bounding box regression.]

    Annotations: Adding Context and Meaning

    An annotation is either a visual marker, such as a bounding box or keypoint, or descriptive text linked to an element in an image. Annotations provide the "ground truth," the reference point used for training and evaluating computer vision models, particularly when combined with labels for grounding. What sets annotations apart from simple labels is the inclusion of additional metadata.

    Categorizing and Understanding Images

    Labels: Assigning Identities

    A label is an identifier assigned to a bounding box or keypoint, categorizing the object or area detected by the model. In image classification, labels are used to train models by associating images with their corresponding categories. [STAT: Labeled datasets are critical for supervised learning; studies demonstrate that the size and quality of labeled data directly correlate with model performance.]

    Image Classification: The Art of Categorization

    Image classification is the process of assigning a predefined label to an entire image based on its overall content. For instance, an image showing a dog would be labeled "dog," while an image of a mountain range would be labeled "mountain." Image classification accuracy rates have steadily increased due to advancements in deep learning. [STAT: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) has seen top-performing models achieve error rates below 5%, demonstrating the progress in image classification accuracy.]

    Image Segmentation: Dividing and Conquering Images

    Image segmentation's goal is to divide an image into distinct regions, each corresponding to a meaningful object or area. The result is a set of segmentation masks, useful for tasks like background removal or object isolation. [STAT: Image segmentation is a critical step in many computer vision applications, with applications in medical imaging, autonomous driving, and satellite image analysis.]

    Object Detection: Finding and Identifying

    Object detection goes beyond simple classification by identifying and locating objects within images and videos. This involves drawing bounding boxes around detected objects, assigning them class labels, and often providing confidence scores. [STAT: Object detection algorithms, such as YOLO and Faster R-CNN, have revolutionized real-time video analysis and autonomous systems. The global object detection market is expected to reach $X billion by YYYY.]

    The Inner Workings: Features, Vectors, and Grounding

    Features: Distinguishing Characteristics

    A feature is a distinctive attribute of an image area or keypoint, typically associated with a specific class of objects. These can be shapes, textures, or patterns that help identify an object.

    Feature Vectors: Numerical Representations

    A feature vector is a numerical representation of an image's features, generated from input data and containing one numerical entry for each feature. Feature vectors are often high-dimensional and are typically produced by a "Backbone," which is a pre-trained model designed to extract useful features.

    Grounding: Bridging Concepts and Vision

    Grounding connects high-level concepts (e.g., "car") with their corresponding visual features (e.g., "wheels," "windshield"). This can be achieved through human annotation, where inputs are labeled with bounding boxes and labels, providing the model with a "Ground Truth" to learn from.

    Advanced Concepts: Referring Expressions and Zero-Shot Learning

    Referring Expressions: Describing with Detail

    Referring expressions are used to describe objects or areas in images using attributes like color or relations to other objects. An example is "The red house on the left." Referring expressions are commonly used in tasks like object localization, object detection, or image retrieval.

    Zero-Shot Learning: Learning Without Examples

    Zero-shot learning empowers a model to perform tasks it wasn't explicitly trained on. This is achieved by training on a broad dataset that describes relationships between categories, enabling the model to generalize to new categories without specific training examples. [STAT: Zero-shot learning is a burgeoning field with applications in scenarios where labeled data is scarce or nonexistent, with ongoing research focusing on improving the transferability of knowledge across different domains.]

    Conclusion

    A solid understanding of these core terms is crucial for navigating the world of AI, ML, and Computer Vision. As these fields continue to evolve, staying updated on these foundational concepts will empower you to understand new developments and contribute to future advancements.

    FAQ

    What is the difference between a label and an annotation?

    A label is a simple identifier or category assigned to an object or image, like "cat" or "dog." An annotation is richer, often including visual markers like bounding boxes or keypoints, along with textual descriptions and metadata providing more context about the object or area.

    Why are bounding boxes important in object detection?

    Bounding boxes are fundamental because they define the precise location and boundaries of objects within an image or video. They allow object detection models to not only classify objects but also pinpoint their position, which is crucial for applications like autonomous driving and surveillance.

    How does zero-shot learning work without training examples?

    Zero-shot learning relies on training a model on a large dataset that establishes relationships between different categories or concepts. The model learns to generalize these relationships, allowing it to recognize and classify new objects or categories it hasn't seen before, based on its understanding of their properties and relationships to known categories.

    What is the purpose of a feature vector?

    A feature vector is a numerical representation of an image's features, capturing the distinctive attributes of an object or area. These vectors are essential for machine learning models because they convert visual information into a format that algorithms can process and analyze. They help the model to identify and classify objects based on their underlying characteristics.

    How is image segmentation different from object detection?

    While both involve identifying objects in an image, image segmentation aims to partition the image into pixel-level regions corresponding to different objects or areas, providing a more detailed and precise representation. Object detection, on the other hand, focuses on drawing bounding boxes around objects and assigning them class labels, without necessarily segmenting the entire image.

    Ready to automate your testing?

    See how AskUI's vision-based automation can help your team ship faster with fewer bugs.

    We value your privacy

    We use cookies to enhance your experience, analyze traffic, and for marketing purposes.