Computer vision is the discipline that gives machines the ability to interpret and understand the visual world. Every object detector, face recognition system, quality inspection pipeline, and medical imaging AI was built on annotated image data. The annotation technique you choose directly determines what your model can learn — and what it cannot.
Bounding Box Annotation
The most common and cost-effective technique. Annotators draw an axis-aligned rectangle around each object instance. Bounding boxes are fast to produce and sufficient for object detection tasks where you need to know where objects are but not their precise shape. Used in pedestrian detection, product recognition in retail, and vehicle detection.
The key quality concern with bounding boxes is tightness consistency: boxes that are too loose include background pixels that confuse the model, while overly tight boxes clip part of the object. Clear guidelines with worked examples are essential.
Polygon Annotation
For irregularly shaped objects where a rectangle is too imprecise, annotators draw a polygon by placing vertices along the object boundary. More time-consuming than bounding boxes but significantly more accurate for elongated objects, curved surfaces, and items with complex silhouettes. Common in document layout analysis and satellite imagery.
Semantic Segmentation
Every pixel in the image is assigned a class label. The output is a color-coded mask where all road pixels are one color, all sky pixels another, all pedestrians another. Semantic segmentation is computationally expensive to annotate (100× slower than bounding boxes per image) but enables models that need to understand scene geometry, such as autonomous driving perception stacks.
Instance Segmentation
Like semantic segmentation, but each object instance gets its own unique mask — so two overlapping cars are labeled as separate instances rather than merged into a single "car" region. Required for crowd counting, surgical instrument tracking, and any task where distinguishing individual objects matters.
Keypoint Annotation
Annotators mark specific points on objects — joints on a human skeleton (shoulder, elbow, wrist, hip, knee, ankle), landmarks on a face (eye corners, nose tip, lip edges), or reference points on mechanical parts. Keypoint data trains pose estimation models used in fitness apps, sports analytics, and gesture recognition systems.
Classification Tagging
Assigning one or more tags to an entire image without localizing objects. Used to train image classification models: "contains_vehicle", "indoor_scene", "product_defect_visible". The simplest and fastest annotation type — well-suited for large-scale content moderation or image search indexing.
Choosing the Right Technique
- Object detection (where is it?): Bounding boxes — fast, scalable, widely supported.
- Shape-sensitive detection: Polygons — when object shape matters more than speed.
- Scene understanding: Semantic segmentation — essential for robotics and autonomous driving.
- Individual instance counting: Instance segmentation — when objects overlap or cluster.
- Pose and gesture: Keypoints — human body, hands, and face landmark tasks.
- Content tagging: Classification — image-level labels for search and moderation.
Domain-Specific Annotation Challenges
Medical imaging annotation (X-rays, MRIs, histology slides) requires clinically trained annotators, not general-purpose labelers. A misidentified tumor margin is not the same as a misidentified car. Similarly, satellite and aerial imagery annotation requires annotators who understand top-down perspective and can identify structures that look very different from ground-level photos.
Manufacturing quality inspection data often contains subtle defects — micro-cracks, surface discoloration, dimensional deviations — that only experienced inspectors can reliably identify. In these domains, domain expertise is a prerequisite, not a nice-to-have.
Quality Metrics for Image Annotation
- Intersection over Union (IoU): For bounding boxes and segmentation masks. Target IoU > 0.85 for most applications.
- Pixel accuracy: Percentage of correctly labeled pixels in segmentation tasks.
- Keypoint localization error: Mean pixel distance between annotated and ground-truth keypoints.
- Annotation consistency: Re-annotating a random sample of completed images to measure drift.