Skip to main content

Imagine a robot cleaning your house, a tractor working in a farm field without a farmer, or a self-driving car taking you to the airport. You may wonder how the AI systems in each of these examples navigate in these scenarios. This is a critical question, because the AI system needs to understand the environment (ex., streets, pedestrians) and tasks (driving) such that proper actions (move forward / turn left) can be taken.

In practice, the AI system relies on a computer vision model that can automatically generate semantic labels for each video pixel, which is known as automatic video semantic segmentation. For example, it helps self-driving cars understand what they’re seeing as they drive. Here’s how it works:

  1. Capture Video: The car has cameras that record everything around it as it moves.
  2. Prepare the Video: The car’s computer cleans up the video to make it clearer and easier to analyze.
  3. Learn to See: The computer is trained using lots of annotated videos, where each part of the video is marked as street, car, person, etc. This helps the computer learn to recognize these things on its own. The figure below shows a video frame example.
  4. Identify Objects: When the car is driving, the computer looks at each frame of the video and identifies every pixel as part of the street, a car, a person, etc. This creates a map of everything around the car.
  5. Keep it Smooth: The computer uses special techniques to make sure its understanding of the scene is consistent from one frame to the next, avoiding the perception of flickering or jumping objects.
  6. Make Decisions: The car uses this detailed map to make decisions, like when to stop, go, or turn, ensuring it can navigate safely and avoid obstacles.
A video frame example from Cityscape dataset where different colors denote different labels in pixel level. For example, red means person and light blue means sky.

The key step in this whole process is step 3: Learn to See. How can we train an AI model that is able to precisely make predictions on all pixels in all video frames? It sounds challenging in practice, but isn’t entirely impossible.

In order to  successfully train a computer,    we need lots of accurately annotated videos. Unfortunately, it’s usually impossible to ask a human annotator to fully label an entire video sequence due to the high per-pixel annotation cost. Therefore, the annotations are usually limited to a small subset of the video frames to save on this cost.

However, are all video frames equally important to get annotated for training purposes? Not really!

From the point of view of machine learning, this kind of problem falls into the framework of active learning, where the algorithm interacts with a human annotator for multiple rounds. At each round, the algorithm can strategically select a video frame and send it to the human annotator to get annotated. The algorithm then uses the annotated video frame to learn about the environment, retrain the model, and decide which video frame to send in the next round. The key point of active learning is that the algorithm always selects the most important video frame to get annotations, which helps the training process most. Overall, by only selecting a small subset of video frames, active learning is expected to significantly reduce annotation costs.

Our paper [QSLXLZK, WACV’23] systematically applies active learning for automatic video semantic segmentation. We propose a novel human-in-the-loop framework, called HVSA (Human-in-the-Loop Video Semantic Segmentation Auto-Annotation), to generate semantic segmentation annotations for the entire video using only a small annotation budget. Our method alternates between active learning and training algorithms until annotation quality is satisfied. In particular, the active learning picks the most important samples to get manual annotations, where the sample can be a video frame, a rectangle, or even a super-pixel (ie, a set of pixels with an arbitrary shape). Additionally, the test-time fine-tuning algorithm propagates the manual annotations of selected samples to the entire video. Real-world experiments on the Cityscape dataset show our method generates highly accurate and consistent semantic segmentation annotations while simultaneously enjoying a significantly small(er?) annotation cost.

In the future, building a successful AI self-driving system cannot depend solely on vision information. An interesting research direction is to incorporate more information sources, such as sound and laser.

Results on two video frames (Figure 10 of [QSLXLZK, WACV’23]). Column (a) is the video frame, (b) costs about 10.4% human annotation clicks with annotating frame, and (c) costs about 44% clicks with annotating frame, (d) is the mimic manual annotation, and (e) is the ground-truth. Our method HVSA achieves similar performance in (b) compared to (c) but costs significantly fewer clicks.

People

Chong Liu

Postdoctoral Scholar, Data Science Institute
arrow-left-smallarrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-long-yellowarrow-right-smallfacet-arrow-down-whitefacet-arrow-downCheckedCheckedlink-outmag-glass