Panoptic Out-of-Distribution Segmentation

Abstract

Deep learning has led to remarkable strides in scene understanding with panoptic segmentation emerging as a key holistic scene interpretation task. However, the performance of panoptic segmentation is severely impacted in the presence of out-of-distribution (OOD) objects i.e. categories of objects that deviate from the training distribution. To overcome this limitation, we propose Panoptic Out-of Distribution Segmentation for joint pixel-level semantic in-distribution and out-of-distribution classification with instance prediction. We extend two established panoptic segmentation benchmarks, Cityscapes and BDD100K, with out-of-distribution instance segmentation annotations, propose suitable evaluation metrics, and present multiple strong baselines. Importantly, we propose the novel PoDS architecture with a shared backbone, an OOD contextual module for learning global and local OOD object cues, and dual symmetrical decoders with task-specific heads that employ our alignment-mismatch strategy for better OOD generalization. Combined with our data augmentation strategy, this approach facilitates progressive learning of out-of-distribution objects while maintaining in-distribution performance. We perform extensive evaluations that demonstrate that our proposed PoDS network effectively addresses the main challenges and substantially outperforms the baselines.

What is Panoptic Out-of-Distribution Segmentation?

Recent advances in deep learning have substantially improved the capabilities of autonomous systems to interpret their surroundings. Central to these advancements is panoptic segmentation, which integrates semantic segmentation with instance segmentation, providing a holistic understanding of the environment. However, a significant challenge is that these models yield overconfident predictions of object categories out of the distribution they were trained on, known as out-of-distribution (OOD) objects. Segmenting these OOD objects poses a major challenge as they can vary significantly in appearance and semantics, include fine-grained details, and share visual characteristics with in-distribution objects, leading to ambiguity. Moreover, learning to jointly segment both OOD objects and in-distribution categories is extremely challenging. Given the potential consequences of autonomous systems malfunctioning due to unexpected inputs, it is crucial to ensure the safe and robust deployment.

To directly address these challenges at the task level, we introduce panoptic out-of-distribution segmentation that focuses on holistic scene understanding while effectively segmenting OOD objects. The proposed task that aims to predict both the semantic segmentation of stuff classes and instance segmentation of thing classes as well as an OOD class. An object is considered OOD if it is not present in the training distribution but appears in the testing/deployment stages. Thus, Panoptic out-of-distribution segmentation aims to assign each pixel \(i\) of an input image to an output pair \((c_i, \kappa_i) \in (C \cup O) \times N\). Here, \(C\) denotes known semantic classes, while \(O\) represents the out-of-distribution class, such that \(C \cap O = \emptyset\), and \(N\) is the total number of instances. \(C\) is further divided into stuff labels \(C^S\) (e.g., sidewalks) and thing labels \(C^T\) (e.g., pedestrians). In this task, the variable \(c_i\) can be a semantic or OOD class, and \(\kappa_i\) indicates the corresponding instance ID. For stuff classes, \(\kappa_i\) is not applicable.

Technical Approach

As the first novel approach to addressing the task of Panoptic Out-of-Distribution Segmentation, we propose the PoDS architecture. PoDS builds on top of a base panoptic segmentation network that has a shared backbone and task-specific decoders (purple) by incorporating modules specially designed to embed out-of-distribution capabilities based on prior knowledge of in-distribution classes. We incorporate an OOD contextual module (blue) that complements the robust in-distribution semantic features of the shared backbone with both global discriminatory and fine local OOD object representations. Subsequently, we introduce an additional task-specific decoder (green), equipped with dynamic modules, alongside the existing ones. This design allows for adaptive integration of OOD features while preserving the in-distribution features of the high-performing base panoptic network. The unique dual task-specific decoder configuration benefits further from our novel alignment-mismatch loss. This loss encourages learning finer details within in-distribution semantic classes and what lies outside by balancing consensus and divergence between the two heads. Furthermore, we incorporate a data augmentation strategy to facilitate the training of our novel modules. Please refer to our paper for more details.

Benchmarking Datasets

We have extended the existing urban scene datasets—Cityscapes and BDD100K—by incorporating diverse out-of-distribution (OOD) object instances. This has led to the creation of two new datasets: Cityscapes-OOD and BDD100K-OOD.

These enriched datasets include:

11 stuff classes
8 thing classes
1 additional OOD class

The inclusion of OOD objects provides a more challenging and comprehensive benchmark for evaluating models in real-world urban environments. These datasets are designed to facilitate research in detecting and handling OOD instances, enhancing the robustness of scene understanding methods.