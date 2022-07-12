



Posted by: Qihang Yu, Student Researcher, Liang-Chieh Chen, Research Scientist, Google Research

Panopticon segmentation is a computer vision issue that serves as a core task for many real-world applications. Due to its complexity, previous work included panoramic segmentation (assigning semantic labels such as “people” and “sky” to every pixel in the image) and instance segmentation (“pedestrians” and “pedestrians” in the image. Car “), and further divide it into several subtasks. Each subtask is processed individually, additional modules are applied, and the results of each subtask stage are merged. Not only is this process complicated, but it also introduces many manually designed presets when processing subtasks and when combining results from different subtask stages.

Inspired by Transformers and DETR, MaX-DeepLab recently proposed an end-to-end solution for panoramic segmentation using a mask transformer, an extension of the Transformer architecture used to generate segmentation masks. The solution is a pixel path (consisting of either a convolutional neural network or a vision transformer) for extracting pixel features, a memory path (consisting of a transcoder module) for extracting memory features, And features that employ dual-pass transformers for pixel features and memory-to-memory interactions. However, dual-pass transformers that utilize cross-attention were originally designed for language tasks where the input sequence consists of dozens or hundreds of words. Nevertheless, for visual tasks, especially for segmentation issues, the input sequence consists of tens of thousands of pixels. Not only does this indicate that the input scale is much larger, but it also represents a lower level of embedding compared to language words.

The “CMT-DeepLab: Clustering Mask Transformer for Panorama Segmentation” announced at CVPR 2022 and the “kMaX-DeepLab: k-means Mask Transformer” announced at ECCV 2022 have a clustering perspective (that is, the same semantic label). Group the pixels you have). This adapts better to vision tasks. CMT-DeepLab is built on the previous state-of-the-art method, MaX-DeepLab, which uses a pixel clustering approach to perform cross-attentions for higher density and plausible attention maps. kMaX-DeepLab further redesigns cross-attention like a k-means clustering algorithm by simply changing the activation function. It shows that CMT-DeepLab delivers significant performance improvements and that kMaX-DeepLab not only simplifies changes, but also significantly improves state-of-the-art technology without extended test time. We are also pleased to announce the open source release of kMaX-DeepLab, the best performing segmentation model in the DeepLab2 library.

Instead of applying cross-attention directly to the visual task without any overview changes, we suggest reinterpreting it from a clustering perspective. Specifically, the mask Transformer object query can be thought of as a cluster center (which aims to group pixels with the same semantic label), and the cross-attention process employs an iterative process k-means. Note that it is similar to a clustering algorithm. (1) Multiple pixels can be assigned to a single cluster center, some cluster centers may not have pixels assigned to cluster centers, and (2) to the same cluster center. Update the cluster center by averaging the allocated pixels, if no pixels are allocated, the center of the cluster will not be updated).

CMT-DeepLab and kMaX-DeepLab reformulate cross-attention in terms of clustering, which consists of iterative steps of cluster allocation and cluster update.

Given the popularity of the k-means clustering algorithm, CMT-DeepLab has redesigned the cross-attention to actually cluster the spatial softmax operations (that is, the softmax operations applied along with the spatial resolution of the image). Instead of the center of, the pixels are applied along the center of the cluster. kMaX-DeepLab further simplifies spatial softmax to per-cluster argmax (that is, applies argmax operations along the center of the cluster). Note that the argmax operation is the same as the hard allocation used in the k-means clustering algorithm (that is, pixels are allocated to only one cluster).

Re-formulating the mutual attention of mask transformers from a clustering perspective can significantly improve segmentation performance and make complex mask transformer pipelines easier to interpret. First, the encoder-decoder structure is used to extract pixel features from the input image. Then use the set in the center of the cluster to group the pixels. Pixels are further updated based on the clustering assignments. Finally, the clustering allocation and update steps are repeated, with the final allocation acting directly as a segmentation prediction.

To convert a typical mask transformer decoder (consisting of cross-attention, multi-head self-attention, and feedforward network) to the proposed k-means cross-attention, simply replace Spatialwise Softmax with Clusterwise. .. argmax.

The proposed kMaX-DeepLab metaarchitecture consists of three components: a pixel encoder, an extended pixel decoder, and a kMaX decoder. A pixel encoder is any network backbone used to extract image features. The enhanced pixel decoder includes a transformer encoder to enhance pixel functionality and an upsampling layer to generate higher resolution functionality. The set of kMaX decoders transforms the center of the cluster into (1) a masked embedded vector. It multiplies the pixel features to generate predictive masks and (2) makes class predictions for each mask.

kMaX-DeepLab meta-architecture.

Results Evaluate CMT-DeepLab and kMaX-DeepLab for MaX-DeepLab and other state-of-the-art methods using the two most difficult panopticon segmentation datasets, the COCO and Cityscapes Panopticon Quality (PQ) metrics. To do. CMT-DeepLab delivers significant performance improvements, and kMaX-DeepLab not only simplifies changes, but also significantly enhances cutting-edge technology with a COCO value set of 58.0% PQ, 68.4% PQ, and 44.0%. To improve. Mask average accuracy (mask AP), 83.5%, means Intersection-over-Union (mIoU) of the Cityscapes val set with no extended test time or no external dataset.

Comparison of COCO value sets. Method PQAPmaskmIoU Panoptic-DeepLab 63.0% (-5.4%) 35.3% (-8.7%) 80.5% (-3.0%) Axial-DeepLab 64.4% (-4.0%) 36.7% (-7.3%) 80.6% (-2.9%) SWideRNet 66.4% (-2.0%) 40.1% (-3.9%) 82.2% (-1.3%) kMaX-DeepLab 68.4% 44.0% 83.5% Comparison with the cityscape value set.

Designed from a clustering perspective, kMaX-DeepLab not only provides high performance, but also provides a more reasonable visualization of attention maps to understand how it works. In the following example, kMaX-DeepLab iteratively allocates and updates clustering. This will gradually improve the quality of the mask.

The kMaX-DeepLab attention map can be directly visualized as panoramic segmentation. This makes the model’s working mechanism (image credits: coco_url, and license) more relevant.

Conclusion We have shown how to better design mask transformers for vision tasks. With a simple change, CMT-DeepLab and kmaX-DeepLab reformulate cross-attention like a clustering algorithm. As a result, the proposed model delivers state-of-the-art performance with challenging COCO and Cityscapes datasets. We hope that the open source release of kMaX-DeepLab in the DeepLab2 library will facilitate future research into the design of vision-specific transarchitectures.

Acknowledgments We would like to thank Waiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Florian Schroff, Hartwig Adam, and Alan Yuille for their valuable discussions and support.

