Context-aware Image Semantic Segmentation

Publication Type:
Thesis
Issue Date:
2022
Full metadata record
Semantic segmentation is a fundamental task for computer vision applications. However, the existing solutions have many issues when handling difficult cases. This thesis develops three novel approaches which have improved the generalization ability of the existing solutions at significantly reduced computation costs. Extensive experiments conducted on multiple benchmark datasets have demonstrated the superior performance of the proposed approaches. 𝗦𝗰𝗮𝗹𝗲-𝗶𝗻𝘃𝗮𝗿𝗶𝗮𝗻𝘁: The state-of-the-art semantic segmentation solutions usually leverage different receptive fields via multiple parallel branches to handle objects of different sizes. However, employing separate kernels for individual branches degrades the generalization of the network to objects with different scales, and the computational cost increases with the increase of the number of branches. In this thesis, a novel network structure, namely Kernel-Sharing Atrous Convolution (KSAC), is proposed, where branches with different receptive fields share the same kernel, i.e., letting a single kernel “see” the input feature maps more than once with different receptive fields. 𝗦𝗲𝗮𝗺𝗹𝗲𝘀𝘀 𝗱𝘂𝗮𝗹 𝗮𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻: Spatial and channel attentions, modelling the semantic inter-dependencies in spatial and channel dimensions respectively, have recently been widely used for semantic segmentation. However, computing spatial attention and channel attention separately sometimes causes errors, especially in those difficult cases. In this research, a Channelized Axial Attention (CAA) is developed to seamlessly integrate channel attention and spatial attention into a single operation with negligible computation overhead. Furthermore, a novel grouped vectorization approach is developed to allow the proposed model to run with very little memory consumption without slowing down the computation. 𝗖𝗹𝗮𝘀𝘀-𝗮𝘄𝗮𝗿𝗲 𝗿𝗲𝗴𝘂𝗹𝗮𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Recent segmentation methods utilizing class-level information in addition to pixel features have achieved notable success in boosting the accuracy of existing network models. However, the extracted class-level information was simply concatenated to pixel features, without being explicitly exploited to learn better pixel representation. Moreover, these approaches learn soft class centers based on coarse mask prediction, which is prone to error accumulation. Motivated by the fact that humans can recognize an object by itself no matter which other objects it appears with and aiming to use class-level information more effectively, a universal Class-Aware Regularization (CAR) approach is proposed to optimize the intra-class variance and inter-class distance during feature learning. Furthermore, the class center in the proposed approach is directly generated from ground truth instead of from the error-prone coarse prediction. The proposed CAR can be easily applied to most existing segmentation models and can largely improve their accuracy at no additional inference overhead.
Please use this identifier to cite or link to this item: