Towards Image Semantic Segmentation: From Context to Language
- Publication Type:
- Thesis
- Issue Date:
- 2025
Open Access
Copyright Clearance Process
- Recently Added
- In Progress
- Open Access
This item is open access.
Semantic image segmentation is crucial for contemporary computer vision applications. It aims to classify each pixel into a specific category or class. Despite significant advancements in semantic segmentation, current methods still face challenges, including low efficiency and limited capture of contextual dependencies due to structural limitations. This research primarily focuses on improving the semantic segmentation algorithms from three different aspects.
Class-Aware Contextual Information: Leveraging contextual dependencies is a commonly used technique to enhance the performance of image segmentation. However, existing solutions do not effectively catch the class-level association between the pixels along the boundary across the objects of the different classes, but focus more on the local pixel-to-pixel relation. In this thesis, a Class-Aware Affinity module (CAA) is proposed that considers both pixel-to-pixel relation and pixel-to-class association.
Extended Context-Aware Classifier: The vanilla classifier captures global information from the training data, encoded through a fixed set of parameters, including weights and biases. However, each image has a different class distribution, which prevents the classifier from addressing the unique characteristics of individual images. At the dataset level, class imbalance leads to segmentation results being biased towards majority classes, limiting the model's effectiveness in identifying and segmenting minority class regions. In this research, we propose an Extended Context-Aware Classifier (ECAC) that dynamically adjusts the classifier using global (dataset-level) and local (image-level) contextual information.
Open-Vocabulary Semantic Segmentation: Open-vocabulary semantic segmentation relies on precise pixel-level alignment of visual and textual representations, using text as a universal reference to bridge visual disparities across diverse datasets. While prior work has primarily focused on improving visual representations or alignment models, the pivotal role of textual representations has often been neglected. This research proposes a novel approach that harnesses large language models (LLMs) to produce enriched text prompts, replacing rudimentary templates with semantically detailed descriptions.
Please use this identifier to cite or link to this item:
