School of Engineering Department of Computer Science and Engineering 150 Language-Guided Dense Prediction for Scene Understanding Supervisor: XU Dan / CSE Student: CHEN Ruiping / DSCT Course: UROP 1100, Spring Referring image segmentation (RIS) is a task combining computer vision and natural language processing, aiming to localize and segment objects in images based on natural language descriptions, requiring a precise multimodal alignment between visual and textual features. While fully supervised RIS methods can achieve strong performance, their reliance on costly pixel-level annotations limits scalability. This report explores recent advances in weakly supervised RIS. We analyze two state-of-the-art approaches: Curriculum Point Prompting, and “Shatter and Gather”, showing their overall frameworks and internal mechanisms, as examples to illustrate the current approaches of weakly-supervised RIS. Finally, we present our experiment results, observations, and conclusions. Language-Guided Dense Prediction for Scene Understanding Supervisor: XU Dan / CSE Student: GUO Zilin / MATH-GM Course: UROP 1100, Summer Referring Image Segmentation (RIS) aims to segment a specific object in an image based on a natural language description. Most methods require expensive pixel-level annotations. This project explores weakly supervised RIS using only image-text pairs. We propose a novel framework that integrates multi-granularity dense alignment through mask-text and patch-text alignment. A mask communication module enhances spatial awareness among regions, while cross-modal pre-alignment improves feature consistency. The model is trained using contrastive cross-modal consistency loss without dense labels. An auxiliary patch-level branch further refines predictions. Experiments on RefCOCO, RefCOCO+, and G-Ref show significant improvements over existing weakly supervised methods and competitive performance against fully supervised ones. This work demonstrates the potential of label-efficient learning in vision-language tasks. Language-Guided Dense Prediction for Scene Understanding Supervisor: XU Dan / CSE Student: LU Junchi / COGBM Course: UROP 1100, Fall This in-progress report presents the ongoing efforts in a project to use large-scale vision-language models (VLMs), such as CLIP, to achieve open-vocabulary semantic segmentation and anomaly detection in complex and dynamic scenes. Building on the literature and initial baseline implementation, I am currently focusing on integrating text-driven messaging and region-level reasoning into the segmentation framework. I tried to develop a foundational codebase integrating CLIP embeddings with classical dense prediction architectures, aiming to allow zero-shot labeling of pixel-level regions. Preliminary experiments on benchmarks suggest some potential in managing familiar categories, although further refinement is needed to fully validate the approach. I am currently moving towards the ‘fourth phase’ of my research plan - the detection of anomalies and novel objects - where the focus is on exploring methods to detect, locate and properly label unknown objects with linguistic clues. This report describes my recent progress, preliminary results and tentative plans for improvement the ability of the system to adapt to new and unexpected classes.
RkJQdWJsaXNoZXIy NDk5Njg=