School of Engineering Department of Computer Science and Engineering 128 Language-Guided Dense Prediction for Scene Understanding Supervisor: XU Dan / CSE Student: LIANG Yan / COMP Course: UROP 1100, Summer This report presents a comprehensive overview of my UROP1100 research experience. The purpose of this project is to expore various methods for making open-vocabulary dense prediction. The first section is a brief introduction about the open-vocabulary pixel-wise segmentation task. In this section, we also introduce the settings and the goals of the task. The second section is about different methods that has been introduced by former researchers, which can be generally classified into two approaches as end-to-end pipelines and two-stage methods. The third section shows our method and its implementation in detail, which is based on the state-of-the-art model, CATSEG. Then, we show the details of the experiments in the next section and analyze the effectiveness of the new auxilaury modules. Finally, the report concludes with a discussion of the future work and the potential research directions. Due to the limitation of the computing resources, the experiments are not comprehensive, which is also a potential future work. Language-Guided Dense Prediction for Scene Understanding Supervisor: XU Dan / CSE Student: LU Junchi / COGBM Course: UROP 1000, Summer This report explores the current state of research in the field of language-guided dense prediction for scene understanding, particularly focusing on leveraging large-scale pretrained vision-language models such as CLIP. The study investigates how these models can be adapted for open-vocabulary semantic segmentation. Through a comprehensive literature review, I analyze various approaches to semantic segmentation, openworld detection, and model adaptation techniques. The review identifies key trends and critical challenges in the field, such as the need for improved handling of masked image regions and the integration of language cues for finer-grained segmentation tasks. The findings highlight several gaps in the existing literature, particularly in the areas of model generalization and scalability, which will inform the direction of future research. The report lays a basis for my further exploration into enhancing state-of-the-art models to achieve more accurate and flexible scene understanding in diverse, real-world environments. Open World Understanding based on Large Vision-Language Models Supervisor: XU Dan / CSE Student: TONG Tsun Man / COMP Course: UROP 1100, Summer Over the years, there has been a continuous enhancement in the accuracy of image segmentation models. However, these models are often constrained by a fixed number of predefined classes, which hampers their ability to generalize effectively. Referring Image Segmentation (RIS) seeks to address this limitation by associating segmentation masks of object instances with linguistic expressions, thereby enabling the development of models that are not restricted to predefined categories. This project aims to develop a system capable of open-world understanding by leveraging pre-trained large vision-language models that facilitate semantic comprehension of the real world across a broad spectrum of object classes. Due to the limited timeframe of the Summer term, the project has focused on conducting a literature review and selecting a baseline model.
RkJQdWJsaXNoZXIy NDk5Njg=