1 169 Table of Contents 171 280

UROP Proceeding 2024-25

School of Engineering Department of Computer Science and Engineering 151 Language-Guided Dense Prediction for Scene Understanding Supervisor: XU Dan / CSE Student: ZHANG Tianrui / COSC Course: UROP 1100, Fall UROP 2100, Spring In this UROP project, I contributed to weakly supervised referring image segmentation (WSRIS), enhancing object segmentation using image-text pairs without pixel-level annotations. I developed a novel framework integrating advanced models for mask proposal generation and selection, optimized by a custom loss function to align visual and textual features. Through extensive experiments on RefCOCO, RefCOCO+, and GRef datasets, my approach achieved state-of-the-art performance on GRef, demonstrating robust segmentation. This work deepened my expertise in computer vision and machine learning, advancing WSRIS and shaping my research trajectory. Learning Implicit Representations for Taking Head Generation Supervisor: XU Dan / CSE Student: MAO Qiujing / CPEG WU Yuetong / DSCT Course: UROP 1100, Spring Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic head avatar generation, overcoming limitations of traditional mesh-based and NeRF-based approaches. Unlike previous methods like NeRF, 3DGS represents dynamic facial geometry and appearance using “anisotropic 3D Gaussians”, enabling “real-time rendering” and “explicit control”. This review systematically analyzes recent works applying 3DGS and some other methods to talking head generation, covering: 1) Hybrid representations (e.g., combining 3DGS with FLAME model) 2) Dynamic modeling (e.g., Audio-Driven Gaussian Avatars) 3) Real-time talking head generation Open World Understanding Based on Large Vision-Language Models Supervisor: XU Dan / CSE Student: CHEN Ruiping / DSCT Course: UROP 1100, Spring Referring image segmentation (RIS) is a task combining computer vision and natural language processing, aiming to localize and segment objects in images based on natural language descriptions, requiring a precise multimodal alignment between visual and textual features. While fully supervised RIS methods can achieve strong performance, their reliance on costly pixel-level annotations limits scalability. This report explores recent advances in weakly supervised RIS. We analyze two state-of-the-art approaches: Curriculum Point Prompting, and “Shatter and Gather”, showing their overall frameworks and internal mechanisms, as examples to illustrate the current approaches of weakly-supervised RIS. Finally, we present our experiment results, observations, and conclusions.

Made with FlippingBook

RkJQdWJsaXNoZXIy NDk5Njg=