1 145 Table of Contents 147 280

UROP Proceeding 2024-25

School of Engineering Department of Computer Science and Engineering 127 Safe Diffusion Models for Robust AI Generation Supervisor: GUO Song / CSE Student: LU Zetian / CPEG Course: UROP 1100, Fall Machine unlearning is an emerging concept in machine learning that enables models to efficiently forget specific data points or learned representations. This report explores the application of machine unlearning methods to eliminate certain concepts from the Stable Diffusion model. As generative models gain prominence, the ability to remove specific learned representations is crucial for maintaining data privacy and model integrity. We investigate attack and unlearning techniques, focusing on their effectiveness in modifying the Stable Diffusion parameters without changing the architecture. Empower Multimodal Language Models with Long Video Understanding Supervisor: HE Junxian / CSE Student: GAO Yitang / COMP Course: UROP 1100, Fall Long video understanding has emerged as a critical challenge in the field of multimodal machine learning. While recent years have seen the rapid development of large vision-language models (VLMs) and videolanguage models (VLMs), current solutions often struggle with extended video content spanning tens of minutes or even hours. The temporal dimension, vast frame counts, and complexity of narrative structures pose unique difficulties not encountered in shorter video scenarios or static images. Our project is focused on adapting and extending state-of-the-art VLM architectures to better handle longform videos. By gathering training data from large-scale video sources and incorporating more robust pooling and compression, we aim to improve temporal reasoning capabilities and produce models that genuinely understand lengthy, complex content. Making Large Language Models (LLMs) Interact with Physical World Supervisor: LI Mo / CSE Student: LIU Zihe / COMP Course: UROP 1100, Fall UROP 2100, Spring AutoTour is an endtoend system that transforms a geotagged photograph into two complementary deliverables: (i) an annotated image in which salient scene objects are bounded and labelled, and (ii) a concise, tourguide style narrative that situates those objects within their urban or indoor context. The core innovation lies in jointly reasoning over visual and spatial signals. A lightweight Overpass engine extracts OpenStreetMap (OSM) primitives inside a cameraaligned cone, while a vision–language model (VLM) locates corresponding entities in the picture. A large language model (LLM) then fuses geometric, cartographic, and textual cues to generate fluent descriptions. We present the system architecture, implementation, and an evaluation on HKUST campus photo that demonstrates high precision and compelling subjective usability.

Made with FlippingBook

RkJQdWJsaXNoZXIy NDk5Njg=