School of Engineering Department of Computer Science and Engineering 111 Generative AI Supervisor: CHEN Qifeng / CSE Student: CHEN Jing / COMP Course: UROP 1100, Summer I participated in the project led by PhD student Runtao Liu in UROP 1100, with tasks of reproducing SafetyDPO and labeling Hunyuan Videos. Under the task of SafetyDPO reproduction, IP (Inappropriate Probability) of the models being trained for different checkpoint steps was tested, and it can be concluded that the model trained with 2000 steps is safer than the model trained with 1000 steps. In the task of labeling, at least 500 pairs of videos are required to be labeled in a week, with the first week of 250 pairs and 2000 pairs in total. The purpose of labeling is to enable the model to learn human preferences. Generative AI Supervisor: CHEN Qifeng / CSE Student: DENG King Ho / COMP Course: UROP 1100, Fall The interplay between preference learning and latent space representations in Stable Diffusion models has profound implications for the quality, fidelity, and safety of generated outputs. This paper investigates how these two factors influence text-to-image synthesis, particularly under complex and detailed prompts. By synthesizing insights from recent advancements, we explore the effects of preference alignment, latent space transformations, and safety mechanisms on image generation. Our findings indicate that strategic interventions in both preference learning and latent representations significantly enhance alignment with user intent, improve aesthetic quality, and provide a foundation for safety in text-to-image generation. Additionally, we present progress on training models designed to ensure safe content generation, while acknowledging the challenges that remain in testing and evaluation. Generative AI Supervisor: CHEN Qifeng / CSE Student: FANG Pengjun / COMP Course: UROP 3100, Fall We introduce a new framework for animating portrait images into videos using only textual input. Unlike existing methods that rely on audio or video signals, which limit flexibility and diversity, our approach focuses solely on textual descriptions. To support this, we developed a comprehensive dataset of 150,000 captioned portrait videos, which serves as the foundation for training our model. Our model includes three key innovations: 1) a hierarchical caption structure that enhances resilience to prompts with varying detail levels, 2) a novel method for introducing perturbations into reference latent vectors to improve dynamic quality while maintaining identity alignment, especially for out-of-domain portraits, and 3) a multi-condition classifier-free guidance that effectively balances text, identity, and reference portrait conditions during inference. Our results highlight the effectiveness of these innovations, showcasing the potential for more adaptable and high-quality portrait video synthesis.
RkJQdWJsaXNoZXIy NDk5Njg=