School of Engineering Department of Computer Science and Engineering 117 Generative AI Supervisor: CHEN Qifeng / CSE Student: WU Kam Man / DSCT Course: UROP 2100, Spring This project explores audio separation using a diffusion model adapted from the Stable Audio Tools repository. We create a dataset from VGGSound by mixing audio clips and use text prompts to guide the separation process. The model, trained on a JSONL dataset with mixed and target audio paths, isolates specific sounds as specified by prompts like “remove the sound of heartbeats.” Our approach demonstrates the efficacy of diffusion models for audio separation, achieving promising results in isolating sounds from complex mixtures. This work suggests potential applications in audio editing and sound engineering. Generative AI Supervisor: CHEN Qifeng / CSE Student: YAN Yao / DSCT Course: UROP 1000, Summer In recent years, diffusion models have made significant progress in text-to-image (T2I) and text-to-video (T2V) generation. However, existing technologies still face numerous challenges, particularly insufficient safety filtering mechanisms and poor alignment of generated content with user preferences. In this research report, I first reviewed some published papers to gain a deeper understanding of related concepts and applications, and then reported on the replication and training of the SD1.5 results in Table 1&2 of the paper: “AlignGuard: Scalable Safety Alignment for Text-to-Image Generation”. Generative AI Supervisor: CHEN Qifeng / CSE Student: ZHAO Donghao / COMP Course: UROP 1100, Spring Artificial Intelligence-Generated Content (AIGC) has emerged as a pivotal research area, particularly in video generation. Despite rapid progress, state-of-the-art (SOTA) text-to-video (T2V) and image-to-video (I2V) models still exhibit limitations such as pixel inconsistencies and temporal artifacts. This project helps address these challenges through two key contributions: (1) comprehensive benchmarking of SOTA models (e.g., HunyuanVideo, Wan2.1, CogVideoX, Step Video) to evaluate their strengths and weaknesses, and (2) the development of a binary classifier that serves as a reward model to provide feedback on output quality. The classifier distinguishes between artifact-laden and high-quality frames, enabling iterative refinement of generative models. Additionally, we curated a custom dataset for reward fine-tuning, combining frames from flawed generated videos and high-quality Flux-generated images. Our work provides actionable insights for improving video generation pipelines and sets the foundations for future integration of reward-based training.
RkJQdWJsaXNoZXIy NDk5Njg=