School of Engineering Department of Computer Science and Engineering 101 Generative AI Supervisor: CHEN Qifeng / CSE Student: FANG Pengjun / COMP Course: UROP 1100, Spring UROP 2100, Summer Although video generation models have experienced rapid development, current advancements still fall short in accurately generating human faces, encompassing expressions and subtle head movements. This research delves into developing and training a state-of-the-art (SOTA) video generation model specifically designed for human faces. Leveraging advancements in deep learning and computer vision, the study aims to create a model capable of generating realistic and high-fidelity videos depicting human facial expressions, movements, and interactions. The report will primarily focus on the processing and construction of the dataset, detailing some of the training attempts undertaken to enhance the model's capabilities in generating lifelike human facial animations. The report will also mention some data processing methods for general video datasets. Generative AI Supervisor: CHEN Qifeng / CSE Student: FEI Yang / COSC Course: UROP 1000, Summer In this paper, we explore two distinct video VAE architectures developed by PKU-Open-Sora-Plan and OpenSora. We evaluate these models using metrics such as PSNR, SSIM, LPIPS, and delve into the reasons behind their differing performances. Furthermore, we introduce a novel approach for advancing video VAE design, which involves sequentially compressing the spatial and then the temporal dimensions of video data. Building on this concept, we propose a new video VAE architecture that sets a new benchmark in the field, specifically enhancing text-to-video generation tasks. This study not only sheds light on the mechanics of video VAEs but also pushes the boundaries of their capabilities. Generative AI Supervisor: CHEN Qifeng / CSE Student: HE Songyu / COMP Course: UROP 1100, Fall This report describes, CLIP an OpenAI-developed multimodal neural network that combines vision with natural language to learn concepts efficiently. Unlike traditional visual models, which require labour- and finance-intensive datasets, CLIP makes use of the rich natural language supervision available on the Internet. Notably, CLIP achieves impressive performance on a variety of classification benchmarks without requiring specialised training for these tasks, similar to the "Zero-Shot" capabilities of GPT-2 and GPT-3. Even without using a 1.28 million training dataset, its performance is comparable to dedicated models trained on ImageNet-1K. In addition, CLIP can be efficiently transferred to new tasks, achieving competitive performance comparable to fully supervised methods. This report also describes some applications and limitations of CLIP.
RkJQdWJsaXNoZXIy NDk5Njg=