School of Engineering Department of Computer Science and Engineering 104 Generative AI Supervisor: CHEN Qifeng / CSE Student: SUN Mengxi / DSCT Course: UROP 1100, Summer This report summarizes what I have done and learned from UROP 1100 this summer. I worked with two of Professor Chen’s postgraduate students working on LLM Photographer. I had never taken any courses about deep learning, and I barely understood this era, so I joined because of my interest. After chatting with one of the PG students in this project, I was assigned some fundamental tasks to be prepared for officially joining their research. So, I learned to write code to construct Alexnet, connect to the server, run the Alexnet code, and self-studied CS231n Deep Learning for Computer Vision. After finishing the first two tasks and communicating with that PG student again while continuing to learn the CS231n course, I started to read papers. Finally, I finished reading two papers, Visual Instruction Tuning and Augmented Language Model: a Survey. Although for me, these two papers are hard to understand, I learned a lot of new terms and commonly used methods, as well as gained some inspiration. In this report, I will illustrate my tasks individually and demonstrate what I have learned. Generative AI Supervisor: CHEN Qifeng / CSE Student: WANG Dingqi / MATH-PMA ZHOU Yukai / MATH-PMA Course: UROP 1000, Summer UROP 1000, Summer 3D Geometry Estimation from 2D visuals has long been a challenging yet fundamental task for computer vision. Recent studies have proved that latent diffusion models, together with the U-Net and attention architectures, can achieve state-of-the-art results in the estimation of the depth and camera parameters from images or videos, besides their well-known generative abilities. The report delves into the detailed implementations behind such estimation models, as well as a generative diffusion model, which accepts camera pose representations as its conditions. Generative AI Supervisor: CHEN Qifeng / CSE Student: WU Kam Man / DSCT Course: UROP 1100, Spring UROP 1000, Summer This summer semester, I concentrated on handing some more specific questions in the text-to-video research field. First, I was asked to run several repositories to know how we can build a text-to-video model from scratch. After learning the code inside different repos, the differences behind different models can inspire me to find better data preprocessing and model construction methods. Next, I generated some videos based on some of the models that I have tested previously, for example, VideoCrafter. Then I try to analyze the code inside and learn the techniques used in a well-constructed project. Finally, I have also learnt some data preprocessing skills, for example, video captioning (LLaVA) and aesthetic scores and image quality.
RkJQdWJsaXNoZXIy NDk5Njg=