UROP Proceeding 2023-24

School of Engineering Department of Computer Science and Engineering 135 Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU Xiaofang / CSE Student: ZHANG Lveyang / COMP Course: UROP 1100, Fall Graph, as a popular data structure, have been used to model many objects and their relationships in real life. And subgraph is a fundamental problem in graph aiming to find a specific query graph within a larger graph. Over the years, numerous algorithms have been developed to address this problem, and researchers have extensively compared their overall effectiveness. However, this report takes a different approach by analyzing the individual performance of four key aspects: 1) candidate vertices filtering, 2) query vertex ordering, 3) partial result enumeration, and 4) other optimization techniques. Our focus is on two representative algorithms, namely GraphQL and CFL. Through experimental evaluation, we will present the results of our study. Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU Xiaofang / CSE Student: ZOU Hao / COMP Course: UROP 1100, Fall Multi-model learning has been recently a hot research direction, one of the representative was the CLIP Model issued by OpenAI. It combines the advantages of ChatGPT (A NLP model) into CV, utilizing the imagetext pairs gathered from the Internet for the pre-training of the model. The training dataset size, according to article published by OpenAI, was around 400million image-text pairs. And the model was trained for 12 days on 256 V100 GPUs. So this would naturally induce one consequence: Due to the extremely large dataset used for training, it might be more than challenging for us to make sure the quality of the data selected is high enough. Therefore, our research target is to develop some sufficient data pruning algorithm, which could be further used to efficiently and robustly prune those fake data samples from a large dataset. Also, there could be lots of identical or similar pairs in the datasets chosen, how to de-duplicate these data is also one of our research targets. High quality data samples represent those datasets without too much duplication and defects, for example, for LLaMA LLM, the researchers selected some raw training data from Gutenberg and Books3, which describes concepts in the public domain. Such high quality data could promote the training results of our LLMs and make it more robust in downstream tasks. The article about Scaling Language Models also mentioned about the researchers’ efforts when it comes to pruning and cleaning the huge pre-training datasets.

RkJQdWJsaXNoZXIy NDk5Njg=