UROP Proceedings 2022-23

School of Engineering Department of Computer Science and Engineering 128 Automatic and Scalable Data Collection and Pruning for LLM Training Supervisor: ZHOU, Xiaofang / CSE Student: BIE, Jiarui / COMP HAN, Liuruo / SSCI ZHANG, Zhanhua / DSCT Course: UROP1100, Summer UROP1000, Summer UROP2100, Summer Python is commonly known as a powerful programming language that can be used in a wide range of subjects. With the support of its large-scale library, one can make use of various tools to perform high-level tasks conveniently without the need to start everything from scratch. In this report, we will discuss one major application based on public packages, such as Scrapy and Selenium. Secondly, the methodology we applied will be shown with the shared code samples. Using the samples, we will discuss the feasibility and applicability of our approach to the desired functionalities. And finally, we will share the conclusion of our project. Automatic and Scalable Data Collection and Pruning for LLM Training Supervisor: ZHOU, Xiaofang / CSE Student: HU, Yutong / SSCI Course: UROP1000, Summer A Large Language Model, being trained based on vast amounts of data, generates human-like text by multilayer recurrent neural networks. Chat-GPT gives rise to an evolution of this model. Instead of using traditional statistical techniques, it ‘uses transformer-based models that allow for the processing of vast amounts of data in parallel’. This project aims on cleaning the raw data to prepare for a Chinese Large Language Model (LLM) to train on. Meanwhile, it also evaluates the Chinese protocol so that the training data are maintained accurately. This paper highlights the challenges and strategies on: i) Downloading data from CommonCrawl and cleaning them by cc_net package. ii) Downloading arXiv data and running the cleaning algorithm. iii) Evaluate the translation texts for Chinese protocol. Automatic and Scalable Data Collection and Pruning for LLM Training Supervisor: ZHOU, Xiaofang / CSE Student: YANG, Haolin / SSCI Course: UROP1000, Summer This report discusses the research efforts made on data preparation for a Chinese Large Language Model (LLM). Existing LLMs are mainly trained on a mix dataset with large portion of English documents and small share of Chinese texts. Our research aims at exploring how to improve Large Language Models’ performance on Chinese in data preparation part. As part of data preparation and evaluation team, my main responsibilities include make data cleaning for pre-training data and preparing Chinese evaluation protocol. In detail, I finished exact tasks on downloading data from Common Crawl, deduplication, and improving the Chinese translation of commonsense data.

RkJQdWJsaXNoZXIy NDk5Njg=