UROP Proceeding 2023-24

School of Engineering Department of Computer Science and Engineering 134 Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU Xiaofang / CSE Student: REN Yichen / COMP Course: UROP 1100, Fall UROP 2100, Spring In modern societies, cargo transport serves as a backbone for global trade, enhancing the movement of goods across the world. Intra-facility transport and organization, the internal movement and storage of goods, materials, and personnel within an enclosed facility plays a pivotal role in the efficiency of supply chain operations. The complexity of modern in-site logistics requires seamless integration of various processes ranging from inventory management to resource deployment. Effective management and visualization of intra-facility transport are crucial to achieving these objectives. However, the currently available commercial visualization systems for intra-facility cargo transfer predominantly offer only rudimentary levels of information integration, such as linking databases to visualization platforms. Moreover, the utilization of specialized SQL language for data manipulation and retrieval in databases demands a certain level of technical proficiency, which can be challenging for a broad range of professionals within the warehousing industry. Consequently, we are exploring the potential of OpenAI’s advanced model, GPT-4, to facilitate database queries and modifications using natural language, thereby simplifying the user experience. To enhance this solution, we have also integrated these capabilities with visualization platforms, thereby streamlining the user experience and improving data accessibility. Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU Xiaofang / CSE Student: YANG Haolin / DSCT Course: UROP 1100, Fall This report discusses the research efforts made on data preparation for a Chinese Large Language Model (LLM). Existing LLMs are mainly trained on a mixed dataset with a large portion of English documents and a small share of Chinese texts. Our research aims to explore how to improve Large Language Models’ performance in Chinese in the data preparation part. As part of the data preparation and evaluation team, my main responsibilities include data cleaning for pre-training data and preparing Chinese evaluation protocol. Specifically, my tasks encompassed downloading data from Hugging-face, implementing deduplication using the Redpajama method, and employing the data processing technique known as Slimpajama for data refinement.

RkJQdWJsaXNoZXIy NDk5Njg=