UROP Proceeding 2023-24

School of Engineering Department of Computer Science and Engineering 133 Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU Xiaofang / CSE Student: HOU Jingcheng / COMP Course: UROP 2100, Fall SELF-INSTRUCT is a framework for improving the instruction-based performance of pre-trained language models with their own generations. However, besides invalidness and similarity, there are other issues such as hallucinations and empty outputs with the original dataset that may negatively affect the outcome of the model. AlpacaDataCleaned improves the performance of natural language processing models trained on this data by removing errors and inconsistencies. In this project, we first generate 52k data using method of SELFINSTRUCT, then after making statistics over the 10 issues that may occur in the dataset and length of instances, we compare the result with alpaca dataset. We also make improvements in efficiency invoking OpenAI API keys to reduce the cost of the data generation. Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU Xiaofang / CSE Student: HU Ruixi / COMP Course: UROP 1100, Summer In the logistics industry, complex mechanical systems are significantly in used for convenience in handling thousands of orders from day to day, and millions of IoT data flows are generated. However, such data flows are connected via space and time, thus require different treatment from individual data. This study focuses on enhancing a simulation system designed to represent and analyze cargo movements, addressing the shortcomings of previous models. Key improvements include the development of a centralized online database to manage spatio-temporal data, the implementation of an interactive 3D modeling framework using Power BI for better data visualization and comparison, and the incorporation of cargo prioritization in simulation algorithms. Additionally, the transition to Object-Oriented Programming (OOP) methodology aimed to enhance the maintainability and flexibility of the simulation code. These advancements are intended to provide a more accurate representation of real-world systems, facilitate comprehensive data analysis, and improve operational efficiency. Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU Xiaofang / CSE Student: HU Yutong / DSCT Course: UROP 1100, Fall The criteria of the Large Language Model place more emphasis on quality and amount of training data, highlighting the significance of bulk data collecting and data trimming. In particular, the papers introduce a simple method to categorize 340,000 academic papers from ArXiv to support a further domain-stratified language model training, by scrawling the category suggested from ‘ArXiv’. Meanwhile, experiments were conducted to gather statistics over NOUGAT algorithm, both on GPUs and CPUs. In addition, a literature review over OpenWebMath, CC-net and LLemma was carried out to compare the methods and attempt to analyze the techniques for data collection. Besides, experiments were performed to convey some html files into Json format and assemble questions with answers from Chat-GPT4 to provide data for another model.

RkJQdWJsaXNoZXIy NDk5Njg=