School of Engineering Department of Computer Science and Engineering 156 Large-Scale Spatiotemporal Data Analytics and Learning Supervisor: ZHOU Xiaofang / CSE Student: REN Yichen / COMP Course: UROP 3200, Spring This paper dives deeper into the field of Natural Language (NL) to Structured Query Language (SQL) conversion (NL2SQL). Using the widely accepted NL2SQL agent provided with the Spider-2 dataset, it aims to identify basic and common issues present in most, if not all, NL2SQL agents. Specifically, it evaluates the performance of the original Spider-2 agent and the Spider-2 + DIN-SQL model on the Spider-2 Snow dataset. Out of the 547 results, we manually examine a subset of 50, a sample size that is statistically significant. The results reveal that current models struggle primarily with understanding semi-structured variable names, such as column names in schemas and table names. The performance is particularly poor in the absence of relevant illustrative files. Even when such files are available, the agent often fails to interpret the meaning of file names correctly, leading to the selection of incorrect files or tables which hold the data. This study also proposes potential directions for improvement, particularly in cases where file or table names involve temporal elements, such as dates or times. Based on our experiments, we believe that incorporating a hierarchical tree structure could offer a promising solution. Retrieval Augmented Generation with Vector Database Supervisor: ZHOU Xiaofang / CSE Student: CHOW Wang Hin / COMP Course: UROP 2100, Fall Retrieval Augmented Generation (RAG) systems combine external knowledge retrieval with generative models, making vector databases integral for efficiently storing and querying high-dimensional embeddings. Vector databases are critical to RAG’s performance, enabling rapid retrieval of relevant data points. Serverless vector databases like LanceDB further streamline deployment by reducing infrastructure overhead while maintaining scalability. LanceDB is distinguished by its use of the Lance format, which is designed to optimize vector data storage and retrieval. Compared to traditional formats like Parquet, Lance has demonstrated improved query and storage efficiency. This report explores LanceDB’s encoding mechanics, contrasts it with Parquet, and evaluates its suitability for RAG workflows, focusing on storage efficiency, retrieval performance, and implications for serverless architectures. Retrieval Augmented Generation with Vector Database Supervisor: ZHOU Xiaofang / CSE Student: FANG Zihao / DSCT Course: UROP 1100, Spring This Undergraduate Research Opportunities Program (UROP) project evaluates the FinRobot Agent, an AIdriven system for financial decision-making, through tasks like stock price prediction, annual report generation, and trading strategies. A concise literature review of financial AI agents, multimodal systems, and adaptive architectures informed the study’s theoretical framework. Evaluations, focusing on prompt engineering, achieved high accuracy (92% for report generation) and highlighted real-time adaptability, but Azure API rate limits and LLM reliability posed challenges. Conducted under supervisor Zhang Ruiyuan’s guidance, this first-time research endeavor balanced academics with experimentation, fostering technical and personal growth. The report synthesizes literature insights, evaluation results, challenges, and reflections, recommending future work on hallucination reduction and model comparisons (e.g., DeepSeek, Grok, Gemini) to enhance FinRobot’s robustness.
RkJQdWJsaXNoZXIy NDk5Njg=