UROP Proceeding 2024-25

School of Engineering Department of Computer Science and Engineering 154 Advanced Analytics on Domain-Specific Knowledge Graphs Supervisor: ZHOU Xiaofang / CSE Student: LIU Qiushi / RMBI Course: UROP 1100, Spring This report presents the progress of my UROP1100 project for Spring 2025, focused on advanced analytics for financial knowledge graphs, with particular emphasis on state-of-the-art techniques for Graph Anomaly Detection. The primary objective was to analyze the ConsisGAD model — a consistency training framework with learnable data augmentation designed for robust anomaly detection under limited supervision — and reproduce its experiments. Insights from the recent ARC framework, a generalist graph anomaly detector, are also discussed as an alternative approach. The report aims to analyse the methods, experiments, and applicability of these techniques and outlines challenges I encountered and my goals for future research in this domain. Automatic and Scalable Data Collection and Pruning for LLM Training Supervisor: ZHOU Xiaofang / CSE Student: LI Jingsheng / MATH-STAT Course: UROP 1000, Summer This report explores the application of optical character recognition (OCR) and large language model (LLM) technologies to extract and analyze tabular data from PDF documents. Utilizing PaddleOCR, an open-source OCR toolkit, this study implemented table recognition algorithms to detect and parse structured content, addressing challenges such as varying layouts, fonts, and noise in scanned PDFs. Extracted tables were also processed using multimodal LLMs, such as Qwen2.5-VL, to perform text extraction and table construction in HTML and CSV formats. Preliminary experiments demonstrate decent accuracy in table extraction (up to 90% F1-score) with OCR techniques, though limitations persist in handling complex merged cells and detecting structured cells with multi-level headers. The performance of Qwen2.5-VL was better, especially when dealing with tables containing large amounts of information. Automatic and Scalable Data Collection and Pruning for LLM Training Supervisor: ZHOU Xiaofang / CSE Student: YANG Haolin / DSCT Course: UROP 3200, Spring Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and functions, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data and integrating execution feedback beyond static “text and SQL” pairs. This work introduces ExeSQL, a text-toSQL framework with execution-driven bootstrapping. The approach consists of translation boot-strapping, iterative data generation, and preference training, allowing models to adapt to different SQL dialects through execution-guided learning. Experiments show that ExeSQL achieves improvements of 15.2% and 13.21% over GPT-4 in two SQL dialects.

Made with FlippingBook

RkJQdWJsaXNoZXIy NDk5Njg=