UROP Proceeding 2024-25

School of Engineering Department of Computer Science and Engineering 140 Mental State Reasoning for Large Language Models Supervisor: SONG Yangqiu / CSE Student: TU Tianyi / COMP Course: UROP 1000, Summer This report presents summer UROP research on Theory of Mind (ToM) capabilities in large language models (LLMs). The project encompassed theoretical analysis of emerging ToM evaluation frameworks—including ATOM, FANTOM, NegotiationToM, and CCoToM—alongside practical experimentation with lightweight model fine-tuning. Using the Verl reinforcement learning platform, we implemented adaptation procedures for the Qwen3-0.6B model across cloud (Google Colab) and containerized (Docker) environments. Complementary work involved refining Belief-Desire-Intention (BDI) model annotations within ToM datasets. These research activities significantly advanced my understanding of ToM evaluation paradigms and practical LLM fine-tuning workflows. Through hands-on experimentation with containerization and dataset annotation, I developed foundational skills for future research in cognitive AI systems. Reasoning with Large Foundation Models Supervisor: SONG Yangqiu / CSE Student: LIU Jiayu / COMP Course: UROP 1100, Spring Large Language Models (LLMs) have rapidly evolved, transforming various applications across diverse fields. Despite their significant advancements, the application of LLMs in high-stakes environments has consistently faced challenges regarding reliability and accuracy. A critical aspect of this reliability is the ability of LLMs to accurately reflect their internal confidence through explicit communication, particularly via natural language expressions known as epistemic markers. Epistemic markers, such as “fairly confident” or “uncertain,” are frequently used by humans to express varying degrees of uncertainty. However, it remains unclear how effectively LLMs employ these markers consistently to communicate their internal uncertainty. This report explores the extent to which epistemic markers generated by LLMs align with their intrinsic confidence, investigating their reliability across multiple scenarios. Reasoning with Large Foundation Models Supervisor: SONG Yangqiu / CSE Student: LYU Zongwei / COMP Course: UROP 1100, Fall In recent years, large foundation models are widely employed in various fields and have shown impressive skills in a range of natural language processing tasks, such as language understanding, generation, and translation. However, it seems that LLMs do not perform well in answering questions about tabular data. In the project, I worked with PhD Weiqi Wang to join the 19th International Workshop on Semantic Evaluation task8 Question Answering over Tabular Data available at Github Semeval (https://jorses.github.io/semeval). Here, we created a system that responds to queries similar to those found in DataBench on regular datasets. Lastly, we will submit our own prediction to the task after the task organizer releases the test set specifically created for the task competition. This report will show how the task was completed.

RkJQdWJsaXNoZXIy NDk5Njg=