FALL 2024 NO.87 / THOUGHT LEADERSHIP BRIEF 2 ASSESSMENT We develop Hong Kong's first open-source LLM for financial generative AI (GenAI) applications, capable of generating investment-related, human-like responses comparable to those of well-known commercial chatbots, including OpenAI's ChatGPT. InvestLM is trained on the LLaMA 65B, using a carefully curated instruction dataset related to finance and investment (Fig. 1). We evaluate InvestLM’s utility in providing helpful investment advice by collaborating and interviewing a group of six financial experts, including hedge fund managers and research analysts. Figure 3. GPT-4 Evaluation Figure 1. Financial Domain Instruction Dataset Figure 2. Expert Evaluation We manually write 30 test questions that are related to financial markets and investment. For each question, we generate a single response from InvestLM and the three commercial models. We then ask the financial experts to compare InvestLM responses to each of the baselines and label which response is better, or whether neither response is significantly better than the other. In addition to the expert evaluation, we also conduct a GPT-4 evaluation, following the same protocol used in (Zhou et al., 2023). Specifically, we send GPT-4 with exactly the same instructions and data annotations, and ask GPT-4 which response is better or whether neither response is significantly better than the other. The expert evaluation and GPT-4 evaluation results are presented in Figure 2 and Figure 3. These results indicate that financial experts rate InvestLM’s responses as either comparable to or better than those of the GPT-3.5 and GPT-4 models. This expert assessment aligns with GPT-4’s own evaluation, which also prefers InvestLM’s responses. While financial experts tend to favor Claude-2’s responses over InvestLM most of the time, GPT-4 shows a preference for InvestLM’s responses. Overall, it is encouraging to observe that our domain-specific instruction tuning effectively generates helpful answers to investment-related questions, especially considering that its foundational model, LLaMA, frequently produces hallucinations (introducing numbers that are not even mentioned in the news). In contrast, InvestLM’s responses are grounded in the information presented in the news and reflect logical reasoning with consideration of risks. This highlights the value of domain instruction tuning. Size Input len. Resp. len. All 1,335 152.9 145.5 Stackexchange 205 19.4 296.2 CFA 329 125.6 157.4 Academic Journals 200 169.3 74.8 Textbooks 200 128.9 136.6 SEC Filings 80 316.2 88.2 Financial NLP tasks 200 325.9 74.5 Investments 119 72.7 144.3 InvestLM wins Tie InvestLM loses 0% 20% 40% 60% 80% 100% GPT-4 GPT-3.5 Claude-2 20% 40% 40% 50% 10% 40% 60% 20% 20% InvestLM wins Tie InvestLM loses 0% 20% 40% 60% 80% 100% GPT-4 GPT-3.5 Claude-2 53% 47% 78% 22% 62% 38%
RkJQdWJsaXNoZXIy NDk5Njg=