IEMS - Thought Leadership Brief #87

3 FALL 2024 NO.87 / THOUGHT LEADERSHIP BRIEF We further evaluate InvestLM’s performance on financial NLP benchmarks. We consider the following LLMs, including two instruction tuned models GPT-3.5, GPT-4 from OpenAI, two financial LLMs, BloombergGPT (a 50B foundation model) and FinMA (an instruction tuned model on LLaMA-7B), and one foundation model LLaMA-65B, upon which InvestLM is built. These results are presented in Figure 4. When comparing InvestLM with LLaMA-65, we find that domain instruction tuning is very effective. In 8 out of 9 tasks, InvestLM outperforms LLaMA-65. Second, GPT-4 achieves the best performance in 6 out of the 9 tasks, while InvestLM achieves the best performance in 2 out of the 9 tasks, suggesting that GPT-4 is the state-of-the-art commercial LLM. To assess the advantages of domain instruction tuning across foundation models of varying sizes, we train an InvestLM-7B model on the LLaMA-7B foundation model using our domain instruction dataset. The relative improvement brought about by domain instruction tuning is considerably more pronounced for the smaller 7B model compared to the 65B model. Domain instruction tuning improves performance by an average of 138.4% across tasks. In contrast, for the LLaMA-65B model, there’s a performance increment of 28.2%. The results indicate that in scenarios where computational constraints prevent deploying a 65B model, domain instruction tuning is vital in optimizing the performance of the smaller model. We also aim to explore whether the inclusion of general-purpose instructions can further enhance themodel’s performance in domain NLP tasks. Given that the general-purpose instruction dataset encompasses instructions related to numerical reasoning and sentiment, there is potential that integrating general-purpose instructions could also improve the model’s capability in financial NLP tasks. We incorporate the instruction-following data used in the fine-tuning of the Figure 4. Different LLMs Performance on Financial NLP Benchmarks. Stanford Alpaca model (Taori et al., 2023) comprising 52K instructions into our domain instruction dataset. Using this augmented dataset, we train an InvestLM-7B+AlpacaInstructions model. We then evaluate the utility of generic instructions on the financial NLP benchmarks. The results (Fig 5.) lead to an interesting finding that the inclusion of generic instructions appears to negatively impact the model’s generalizability on domain-specific NLP tasks. When comparing InvestLM-7B+AlpacaInstructions (trained on the combined instruction dataset) to InvestLM7B (trained solely on the domain instruction dataset), it’s evident that InvestLM-7B consistently outperforms InvestLM-7B+Alpaca-Instructions across all Tasks. This underscores the value of our carefully curated domain instructions. This finding suggests that rather than generating a large volume of general-purpose instructions, creating a set of highquality, domain-specific instructions can be more effective in tapping into a model’s capabilities for domain tasks. Figure 5. Performance of InvestLM-7B Trained Using Different Instruction Dataset Dataset Metric LLaMA-65B InvestLM GPT-3.5 BloombergGPT FinMA GPT-4 FinSent Micro-F1 0.71 0.79 0.75 - 0.80 0.81 FPB Micro-F1 0.38 0.71 0.75 0.51 0.88 0.90 FOMC Micro-F1 0.53 0.61 0.60 - 0.52 0.73 FiQA Micro-F1 0.75 0.90 0.77 0.75 0.87 0.92 ESG Micro-F1 0.67 0.80 0.64 - 0.51 0.63 FLS Micro-F1 0.60 0.51 0.57 - 0.27 0.57 QA Micro-F1 0.73 0.81 0.71 - 0.68 0.78 FinQA Acc 0.23 0.29 0.47 - 0.15 0.54 ECTSum Rouge-1 0.14 0.26 0.21 - 0.08 0.30 Rouge-2 0.12 0.12 0.13 - 0.01 0.15 Rouge-L 0.13 0.17 0.15 - 0.06 0.20 CHRF++ 23.65 31.53 29.79 - 6.34 36.31 Dataset Metric LLaMA-7B InvestLM-7B InvestLM7B+AlpacaInstructions FinSent Micro-F1 0.53 0.69 0.64 FPB Micro-F1 0.12 0.74 0.42 FOMC Micro-F1 0.25 0.40 0.32 FIQA ABSA Micro-F1 0.31 0.76 0.40 ESG Micro-F1 0.19 0.61 0.48 FLS Micro-F1 0.34 0.53 0.17 QA Micro-F1 0.72 0.84 0.40 FinQA Acc 0.07 0.07 0.03 EctSUM Rouge-1 0.06 0.24 0.14 Rouge-2 0.06 0.10 0.05 Rouge-L 0.06 0.15 0.09 Bert Score 0.73 0.78 0.75 CHRF++ 12.90 29.18 19.48

RkJQdWJsaXNoZXIy NDk5Njg=