Home

Awesome

<p align="center" width="100%"> <img src="https://i.postimg.cc/xTpWgq3L/pixiu-logo.png" width="100%" height="100%"> </p> <div> <div align="left"> <a target='_blank'>Qianqian Xie<sup>1</sup></span>&emsp; <a target='_blank'>Weiguang Han<sup>2</sup></span>&emsp; <a target='_blank'>Zhengyu Chen<sup>2</sup></span>&emsp; <a target='_blank'>Ruoyu Xiang<sup>1</sup></a>&emsp; <a target='_blank'>Xiao Zhang<sup>1</sup></a>&emsp; <a target='_blank'>Yueru He<sup>1</sup></a>&emsp; <a target='_blank'>Mengxi Xiao<sup>2</sup></a>&emsp; <a target='_blank'>Dong Li<sup>2</sup></a>&emsp; <a target='_blank'>Yongfu Dai<sup>7</sup></a>&emsp; <a target='_blank'>Duanyu Feng<sup>7</sup></a>&emsp; <a target='_blank'>Yijing Xu<sup>1</sup></a>&emsp; <a target='_blank'>Haoqiang Kang<sup>5</sup></a>&emsp; <a target='_blank'>Ziyan Kuang<sup>12</sup></a>&emsp; <a target='_blank'>Chenhan Yuan<sup>3</sup></a>&emsp; <a target='_blank'>Kailai Yang<sup>3</sup></a>&emsp; <a target='_blank'>Zheheng Luo<sup>3</sup></a>&emsp; <a target='_blank'>Tianlin Zhang<sup>3</sup></a>&emsp; <a target='_blank'>Zhiwei Liu<sup>3</sup></a>&emsp; <a target='_blank'>Guojun Xiong<sup>10</sup></a>&emsp; <a target='_blank'>Zhiyang Deng<sup>9</sup></a>&emsp; <a target='_blank'>Yuechen Jiang<sup>9</sup></a>&emsp; <a target='_blank'>Zhiyuan Yao<sup>9</sup></a>&emsp; <a target='_blank'>Haohang Li<sup>9</sup></a>&emsp; <a target='_blank'>Yangyang Yu<sup>9</sup></a>&emsp; <a target='_blank'>Gang Hu<sup>8</sup></a>&emsp; <a target='_blank'>Jiajia Huang<sup>11</sup></a>&emsp; <a target='_blank'>Xiao-Yang Liu<sup>5</sup></a>&emsp; <a href='https://warrington.ufl.edu/directory/person/12693/' target='_blank'>Alejandro Lopez-Lira<sup>4</sup></a>&emsp; <a target='_blank'>Benyou Wang<sup>6</sup></a>&emsp; <a target='_blank'>Yanzhao Lai<sup>13</sup></a>&emsp; <a target='_blank'>Hao Wang<sup>7</sup></a>&emsp; <a target='_blank'>Min Peng<sup>2*</sup></a>&emsp; <a target='_blank'>Sophia Ananiadou<sup>3</sup></a>&emsp; <a href='' target='_blank'>Jimin Huang<sup>1</sup></a> </div> <br /> <div align="left"> <sup>1</sup>The Fin AI&emsp; <sup>2</sup>Wuhan University&emsp; <sup>3</sup>The University of Manchester&emsp; <sup>4</sup>University of Florida&emsp; <sup>5</sup>Columbia University&emsp; <sup>6</sup>The Chinese University of Hong Kong, Shenzhen&emsp; <sup>7</sup>Sichuan University&emsp; <sup>8</sup>Yunnan University&emsp; <sup>9</sup>Stevens Institute of Technology&emsp; <sup>10</sup>Stony Brook University&emsp; <sup>11</sup>Nanjin Audit University&emsp; <sup>12</sup>Jiangxi Normal University&emsp; <sup>13</sup>Southwest Jiaotong University </div> <br /> <div align="left"> <img src='https://i.postimg.cc/CLtkBwz7/57-EDDD9-FB0-DF712-F3-AB627163-C2-1-EF15655-13-FCA.png' alt='Wuhan University Logo' height='50px'>&emsp; <img src='https://assets.manchester.ac.uk/corporate/images/design/logo-university-of-manchester.png' alt='Manchester University Logo' height='50px'>&emsp; <img src='https://i.postimg.cc/XY1s2RHD/University-of-Florida-Logo-1536x864.jpg' alt='University of Florida Logo' height='50px'>&emsp; <img src='https://admissions.ucr.edu/sites/default/files/styles/form_preview/public/2020-07/ucr-education-logo-columbia-university.png?itok=-0FD6Ma2' alt='Columbia University Logo' height='50px'>&emsp; <img src='https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQeMTMkJVT6g36_LN-8qJ4nMvgT3vM5spUHV3ITRYbym1CEg4Af5Shlp5jX2sWtDFtTK9I&usqp=CAU' alt='HK University (shenzhen) Logo' height='50px'>&emsp; <img src='https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcToJAAiyqxfFuwro5N9Um9TB5LDkiJNKF3hMMQp3pfC0A&s' alt='Sichuan University' height='50px'>&emsp; <img src='https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRbx3AQWiMhxwOvFb7r1PH-h_i5-b3H9xsGVKnkQwbFlA&s' alt='Yunnan University' height='50px'>&emsp; <img src='https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRS_o8HItSOTkg5M75N59D6V5u9qg7QYfBa_ITxdfEfwQ&s' alt='Stevens Insititute of Technology' height='50px'>&emsp; <img src='https://www.stonybrook.edu/sbu-brand/_images/2015/10/logo_stacked_vert.jpg' alt='Stony Brook University' height='50px'>&emsp; <img src='https://upload.wikimedia.org/wikipedia/en/9/9c/Nanjing_Audit_University_logo.png' alt='Nanjing Audit University' height='50px'>&emsp; <img src='https://upload.wikimedia.org/wikipedia/en/thumb/c/c5/Jiangxi_Normal_University.svg/1200px-Jiangxi_Normal_University.svg.png' alt='Jiangxi Normal University' height='50px'>&emsp; <img src='https://i.postimg.cc/k5WpYj0r/SWJTULogo.png' alt='Southwest Jiaotong University Logo' height='50px'>&emsp; </div>

Discord

Pixiu Paper | FinBen Leaderboard

Disclaimer

This repository and its contents are provided for academic and educational purposes only. None of the material constitutes financial, legal, or investment advice. No warranties, express or implied, are offered regarding the accuracy, completeness, or utility of the content. The authors and contributors are not responsible for any errors, omissions, or any consequences arising from the use of the information herein. Users should exercise their own judgment and consult professionals before making any financial, legal, or investment decisions. The use of the software and information contained in this repository is entirely at the user's own risk.

By using or accessing the information in this repository, you agree to indemnify, defend, and hold harmless the authors, contributors, and any affiliated organizations or persons from any and all claims or damages.

๐Ÿ“ข Update (Date: 09-22-2023)

๐Ÿš€ We're thrilled to announce that our paper, "PIXIU: A Comprehensive Benchmark, Instruction Dataset and Large Language Model for Finance", has been accepted by NeurIPS 2023 Track Datasets and Benchmarks!

๐Ÿ“ข Update (Date: 10-08-2023)

๐ŸŒ We're proud to share that the enhanced versions of FinBen, which now support both Chinese and Spanish!

๐Ÿ“ข Update (Date: 02-20-2024)

๐ŸŒ We're delighted to share that our paper, "The FinBen: An Holistic Financial Benchmark for Large Language Models", is now available at FinBen.

๐Ÿ“ข Update (Date: 05-02-2024)

๐ŸŒ We're pleased to invite you to attend the IJCAI2024-challenge, "Financial Challenges in Large Language Models - FinLLM", the starter-kit is available at Starter-kit.

Checkpoints:

Languages

Papers

Evaluations:

Sentiment Analysis

Classification

Knowledge Extraction

Number Understanding

Text Summarization

Credit Scoring

Forecasting

Overview

Welcome to the PIXIU project! This project is designed to support the development, fine-tuning, and evaluation of Large Language Models (LLMs) in the financial domain. PIXIU is a significant step towards understanding and harnessing the power of LLMs in the financial domain.

Structure of the Repository

The repository is organized into several key components, each serving a unique purpose in the financial NLP pipeline:

Key Features


FinBen 2.0: Financial Language Understanding and Prediction Evaluation Benchmark

In this section, we provide a detailed performance analysis of FinMA compared to other leading models, including ChatGPT, GPT-4, and BloombergGPT et al. For this analysis, we've chosen a range of tasks and metrics that span various aspects of financial Natural Language Processing and financial prediction. All model results of FinBen can be found on our leaderboard!

Tasks

DataTaskRawData TypesModalitiesLicensePaper
FPBsentiment analysis4,845newstextCC BY-SA 3.0[1]
FiQA-SAsentiment analysis1,173news headlines, tweetstextPublic[2]
TSAsentiment analysis561news headlinestextCC BY-NC-SA 4.0[3]
FOMChawkish-dovish classification496FOMC transcriptstextCC BY-NC 4.0[4]
Headlinesnews headline classification11,412news headlinestextCC BY-SA 3.0[5]
FinArg-ECC-Task1argument unit classification969earnings conference calltextCC BY-NC-SA 4.0[6]
FinArg-ECC-Task2argument relation classification690earnings conference calltextCC BY-NC-SA 4.0[6]
Multifin ENmulti-class classification546article headlinestextPublic[7]
M&Adeal completeness classification500news articles, tweetstextPublic[8]
MLESG ENESG Issue Identification300news articlestextCC BY-NC-ND[9]
NERnamed entity recognition1,366financial agreementstextCC BY-SA 3.0[10]
Finer Ordnamed entity recognition1,080news articlestextCC BY-NC 4.0[11]
FinREDrelation extraction1,070earning call transciptstextPublic[12]
FinCausual 2020 Task1causal classification8,630news articles, SECtextCC BY 4.0[13]
FinCausual 2020 Task2causal detection226news articles, SECtextCC BY 4.0[13]
FinQAquestion answering8,281earnings reportstext, tableMIT License[14]
TatQAquestion answering1,670financial reportstext, tableMIT License[15]
FNXLnumeric labeling318SECtextPublic[16]
FSRLtoken classification97news articlestextMIT License[17]
ECTSUMtext summarization495earning call transciptstextPublic[18]
EDTSUMtext summarization2000news articlestextPublic[19]
Germancredit scoring1000credit recordstableCC BY 4.0[20]
Australiancredit scoring690credit recordstableCC BY 4.0[21]
Lending Clubcredit scoring1,3453financial informationtableCC0 1.0[22]
BigData22stock movement prediction7,164tweets, historical pricestext, time seriesPublic[23]
ACL18stock movement prediction27,053tweets, historical pricestext, time seriesMIT License[24]
CIKM18stock movement prediction4,967tweets, historical pricestext, time seriesPublic[25]
ConvFinQAmulti-turn question answering1,490earnings reportstext, tableMIT License[26]
Credit Card FraudFraud Detection11,392financial informationtable(DbCL) v1.0[22]
ccFraudFraud Detection10,485financial informationtablePublic[22]
PolishFinancial Distress Identification8,681financial status featurestableCC BY 4.0[22]
Taiwan Economic JournalFinancial Distress Identification6,819financial status featurestableCC BY 4.0[22]
PortoSeguroClaim Analysis11,904claim and financial informationtablePublic[22]
Travel InsuranceClaim Analysis12,665claim and financial informationtable(ODbL) v1.0[22]

<span id="1">1.</span> Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. 2014. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology 65, 4 (2014), 782โ€“796.

<span id="2">2.</span> Macedo Maia, Siegfried Handschuh, Andrรฉ Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. Wwwโ€™18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018. 1941โ€“1942.

<span id="3">3.</span> Keith Cortis, Andrรฉ Freitas, Tobias Daudert, Manuela Huerlimann, Manel Zarrouk, Siegfried Handschuh, and Brian Davis. 2017. SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs and News. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 519โ€“535, Vancouver, Canada. Association for Computational Linguistics.

<span id="4">4.</span> Agam Shah, Suvan Paturi, and Sudheer Chava. 2023. Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6664โ€“6679, Toronto, Canada. Association for Computational Linguistics.

<span id="5">5.</span> Ankur Sinha and Tanmay Khandait. 2021. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2. Springer, 589โ€“601.

<span id="6">6.</span> Chen C C, Lin C Y, Chiu C J, et al. Overview of the NTCIR-17 FinArg-1 Task: Fine-grained argument understanding in financial analysis[C]//Proceedings of the 17th NTCIR Conference on Evaluation of Information Access Technologies, Tokyo, Japan. 2023.

<span id="7">7.</span> Rasmus Jรธrgensen, Oliver Brandt, Mareike Hartmann, Xiang Dai, Christian Igel, and Desmond Elliott. 2023. MultiFin: A Dataset for Multilingual Financial NLP. In Findings of the Association for Computational Linguistics: EACL 2023, pages 894โ€“909, Dubrovnik, Croatia. Association for Computational Linguistics.

<span id="8">8.</span> Yang, L., Kenny, E.M., Ng, T.L., Yang, Y., Smyth, B., & Dong, R. (2020). Generating Plausible Counterfactual Explanations for Deep Transformers in Financial Text Classification. International Conference on Computational Linguistics.

<span id="9">9.</span> Chung-Chi Chen, Yu-Min Tseng, Juyeon Kang, Anaiฬˆs Lhuissier, Min-Yuh Day, Teng-Tsai Tu, and Hsin-Hsi Chen. 2023. Multi-lingual esg issue identification. In Proceedings of the Fifth Workshop on Financial Tech- nology and Natural Language Processing (FinNLP) and the Second Multimodal AI For Financial Fore- casting (Muffin).

<span id="10">10.</span> Julio Cesar Salinas Alvarado, Karin Verspoor, and Timothy Baldwin. 2015. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015. 84โ€“90.

<span id="11">11.</span> Shah A, Vithani R, Gullapalli A, et al. Finer: Financial named entity recognition dataset and weak-supervision model[J]. arXiv preprint arXiv:2302.11157, 2023.

<span id="12">12.</span> Sharma, Soumya et al. โ€œFinRED: A Dataset for Relation Extraction in Financial Domain.โ€ Companion Proceedings of the Web Conference 2022 (2022): n. pag.

<span id="13">13.</span> Dominique Mariko, Hanna Abi-Akl, Estelle Labidurie, Stephane Durfort, Hugues De Mazancourt, and Mahmoud El-Haj. 2020. The Financial Document Causality Detection Shared Task (FinCausal 2020). In Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, pages 23โ€“32, Barcelona, Spain (Online). COLING.

<span id="14">14.</span> Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al . 2021. FinQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3697โ€“3711.

<span id="15">15.</span> Zhu, Fengbin, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng and Tat-Seng Chua. โ€œTAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance.โ€ ArXiv abs/2105.07624 (2021): n. pag.

<span id="16">16.</span> Soumya Sharma, Subhendu Khatuya, Manjunath Hegde, Afreen Shaikh, Koustuv Dasgupta, Pawan Goyal, and Niloy Ganguly. 2023. Financial Numeric Extreme Labelling: A dataset and benchmarking. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3550โ€“3561, Toronto, Canada. Association for Computational Linguistics.

<span id="17">17.</span> Matthew Lamm, Arun Chaganty, Christopher D. Manning, Dan Jurafsky, and Percy Liang. 2018. Textual Analogy Parsing: Whatโ€™s Shared and Whatโ€™s Compared among Analogous Facts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 82โ€“92, Brussels, Belgium. Association for Computational Linguistics.

<span id="18">18.</span> Rajdeep Mukherjee, Abhinav Bohra, Akash Banerjee, Soumya Sharma, Manjunath Hegde, Afreen Shaikh, Shivani Shrivastava, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, and Pawan Goyal. 2022. ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10893โ€“10906, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

<span id="19">19.</span> Zhihan Zhou, Liqian Ma, and Han Liu. 2021. Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2114โ€“2124, Online. Association for Computational Linguistics.

<span id="20">20.</span> Hofmann,Hans. (1994). Statlog (German Credit Data). UCI Machine Learning Repository. https://doi.org/10.24432/C5NC77.

<span id="21">21.</span> Quinlan,Ross. Statlog (Australian Credit Approval). UCI Machine Learning Repository. https://doi.org/10.24432/C59012.

<span id="26to32">22.</span> Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Alejandro Lopez-Lira, Hao Wang. 2023. Empowering Many, Biasing a Few: Generalist Credit Scoring through Large Language Models. ArXiv abs/2310.00566 (2023): n. pag.

<span id="23">23.</span> Yejun Soun, Jaemin Yoo, Minyong Cho, Jihyeong Jeon, and U Kang. 2022. Accurate Stock Movement Prediction with Self-supervised Learning from Sparse Noisy Tweets. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, 1691โ€“1700.

<span id="24">24.</span> Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1970โ€“1979.

<span id="25">25.</span> Huizhe Wu, Wei Zhang, Weiwei Shen, and Jun Wang. 2018. Hybrid deep sequential modeling for social text-driven stock prediction. In Proceedings of the 27th ACM international conference on information and knowledge management. 1627โ€“1630.

<span id="26">26.</span> Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. 2022. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6279โ€“6292, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Evaluation

Preparation

Locally install
git clone https://github.com/The-FinAI/PIXIU.git --recursive
cd PIXIU
pip install -r requirements.txt
cd src/financial-evaluation
pip install -e .[multilingual]
Docker image
sudo bash scripts/docker_run.sh

Above command starts a docker container, you can modify docker_run.sh to fit your environment. We provide pre-built image by running sudo docker pull tothemoon/pixiu:latest

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    --network host \
    --env https_proxy=$https_proxy \
    --env http_proxy=$http_proxy \
    --env all_proxy=$all_proxy \
    --env HF_HOME=$hf_home \
    -it [--rm] \
    --name pixiu \
    -v $pixiu_path:$pixiu_path \
    -v $hf_home:$hf_home \
    -v $ssh_pub_key:/root/.ssh/authorized_keys \
    -w $workdir \
    $docker_user/pixiu:$tag \
    [--sshd_port 2201 --cmd "echo 'Hello, world!' && /bin/bash"]

Arguments explain:

Automated Task Assessment

Before evaluation, please download BART checkpoint to src/metrics/BARTScore/bart_score.pth.

For automated evaluation, please follow these instructions:

  1. Huggingface Transformer

    To evaluate a model hosted on the HuggingFace Hub (for instance, finma-7b-full), use this command:

python eval.py \
    --model "hf-causal-llama" \
    --model_args "use_accelerate=True,pretrained=TheFinAI/finma-7b-full,tokenizer=TheFinAI/finma-7b-full,use_fast=False" \
    --tasks "flare_ner,flare_sm_acl,flare_fpb"

More details can be found in the lm_eval documentation.

  1. Commercial APIs

Please note, for tasks such as NER, the automated evaluation is based on a specific pattern. This might fail to extract relevant information in zero-shot settings, resulting in relatively lower performance compared to previous human-annotated results.

export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python eval.py \
    --model gpt-4 \
    --tasks flare_ner,flare_sm_acl,flare_fpb
  1. Self-Hosted Evaluation

To run inference backend:

bash scripts/run_interface.sh

Please adjust run_interface.sh according to your environment requirements.

To evaluate:

python data/*/evaluate.py

Create new tasks

Creating a new task for FinBen involves creating a Huggingface dataset and implementing the task in a Python file. This guide walks you through each step of setting up a new task using the FinBen framework.

Creating your dataset in Huggingface

Your dataset should be created in the following format:

{
    "query": "...",
    "answer": "...",
    "text": "..."
}

In this format:

For Multi-turn tasks (such as )

For Classification tasks (such as FPB (FinBen_fpb)), additional keys should be defined:

For Sequential Labeling tasks (such as Finer Ord (FinBen_finer_ord)), additional keys should be defined:

For Extractive Summarization tasks (such as ECTSUM (FinBen_ectsum)), additional keys should be defined:

For abstractive Summarization and Question Answering tasks (such as EDTSUM (FinBen_edtsum)), no additional keys should be defined

Implementing the task

Once your dataset is ready, you can start implementing your task. Your task should be defined within a new class in flare.py or any other Python file located within the tasks directory.

To cater to a range of tasks, we offer several specialized base classes, including Classification, SequentialLabeling, RelationExtraction, ExtractiveSummarization, AbstractiveSummarization and QA.

For instance, if you are embarking on a classification task, you can directly leverage our Classification base class. This class allows for efficient and intuitive task creation. To better demonstrate this, let's delve into an example of crafting a task named FinBen-FPB using the Classification base class:

class flareFPB(Classification):
    DATASET_PATH = "flare-fpb"

And that's it! Once you've created your task class, the next step is to register it in the src/tasks/__init__.py file. To do this, add a new line following the format "task_name": module.ClassName. Here is how it's done:

TASK_REGISTRY = {
    "flare_fpb": flare.FPB,
    "your_new_task": your_module.YourTask,  # This is where you add your task
}

Predefined task metrics

TaskMetricIllustration
ClassificationAccuracyThis metric represents the ratio of correctly predicted observations to total observations. It is calculated as (True Positives + True Negatives) / Total Observations.
ClassificationF1 ScoreThe F1 Score represents the harmonic mean of precision and recall, thereby creating an equilibrium between these two factors. It proves particularly useful in scenarios where one factor bears more significance than the other. The score ranges from 0 to 1, with 1 signifying perfect precision and recall, and 0 indicating the worst case. Furthermore, we provide both 'weighted' and 'macro' versions of the F1 score.
ClassificationMissing RatioThis metric calculates the proportion of responses where no options from the given choices in the task are returned.
ClassificationMatthews Correlation Coefficient (MCC)The MCC is a metric that assesses the quality of binary classifications, producing a score ranging from -1 to +1. A score of +1 signifies perfect prediction, 0 denotes a prediction no better than random chance, and -1 indicates a completely inverse prediction.
Sequential LabelingF1 scoreIn the context of Sequential Labeling tasks, we utilize the F1 Score as computed by the seqeval library, a robust entity-level evaluation metric. This metric mandates an exact match of both the entity's span and type between the predicted and ground truth entities for a correct evaluation. True Positives (TP) represent correctly predicted entities, False Positives (FP) denote incorrectly predicted entities or entities with mismatched spans/types, and False Negatives (FN) signify missed entities from the ground truth. Precision, recall, and F1-score are then computed using these quantities, with the F1 Score representing the harmonic mean of precision and recall.
Sequential LabelingLabel F1 scoreThis metric evaluates model performance based solely on the correctness of the labels predicted, without considering entity spans.
Relation ExtractionPrecisionPrecision measures the proportion of correctly predicted relations out of all predicted relations. It is calculated as the number of True Positives (TP) divided by the sum of True Positives and False Positives (FP).
Relation ExtractionRecallRecall measures the proportion of correctly predicted relations out of all actual relations. It is calculated as the number of True Positives (TP) divided by the sum of True Positives and False Negatives (FN).
Relation ExtractionF1 scoreThe F1 Score is the harmonic mean of precision and recall, and it provides a balance between these two metrics. The F1 Score is at its best at 1 (perfect precision and recall) and worst at 0.
Extractive and Abstractive SummarizationRouge-NThis measures the overlap of N-grams (a contiguous sequence of N items from a given sample of text) between the system-generated summary and the reference summary. 'N' can be 1, 2, or more, with ROUGE-1 and ROUGE-2 being commonly used to assess unigram and bigram overlaps respectively.
Extractive and Abstractive SummarizationRouge-LThis metric evaluates the longest common subsequence (LCS) between the system and the reference summaries. LCS takes into account sentence level structure similarity naturally and identifies longest co-occurring in-sequence n-grams automatically.
Question AnsweringEmACCEMACC assesses the exact match between the model-generated response and the reference answer. In other words, the model-generated response is considered correct only if it matches the reference answer exactly, word-for-word.

Additionally, you can determine if the labels should be lowercased during the matching process by specifying LOWER_CASE in your class definition. This is pertinent since labels are matched based on their appearance in the generated output. For tasks like examinations where the labels are a specific set of capitalized letters such as 'A', 'B', 'C', this should typically be set to False.


FIT: Financial Instruction Dataset

Our instruction dataset is uniquely tailored for the domain-specific LLM, FinMA. This dataset has been meticulously assembled to fine-tune our model on a diverse range of financial tasks. It features publicly available multi-task and multi-modal data derived from the multiple open released financial datasets.

The dataset is multi-faceted, featuring tasks including sentiment analysis, news headline classification, named entity recognition, question answering, and stock movement prediction. It covers both textual and time-series data modalities, offering a rich variety of financial data. The task specific instruction prompts for each task have been carefully degined by domain experts.

Modality and Prompts

The table below summarizes the different tasks, their corresponding modalities, text types, and examples of the instructions used for each task:

TaskModalitiesText TypesInstructions Examples
Sentiment AnalysisTextnews headlines,tweets"Analyze the sentiment of this statement extracted from a financial news article.Provide your answer as either negative, positive or neutral. For instance, 'The company's stocks plummeted following the scandal.' would be classified as negative."
News Headline ClassificationTextNews Headlines"Consider whether the headline mentions the price of gold. Is there a Price or Not in the gold commodity market indicated in the news headline? Please answer Yes or No."
Named Entity RecognitionTextfinancial agreements"In the sentences extracted from financial agreements in U.S. SEC filings, identify the named entities that represent a person ('PER'), an organization ('ORG'), or a location ('LOC'). The required answer format is: 'entity name, entity type'. For instance, in 'Elon Musk, CEO of SpaceX, announced the launch from Cape Canaveral.', the entities would be: 'Elon Musk, PER; SpaceX, ORG; Cape Canaveral, LOC'"
Question AnsweringTextearnings reports"In the context of this series of interconnected finance-related queries and the additional information provided by the pretext, table data, and post text from a company's financial filings, please provide a response to the final question. This may require extracting information from the context and performing mathematical calculations. Please take into account the information provided in the preceding questions and their answers when formulating your response:"
Stock Movement PredictionText, Time-Seriestweets, Stock Prices"Analyze the information and social media posts to determine if the closing price of {tid} will ascend or descend at {point}. Please respond with either Rise or Fall."

Dataset Statistics

The dataset contains a vast amount of instruction data samples (136K), allowing FinMA to capture the nuances of the diverse financial tasks. The table below provides the statistical details of the instruction dataset:

DataTaskRawInstructionData TypesModalitiesLicenseOriginal Paper
FPBsentiment analysis4,84548,450newstextCC BY-SA 3.0[1]
FiQA-SAsentiment analysis1,17311,730news headlines, tweetstextPublic[2]
Headlinenews headline classification11,41211,412news headlinestextCC BY-SA 3.0[3]
NERnamed entity recognition1,36613,660financial agreementstextCC BY-SA 3.0[4]
FinQAquestion answering8,2818,281earnings reportstext, tableMIT License[5]
ConvFinQAquestion answering3,8923,892earnings reportstext, tableMIT License[6]
BigData22stock movement prediction7,1647,164tweets, historical pricestext, time seriesPublic[7]
ACL18stock movement prediction27,05327,053tweets, historical pricestext, time seriesMIT License[8]
CIKM18stock movement prediction4,9674,967tweets, historical pricestext, time seriesPublic[9]
  1. Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. 2014. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology 65, 4 (2014), 782โ€“796.
  2. Macedo Maia, Siegfried Handschuh, Andrรฉ Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. Wwwโ€™18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018. 1941โ€“1942
  3. Ankur Sinha and Tanmay Khandait. 2021. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2. Springer, 589โ€“601
  4. Julio Cesar Salinas Alvarado, Karin Verspoor, and Timothy Baldwin. 2015. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015. 84โ€“90.
  5. Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al . 2021. FinQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3697โ€“3711.
  6. Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. 2022. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. arXiv preprint arXiv:2210.03849 (2022).
  7. Yejun Soun, Jaemin Yoo, Minyong Cho, Jihyeong Jeon, and U Kang. 2022. Accurate Stock Movement Prediction with Self-supervised Learning from Sparse Noisy Tweets. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, 1691โ€“1700.
  8. Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1970โ€“1979.
  9. Huizhe Wu, Wei Zhang, Weiwei Shen, and Jun Wang. 2018. Hybrid deep sequential modeling for social text-driven stock prediction. In Proceedings of the 27th ACM international conference on information and knowledge management. 1627โ€“1630.

Generating Datasets for FIT

When you are working with the Financial Instruction Dataset (FIT), it's crucial to follow the prescribed format for training and testing models.

The format should look like this:

{
    "id": "unique id",
    "conversations": [
        {
            "from": "human",
            "value": "Your prompt and text"
        },
        {
            "from": "agent",
            "value": "Your answer"
        }
    ],
    "text": "Text to be classified",
    "label": "Your label"
}

Here's what each field means:

The first turn in the "conversations" list should always be from "human", and contain your prompt and the text. The second turn should be from "agent", and contain your answer.


FinMA v0.1: Financial Large Language Model

We are pleased to introduce the first version of FinMA, including three models FinMA-7B, FinMA-7B-full, FinMA-30B, fine-tuned on LLaMA 7B and LLaMA-30B. FinMA-7B and FinMA-30B are trained with the NLP instruction data, while FinMA-7B-full is trained with the full instruction data from FIT covering both NLP and prediction tasks.

FinMA v0.1 is now available on Huggingface for public use. We look forward to the valuable contributions that this initial version will make to the financial NLP field and encourage users to apply it to various financial tasks and scenarios. We also invite feedback and shared experiences to help improve future versions.

How to fine-tune a new large language model using PIXIU based on FIT?

Coming soon.


FinMem: A Performance-Enhanced LLM Trading Agent

FinMem is a novel LLM-based agent framework devised for financial decision-making, encompasses three core modules: Profiling, to outline the agent's characteristics; Memory, with layered processing, to aid the agent in assimilating realistic hierarchical financial data; and Decision-making, to convert insights gained from memories into investment decisions. Currently, FinMem can trade single stocks with high returns after a simple mode warm-up. Below is a quick start for a dockerized version framework, with TSLA as sample input.

Step 1: Set environmental variables in .env add HUGGINGFACE TOKEN and OPENAI API KEY as needed.

OPENAI_API_KEY = "<Your OpenAI Key>"
HF_TOKEN = "<Your HF token>"

Step 2: Set endpoint URL in config.toml Use endpoint URL to deploy models based on the model of choice (OPENAI, Gemini, open source models on HuggingFace, etc.). For open-source models on HuggingFace, one choice for generating TGI endpoints is through RunPod.

[chat]
model = "tgi"
end_point = "<set the your endpoint address>"
tokenization_model_name = "<model name>"
...

Step 3: Build Docker Image and Container

docker build -t test-finmem .devcontainer/. 

start container:

docker run -it --rm -v $(pwd):/finmem test-finmem bash

Step 4: Start Simulation!

 Usage: run.py sim [OPTIONS]                                                                                                                
                                                                                                                                            
 Start Simulation                                                                                                                           
                                                                                                                                            
โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --market-data-path    -mdp      TEXT  The environment data pickle path [default: data/06_input/subset_symbols.pkl]                       โ”‚
โ”‚ --start-time          -st       TEXT  The training or test start time [default: 2022-06-30 For Ticker 'TSLA']                                                               โ”‚
โ”‚ --end-time            -et       TEXT  The training or test end time [default: 2022-10-11]                                                                 โ”‚
โ”‚ --run-model           -rm       TEXT  Run mode: train or test [default: train]                                                           โ”‚
โ”‚ --config-path         -cp       TEXT  config file path [default: config/config.toml]                                                     โ”‚
โ”‚ --checkpoint-path     -ckp      TEXT  The checkpoint save path [default: data/10_checkpoint_test]                                             โ”‚
โ”‚ --result-path         -rp       TEXT  The result save path [default: data/11_train_result]                                               โ”‚
โ”‚ --trained-agent-path  -tap      TEXT  Only used in test mode, the path of trained agent [default: None. Can be changed to data/05_train_model_output OR data/06_train_checkpoint]                                  โ”‚
โ”‚ --help                                Show this message and exit.                                                                        โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
                              

Example Usage:

python run.py sim --market-data-path data/03_model_input/tsla.pkl --start-time 2022-06-30 --end-time 2022-10-11 --run-model train --config-path config/tsla_tgi_config.toml --checkpoint-path data/06_train_checkpoint --result-path data/05_train_model_output

There are also checkpoint functionalities. For more details please visit FinMem Repository directly.


Citation

If you use PIXIU in your work, please cite our paper.

@misc{xie2023pixiu,
      title={PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance}, 
      author={Qianqian Xie and Weiguang Han and Xiao Zhang and Yanzhao Lai and Min Peng and Alejandro Lopez-Lira and Jimin Huang},
      year={2023},
      eprint={2306.05443},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{xie2024FinBen,
      title={The FinBen: An Holistic Financial Benchmark for Large Language Models}, 
      author={Qianqian Xie and Weiguang Han and Zhengyu Chen and Ruoyu Xiang and Xiao Zhang and Yueru He and Mengxi Xiao and Dong Li and Yongfu Dai and Duanyu Feng and Yijing Xu and Haoqiang Kang and Ziyan Kuang and Chenhan Yuan and Kailai Yang and Zheheng Luo and Tianlin Zhang and Zhiwei Liu and Guojun Xiong and Zhiyang Deng and Yuechen Jiang and Zhiyuan Yao and Haohang Li and Yangyang Yu and Gang Hu and Jiajia Huang and Xiao-Yang Liu and Alejandro Lopez-Lira and Benyou Wang and Yanzhao Lai and Hao Wang and Min Peng and Sophia Ananiadou and Jimin Huang},
      year={2024},
      eprint={2402.12659},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

PIXIU is licensed under [MIT]. For more details, please see the MIT file.

Star History

Star History Chart