eXtensible Business Reporting Language (XBRL)

Task Description

In this task, we assess the ability of LLMs to retrieve and interpret eXtensible Business Reporting Language (XBRL) filings and their ability to use this data to generate precise answers to financial math questions. The objective of this task is to evaluate how well LLMs can process the structured financial data encoded in XBRL format.

Companies need to provide accurate and standardized financial information to stakeholders, including investors, regulators, and auditors. XBRL is a globally recognized standard for electronic communication of business and financial data. Using XBRL will help ensure consistency and comparability across different companies and jurisdictions. Using LLMs to retrieve and process financial data in XBRL format is an important application in financial regulation.

Example Input and Output:

XBRL Term

Q: What does the term ‘abstract’ mean in the context of the XBRL standard? Please provide a detailed explanation of this term.

A: An attribute of an element to indicate that the element is only used in a hierarchy to group related elements together. An abstract element cannot be used to tag data in an instance document.

Domain Query to XBRL Reports

Q: Among operations, investing, and financing activities, which brought in the most net (or lowest net) cash flow for Nike in FY2023?

A: Among the three, cash flow from operations was the highest for Nike in FY2023.

Financial Math

Q: A project expects annual cash inflows of $6,000 over 7 years. If the discount rate is 8%, what is the Net Present Value (NPV) of the project?

A: 21462.58

Numeric Query to XBRL Reports

Q: What is the FY2015 unadjusted EBITDA margin for Netflix? Calculate unadjusted EBITDA using unadjusted operating income and D&A.

A: 0.054

Below are examples of XBRL-related questions:

Dataset

The XBRL benchmark dataset is used to evaluate the ability of LLMs to interpret the XBRL standards, retrieve data in XBRL filings, and answer financial math questions. This benchmark dataset comprises XBRL terms, domain queries to XBRL reports, numeric queries to XBRL reports, tag queries to XBRL reports, financial math questions, and financial ratio formula with XBRL tags.

The questions for XBRL terms is to evaluate LLMs’ ability to explain an XBRL term. Domain queries are questions about different domains, such as products and services, in XBRL reports. Numeric queries are questions asking LLMs to retrieve specific data from XBRL reports. Tag queries are questions asking LLMs to retrieve corresponding tags for an item from XBRL reports. Financial math questions asking LLMs to return the result of the given math problem. Financial ratio formula questions ask LLMs to return the calculation formula with corresponding tags for a given financial ratio.

Data statistics

Data	Size	XBRL reports	Data Source
XBRL Term	500	Not provided	XBRL Agent
Domain Query to XBRL Reports	50	Selectively provided	XBRL Agent
Financial Math	90	Not provided	XBRL Agent
Numeric Query to XBRL Reports	50	Selectively provided	XBRL Agent
XBRL Tag Query to XBRL Reports	50	Selectively provided	XBRL filings from SEC
FiNER: Financial Numeric Entity Recognition for XBRL Tagging	100	Selectively provided	FiNER-139 Dataset: https://huggingface.co/datasets/nlpaueb/finer-139, https://github.com/nlpaueb/finer
FNXL: Financial Numeric Extreme Labelling	100	Selectively provided	FNXL Dataset: https://huggingface.co/datasets/ChanceFocus/flare-fnxl, https://arxiv.org/abs/2306.03723
Total	940

Users can fine-tune or evaluate LLMs using this dataset, call additional tools, using additional open-source.

Metrics

We use accuracy for financial math questions, numeric queries to XBRL reports, tag queries to XBRL reports, and financial ratio formulas. We use FActScore for XBRL terms and domain queries to XBRL reports.

Baseline Performance

Model	Method	XBRL Term (FActScore)	Domain and Numeric Query to XBRL Reports (FActScore)	Financial Math (Accuracy)	Tag Query to XBRL Reports (Accuracy)	Score (Average)
Llama 3.1-8B	Zero-shot	0.7083	0.5845	0.7667	0.1667	0.5565
GPT-4o	Zero-shot	0.8503	0.5851	0.8842	0.7778	0.7743
Mistral Large 2	Zero-shot	0.8221	0.6831	0.7444	0.8667	0.7791

Reference

[1] Sewon Min et al. (2023). FactScore: Fine-grained atomic evaluation of factual precision in long-form text generation. arXiv preprint arXiv:2305.14251. Available at: https://arxiv.org/abs/2305.14251

Shijie Han, et al. XBRL-Agent: Leveraging Large Language Models for Financial Report Analysis. Proceedings of the Conference ICAIF ‘24: Proceedings of the 5th ACM International Conference on AI in Finance https://doi.org/10.1145/3677052.3698614.