Blog

5 Steps to Build An AI-Powered Financial Statement Knowledge Graph

Jason Jones
December 19, 2024

Recently we launched FootNoteDB Private, an AI-powered private internal database that we construct for each of our clients that contains hundreds or thousands of their financial statements, making it easier to search and retrieve valuable information from their proprietary dataset. Our product and engineering team is rapidly iterating on our database infrastructure applying advanced AI techniques to surface high-quality results.

We use a methodology called RAG (retrieval augmented generation) to retrieve contextually relevant information from the financial statements to run through an LLM (GPT 4o) and generate results from natural language queries. In our RAG methodology, we use an embedding model and vector database to help with the semantic understanding of the underlying documents. While our embedding-based RAG methodology is very good at finding the exact thing a user asks for in a single document, it is not so great at finding related information across documents or related information in separate sections of a larger document.

We are considering a solution to add a knowledge graph to RAG by upgrading our RAG methodology to a hybrid database approach, including vector and graph databases. We can examine the whole picture by creating a knowledge graph derived from the financial statements. This graph-based approach to RAG maps how key concepts and data points in the financial statements are connected so related information can be found across documents and in separate sections of larger documents.

The following is our five-step framework to build an AI-powered knowledge base for analyzing financial data across large quantities of financial statements.

Step 1: Data extraction and data structuring: We have built a data pipeline that utilizes a sequence of AI models to extract data from PDFs, transform unstructured content into a structured format, and tag the data. These models identify and separate objects such as characters, words, paragraphs, headers in text, cells, tables, and images. Beyond organizing the data, they retain precise coordinates for each element, ensuring that every piece of information can be traced back to its original location on the page.

Step 2: Knowledge base architecture - a hybrid database approach: We will use two types of databases in our knowledge base layer. The first will be a vector database to store and query embedded chunks of financial data. The second will be a graph database representing the nodes and relationships between financial concepts, allowing us to link concepts across documents, understand relationships inside financial statements, commonalities between financial statements, and understand financial hierarchies.

Step 3: Ontology development strategy: To capture the richness and complexity of the financial statement data, we need to build an ontology as our foundation. By building an ontology, we can go beyond just storing data and use the power of LLMs for reasoning and inference. For instance, if our ontology defines the relationship that "current assets minus current liabilities equals working capital," then we can infer new knowledge by automatically calculating working capital when we add data about assets and liabilities.

We are considering using XBRL (eXtensible Business Reporting Language) as our schema for building our ontology. The tags in XBRL can be mapped to the concepts in our ontology, and our ontology can enrich the XBRL data by adding a layer of semantic meaning and relationships. Then, we can add custom extensions like footnote-specific concepts not already tagged in XBRL, as well as temporal (dates, timestamps, durations, etc.) and contextual (location, source, author, conditions, etc.) attributes.

The question then becomes, how do we map the XBRL concepts to the ontology? We can perform an LLM-assisted ontology construction and refinement using an XBRL schema modification process. We would start by using an LLM to analyze existing XBRL schema and build the first iteration of the ontology. We would then use a “human in the loop” system to edit and iterate by working closely with the LLM to refine various ontology sections. We will manually add the temporal and contextual attributes. Our subject matter experts (ie, CPAs) will find gaps in our footnote coverage and propose extensions for missing concepts. Finally, we will test and validate our ontology against real financial statement samples. 

Step 4: RAG implementation architecture: We will build a multi-stage retrieval pipeline consisting of Query Planning, Retrieval Execution, and Results Processing layers. 

The Query Planning layer, which functions as the “brains” of the RAG pipeline, will ensure that the right information is retrieved to generate the best possible answer. This layer will understand the knowledge base architecture, including the available databases and their underlying models (ie, ADA-003 as a possible embedding model and NEO4J as a possible graph model). The query processor, which uses the LLM for processing, will analyze the query to understand its intent and key concepts. It will involve techniques like indexing, query rewriting, and query optimization to break the query into its essential components. Or it may translate the query into a format suitable for the knowledge source, such as a structured query or a keyword-based search.

The query planner, which also uses the LLM for planning, will decide the most efficient way to find relevant information, which may involve selecting the appropriate knowledge source (e.g., the graph database, the vector database, or even an external API) and choosing the best search algorithm. If the query requires information from multiple sources, the planner will coordinate the retrieval process, ensuring that the results are combined in a meaningful way.

The Retrieval Execution layer is the system that will actually go out and fetch the relevant information from our knowledge source. It will take the plan created by the query planner and put it into action to bring back the information needed to answer the query. It will retrieve the most relevant chunks of information, which could be paragraphs from documents in the vector database, nodes or relationships from the graph database, or it could be hybrid retrieval from both databases for more complex queries. The retrieved chunks will be further ranked and filtered based on their relevance to the query (ie, BM25 keyword extraction). The information will be formatted in a suitable way to feed into the language model. 

The Results Processing layer will transform the raw retrieved information into a refined, well-structured, and informative context that will enable the language model to generate the best possible response to the user's query. We can use a wide variety of techniques in this layer.

Cross-referencing resolution: We can use cross-referencing resolution to automatically identify and link related pieces of information. For instance, if a retrieved chunk mentions an item in the balance sheet, cross-reference resolution could automatically include the relevant details from the corresponding footnote, ensuring that the language model has a complete and coherent picture. 

Time-based query understanding: Time-based query understanding can analyze trends and changes over time. For instance, if a user asks about a company's revenue "in the last quarter," temporal alignment would ensure that the system retrieves data from the correct financial reporting period.

Knowledge graph-based reranking: We can use knowledge graph-based reranking to allow the system to identify chunks that might not have exact keyword matches but are still semantically related to the query. For example, if the query is about "debt," the system might use the knowledge graph to identify chunks that discuss related concepts like lease liabilities, which may not necessarily be considered debt but do represent future obligations in a manner similar to debt and, therefore, may be incorporated into debt ratios.

Source document listing: We can use source document listing to link retrieved chunks back to their original source documents (ie, financial statements), allowing users to verify the information or explore the context further.

Query rewriting: Since the original query may be ambiguous or incomplete, we can use query rewriting to clarify the intent. We can expand the scope to broader concepts to retrieve a wider range of relevant information. 

Step 5: Output layer and evaluation framework: Our last step to build a framework for an AI-powered knowledge base for analyzing financial data will be to produce an output layer with an evaluation framework. The output layer will include answer generation from the LLM, accompanied by source attribution from the retrieval layer, as discussed in step 4. We will also have a confidence scoring evaluation system that measures metrics including accuracy, retrieval quality, and user experience, as well as validation, including technical validation and benchmarking.

Our five-step framework for building an AI-powered knowledge base for financial statements is a big project with much applicability. We are building it as part of our software to enable auditors to generate footnote disclosures for their financial statement preparation for middle-market clients. However, this knowledge base probably has many investing, finance, and accounting applications, and there are many other teams that could utilize this solution and may be building similar solutions. We welcome your feedback if you are working on a similar problem or have thoughts on this design. 

Finally, please reach out if you run an accounting firm and would like to turn your financial statement repository into an AI-powered private database to automate the generation of footnotes. Special thanks to Subham Kundu for co-writing this blog post with me!

Subscribe to our Newsletter

Stay informed on how AI is shaping the world of professional services, and how your firm can stay ahead in this dynamic landscape.

Join the Waitlist
Tellen is in private Beta
Full Name
Email
Phone
About your Firm
Form sent!
The form has been submitted. Thank you.
Submission Error: There was an issue with your form.
Please refresh the page and resubmit.

Tellen: Announcing the launch of FootNoteDB. Learn more.

Latest news

/

5 Steps to Build An AI-Powered Financial Statement Knowledge Graph

Jason Jones • Dec 19, 2024

5 Steps to Build An AI-Powered Financial Statement Knowledge Graph

Recently we launched FootNoteDB Private, an AI-powered private internal database that we construct for each of our clients that contains hundreds or thousands of their financial statements, making it easier to search and retrieve valuable information from their proprietary dataset. Our product and engineering team is rapidly iterating on our database infrastructure applying advanced AI techniques to surface high-quality results.

We use a methodology called RAG (retrieval augmented generation) to retrieve contextually relevant information from the financial statements to run through an LLM (GPT 4o) and generate results from natural language queries. In our RAG methodology, we use an embedding model and vector database to help with the semantic understanding of the underlying documents. While our embedding-based RAG methodology is very good at finding the exact thing a user asks for in a single document, it is not so great at finding related information across documents or related information in separate sections of a larger document.

We are considering a solution to add a knowledge graph to RAG by upgrading our RAG methodology to a hybrid database approach, including vector and graph databases. We can examine the whole picture by creating a knowledge graph derived from the financial statements. This graph-based approach to RAG maps how key concepts and data points in the financial statements are connected so related information can be found across documents and in separate sections of larger documents.

The following is our five-step framework to build an AI-powered knowledge base for analyzing financial data across large quantities of financial statements.

Step 1: Data extraction and data structuring: We have built a data pipeline that utilizes a sequence of AI models to extract data from PDFs, transform unstructured content into a structured format, and tag the data. These models identify and separate objects such as characters, words, paragraphs, headers in text, cells, tables, and images. Beyond organizing the data, they retain precise coordinates for each element, ensuring that every piece of information can be traced back to its original location on the page.

Step 2: Knowledge base architecture - a hybrid database approach: We will use two types of databases in our knowledge base layer. The first will be a vector database to store and query embedded chunks of financial data. The second will be a graph database representing the nodes and relationships between financial concepts, allowing us to link concepts across documents, understand relationships inside financial statements, commonalities between financial statements, and understand financial hierarchies.

Step 3: Ontology development strategy: To capture the richness and complexity of the financial statement data, we need to build an ontology as our foundation. By building an ontology, we can go beyond just storing data and use the power of LLMs for reasoning and inference. For instance, if our ontology defines the relationship that "current assets minus current liabilities equals working capital," then we can infer new knowledge by automatically calculating working capital when we add data about assets and liabilities.

We are considering using XBRL (eXtensible Business Reporting Language) as our schema for building our ontology. The tags in XBRL can be mapped to the concepts in our ontology, and our ontology can enrich the XBRL data by adding a layer of semantic meaning and relationships. Then, we can add custom extensions like footnote-specific concepts not already tagged in XBRL, as well as temporal (dates, timestamps, durations, etc.) and contextual (location, source, author, conditions, etc.) attributes.

The question then becomes, how do we map the XBRL concepts to the ontology? We can perform an LLM-assisted ontology construction and refinement using an XBRL schema modification process. We would start by using an LLM to analyze existing XBRL schema and build the first iteration of the ontology. We would then use a “human in the loop” system to edit and iterate by working closely with the LLM to refine various ontology sections. We will manually add the temporal and contextual attributes. Our subject matter experts (ie, CPAs) will find gaps in our footnote coverage and propose extensions for missing concepts. Finally, we will test and validate our ontology against real financial statement samples. 

Step 4: RAG implementation architecture: We will build a multi-stage retrieval pipeline consisting of Query Planning, Retrieval Execution, and Results Processing layers. 

The Query Planning layer, which functions as the “brains” of the RAG pipeline, will ensure that the right information is retrieved to generate the best possible answer. This layer will understand the knowledge base architecture, including the available databases and their underlying models (ie, ADA-003 as a possible embedding model and NEO4J as a possible graph model). The query processor, which uses the LLM for processing, will analyze the query to understand its intent and key concepts. It will involve techniques like indexing, query rewriting, and query optimization to break the query into its essential components. Or it may translate the query into a format suitable for the knowledge source, such as a structured query or a keyword-based search.

The query planner, which also uses the LLM for planning, will decide the most efficient way to find relevant information, which may involve selecting the appropriate knowledge source (e.g., the graph database, the vector database, or even an external API) and choosing the best search algorithm. If the query requires information from multiple sources, the planner will coordinate the retrieval process, ensuring that the results are combined in a meaningful way.

The Retrieval Execution layer is the system that will actually go out and fetch the relevant information from our knowledge source. It will take the plan created by the query planner and put it into action to bring back the information needed to answer the query. It will retrieve the most relevant chunks of information, which could be paragraphs from documents in the vector database, nodes or relationships from the graph database, or it could be hybrid retrieval from both databases for more complex queries. The retrieved chunks will be further ranked and filtered based on their relevance to the query (ie, BM25 keyword extraction). The information will be formatted in a suitable way to feed into the language model. 

The Results Processing layer will transform the raw retrieved information into a refined, well-structured, and informative context that will enable the language model to generate the best possible response to the user's query. We can use a wide variety of techniques in this layer.

Cross-referencing resolution: We can use cross-referencing resolution to automatically identify and link related pieces of information. For instance, if a retrieved chunk mentions an item in the balance sheet, cross-reference resolution could automatically include the relevant details from the corresponding footnote, ensuring that the language model has a complete and coherent picture. 

Time-based query understanding: Time-based query understanding can analyze trends and changes over time. For instance, if a user asks about a company's revenue "in the last quarter," temporal alignment would ensure that the system retrieves data from the correct financial reporting period.

Knowledge graph-based reranking: We can use knowledge graph-based reranking to allow the system to identify chunks that might not have exact keyword matches but are still semantically related to the query. For example, if the query is about "debt," the system might use the knowledge graph to identify chunks that discuss related concepts like lease liabilities, which may not necessarily be considered debt but do represent future obligations in a manner similar to debt and, therefore, may be incorporated into debt ratios.

Source document listing: We can use source document listing to link retrieved chunks back to their original source documents (ie, financial statements), allowing users to verify the information or explore the context further.

Query rewriting: Since the original query may be ambiguous or incomplete, we can use query rewriting to clarify the intent. We can expand the scope to broader concepts to retrieve a wider range of relevant information. 

Step 5: Output layer and evaluation framework: Our last step to build a framework for an AI-powered knowledge base for analyzing financial data will be to produce an output layer with an evaluation framework. The output layer will include answer generation from the LLM, accompanied by source attribution from the retrieval layer, as discussed in step 4. We will also have a confidence scoring evaluation system that measures metrics including accuracy, retrieval quality, and user experience, as well as validation, including technical validation and benchmarking.

Our five-step framework for building an AI-powered knowledge base for financial statements is a big project with much applicability. We are building it as part of our software to enable auditors to generate footnote disclosures for their financial statement preparation for middle-market clients. However, this knowledge base probably has many investing, finance, and accounting applications, and there are many other teams that could utilize this solution and may be building similar solutions. We welcome your feedback if you are working on a similar problem or have thoughts on this design. 

Finally, please reach out if you run an accounting firm and would like to turn your financial statement repository into an AI-powered private database to automate the generation of footnotes. Special thanks to Subham Kundu for co-writing this blog post with me!

Subscribe to our Newsletter

Stay informed on how AI is shaping the accounting & audit industry, and how your firm can stay ahead in this dynamic landscape.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Unlock the Power of Tellen

Experience the future of auditing. Schedule a demo today and see how our AI platform can transform your firm.