Skip to content

Leveraging Large Language Models (LLMs) to extract, summarize, financial information out of 10-K filings derived from the SEC EDGAR database.

Notifications You must be signed in to change notification settings

richardso21/llm-plus-10k

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM + 10-K

screenshot

Leveraging Large Language Models (LLMs) to extract, summarize, financial information out of 10-K filings derived from the SEC EDGAR database.

Form 10-K filings are financial reports annually submitted by publicly reporting companies in the U.S., where crucial information regarding the corporation's financial status, numbers, and risks are disclosed. They are notoriously long, with some reaching over 50,000 words of text, making it tedious to manually derive meaningful insights out from them. Alternatively, LLMs, with their ability to perform information retrieval and summarization, can be used to automate this process and thus reduce the overhead needed to parse through them by hand.

Features

  • Retrieve, compile, and visualize key metrics (e.g. net sales, gross margin) across a timespan
    • Default metrics are: Net Sales, Gross Margin, and Total Cost of Operations. I picked these since they seem to be discussed quite commonly across the board for the tickers I've selected. They stand to be crucial in an intuitive sense in evaluating the financial status of a company.
    • Ability to customize metrics that the LLM retrieves from 10-K filings
  • Compare key metrics across three different companies/tickers
  • Generate summaries of important sections of a particular Form 10-K

Tech Stack

Below is a list of libraries/tools I've used heavily in this project:

  • edgartools: One of the most polished and well-featured libraries for retrieving 10-K filings from SEC EDGAR that I've encountered. Ability to extract raw text from each filing with ease. Also works especially well as a CLI tool for debugging/exploration.

  • gemini-1.5-flash-latest: The LLM API of choice for this project. It supports a very generous input context window (up to 1 million tokens), which is ideal for supporting such a large document as Form 10-K. Additionally, it is capable of generating responses in JSON format, making it especially easier to work with the retrieved data for visualization.

  • streamlit: Used for the UI frontend for displaying visualizations, showing user options, input for calling the LLM API, etc. Library was especially intuitive and was hassle-free for the most part. Additionally, I used Streamlit Community Cloud to host this project site.

About

Leveraging Large Language Models (LLMs) to extract, summarize, financial information out of 10-K filings derived from the SEC EDGAR database.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages