All the filings I cannot read: structuring mountains of spaghetti text

If you ever decide to join the suit-wearers, the bankers, the number crunchers, the financial modelers, and the night owls pulling all-nighters, you’ll need to make nice with company disclosures. It’s a foregone conclusion that you’ll spend hours going over the chunky text documents MumboJumbo Industries — or whatever company you’re covering — puts out every quarter and files with the SEC: presentations, press releases, and forms. It’s a lot of text, but you get used to it.

I was there, and I was never good at it. It takes an annoyingly meticulous person to take in all the words, all those pages, and turn them into something insightful. As with any skill, the longer you keep reading these disclosures, the better you get at capturing the right information. After all, it gets repetitive. You learn which sections to focus on and which keywords to look for.

You typically go over the company’s reported numbers, run calculations in your templated Excel sheet, then compile your questions. You don’t “read” the pages — you look for answers.

The questions you ask tend to be: What new risks is MumboJumbo disclosing? What does management attribute the rise in enterprise customers this quarter to? What do they think of the interest-rate cuts, and how will those affect the balance sheet?

You get the idea. But even the most experienced analyst has an error rate. They’ll miss a key insight mentioned in passing or fail to connect two disparate pieces of information. Even the best analysts are only as good as the information they choose to consume and decipher. But what about tangentially related, material information in companies they do not cover? The big shops have addressed that tunnel-vision problem by dedicating teams of data scientists and economists to aggregate information and look at the big picture. That though is about empowering the analyst to do more on their own; running bottom-up and top-down research, and extracting and distilling cross-industry insights.

Can AI help? Sorry to report: yes

Every working professional — or even a “disclosures enthusiast” — has different motives for reading these documents. Some are interested in a company’s financial performance. Others want management’s perspective on macroeconomic events. And some weirdos are counting the number of times “AI” is mentioned because data science told them to. Well, OK, I am one of those weirdos, but keyword counting is table stakes; I’m after more insightful analysis.

If you’re an analyst covering tech stocks, you’re probably focused on certain verticals, or maybe only the large-cap names. It’s almost certain you’re not going over the disclosures of the 400+ publicly listed tech and tech-adjacent companies. That’s where the magic of large language models (LLMs) comes in. They excel at synthesizing text at scale.

Decoding text and finding themes

There’s nothing wrong with keyword stats, but they provide no context for how a word is used. Here’s the task at hand: I want LLMs to read hundreds of filings for all the listed tech stocks over the past four quarters and produce a structured summary that highlights what’s discussed, the sentiment, and why it’s being raised. That suits my needs — finding themes in filings — though the structure could be adapted to whatever an analyst cares about.

You may wonder if that approach gives the LLM too much freedom to invent its own themes. That’s deliberate. A more traditional approach is to build a lexicon of terms, hand it to a natural language processing library — a much less powerful tool — and ask it to check for the presence of those terms across filings: a classification exercise that places documents in different buckets. But I’m running a discovery exercise to surface hidden patterns I might miss; a fixed lexicon would only reinforce tunnel vision.

I built and ran this script over a few hundred filings, and now I have a long list of themes to distill these documents. I haven’t found hidden patterns yet, but to get there I’ll need to expand the number of filings and extend the time horizon. Surely some interesting patterns will emerge. Until then, I’ll keep running.

You can run a version of this project in your browser here