I’m building a knowledge base from scratch. I’ve done that before, but this time there are a few curveballs. A knowledge base can be a collection of books, documents, or files of any kind. Each entry in that knowledge base typically has certain metadata that define that entry. For books, it can be “author”, “release date”, or “keywords”. But I don’t have any of that information. I’m building a knowledge base for startups and all I have is a laundry list of startup names that I somehow managed to scrape off the corners of the internet. That’s a good place to start.
Data providers with knowledge bases on startups – think crunchbase, pitchbook et al – have the structured data; the startup’s website, contact details, funding rounds, latest news, and all the other key information. “Good then so it’s solved, why are you doing this?” you might be wondering. Well I have a few reasons, but it boils down to this: I’m building a different dataset; a deeper, relational, and more organic rather than structural. But in order to gather any data, I need more than text strings.
Oftentimes, I’ll use the search bar on my browser to get to a company’s website. If I hadn’t gone on there in a while, the autocomplete would fail, and I’d land on Google.com, where the first result is usually the sought-after website. I’m not alone in this behavior; it turns out a third of Google searches are what is called Navigational searches, a term used to describe users who want to get to Netflix but can't be bothered to type in ".com." Sometimes it's more complicated than that, with users not knowing what the top-level domain is. It could be a ".com", ".net", or more commonly now, ".ai."
And that is my quandary, but it’s not Netflix, it’s thousands of startup names. This would typically be called a data enrichment task and it is exactly what it sounds like; a table with hundreds or thousands of entries in one column – usually a name – and other blank columns representing data points that are unavailable. Any dataset needs a starting point for enrichment and that is usually a URL, the gateway to information stored on the web. Without a URL I’m just left with a laundry list of text strings, devoid of any context.
Exa is a search engine built for the AI world; it can be plugged into an AI workflow to enable web search for accurate AI responses with cited sources. They’ve crawled the web and built their own knowledge base for semantic retrieval. It’s Google that can be plugged into AI tools and applications. But instead of building some fancy agent, I thought I can just use it to get the official startup websites. And I did.
I created a simple script that uses one search query per company in my database. That query is the company name and “official website” attached to it. It then looks at the retrieved search results and picks the result with the URL that most probably directs to the company’s official website. It can run several queries at once, process these companies in batches, and run on a schedule.
This process, I’ve found, is the modern equivalent of walking into a library and handing over a list of companies to the librarian to search for all the books where these companies are mentioned. And yes, search engines are an engineering marvel; they search through all the books and surface the answers. You just have to ask a lot, and you have to pay per question asked. So for thousands of companies, the costs will quickly add up. No thank you, I’ll just read the books.
I'm making a version of this script available in a jupyter notebook, which you can access here. You'll need to get an API key from Exa.
