Skip to content

Like pitchbook, but open. An open source investor/venture capital database

License

Notifications You must be signed in to change notification settings

iloveitaly/openbook

Repository files navigation

OpenBook: A Public VC Database

Like Pitchbook, but open. You view the resulting dataset here.

I'm starting a new company, but we haven't picked an idea yet. I thought it would be helpful to have a database of investors with what they are interested in so we can easily reach out to folks within those firms to get feedback on our ideas. VCs have been really helpful in offering thoughtful feedback on ideas really early in the process.

Open source databases have always been interesting to me. There's lots of open source code, but not much open source data. For instance, getting a list of US zip codes isn't easy. It's not clear why this is the case; sure, it's hard, but not harder than open source code. Is zip code data, or IP to location data, core to your business? Probably not. Can't be harder than building or maintaining software. If you haven't any thoughts on the mechanics here, I'd love to hear it.

Anyway, the other motivation for this project was playing with a couple of new technologies, and I'm always a sucker for a good learning project:

  • Dolthub
  • Langchain
  • LLMs applied to web scraping
  • pnpm

I ran into rehype as part of this, which is a neat parsing pipeline. Reminds me of html-pipeline.

Goals

  • Completely open source database of venture capital firms
  • Make it easy for people to contribute to the database
  • Database of firms with some helpful metadata about those companies that can be easily queried
  • Database of investors within those firms with as much publicly available contact information sourced as possible

Development

Pull the dolthub database:

dolt clone iloveitaly/venture_capital_firms .

Run a dolthub server (probably in a separate tab):

dolt sql-server

Install node modules

pnpm install

Setup secrets

cp .env.example .env
direnv allow .

Now you should be ready to roll:

tsx commands.ts --help

Contributing

Scrape a specific company:

tsx commands.ts scrape-companies --url 'a16z.com'
tsx commands.ts scrape-team --url 'a16z.com'

Scrape a random set:

tsx commands.ts scrape-companies --limit 5
tsx commands.ts scrape-team --limit 5

Scrape a random set every day:

watch -n 86400 '
tsx commands.ts scrape-companies --limit 2 2>&1 | tee -a ./scrape-companies.log;
tsx commands.ts scrape-team --limit 2 2>&1 | tee -a ./scrape-team.log;
'

Create a dolthub PR.

Usage

Usage: commands [options] [command]

Options:
  -h, --help                  display help for command

Commands:
  scrape-companies [options]  scrape company sites and categorize pages
  scrape-team [options]       scrape team command
  help [command]              display help for command

Development

Debugging

One of my biggest gripes with the node ecosystem is the terrible debugging tools. As part of this project, I've been iterating on better-node-inspect:

better-node-inspect --loader tsx ./commands.ts scrape-team

# or, with a specific site
better-node-inspect --loader tsx ./commands.ts scrape-team --url playfair.vc

If you want to hack on better-node-inspect (it's still early in development) you can link the local package with:

pnpm link ~/Projects/javascript/better-node-repl

Queries

Some helpful SQL queries (TablePlus is great for working with these):

UPDATE venture_capital_firms
SET scrape_categorization = NULL, team_members = NULL, normalized_url = NULL, scrape_team_members_at = NULL, scrape_categorization_at = NULL;

Create view flattening jsonb data:

CREATE VIEW `people` AS
SELECT
  `name`,
  `normalized_url`,
  JSON_UNQUOTE(JSON_EXTRACT(`team_member`, '$.name')) AS 'person_name',
  JSON_UNQUOTE(JSON_EXTRACT(`team_member`, '$.role')) AS 'person_role',
  JSON_UNQUOTE(JSON_EXTRACT(`team_member`, '$.roleDescription')) AS 'person_role_description',
  JSON_UNQUOTE(JSON_EXTRACT(`team_member`, '$.email')) AS 'person_email',
  JSON_UNQUOTE(JSON_EXTRACT(`team_member`, '$.twitter')) AS 'person_twitter',
  JSON_UNQUOTE(JSON_EXTRACT(`team_member`, '$.linkedin')) AS 'person_linkedin'
FROM
  `venture_capital_firms`,
  JSON_TABLE(
    `team_members`,
    '$[*]' COLUMNS(
      `team_member` JSON PATH '$'
    )
  ) AS `team_members_table`;

Design Decisions

  • Some sites have a query string url on the team page which just determines which person should be displayed first. In this case, there is a lot of extra text that will be processed, but there's not an easy way to determine the function of the query string so we leave it alone.
  • jsonb columns with arrays of JSON blobs instead of separate tables. I know this is probably a bad idea, but it keeps some aspects of the project easier for now. The plan is to denormalize the team data attached to companies.

Scraping with GPT

This repo has some interesting code which uses langchain + openai to return JSON results from a webpage by converting the raw HTML to simplified markdown. It's amazing how well this works; so many interesting opportunities here. Some more of my thoughts on this.

Existing Lists & SaaS

Here a list of venture capital firms, and some paid services that provide these lists. I think there's a good opportunity to disrupt traditional data brokers using scraping + LLMs, but over the long term some larger provider will just support real-time data web data ingestion and make any LLM-powered scraping DB obsolete.

Paid

About

Like pitchbook, but open. An open source investor/venture capital database

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project