When I was putting together the pitchdeck for our startup I wanted to search for slides to learn from - but I was looking for specific sections or types of startups for slide decks. I had to open tens of decks and scroll through them which sucked. So I decided to make a tool that would allow me to search inside the decks more easily. Happy to answer questions
Nice project for looking for pitchdeck references. Thanks for building and sharing it. I am curious about the tech behind it - are you doing OCR on images? The search is very responsive - it's definitely not elastic search, curious what index/search system are you using?
Glad it helps! There are 4 key steps that I took:
- Upscaling (using Upscayl[0])
- OCR (using tesseract[1])
- Indexing (using Algolia[2])
- Scaling the processing and running on AWS (Klotho[3] - our startup)
I mentioned this in a separate comment: the source images of some of the slides have too low resolution for the upscaling algorithm to recognize/improve it - so it gets all mangled up
Wow, very cool! I’m building almost the exact same thing but for public company investor relations decks as a side project. My use case stemmed from building decks in investment banking, very similar to yours.
Let me make a suggestion, paginate and don't display full resolution that's scaled down to thumbnail size. I was able to scroll down and keep scrolling and then collected ~1,000 slides by just doing Command+S.
This is cool, I was in a similar position when I was going to try to raise some money for a potential product (which I didn’t end up doing…). I was thinking about putting something together like this for fun out of the hundred or so of decks I downloaded and had found online, but wasn’t sure how to go about requesting permission from all the deck creators and even managing how to find them. So I didn’t go through with it.
The fact that you were able to get permission from all these people, with an order of magnitude more decks than I had is astounding! Kudos, do you mind if I ask about the secret sauce to how you were able to get all these deck authors to agree to let you use these on your site?
Maybe think of the most likely target audience for a corpus of startup pitch decks?
This little instance of not-asking-permission seems very minor, compared to the ruthless exploitation of people on which some of the most lucrative startups are predicated. Perhaps laid out in some of these very same decks.
Someone uninterested in becoming the next exploiter could do ethical analysis on this corpus.
The slides are already tagged with the deck they're associated with - just gotta implement that feature. We'll likely open-source the GUI so folks can add features to it.
The search does not seem to be returning what your searching. Example search "IPFS" nothing comes up that mentions IPFS or related tech . I guess this is a fake it till you make it POC ?
Not at all - the search is only as good as the OCR - I suspect tesseract has a harder time mapping IPFS because it's not a dictionary-like word. Try "congratulations" - What you'll also notice is that the detected word may be in a screenshot or smaller font than what you were expecting
That's a peculiar search term, what kind of business would you envision that would be based on a public file distribution protocol? NFT scams don't count as business.