Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: SingleAPI – Convert the Internet into your own API (singleapi.co)
7 points by semanser on Oct 17, 2023 | hide | past | favorite | 6 comments
Hi! I created a universal data API that uses headless browsers and GPT to extract any data from the web in JSON format. I started this project because I needed some API to do data enrichment to get company data (headcount, investment rounds, etc.). Once I did the first version, I quickly realized that there can be many use cases for such a tool: data enrichment, web scraping, data validation, etc.



This is pretty cool, it is able to parse data out of a random pricing table somewhere in the page. It does seem to just make up data it if is not found in the page (probably expected with LLM's), I wonder if you can reduce that with some prompting, or maybe verify the data is actually present? Your schema page docs is broken https://singleapi.co/docs/schema


Fixed the link: https://singleapi.co/docs/getting-data/ (the docs/schema was incorrect one). Thanks for that!

Yes, it's able to parse data out of a random pricing table somewhere on the page. Here is an exact example of how to do that: https://singleapi.co/docs/examples/scraping-pricing/

The prompt leakage is a pretty common issue that I still have to address, but ideally, it should just return empty fields for data that it couldn't find on the page.


I also published a simplified version on GitHub, so you can try to self-host it. I'm really excited to see all the possible use cases for such a tool besides web scraping or data enrichment.

https://github.com/semanser/JsonGenius


What role does GPT play in this?


After retrieving all the text data from the webpage (using a headless browser), GPT is used to filter out all the noise and extract the actual information requested in the request schema. Let's say you request for {"product_name": "string"}. GPT will retrieve that product name from the webpage and return the correctly formatted JSON with the fields you requested.

It works pretty similarly to GraphQL when you define a schema that you want, and the backend returns the exact data that you requested. But in this case, the data is received from the webpage that you provided.


Can the results really be trusted then? Isn't it possible for GPT to make something up that doesn't exist on the site




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: