Hi! I created a universal data API that uses headless browsers and GPT to extract any data from the web in JSON format. I started this project because I needed some API to do data enrichment to get company data (headcount, investment rounds, etc.). Once I did the first version, I quickly realized that there can be many use cases for such a tool: data enrichment, web scraping, data validation, etc.
This is pretty cool, it is able to parse data out of a random pricing table somewhere in the page.
It does seem to just make up data it if is not found in the page (probably expected with LLM's), I wonder if you can reduce that with some prompting, or maybe verify the data is actually present?
Your schema page docs is broken https://singleapi.co/docs/schema
The prompt leakage is a pretty common issue that I still have to address, but ideally, it should just return empty fields for data that it couldn't find on the page.
I also published a simplified version on GitHub, so you can try to self-host it. I'm really excited to see all the possible use cases for such a tool besides web scraping or data enrichment.
After retrieving all the text data from the webpage (using a headless browser), GPT is used to filter out all the noise and extract the actual information requested in the request schema. Let's say you request for {"product_name": "string"}. GPT will retrieve that product name from the webpage and return the correctly formatted JSON with the fields you requested.
It works pretty similarly to GraphQL when you define a schema that you want, and the backend returns the exact data that you requested. But in this case, the data is received from the webpage that you provided.