Hey HN! My brother Arctic_fly and I spent the last two weeks since the GPT-4 launch building Taxy, an open source Chrome extension that lets you automate arbitrary tasks in your browser using GPT-4. You can see a few demos in the Github README, but basically it works like this:
1. You open the extension and write the task you'd like done (eg. "schedule a meeting with David tomorrow at 2").
2. Taxy pulls the DOM of the current page, puts it through a pipeline to remove all non-semantic information, hidden elements, etc and sends it to GPT-4 along with your text instructions.
3. GPT-4 tries to figure out what action to take. In our prompt we give it the option to either click an element or set an input's value. We use the ReAct paradigm (https://arxiv.org/abs/2210.03629) so it explains what it's trying to do before taking an action, which both makes it more accurate and helps with debugging.
4. Taxy parses GPT-4's response and performs the action requested on the page. It then goes back to step (2) and asks GPT-4 for the next action to take with the updated page DOM. It also sends the list of actions already taken as part of the current task so GPT-4 can detect if it's getting stuck in a loop and abort. :)
5. Once GPT-4 has decided the task is done or it can't make any more progress, it responds with a special action indicating it's done.
Right now there are a lot of limitations, and this is more a "research preview" than a finished product. That said, I've found it surprisingly capable for a number of tasks, and I think it's in a stable enough place we can share. Happy to answer any questions!
Very cool! I'm curious if you tried having it interactively query the DOM, rather than sending a simplified DOM? I was toying with a similar idea but was thinking it could just look iteratively for all buttons/text fields on the page.
Not yet, but that's definitely a direction I'm excited to explore. At a minimum we're planning on shipping a "viewport" concept and giving GPT the ability to scroll up/down so it doesn't have to load the whole context at once. There are more ideas I'm excited to experiment with like auto-collapsing long lists of similar elements or long chunks of text, and then inserting a special tag to let GPT-4 know there's more content there that it can unfold if necessary.
XPCOM or WebExtension API makes it even easier to run arbitrary apps than it just using a speculative executation or bithammer type attack to nop sled into shellcode.
1. You open the extension and write the task you'd like done (eg. "schedule a meeting with David tomorrow at 2").
2. Taxy pulls the DOM of the current page, puts it through a pipeline to remove all non-semantic information, hidden elements, etc and sends it to GPT-4 along with your text instructions.
3. GPT-4 tries to figure out what action to take. In our prompt we give it the option to either click an element or set an input's value. We use the ReAct paradigm (https://arxiv.org/abs/2210.03629) so it explains what it's trying to do before taking an action, which both makes it more accurate and helps with debugging.
4. Taxy parses GPT-4's response and performs the action requested on the page. It then goes back to step (2) and asks GPT-4 for the next action to take with the updated page DOM. It also sends the list of actions already taken as part of the current task so GPT-4 can detect if it's getting stuck in a loop and abort. :)
5. Once GPT-4 has decided the task is done or it can't make any more progress, it responds with a special action indicating it's done.
Right now there are a lot of limitations, and this is more a "research preview" than a finished product. That said, I've found it surprisingly capable for a number of tasks, and I think it's in a stable enough place we can share. Happy to answer any questions!