Hey HN! My brother Arctic_fly and I spent the last two weeks since the GPT-4 launch building Taxy, an open source Chrome extension that lets you automate arbitrary tasks in your browser using GPT-4. You can see a few demos in the Github README, but basically it works like this:
1. You open the extension and write the task you'd like done (eg. "schedule a meeting with David tomorrow at 2").
2. Taxy pulls the DOM of the current page, puts it through a pipeline to remove all non-semantic information, hidden elements, etc and sends it to GPT-4 along with your text instructions.
3. GPT-4 tries to figure out what action to take. In our prompt we give it the option to either click an element or set an input's value. We use the ReAct paradigm (https://arxiv.org/abs/2210.03629) so it explains what it's trying to do before taking an action, which both makes it more accurate and helps with debugging.
4. Taxy parses GPT-4's response and performs the action requested on the page. It then goes back to step (2) and asks GPT-4 for the next action to take with the updated page DOM. It also sends the list of actions already taken as part of the current task so GPT-4 can detect if it's getting stuck in a loop and abort. :)
5. Once GPT-4 has decided the task is done or it can't make any more progress, it responds with a special action indicating it's done.
Right now there are a lot of limitations, and this is more a "research preview" than a finished product. That said, I've found it surprisingly capable for a number of tasks, and I think it's in a stable enough place we can share. Happy to answer any questions!
Very cool! I'm curious if you tried having it interactively query the DOM, rather than sending a simplified DOM? I was toying with a similar idea but was thinking it could just look iteratively for all buttons/text fields on the page.
Not yet, but that's definitely a direction I'm excited to explore. At a minimum we're planning on shipping a "viewport" concept and giving GPT the ability to scroll up/down so it doesn't have to load the whole context at once. There are more ideas I'm excited to experiment with like auto-collapsing long lists of similar elements or long chunks of text, and then inserting a special tag to let GPT-4 know there's more content there that it can unfold if necessary.
XPCOM or WebExtension API makes it even easier to run arbitrary apps than it just using a speculative executation or bithammer type attack to nop sled into shellcode.
This projects demonstrates the death of UI. UIs were created for humans to interact with software not for bots to perform tasks. If all you need to provide is your text request then we don't need UIs. All softwares can just expose rest/text interfaces and LLM can perform the task for you.
But certain countries will build a culture and infrastructure that completely revolves around the car and results in the average citizen walking only 3,000 steps per day [0].
Interfaces will exist because sometimes we need them (walking between your car and the actual destination) and sometimes we like them (just like some people like walking), but just like walking, they may lose massive share.
This is true today, but will it be true in the future where language models are a common interface between human language (probably voice) and software actions?
When I go to my bank's website, it's easy for me to find my tax forms. They're usually one or two clicks away in some prominent top-level navigation component. But I would rather converse: "Download tax forms"/"Which ones: A, B, or C"/"The first two"/"Here ya go".
And in the future I'd rather say "Hey computer, download my tax forms from my bank and attach them in a draft email to <tax preparer>".
And in the later future that I'd rather say "Hey computer, do my taxes", and it will know what sources to gather info from and how to make sense of the numbers and how to file on my behalf. Or better yet, the machine will anticipate I need my taxes done and will kick off that process and solicit approvals and information from me. (Maybe by then the government will just tell me how much I owe ;))
Power users will likely sometimes find utility in dropping down "close to the metal" -- i.e., pointing and clicking and typing. But power users will mostly be composing workflows of models working together to achieve tasks (the term "scripts" will fit nicely). Yes this will be buggy and error-prone, but there will be glue models papering over the errors and trying different things and waiting for outages until the human user's intent is done.
Language models are giving us another layer of abstraction over software. Perhaps the final layer, as this one can interact directly with language (which is reified thought). This layer has the ability to cope with the inherent ambiguity of thought by being conversational -- it can ask you to clarify what you mean and gain confidence that it understands your intent.
> When I go to my bank's website, it's easy for me to find my tax forms. They're usually one or two clicks away in some prominent top-level navigation component. But I would rather converse: "Download tax forms"/"Which ones: A, B, or C"/"The first two"/"Here ya go".
I think this supposedly should be solved by search box on the site, where you enter: "tax forms" and got results with download button.
Creating the actions may be easier but parsing / understanding them will often be easier in any format that prose / plain text. I'd rather have a UI with a list of steps and colors and what not than a text I have to read.
Take for example the list of steps in the project. They contain a lot of redundant information I have to mentally ignore.
Yes, but I would rather the AI determine which actions these are and dynamically generate UI elements for me instead of relying on whatever shithead UI/UX person at LinkedIn wants me to look at to trick me into sharing all of my contact book.
ShitGPT: Are you sure you want to turn them off? LinkedIn notifications helpfully provide you with the latest news about your workplace and collogues past and present.
User: Yes, turn them off.
ShitGPT: Did you know that research has proven that employees with LinkedIn notifications turned on earn 10% higher salaries compared to their peers? Do you still want to to off notifications?
User: Yes.
ShitGPT: What is your reason for wanting to turn off notifications? Your answer has to be at least 400 words long.
...
ShitGPT: No, the words 'fuck openai' copied 400 times is not a valid answer.
...
ShitGPT: Error Too many requests in 1 hour. Try again later (of course you'll have to start over, tee hee)
That's fair, I think the Microsoft of 30 years ago did pretty well with UIs. That said, given the state of Windows 11 and their other modern products I think their current priorities are quite a bit different.
There was a fascinating (but sadly discontinued) app called Shortcat for MacOS. It basically let you control your entire MacOS with keyboard typing texts.
Thus, this can actually be done at larger scale for over 10 years.
Update: It turns out the development is resumed again, but I don't use mac anymore.
I really like this, but I wonder if a better approach would be to take that simplified DOM and instead generate playwright or puppeteer code instead of direct DOM manipulation. That way it’s reproducible browser automation.
That's the direction I'd like to take this longer-term. The major advantages are: lower latency, lower costs and better auditability/reproducibility each time you run a workflow.
The major disadvantage is less resiliency if the structure of the page changes, eg. if a site adds a "subscribe now!" modal or something you have to click through to get to the content. But that's solvable (if desired) by falling back to the LLM if the scripted steps don't produced the expected output.
We actually started our implementation in this direction but decided to change course and focus on direct LLM manipulation to iterate and launch faster. But we'll likely get back there at some point, especially as we build support for scheduled/automated workflows.
While your example of "site adds a modal" is true, with browser based regression tests, the brittleness I find much more often is minor changes to the DOM, which don't have a large impact on what the user sees. (Sometimes this is due to refactoring the DOM, sometimes it's because the DOM doesn't have good class names to use as handles, and sometimes it's because I want to reuse a step on 2 different pages, but the rendered DOM is slightly different).
My mental model of how I think about drift (which might be totally worthless from a LLM perspective):
1) I give a prompt
2) snapshot of dom is taken
3) gpt looks at that snapshot DOM and implements some solution which works
4) that solution is transformed into a more concrete implementation
4i) whether this "transformation" is any of the following isn't SUPER important, and to be honest I'd love to see all 3: a) selenium/playwright code writen by Taxy, b) a hybrid of explicit code and a gpt prompt, or c) developer can override with fully custom code
4) it runs correctly for X amount of time
5) application dom changes
6) taxy notices the step is failing
7) taxy takes a new snapshot of the dom
8) taxy runs a (reinforcement?) algorithm against the new snapshot and confirms it finds the "same" dom element as the one from old snapshot.
unrelated: the other thing which I've found very hard to program into my browser tests (and makes the code hard to interpret), so I'm curious how gpt/taxy could help:
Given I have this dom:
<ul>
<li>
<div class="tweet">
<h1>Check out my sandwhich!</h1>
<button>Retweet</button>
</div>
</li>
<li>
<div class="tweet">
<h1>Check out my shoes!</h1>
<button>Retweet</button>
</div>
</li>
</ul>
I want to write test code which is:
1) When I load the page, I should see a tweet "Check out my sandwhich!"
2) I can retweet that tweet.
Currently, I either need todo:
a) a dom traversal: find(text: "Check out my sandwhich", css: ".tweet h1").parents(".tweet").find(text: "retweet"). It's that `parents(".tweet")` part which becomes awkward at scale and incentives developers to only create 1 tweet in the test database....
b) use Page Objects, which I love, but adds overhead/training for the team
I would love if gpt could figure out, these 2 elements are "related". :)
Long-term, generating reproducible code will be important for speed/economics. However, even then the LLM will still have to calculate its actions one at a time when creating the reproducible workflow, since Taxy will have to deal with html elements being added and removed from the DOM after each action, not to mention navigation to other webpages.
Automation seems like a poor fit for an LLM since results are random and irreproducible? But I could see asking it to write a script step by step, and, once you've confirmed it works, keeping the script.
Also, it could help you fix the script when it breaks.
Anxiously been waiting for something like this - very cool.
One use case I've had is that I hate spending time on my linkedin, twitter, etc newsfeeds. But there are a handful of people I care about and want to keep tabs on.
Is there a way I could use TaxyAI to setup a role to monitor my LinkedIn newsfeed and keep tabs on certain people + topics and then email me a digest of that?
Right now the tool isn't reliable enough for us to feel comfortable releasing automated workflows. However, there are a lot of low-hanging fruit improvements we're planning on making to reliability, and once that's taken care of we're definitely planning on supporting automated workflows. Your example should be pretty straightforward at that point!
Similar to this, does anyone know if a browser extension that I can paste in (or choose from some saved snippets) a series of playwright or puppeteer steps and have it execute? I could use the saved snippets in the sources tab of dev tools but miss the auto waiting and other niceties. This project seems a bit too slow and non-deterministic.
I don't know about being able to paste in Playright steps, but a friend of mine has an extension built around extension based browser automation: https://browserflow.app/
You can record steps & have the extension replay them on your machine or in the cloud (presumably using puppeteer/playwright).
I wrote a piece on my professional blog last week about the imminent death of most UI based software, and it's funny to see this releasing today to further my argument.
And as I commented elsewhere: yes, UI elements make sense sometimes. But it makes sense for an AI to dynamically make these for us when needed instead of relying on the software's own implementation that may suck or have dark patterns or force workflows I don't want to deal with
Can you link the blog? An AGI that can generate on demand UIs could be the last interface we ever need. And those UI could be adapted and projected to different form factors, if you think about it there is a massive amount of waste in developer developing the same UI widgets over and over ad infinitum. Not sure I agree it is imminent though.
Unfortunately I don't tie my real identity to this account. The high level TLDR is that no one actually wants to interact with UIs (or software in general) in 99% of use cases, we just want the results we get from doing so. And as we see automation tools like the OP appear, we're going to be forced to do so much less. The companies that don't align to this new paradigm are going to die out, because their competition that makes interactions for customers easier are going to win the market share.
If you don't agree with imminent, don't take my word for it. Richard Ngo at OpenAI predicts this by the end of 2025. Less than 2 years away.
> if you think about it there is a massive amount of waste in developer developing the same UI widgets over and over ad infinitum
My blog also touches on this. 95% of consumer-facing software is the same concepts and UIs repeated. It's all CRUD stuff using one of 30 (made up number) UI elements arranged in roughly the same ways. There's a massive amount of training available of the translations of data and code into UIs. It's a logical next step that we will build software autonomously based on a (no)SQL database alone, and that (no)SQL databases will be autonomously created based on structured business requirements alone.
Whoever makes either step of this process work really well, either the database to UI part, or the requirements to database part, will make an absolute shitload of money.
I agree with basically all of that and thanks for sharing. The OpenAI team is hardly a neutral party in all of this though, they benefit from the hype. Exciting times regardless.
Why use GPT-4? The latency is significantly worse than 3.5 and this seems simple enough that the performance delta is marginal. If I was going for robustness, I probably wouldn’t be using AI in the first place.
Edit: I noticed they support both but I’m assuming by the speed all the demos are using 3.5?
We do have a switcher in the UI that lets you run against either GPT-3.5-turbo or GPT-4.
We've mainly been using GPT-4 for internal tests because in the "Make it work, make it right, make it fast" development flow, we're still firmly in the "make it work" phase. ;)
This is amazing already! Very exciting. I'll make sure I follow this project's progress. It also reminds me of Adept and their goal with ACT-1. I still haven't seen their product launch, though...
It will be interesting to see whether this sort of approach works better than something using GPT-4's vision capabilities. Obviously websites are built to be easy to use visually rather than easy to use via the DOM. On the other hand, it's much less clear how to ground action proposals in the visual domain - how do you ask GPT where on an image of the screen it wants to click?
According to this survey, 97.4% of websites don't comply with WCAG, which isn't surprising at all to me as someone who has been in the industry since 2004.
Since GPT4 released, I've been hoping the vision capabilities will be very shortly followed by projects to essentially allow natural language RPA of a desktop computer.
Copying financials from a PDF to an Excel sheet, for instance, is the kind of task that is tricky to manually automate but seems like it would be trivial for an LLM to execute.
This is very cool! I was messing with some browser automation (Playwright) via GPT recently.
One idea I had: it would be cool if I could teach the agent. For instance, give it a task, but if it struggles, just complete it myself while the extension observes my interactions.
Perhaps these could be used as few shot examples for priming the model?
We aren't introspecting previous runs or human examples to optimize workflows right now, but it's a powerful tool and one that I expect we'll employ in the future to make Taxy more efficient and reliable.
Does the demo show the AI with the prompt to "Schedule standup tomorrow at 10am. Invite david@taxy.ai" scheduling a meeting at 10am TODAY, which also was already five hours in the past?
> Makes me worried about AI with internet access...
I think I'm going to start ending every post with the signature "You are a friendly AI." Hopefully if it repeats enough times in the dataset, our AIs will be aligned.
Hahah that's not a bad idea. Maybe you could be more subtle, for the rare cases that a human is actually reading your post. Thanks for your kind and helpful response.
A smart, reliable form-filler was one of the pre-GPT ideas I've thought about building for a very long time. That may have been one of the subconscious inspirations for what became Taxy. :D
I always thought that a perfect form-filler would have to be based on image recognition to "see" forms the way humans do (instead of filtering through code).
Very cool. The “sending everything of relevance on the page to OpenAI” is of course creepy. But that’s table stakes for anything like this until people can run them externally.
This would make a cool, “magic box”, at the top of a web page. Type in what you want to do, it sends it to the server along with the DOM extract (same site server). Server asks magical LLM how to do it, and then spits it back to the client. So no plug-in needed and data flow would pass through the source server.
That is a great idea! I can't believe that the same functionality that seemed impossible a year ago (for which some star-studded startup raised $80m) is now achievable by a guy and his brother in 2 weeks. Godspeed!
This is actually an interesting idea. I'm not sure about the pricing, but I might actually try something like this for a project I'm working on. Could blow the socks off of my CEO.
I'm feeling that an API is something much more stable and deterministic than human-readable interface. Also you can train AI to learn which API calls to make for the task by looking at page sources. Why not translating prompts to single API calls instead of a script for clicking through DOM elements achieving the same?
Very cool idea! I'm excited to try it. I'm a little bit worried about the reliability of interfacing with a website via the DOM. I trust GPT-4 enough, but I could see a situation where the correct fields to fill in are ambiguous in the DOM and the plugin ends up saving or deleting the wrong data.
From a developer point of view, this would vastly improve QA for web apps that have UI. As a user, I'd really appreciate a text-to-speech UI that does not require a keyboard or mouse. Ideally, I'd like to see the same UI available for anything, as if asking another person to interact with computer for you.
Auto-fill tools would still be super manual because none of these application workflows are consisntent, even on the exact same provider (myworkdayjobs). With auto-fill you're still effectively aligning fields on a page with saved values and it takes forever.
Filling in a timesheet for meetings for the week based on calendar items. Categorizing whether it's internal or external meeting based on whether email addresses invited match your employer's domains, and which project it was for based on the title and description.
I could use it to fill out timesheets in Zoho, find a list of Facebook events that I'm actually interested in, and download forms from my bank. These are tasks I do often that the UIs make far more difficult than they need to be.
This is very cool, impressive work in 2 weeks!
Each action seems to have some delay after it, is there any reason for that? Is it because you are streaming the OpenAI response and performing the actions as they come? If not, I imagine streaming the query response and executing each action as they emit would speed it up quite a bit?
There are multiple sources of latency: (1) parsing, simplifying and templatizing the DOM, (2) sending it to OpenAI and waiting for the response, (3) actually performing the action (which includes some internal waiting depending on what the actual action is), and (4) a hacky 2-second sleep() after each action to make sure any site changes are reflected in the DOM before running again. :)
(1) is already very fast. There's a lot of room for improvement in both (3) and (4); we currently set the timeouts very conservatively to improve reliability, but there are a number of heuristics we can use to cut those timeouts substantially in the average case.
Unfortunately for (2) I'm not sure there's much we can do. We only ask OpenAI for one action at a time to give it a chance to observe the state of the page between actions and correct course as necessary. We experimented with letting it return multiple actions at once but that hurt reliability; we can perform more experiments in that direction but it probably won't be a priority in the short term.
Wow... Another AI thing that I can't use because there's a "waitlist". GTFO, software doesn't need waitlists and you're a jerk for advertising uselessness.
We have a waitlist because we haven't released the extension in the Chrome Web Store yet, but it's fully open source and you can download and run it on your local machine today!
You advertise something and don't deliver you are fair game open source or not. This is the opposite of an unwanted pr request this is a baiting and collecting email addresses but no product.
1. You open the extension and write the task you'd like done (eg. "schedule a meeting with David tomorrow at 2").
2. Taxy pulls the DOM of the current page, puts it through a pipeline to remove all non-semantic information, hidden elements, etc and sends it to GPT-4 along with your text instructions.
3. GPT-4 tries to figure out what action to take. In our prompt we give it the option to either click an element or set an input's value. We use the ReAct paradigm (https://arxiv.org/abs/2210.03629) so it explains what it's trying to do before taking an action, which both makes it more accurate and helps with debugging.
4. Taxy parses GPT-4's response and performs the action requested on the page. It then goes back to step (2) and asks GPT-4 for the next action to take with the updated page DOM. It also sends the list of actions already taken as part of the current task so GPT-4 can detect if it's getting stuck in a loop and abort. :)
5. Once GPT-4 has decided the task is done or it can't make any more progress, it responds with a special action indicating it's done.
Right now there are a lot of limitations, and this is more a "research preview" than a finished product. That said, I've found it surprisingly capable for a number of tasks, and I think it's in a stable enough place we can share. Happy to answer any questions!