Computer Use & Browser Agents
Agents that see your screen and use mouse & keyboard
The Problem: Most software and websites have no API — legacy desktop apps, internal portals, third-party tools. You can't script them the usual way. How do you automate a system you can only see and click?
The Solution: The Screenshot → Reason → Action Loop
A computer-use agent has no special backdoor — it operates the GUI like a human. Each turn it captures a screenshot, reasons about the next step, and emits a concrete action. The hard part is grounding: turning "click Submit" into a precise x,y. Pixel-only grounding is flexible but brittle; reading the accessibility tree (or the browser DOM via Playwright/CDP) targets the real element by role and label. Because actions can be irreversible, a safety gate asks a human to confirm before anything destructive.
Think of it like a remote assistant who can only see your screen:
- 1. Capture the screen: Take a screenshot (or read the UI/accessibility state) of the current app.
- 2. Reason about the goal: The model decides the next UI action that moves toward the task goal.
- 3. Emit a concrete action: Output click at x,y / type text / key press — grounded to a real target.
- 4. Execute & loop: Run the action, capture the new state, and repeat until the goal is met.
Browser agents run the same loop but use the DOM via Playwright or the Chrome DevTools Protocol (CDP) for far more reliable grounding than raw pixels.
Where Computer-Use Agents Shine
- Web automation & form filling: Fill out portals, book appointments, scrape data from sites that expose no API.
- QA & end-to-end UI testing: Drive a real browser through user flows and catch visual regressions a human would.
- RPA across legacy desktop apps: Automate old desktop software that has no integration points — the agent just clicks.
- Data entry & migration: Move records between two systems by reading one screen and typing into another.
Fun Fact: The whole capability boils down to a tiny loop: screenshot → reason → action → new screenshot. Everything else — better grounding, verification steps, safety gates — is about making that loop reliable enough to trust on a real screen.
Try It Yourself!
Use the interactive demo below to step through the screenshot → reason → action loop, compare pixel-coordinate grounding with the accessibility tree, and watch a safety gate pause before a destructive action.
Capture the current screen as an image (or read the accessibility/UI state).
Frequently asked questions
What is computer use in AI agents?
Computer use is a capability where an agent operates a graphical interface the way a human does: it takes a screenshot, reasons about the goal, emits a concrete action (click at x,y, type text, press a key), executes it, takes a new screenshot, and repeats. Because it works through the GUI, it can automate any application — even legacy software with no API. Claude computer use and similar tools follow this screenshot-reason-action loop.
What is the difference between pixel-coordinate and accessibility-tree grounding?
Grounding turns an instruction like 'click Submit' into a concrete target. Pixel-coordinate grounding predicts raw x,y from the screenshot — flexible but brittle, since it mis-clicks when the layout shifts. Accessibility-tree (and DOM) grounding reads the structured list of UI elements the OS or browser exposes, so the agent targets the actual 'Submit' button by its role and label. Accessibility grounding is far more robust; many production agents combine both.
What are the main safety risks of browser and computer-use agents, and how are they mitigated?
The biggest risk is that an agent takes a destructive, irreversible action — deleting files, sending money, submitting a wrong form — or gets hijacked by prompt injection from page content. The standard mitigation is a safety gate: require explicit human confirmation before irreversible actions, run the agent in a sandboxed environment with least-privilege access, allowlist domains, and add a verification step that checks the screen state before and after each action.
Try it yourself
Interactive demo of this technique
A computer-use agent must click the "Submit" button in a web form where a cookie banner appears on load.
Action: click(412, 530)
Result: nothing happened. The page actually shifted down because of the cookie banner, and (412, 530) now lands on the banner itself. The agent clicked past the button and never noticed — it did not check whether the screen state changed.
Screenshot: I see a cookie banner over the form. Thought: the banner overlays the page and shifts elements — dismiss it first. Action: click(role="button", name="Accept cookies") Screenshot: banner gone, the form is fully visible.
Thought: now find the Submit button by role and label, not old coordinates. Action: click(role="button", name="Submit") Screenshot (verification): a "Thanks, form submitted" page appeared.
Verified: state changed as expected → task complete.
Pixel-only grounding breaks the moment the layout shifts. Grounding by role/label plus a post-action verification step make a computer-use agent reliable.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path