Lesson 12Hands-on

Computer Use & Browser Agents

Agents that see your screen and use mouse & keyboard

The Problem: Most software and websites have no API — legacy desktop apps, internal portals, third-party tools. You can't script them the usual way. How do you automate a system you can only see and click?

The Solution: The Screenshot → Reason → Action Loop

A computer-use agent has no special backdoor — it operates the GUI like a human. Each turn it captures a screenshot, reasons about the next step, and emits a concrete action. The hard part is grounding: turning "click Submit" into a precise x,y. Pixel-only grounding is flexible but brittle; reading the accessibility tree (or the browser DOM via Playwright/CDP) targets the real element by role and label. Because actions can be irreversible, a safety gate asks a human to confirm before anything destructive.

Think of it like a remote assistant who can only see your screen:

1. Capture the screen: Take a screenshot (or read the UI/accessibility state) of the current app.
2. Reason about the goal: The model decides the next UI action that moves toward the task goal.
3. Emit a concrete action: Output click at x,y / type text / key press — grounded to a real target.
4. Execute & loop: Run the action, capture the new state, and repeat until the goal is met.

Browser agents run the same loop but use the DOM via Playwright or the Chrome DevTools Protocol (CDP) for far more reliable grounding than raw pixels.

Where Computer-Use Agents Shine

Web automation & form filling: Fill out portals, book appointments, scrape data from sites that expose no API.
QA & end-to-end UI testing: Drive a real browser through user flows and catch visual regressions a human would.
RPA across legacy desktop apps: Automate old desktop software that has no integration points — the agent just clicks.
Data entry & migration: Move records between two systems by reading one screen and typing into another.

Fun Fact: The whole capability boils down to a tiny loop: screenshot → reason → action → new screenshot. Everything else — better grounding, verification steps, safety gates — is about making that loop reliable enough to trust on a real screen.

Try It Yourself!

Use the interactive demo below to step through the screenshot → reason → action loop, compare pixel-coordinate grounding with the accessibility tree, and watch a safety gate pause before a destructive action.

How a Computer-Use Agent Works

app.example.com

Password

Submit

1. Screenshot

Capture the current screen as an image (or read the accessibility/UI state).

Frequently asked questions

What is computer use in AI agents?

Computer use is a capability where an agent operates a graphical interface the way a human does: it takes a screenshot, reasons about the goal, emits a concrete action (click at x,y, type text, press a key), executes it, takes a new screenshot, and repeats. Because it works through the GUI, it can automate any application — even legacy software with no API. Claude computer use and similar tools follow this screenshot-reason-action loop.

What is the difference between pixel-coordinate and accessibility-tree grounding?

Grounding turns an instruction like 'click Submit' into a concrete target. Pixel-coordinate grounding predicts raw x,y from the screenshot — flexible but brittle, since it mis-clicks when the layout shifts. Accessibility-tree (and DOM) grounding reads the structured list of UI elements the OS or browser exposes, so the agent targets the actual 'Submit' button by its role and label. Accessibility grounding is far more robust; many production agents combine both.

What are the main safety risks of browser and computer-use agents, and how are they mitigated?

The biggest risk is that an agent takes a destructive, irreversible action — deleting files, sending money, submitting a wrong form — or gets hijacked by prompt injection from page content. The standard mitigation is a safety gate: require explicit human confirmation before irreversible actions, run the agent in a sandboxed environment with least-privilege access, allowlist domains, and add a verification step that checks the screen state before and after each action.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateAnalysis

A computer-use agent must click the "Submit" button in a web form where a cookie banner appears on load.

Without technique

Prompt

Click the Submit button. Its coordinates are (412, 530). Use click(x, y) with these coordinates.

Response

Action: click(412, 530)

Result: nothing happened. The page actually shifted down because of the cookie banner, and (412, 530) now lands on the banner itself. The agent clicked past the button and never noticed — it did not check whether the screen state changed.

Tokens:40/60

Time:700ms

Quality:

With agents-computer-use

With technique

Prompt

Goal: submit the form. Work in the screenshot → reason → action loop. Ground to elements by role and label (accessibility tree / DOM), not fixed pixels. After each action take a new screenshot and verify the state changed as expected; retry if needed.

Response

Screenshot: I see a cookie banner over the form. Thought: the banner overlays the page and shifts elements — dismiss it first. Action: click(role="button", name="Accept cookies") Screenshot: banner gone, the form is fully visible.

Thought: now find the Submit button by role and label, not old coordinates. Action: click(role="button", name="Submit") Screenshot (verification): a "Thanks, form submitted" page appeared.

Verified: state changed as expected → task complete.

👁️Screenshot: detected a cookie banner overlaying the form

🧠Action: dismiss the banner by role/label, not pixels

👁️Action: click(role="button", name="Submit") — semantic grounding

🔍Verification: a new screenshot confirms successful submission

Tokens:95/150

Time:2600ms

Quality:

Why this works

Pixel-only grounding breaks the moment the layout shifts. Grounding by role/label plus a post-action verification step make a computer-use agent reliable.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:React Function Calling Tool Impact

This lesson is part of a structured LLM course.

My Learning Path

Lesson 12Hands-on

Computer Use & Browser Agents

Agents that see your screen and use mouse & keyboard

The Solution: The Screenshot → Reason → Action Loop

Think of it like a remote assistant who can only see your screen:

1. Capture the screen: Take a screenshot (or read the UI/accessibility state) of the current app.
2. Reason about the goal: The model decides the next UI action that moves toward the task goal.
3. Emit a concrete action: Output click at x,y / type text / key press — grounded to a real target.
4. Execute & loop: Run the action, capture the new state, and repeat until the goal is met.

Browser agents run the same loop but use the DOM via Playwright or the Chrome DevTools Protocol (CDP) for far more reliable grounding than raw pixels.

Where Computer-Use Agents Shine

Web automation & form filling: Fill out portals, book appointments, scrape data from sites that expose no API.
QA & end-to-end UI testing: Drive a real browser through user flows and catch visual regressions a human would.
RPA across legacy desktop apps: Automate old desktop software that has no integration points — the agent just clicks.
Data entry & migration: Move records between two systems by reading one screen and typing into another.

Try It Yourself!

How a Computer-Use Agent Works

app.example.com

Password

Submit

1. Screenshot

Capture the current screen as an image (or read the accessibility/UI state).

Frequently asked questions

What is computer use in AI agents?

What is the difference between pixel-coordinate and accessibility-tree grounding?

What are the main safety risks of browser and computer-use agents, and how are they mitigated?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskIntermediateAnalysis

A computer-use agent must click the "Submit" button in a web form where a cookie banner appears on load.

Without technique

Prompt

Click the Submit button. Its coordinates are (412, 530). Use click(x, y) with these coordinates.

Response

Action: click(412, 530)

Tokens:40/60

Time:700ms

Quality:

With agents-computer-use

With technique

Prompt

Response

Thought: now find the Submit button by role and label, not old coordinates. Action: click(role="button", name="Submit") Screenshot (verification): a "Thanks, form submitted" page appeared.

Verified: state changed as expected → task complete.

👁️Screenshot: detected a cookie banner overlaying the form

🧠Action: dismiss the banner by role/label, not pixels

👁️Action: click(role="button", name="Submit") — semantic grounding

🔍Verification: a new screenshot confirms successful submission

Tokens:95/150

Time:2600ms

Quality:

Why this works

Pixel-only grounding breaks the moment the layout shifts. Grounding by role/label plus a post-action verification step make a computer-use agent reliable.

1 / 2

Practice Challenges

Create a free account to solve challenges

3 AI-verified challenges for this lesson

Related lessons:React Function Calling Tool Impact

This lesson is part of a structured LLM course.

My Learning Path