Playwright AI Agents - LLM-driven test authoring + execution

1What is a Playwright AI agent

An LLM-driven runner that turns natural-language goals into Playwright actions and verifies the result. Think test author + test runner glued together by a small reasoning loop.

goal -> action plan DOM as context tool calling observation step retry on failure structured output deterministic seed (sometimes) budget guard

The agent runs roughly four phases: perceive (snapshot DOM, accessibility tree, or screenshot), plan (LLM proposes the next Playwright action - click, fill, expect), act (we call the actual Playwright API), and observe (did the action succeed, did the page advance, is the assertion green). A failed observation feeds back into the next plan call so the loop self-corrects.

The two design choices that change everything: (a) what you feed the LLM as page context - full DOM, ARIA tree, screenshot, or all three - and (b) how strictly you constrain the action it can return - free text vs JSON tool calls vs typed Playwright commands. Strict tools + ARIA tree is the cheap, fast, boring default. Free-form vision is the slow, expensive, brilliant backup.

Try it on TTA. Open the TTACart-AI sandbox and watch the 5 mini-demos that ship with it - each one is a different way to wire an agent into Playwright. Open sandbox

Exercises

Map the loop. Sketch the perceive-plan-act-observe loop for "log in to TTACart, add the highlighted SKU to cart, verify total = 999". Mark which step is deterministic and which is LLM-driven.
Compare contexts. Drive the same TTACart "add to cart" flow once using the full HTML, once using only the accessibility tree, once using a screenshot. Compare token cost.
Force structured output. Define a JSON schema for {action, selector, value} and prompt the LLM to fill it. Reject any response that fails schema validation.
Inject a deliberate bug. Rename the checkout button to "Pay now" in the DOM via DevTools, then re-run the agent. Does the loop recover by re-reading the page, or does it hard-fail?

2Stagehand-style API on top of Playwright

A thin wrapper that adds ai.act, ai.observe, and ai.extract to your Playwright Page. You keep all the existing page.goto, expect, page.locator APIs and only reach for AI when the locator is fragile or unknown.

ai.act(intent) ai.click(text) ai.fill(label, value) ai.extract(schema) ai.observe() cache resolved selectors fall back to Playwright

The mental model: stop the agent from generating code, let it generate resolved locators at runtime. The first time the agent sees "the green checkout button" it asks the LLM, gets back a CSS / ARIA locator, runs it, caches it. Next run it uses the cached locator and skips the LLM entirely. Cost amortises to near-zero after the first green test run.

example.spec.ts (5 line illustration)

// Illustrative TTA snippet, not a copy from upstream.
await page.goto('https://app.thetestingacademy.com/playwright/ttacart-ai/');
await ai.act('add the highlighted SKU to the cart');
await ai.act('open the cart drawer');
const total = await ai.extract({ schema: { total: 'number' } });
expect(total.total).toBeGreaterThan(0);

Try it on TTA. Use the Stagehand-style demo card on TTACart-AI - it shows the same flow as the snippet above against our real DOM. Open sandbox

Exercises

Replace 5 locators. Take one TTACart spec and rewrite 5 brittle page.locator calls as ai.act intents. Measure the first run cost vs. cached run cost.
Strict schema extract. Write an ai.extract call that returns { items: Array<{ sku, qty, price }> } from the TTACart cart drawer. Fail the spec if the schema mismatches.
Hybrid spec. Use ai.act only for the unstable steps (search, add-to-cart) and page.locator for the stable steps (login form). Diff the test runtime.
Stale cache test. Cache a resolved locator, rename the matching attribute in DevTools, re-run, and watch the cache invalidate + re-ask the LLM.

3browser-use and web-agent libraries

A different shape - instead of decorating Playwright with AI calls, you hand the entire browser to the agent and let it drive on its own. The agent generates a multi-step plan, executes step by step, re-plans on error.

agent.run(goal) action plan list page state digest history of past steps replan on error max_steps cap screenshot every step

The agent gets one goal ("book the cheapest TTAStays room for next weekend") and gets to choose every Playwright action - including click, fill, scroll, extract, and finish. The library exposes Playwright primitives as tools, and the LLM decides the sequence. Where Stagehand is a hammer, browser-use is a robot arm.

Cost goes up fast - every step is a fresh LLM call - so production usage caps max_steps (typically 10-30), runs a cheap small model first, and only escalates to a large model on retry. Some libraries cache the final action plan and let you replay it deterministically the next run - you got the AI to "discover" the test, and now you can re-run it like a normal Playwright script.

Try it on TTA. The TTACart-AI sandbox has a smart-data demo button that generates 10 edge-case checkouts - same idea, scoped to a single page. Open sandbox

Exercises

Generate 10 checkouts. Use the smart-data button on TTACart-AI to produce 10 edge-case checkout payloads (Unicode names, very long addresses, 0 quantity). Assert the orders table renders all 10 rows cleanly.
Step cap. Set max_steps=8 and goal "find the SKU on page 3 of TTACart and add it to cart". What does the agent do when it hits the cap before finding the SKU?
Plan replay. Capture the action plan from one successful run as JSON. Replay it deterministically without any LLM calls. Verify the run is byte-identical.
Fail injection. Mid-run, break the network for 2 seconds (via Playwright route handler). Does the agent retry, re-plan, or give up?
Cost report. Sum tokens used by the agent across 5 different goals. Pick the cheapest model that still hits all 5.

4Self-healing locators

The single most-loved AI feature in Playwright land. When a locator fails, the AI proposes 3 fallback locators using the current DOM, ranked by stability. You retry with the top pick and log the heal so a human can promote it later.

primary fails redact DOM 3 fallback candidates stability ranking retry top pick log to heals.jsonl human promotion gate

The flow lives one layer below your test: the locator wrapper catches not-found / not-visible, asks an AI client for 3 candidates, applies the top one, and only throws if all three fail. The healed locator is written to a log file but the source spec is left untouched - that way a heal never silently masks a real regression. A weekly maintainer PR promotes the most-used heals into the codebase as real selectors.

Detailed write-up of how we wire this in V2 lives in Framework + AI #heal, plus the Top 5 AI features card with a sequence diagram and an excerpt from UtilElementLocator.ts.

illustrative heal call (5 lines)

// On primary locator fail, ask AI for 3 candidates against the current DOM.
const dom = await page.locator('main').innerHTML();
const candidates = await ai.heal({ original: '#sku-1842', intent: 'click', dom });
await page.locator(candidates[0]).click();    // try the best one
log.heal({ original: '#sku-1842', healed: candidates[0] });

Try it on TTA. The TTACart-AI sandbox has a "heal" demo card - pass a corrupted DOM snippet and watch which of the 3 fallbacks the AI picks. Open sandbox

Exercises

Corrupt the DOM. Pass a snippet where the original id is renamed but the label text is intact. Reason about why the AI's fallback uses getByText rather than another id.
Stability ranking. Force the AI to return 3 candidates: an xpath, an nth-child css, and a getByRole. Rank them by likely-stability and explain.
Heal explosion. Run a TTACart suite where every spec fails locator-wise. How many heals fire? Is the budget guard catching them?
Promotion PR. Read logs/heals.jsonl after a run, group by frequency, and draft a PR that promotes the top 3 heals into source.

5Prompt to spec - natural language to .spec.ts

You write the test in English. The agent reads the target page (or a recorded session), generates a real Playwright spec file, runs it, captures the result. Spec files live in source control - no AI at test time.

English intent DOM snapshot trace replay generated .spec.ts human review gate no AI at test time stable + replayable

This is the design pattern most teams should default to: AI helps you write the test, but the test itself is plain Playwright code that runs without any LLM dependency in CI. You get the speed of AI authoring with the determinism of vanilla Playwright at runtime. The same generator can re-emit specs when the page changes - you keep the English prompt as source of truth, treat generated .spec.ts as build output.

How we use this in the TTA framework + AI build is described in Framework + AI #features (prompt-to-spec card) with a sequence diagram. The TTACart-AI sandbox includes a live generator demo.

Try it on TTA. Use the prompt-to-spec demo card on TTACart-AI. Give it 3 scenarios and read the generated TypeScript. Open sandbox

Exercises

Three flavors. Generate the same TTACart login spec three times - with locator-only output, with full POM output, and with fixture-based output. Compare ergonomics.
Re-emit on drift. Change a TTACart label, re-run the generator, diff the new spec vs. the old one. Did the agent update only the affected lines?
Determinism check. Run the generator 3 times with the same prompt + same DOM. Assert the diff is empty (you'll need temperature=0).
Human gate. Add an approval step - the agent emits a draft PR, you review, then a CI step merges. What does the PR description look like?

6Visual reasoning - the vision LLM as a tester

Instead of feeding the LLM DOM text, you feed it a screenshot. The model can verify layout, spot z-index bugs, catch dark-mode contrast issues, and even reason about which button to click when the DOM is opaque (canvases, custom drawn UI).

screenshot in bounding box out layout verification contrast check canvas drawing UI expensive - cap per run

Vision is the right tool when the DOM lies - canvas-rendered charts, custom GL renderers, PDF viewers, drawn-on-canvas tables. It's also useful for layout assertions ("the price label must be within 20px below the SKU title"). The trade-off: a vision call is 5-20x the token cost of a DOM call, so most teams cap it to once per spec, used only when DOM-based reasoning fails.

A standard recipe: page.screenshot() -> send to vision LLM with a tight prompt ("return the bounding box of the green checkout button in pixels") -> convert pixels back to Playwright page.mouse.click(x, y). Brittle? Yes. Catches bugs DOM-only tests miss? Also yes.

vision-click illustration (8 lines)

const png = await page.screenshot({ fullPage: false });
const box = await ai.visionLocate({
  image: png,
  intent: 'the green Checkout button at the bottom of the cart',
});
await page.mouse.click(box.cx, box.cy);
await expect(page).toHaveURL(/\/payment/);

Exercises

Layout assertion. Take a TTACart screenshot at 320px viewport and ask the vision LLM to flag elements that overflow the viewport. Compare against axe-core output.
Canvas test. Add a canvas-rendered "verify chart legend reads X" assertion. Show how DOM-only Playwright can't do this without vision.
Cost report. Run a 50-test suite with vision enabled vs. disabled. Compute extra tokens and decide which 5 tests deserve vision.
Hybrid spec. Use vision once to find the button, cache the resulting (x, y) for the next 9 runs. Watch the cache get stale on layout change.

7Cost + privacy guards

Five guards that stop AI agents from leaking customer data or burning the AWS bill. Every one of them lives in your AIClient wrapper, not in the upstream provider library.

serverless proxy PII redactor budget hard-stop response cache audit log local provider fallback vision opt-in only

Proxy every cloud call. Route requests through a Cloudflare Pages function or equivalent. Keys never touch the test machine. We do this already for our chat demo via functions/api/chat.js.
Redact before send. Strip emails, phone numbers, postal codes, card last-4s from DOM snapshots and trace summaries. Local providers (Ollama, LM Studio) skip the redactor - data never leaves the box anyway.
Hard token budget. A per-run cap on input + output tokens. Hit the cap, fail fast. No surprise $500 bills.
Cache responses. Hash {model, system, prompt}, store the result, return it on cache hit. Same query in CI never pays twice.
Audit every prompt + response. Append-only log so a security review can answer "what did we ask the model about prod data last Tuesday".

Full implementation in Framework + AI #guards. The principle: AI agents are not exempt from the same data handling rules as the rest of your test suite. They get extra scrutiny, not less.

Privacy note. Do not pass production user data to a cloud LLM during testing. Synthesise data, or run on a local provider. If you must use real data, redact first and confirm the redactor catches every PII pattern in your dataset - not just the obvious ones.

Exercises

Audit log diff. Run a TTACart suite twice with caching on. Read the audit log and confirm the second run made zero cloud calls.
Redactor unit test. Write 20 inputs (including unicode emails, +91 phone numbers, UK postcodes) and assert your redactor removes them all.
Budget breach. Set the budget to 100 tokens, run a heal-heavy suite. What happens at the boundary?
Provider switch. Same spec, run on cloud DeepSeek vs. local Ollama. Compare time, cost, accuracy.

8When NOT to use AI agents

AI is not free. For stable selectors, deterministic flows, and well-defined assertions, vanilla Playwright is faster, cheaper, and easier to debug. Use this checklist to decide.

stable test-ids tight runtime budget 100% determinism strict CI quotas air-gapped env audit-sensitive flows

You already have stable test-ids. A data-testid attribute on every interactive element makes AI redundant. Use the test-id.
You need byte-identical reruns. Even temperature=0 LLM calls have provider drift. If you cannot accept any non-determinism, generate the spec once and check it in.
Runtime budget is under 30s per spec. First-run AI calls add 2-10s each. Cache helps, but cold starts will blow the budget.
You can't ship traffic to a 3rd party. Healthcare, finance, defense. Run a local model or skip AI entirely.
Your DOM is small and stable. A 50-element TTAStays booking form rarely needs healing. Save the AI for the 5,000-element data table.

Good heuristic: use AI to generate tests, use vanilla Playwright to run them. The generated spec is yours to read, edit, commit, and replay forever.

Exercises

Decision matrix. Score 5 TTACart specs on the 5 checklist items above. Decide AI-or-not per spec.
Cold start measurement. Time 10 specs with AI enabled (cold cache) vs. AI disabled. Plot the distribution.
Forbidden flow. Pick one TTACart spec that touches a payment-like flow. Explain why AI must not run on it.

Provider switcher - 7 models behind one interface

Same as the multi-model adapter in the Framework + AI doc. Switch by setting process.env.TTA_AI_MODEL. Local providers (Ollama, LM Studio) need no key - data never leaves the machine.

deepseek-chat . default DeepSeek Cheap + fast. Good for heal calls and trace summaries. Default

claude-3-5-sonnet-20241022 Anthropic Strong at multi-step reasoning. Use for prompt-to-spec and agent loops where determinism matters most.

gpt-4o-mini OpenAI Reliable JSON output, fast. A solid fallback when DeepSeek hits rate limits.

gemini-1.5-flash Gemini Long context window - handy for whole-page DOM snapshots without trimming.

mistral-large-latest Mistral European data residency option. Comparable reasoning, predictable pricing.

ollama:llama3.1 Ollama (local) No key, no network. Air-gapped CI and privacy-sensitive flows. Local

lmstudio:any LM Studio (local) Same idea as Ollama, GUI-driven model picker on dev laptops. Local

Full table with endpoints + auth in Framework + AI #adapter.

Diagram - the agent loop

One mermaid diagram. Perceive, plan, act, observe, decide-to-continue. Vanilla Playwright runs the action; the LLM only owns the plan step.

Agent loop on top of Playwright

Steps in green run inside Playwright. Steps in violet are LLM calls. The loop exits when the goal is satisfied, the budget is exhausted, or max_steps is hit.

flowchart LR
  G[Goal in English] --> P[Perceive: DOM / ARIA / screenshot]
  P --> LP[LLM: propose next action]
  LP --> A[Playwright: click / fill / scroll / extract]
  A --> O[Observe: assertion or page advance]
  O -- goal done --> Z[finish + emit trace]
  O -- not yet, retry --> LP
  O -- failed, replan --> RP[LLM: rewrite plan with new context]
  RP --> A
  O -- budget or max_steps --> X[fail with last screenshot + log]

  classDef pw fill:#d1fae5,stroke:#16a34a,color:#111
  classDef ai fill:#ede9fe,stroke:#8b5cf6,color:#111
  classDef end1 fill:#fef9c3,stroke:#f59e0b,color:#111
  class P,A,O pw
  class LP,RP ai
  class Z,X end1

Playwright AI agents - LLM-driven authoring and execution

1What is a Playwright AI agent

Exercises

2Stagehand-style API on top of Playwright

Exercises

3browser-use and web-agent libraries

Exercises

4Self-healing locators

Exercises

5Prompt to spec - natural language to .spec.ts

Exercises

6Visual reasoning - the vision LLM as a tester

Exercises

7Cost + privacy guards

Exercises

8When NOT to use AI agents

Exercises

Provider switcher - 7 models behind one interface

Diagram - the agent loop

Agent loop on top of Playwright