Playwright AI agents - LLM-driven authoring and execution
An AI agent is a small loop that reads the page, picks the next step, runs it through Playwright, and
checks the result. You stop writing rigid page.locator('#sku-1842') calls and start
writing intents - "click the green checkout button", "fill the email field with a fresh
seeded value", "verify the cart total matches the receipt". This page covers eight
concepts you need to use agents safely on top of Playwright, with try-on-TTA exercises that hit our
TTACart-AI sandbox.
DraftDraft - private preview - not yet in the main sidebar. This page is being built
ahead of the Lecture: Playwright AI Agents session. Treat it as a working notebook, not a launched
doc. Links to external repos point only at the public Playwright API + agent libraries.
1What is a Playwright AI agent
An LLM-driven runner that turns natural-language goals into Playwright actions and verifies the result. Think test author + test runner glued together by a small reasoning loop.
goal -> action planDOM as contexttool callingobservation stepretry on failurestructured outputdeterministic seed (sometimes)budget guard
The agent runs roughly four phases: perceive (snapshot DOM, accessibility tree, or
screenshot), plan (LLM proposes the next Playwright action - click, fill, expect),
act (we call the actual Playwright API), and observe (did the action succeed, did the
page advance, is the assertion green). A failed observation feeds back into the next plan call so
the loop self-corrects.
The two design choices that change everything: (a) what you feed the LLM as page context -
full DOM, ARIA tree, screenshot, or all three - and (b) how strictly you constrain the
action it can return - free text vs JSON tool calls vs typed Playwright commands. Strict tools +
ARIA tree is the cheap, fast, boring default. Free-form vision is the slow, expensive, brilliant
backup.
Try it on TTA. Open the TTACart-AI sandbox and watch the 5 mini-demos that ship
with it - each one is a different way to wire an agent into Playwright.
Open sandbox
Exercises
Map the loop. Sketch the perceive-plan-act-observe loop for "log in to TTACart, add the highlighted SKU to cart, verify total = 999". Mark which step is deterministic and which is LLM-driven.
Compare contexts. Drive the same TTACart "add to cart" flow once using the full HTML, once using only the accessibility tree, once using a screenshot. Compare token cost.
Force structured output. Define a JSON schema for {action, selector, value} and prompt the LLM to fill it. Reject any response that fails schema validation.
Inject a deliberate bug. Rename the checkout button to "Pay now" in the DOM via DevTools, then re-run the agent. Does the loop recover by re-reading the page, or does it hard-fail?
2Stagehand-style API on top of Playwright
A thin wrapper that adds ai.act, ai.observe, and ai.extract to your Playwright Page. You keep all the existing page.goto, expect, page.locator APIs and only reach for AI when the locator is fragile or unknown.
ai.act(intent)ai.click(text)ai.fill(label, value)ai.extract(schema)ai.observe()cache resolved selectorsfall back to Playwright
The mental model: stop the agent from generating code, let it generate resolved
locators at runtime. The first time the agent sees "the green checkout button" it asks the
LLM, gets back a CSS / ARIA locator, runs it, caches it. Next run it uses the cached locator and
skips the LLM entirely. Cost amortises to near-zero after the first green test run.
example.spec.ts (5 line illustration)
// Illustrative TTA snippet, not a copy from upstream.
await page.goto('https://app.thetestingacademy.com/playwright/ttacart-ai/');
await ai.act('add the highlighted SKU to the cart');
await ai.act('open the cart drawer');
const total = await ai.extract({ schema: { total: 'number' } });
expect(total.total).toBeGreaterThan(0);
Try it on TTA. Use the Stagehand-style demo card on TTACart-AI - it shows the
same flow as the snippet above against our real DOM.
Open sandbox
Exercises
Replace 5 locators. Take one TTACart spec and rewrite 5 brittle page.locator calls as ai.act intents. Measure the first run cost vs. cached run cost.
Strict schema extract. Write an ai.extract call that returns { items: Array<{ sku, qty, price }> } from the TTACart cart drawer. Fail the spec if the schema mismatches.
Hybrid spec. Use ai.act only for the unstable steps (search, add-to-cart) and page.locator for the stable steps (login form). Diff the test runtime.
Stale cache test. Cache a resolved locator, rename the matching attribute in DevTools, re-run, and watch the cache invalidate + re-ask the LLM.
3browser-use and web-agent libraries
A different shape - instead of decorating Playwright with AI calls, you hand the entire browser to the agent and let it drive on its own. The agent generates a multi-step plan, executes step by step, re-plans on error.
agent.run(goal)action plan listpage state digesthistory of past stepsreplan on errormax_steps capscreenshot every step
The agent gets one goal ("book the cheapest TTAStays room for next weekend") and gets to choose
every Playwright action - including click, fill, scroll,
extract, and finish. The library exposes Playwright primitives as tools,
and the LLM decides the sequence. Where Stagehand is a hammer, browser-use is a robot arm.
Cost goes up fast - every step is a fresh LLM call - so production usage caps max_steps
(typically 10-30), runs a cheap small model first, and only escalates to a large model on retry.
Some libraries cache the final action plan and let you replay it deterministically the next run -
you got the AI to "discover" the test, and now you can re-run it like a normal Playwright script.
Try it on TTA. The TTACart-AI sandbox has a smart-data demo button that
generates 10 edge-case checkouts - same idea, scoped to a single page.
Open sandbox
Exercises
Generate 10 checkouts. Use the smart-data button on TTACart-AI to produce 10 edge-case checkout payloads (Unicode names, very long addresses, 0 quantity). Assert the orders table renders all 10 rows cleanly.
Step cap. Set max_steps=8 and goal "find the SKU on page 3 of TTACart and add it to cart". What does the agent do when it hits the cap before finding the SKU?
Plan replay. Capture the action plan from one successful run as JSON. Replay it deterministically without any LLM calls. Verify the run is byte-identical.
Fail injection. Mid-run, break the network for 2 seconds (via Playwright route handler). Does the agent retry, re-plan, or give up?
Cost report. Sum tokens used by the agent across 5 different goals. Pick the cheapest model that still hits all 5.
4Self-healing locators
The single most-loved AI feature in Playwright land. When a locator fails, the AI proposes 3 fallback locators using the current DOM, ranked by stability. You retry with the top pick and log the heal so a human can promote it later.
primary failsredact DOM3 fallback candidatesstability rankingretry top picklog to heals.jsonlhuman promotion gate
The flow lives one layer below your test: the locator wrapper catches not-found / not-visible,
asks an AI client for 3 candidates, applies the top one, and only throws if all three fail.
The healed locator is written to a log file but the source spec is left untouched - that way a
heal never silently masks a real regression. A weekly maintainer PR promotes the most-used heals
into the codebase as real selectors.
Detailed write-up of how we wire this in V2 lives in
Framework + AI #heal, plus the
Top 5 AI features card with a sequence
diagram and an excerpt from UtilElementLocator.ts.
illustrative heal call (5 lines)
// On primary locator fail, ask AI for 3 candidates against the current DOM.
const dom = await page.locator('main').innerHTML();
const candidates = await ai.heal({ original: '#sku-1842', intent: 'click', dom });
await page.locator(candidates[0]).click(); // try the best one
log.heal({ original: '#sku-1842', healed: candidates[0] });
Try it on TTA. The TTACart-AI sandbox has a "heal" demo card - pass a corrupted
DOM snippet and watch which of the 3 fallbacks the AI picks.
Open sandbox
Exercises
Corrupt the DOM. Pass a snippet where the original id is renamed but the label text is intact. Reason about why the AI's fallback uses getByText rather than another id.
Stability ranking. Force the AI to return 3 candidates: an xpath, an nth-child css, and a getByRole. Rank them by likely-stability and explain.
Heal explosion. Run a TTACart suite where every spec fails locator-wise. How many heals fire? Is the budget guard catching them?
Promotion PR. Read logs/heals.jsonl after a run, group by frequency, and draft a PR that promotes the top 3 heals into source.
5Prompt to spec - natural language to .spec.ts
You write the test in English. The agent reads the target page (or a recorded session), generates a real Playwright spec file, runs it, captures the result. Spec files live in source control - no AI at test time.
English intentDOM snapshottrace replaygenerated .spec.tshuman review gateno AI at test timestable + replayable
This is the design pattern most teams should default to: AI helps you write the test, but
the test itself is plain Playwright code that runs without any LLM dependency in CI. You get the
speed of AI authoring with the determinism of vanilla Playwright at runtime. The same generator
can re-emit specs when the page changes - you keep the English prompt as source of truth, treat
generated .spec.ts as build output.
How we use this in the TTA framework + AI build is described in
Framework + AI #features (prompt-to-spec card)
with a sequence diagram. The TTACart-AI sandbox includes a live generator demo.
Try it on TTA. Use the prompt-to-spec demo card on TTACart-AI. Give it 3
scenarios and read the generated TypeScript.
Open sandbox
Exercises
Three flavors. Generate the same TTACart login spec three times - with locator-only output, with full POM output, and with fixture-based output. Compare ergonomics.
Re-emit on drift. Change a TTACart label, re-run the generator, diff the new spec vs. the old one. Did the agent update only the affected lines?
Determinism check. Run the generator 3 times with the same prompt + same DOM. Assert the diff is empty (you'll need temperature=0).
Human gate. Add an approval step - the agent emits a draft PR, you review, then a CI step merges. What does the PR description look like?
6Visual reasoning - the vision LLM as a tester
Instead of feeding the LLM DOM text, you feed it a screenshot. The model can verify layout, spot z-index bugs, catch dark-mode contrast issues, and even reason about which button to click when the DOM is opaque (canvases, custom drawn UI).
screenshot inbounding box outlayout verificationcontrast checkcanvas drawing UIexpensive - cap per run
Vision is the right tool when the DOM lies - canvas-rendered charts, custom GL renderers, PDF
viewers, drawn-on-canvas tables. It's also useful for layout assertions ("the price label must be
within 20px below the SKU title"). The trade-off: a vision call is 5-20x the token cost of a DOM
call, so most teams cap it to once per spec, used only when DOM-based reasoning fails.
A standard recipe: page.screenshot() -> send to vision LLM with a tight prompt
("return the bounding box of the green checkout button in pixels") -> convert pixels back to
Playwright page.mouse.click(x, y). Brittle? Yes. Catches bugs DOM-only tests miss?
Also yes.
vision-click illustration (8 lines)
const png = await page.screenshot({ fullPage: false });
const box = await ai.visionLocate({
image: png,
intent: 'the green Checkout button at the bottom of the cart',
});
await page.mouse.click(box.cx, box.cy);
await expect(page).toHaveURL(/\/payment/);
Exercises
Layout assertion. Take a TTACart screenshot at 320px viewport and ask the vision LLM to flag elements that overflow the viewport. Compare against axe-core output.
Canvas test. Add a canvas-rendered "verify chart legend reads X" assertion. Show how DOM-only Playwright can't do this without vision.
Cost report. Run a 50-test suite with vision enabled vs. disabled. Compute extra tokens and decide which 5 tests deserve vision.
Hybrid spec. Use vision once to find the button, cache the resulting (x, y) for the next 9 runs. Watch the cache get stale on layout change.
7Cost + privacy guards
Five guards that stop AI agents from leaking customer data or burning the AWS bill. Every one of them lives in your AIClient wrapper, not in the upstream provider library.
serverless proxyPII redactorbudget hard-stopresponse cacheaudit loglocal provider fallbackvision opt-in only
Proxy every cloud call. Route requests through a Cloudflare Pages function or equivalent. Keys never touch the test machine. We do this already for our chat demo via functions/api/chat.js.
Redact before send. Strip emails, phone numbers, postal codes, card last-4s from DOM snapshots and trace summaries. Local providers (Ollama, LM Studio) skip the redactor - data never leaves the box anyway.
Hard token budget. A per-run cap on input + output tokens. Hit the cap, fail fast. No surprise $500 bills.
Cache responses. Hash {model, system, prompt}, store the result, return it on cache hit. Same query in CI never pays twice.
Audit every prompt + response. Append-only log so a security review can answer "what did we ask the model about prod data last Tuesday".
Full implementation in Framework + AI #guards.
The principle: AI agents are not exempt from the same data handling rules as the rest of
your test suite. They get extra scrutiny, not less.
Privacy note. Do not pass production user data to a cloud LLM during testing.
Synthesise data, or run on a local provider. If you must use real data, redact first and confirm
the redactor catches every PII pattern in your dataset - not just the obvious ones.
Exercises
Audit log diff. Run a TTACart suite twice with caching on. Read the audit log and confirm the second run made zero cloud calls.
Redactor unit test. Write 20 inputs (including unicode emails, +91 phone numbers, UK postcodes) and assert your redactor removes them all.
Budget breach. Set the budget to 100 tokens, run a heal-heavy suite. What happens at the boundary?
Provider switch. Same spec, run on cloud DeepSeek vs. local Ollama. Compare time, cost, accuracy.
8When NOT to use AI agents
AI is not free. For stable selectors, deterministic flows, and well-defined assertions, vanilla Playwright is faster, cheaper, and easier to debug. Use this checklist to decide.
stable test-idstight runtime budget100% determinismstrict CI quotasair-gapped envaudit-sensitive flows
You already have stable test-ids. A data-testid attribute on every interactive element makes AI redundant. Use the test-id.
You need byte-identical reruns. Even temperature=0 LLM calls have provider drift. If you cannot accept any non-determinism, generate the spec once and check it in.
Runtime budget is under 30s per spec. First-run AI calls add 2-10s each. Cache helps, but cold starts will blow the budget.
You can't ship traffic to a 3rd party. Healthcare, finance, defense. Run a local model or skip AI entirely.
Your DOM is small and stable. A 50-element TTAStays booking form rarely needs healing. Save the AI for the 5,000-element data table.
Good heuristic: use AI to generate tests, use vanilla Playwright to run them. The
generated spec is yours to read, edit, commit, and replay forever.
Exercises
Decision matrix. Score 5 TTACart specs on the 5 checklist items above. Decide AI-or-not per spec.
Cold start measurement. Time 10 specs with AI enabled (cold cache) vs. AI disabled. Plot the distribution.
Forbidden flow. Pick one TTACart spec that touches a payment-like flow. Explain why AI must not run on it.
Provider switcher - 7 models behind one interface
Same as the multi-model adapter in the Framework + AI doc. Switch by setting process.env.TTA_AI_MODEL. Local providers (Ollama, LM Studio) need no key - data never leaves the machine.
deepseek-chat . defaultDeepSeekCheap + fast. Good for heal calls and trace summaries. Default
claude-3-5-sonnet-20241022AnthropicStrong at multi-step reasoning. Use for prompt-to-spec and agent loops where determinism matters most.
gpt-4o-miniOpenAIReliable JSON output, fast. A solid fallback when DeepSeek hits rate limits.
gemini-1.5-flashGeminiLong context window - handy for whole-page DOM snapshots without trimming.
mistral-large-latestMistralEuropean data residency option. Comparable reasoning, predictable pricing.
ollama:llama3.1Ollama (local)No key, no network. Air-gapped CI and privacy-sensitive flows. Local
lmstudio:anyLM Studio (local)Same idea as Ollama, GUI-driven model picker on dev laptops. Local
One mermaid diagram. Perceive, plan, act, observe, decide-to-continue. Vanilla Playwright runs the action; the LLM only owns the plan step.
Agent loop on top of Playwright
Steps in green run inside Playwright. Steps in violet are LLM calls. The loop exits when the goal is satisfied, the budget is exhausted, or max_steps is hit.
flowchart LR
G[Goal in English] --> P[Perceive: DOM / ARIA / screenshot]
P --> LP[LLM: propose next action]
LP --> A[Playwright: click / fill / scroll / extract]
A --> O[Observe: assertion or page advance]
O -- goal done --> Z[finish + emit trace]
O -- not yet, retry --> LP
O -- failed, replan --> RP[LLM: rewrite plan with new context]
RP --> A
O -- budget or max_steps --> X[fail with last screenshot + log]
classDef pw fill:#d1fae5,stroke:#16a34a,color:#111
classDef ai fill:#ede9fe,stroke:#8b5cf6,color:#111
classDef end1 fill:#fef9c3,stroke:#f59e0b,color:#111
class P,A,O pw
class LP,RP ai
class Z,X end1
Next step. Open the TTACart-AI sandbox and
step through each demo with this doc open. Then jump back to the
Framework + AI page for the production code patterns.