Codex
for
QA.
A practical Codex masterclass for QA engineers and SDETs — learn AI test automation, agentic testing workflows, AGENTS.md, Skills, Subagents, Hooks, MCP, Playwright, model routing, Gemini CLI, CI, and portfolio deployment.
The room you just walked into
QA is no longer the last step.
It is the agent's quality system.
Codex is OpenAI's coding agent for writing, reviewing, testing, and shipping code across the CLI, app, IDE extension, and cloud task surfaces.
For a tester, Codex is not a faster autocomplete. It reads the repo, changes files, runs the suite, drives browser tools, reviews the diff, opens PRs, and keeps a transcript of what happened. Your job becomes designing the quality loop.
Where Codex sits in your stack.
Think of Codex as a harness-native pair tester. It has repo context, shell access, patch-based edits, web search, image inputs, browser/computer use through the app, MCP connectors, and review mode. Your advantage is turning that into a repeatable QA factory.
Read
AGENTS.md, code, logs, traces, tickets, screenshots, docs.
Plan
Use /plan before editing risky code or tests.
Act
Patch files, run shell commands, call MCP tools.
Verify
Run impacted specs, review diffs, collect evidence.
Ship
PR, review, CI, deploy, monitor, document.
Four surfaces. One workflow.
Codex CLI
Terminal-first TUI. Best for repo edits, quick scripts, local tests, model switches, and repeatable QA prompts.
Codex app
Local workspaces, in-app browser, Chrome/computer use, images, automations, worktrees, and visual QA.
IDE extension
Editor-aware agenting with open files and selections as context. Good for tight code-review loops.
Codex Cloud
Delegated tasks in managed environments. Good for isolated bug fixes, PR prep, and longer work while you continue locally.
Setup · macOS / Linux / Windows
Five lines. One terminal.
# 1. install Codex CLI npm i -g @openai/codex # 2. open your QA repo cd ~/work/qa-portfolio # 3. start Codex codex # 4. first run: sign in when prompted # ChatGPT account or API key auth # 5. scaffold repo guidance > /init → creates AGENTS.md
Run Codex from the directory you want it to understand. The first run asks you to authenticate. After that, start every serious repo by creating and editing AGENTS.md.
A session has four moving parts.
Workspace
The folder or worktree Codex can inspect and edit. Keep tasks scoped to one feature or suite.
AGENTS.md
Stable repo rules, test commands, locator policy, CI rules, review expectations.
Shell + patch + MCP
Codex edits via patches, runs commands, and reaches external tools through MCP or app connectors.
Permissions
Auto, read-only, or full access. Pick the smallest mode that can finish the task.
Three ways to steer Codex.
Plain English
Describe the job and let Codex decide tools.
> run checkout smoke, fix only locator flakes,
then show the diff and test outputSession control
Use built-ins for model, plan, review, permissions.
> /plan > /model > /review
Exact command
Ask Codex to run or run yourself in a terminal.
npx playwright test --grep @smoke npm run lint git diff --stat
The complete QA cheat-sheet
Every slash command a QA touches.
| Control | Use it for |
|---|---|
| /permissions | Switch between Auto, Read Only, or tighter approval requirements. |
| /model | Choose model and reasoning effort for the current task. |
| /fast | Toggle Fast service tier when available. |
| /plan | Ask for a plan before implementing. |
| /goal | Set a persistent objective for long-running work. |
| /status | Check thread state, context usage, rate limits. |
| /compact | Summarize long context and keep going. |
| /resume | Continue a saved conversation. |
| Workflows | Use it for |
|---|---|
| /init | Create an AGENTS.md scaffold. |
| /review | Review current working tree, branch, or commit. |
| /diff | Inspect local edits before committing. |
| /mcp | Inspect configured MCP servers and tools. |
| /skills | Browse and explicitly invoke skills. |
| /hooks | Review and trust lifecycle hooks. |
| /agent | Switch to a spawned subagent thread. |
| /side | Ask a side question without polluting the main thread. |
QA habit · type /plan before broad changes, /diff before claiming done, and /review before opening a PR.
Patch · Shell · Search · Browser.
Everything else is orchestration.
> find the flaky wait in tests/login.spec.ts, replace it with an assertion-based wait, run that spec 20 times, and report the pass rate. // Codex will usually: rg(waitForTimeout|sleep|networkidle) apply_patch(tests/login.spec.ts) shell(npx playwright test tests/login.spec.ts --repeat-each=20) summarize(diff + exit code + failures)
Controlled edits
Small, reviewable hunks. Best for test fixes and framework changes.
Proof loop
Run tests, lint, typecheck, curl, git, trace viewers, report generators.
Current facts
Use live web search when docs, models, prices, or rules might have changed.
External systems
Browser, Jira, GitHub, Figma, docs, APIs, and team-specific tools.
Give Codex just enough autonomy.
Auto
Reads, edits, and runs commands inside the workspace. Still asks before network or outside-scope actions.
Read Only
Best for audits, root-cause analysis, onboarding, and plan-first reviews.
Full Access
Use only in trusted repos or disposable sandboxes. Powerful, expensive, and easy to regret.
The single file that changes everything
Teach Codex your QA rules once.
Codex reads AGENTS.md before doing work. It layers global guidance from ~/.codex, repo guidance from the Git root, and nested directory overrides down to your current folder.
# QA conventions — qa-portfolio ## Build and test - Package manager: pnpm - Unit: pnpm test - E2E: npx playwright test - Smoke: npx playwright test --grep @smoke ## Locators - Prefer getByRole, getByLabel, getByTestId. - No raw xpath in committed tests. - No page.waitForTimeout. ## Done means - Diff reviewed. - Impacted test run pasted with exit code. - Trace/screenshot attached when browser behavior changed.
| Scope | File |
|---|---|
| Global | ~/.codex/AGENTS.md or AGENTS.override.md |
| Repo root | ./AGENTS.md |
| Nested folder | tests/e2e/AGENTS.md |
| Override | AGENTS.override.md wins in that directory |
/init to scaffold it.What goes inside AGENTS.md
Nine sections every QA repo needs.
## 1. Stack Playwright, TypeScript, APIRequestContext, axe. ## 2. Commands pnpm lint · pnpm test · npx playwright test. ## 3. Folder layout tests/e2e · tests/api · tests/fixtures · tests/pom. ## 4. Locators Role/label/testid first. XPath forbidden. ## 5. Waits No hard sleeps. Use assertions and expect.poll. ## 6. Test data No real PII. Use builders and env-based users. ## 7. Tags @smoke @regression @a11y @visual @flaky. ## 8. Review Prioritize bugs, regressions, missing tests. ## 9. Guardrails Ask before deleting specs, changing CI, or bumping deps.
CommandsCodex can prove work instead of guessing how to run tests.LocatorsThe fastest way to stop generated tests becoming flaky.DataKeeps secrets and real users out of prompts and fixtures.ReviewTurns Codex into a QA reviewer, not a style nit machine.GuardrailsDefines where autonomy must stop and ask.When Codex makes a wrong assumption, do not just correct the prompt. Update AGENTS.md so the next session starts smarter.
Context is a budget. Spend it like one.
AGENTS.md
Rules that should be true next month: commands, conventions, architecture boundaries.
/compact
Summarize a long thread when it has useful decisions but too much transcript weight.
/memories
Manage useful local context learned across work, where enabled by your setup.
Current Codex model map
Pick the model for the risk, not the ego.
| Model | Best QA use | Command |
|---|---|---|
| gpt-5.5 | Hard debugging, large refactors, research-heavy QA strategy, computer use. | codex -m gpt-5.5 |
| gpt-5.4 | Professional coding and test framework work with strong reasoning. | codex -m gpt-5.4 |
| gpt-5.4-mini | Fast, lower-cost edits, subagents, simple spec generation. | codex -m gpt-5.4-mini |
| gpt-5.3-codex | Dedicated agentic coding and local code review workflows. | codex -m gpt-5.3-codex |
| gpt-5.3-codex-spark | Near-instant text-only coding iteration where available. | codex -m gpt-5.3-codex-spark |
Inside the CLI, use /model to switch mid-session and set reasoning effort. For simple subagents, mini is often enough. For migration or risk-heavy changes, start frontier.
Gemini, OpenAI-compatible endpoints, and local models
Use Gemini as a second expert, not a random swap.
# install npm install -g @google/gemini-cli # run in the same repo gemini # pick a specific Gemini model gemini -m gemini-2.5-flash # non-interactive review gemini -p "Review tests/e2e for missing assertions" \ --output-format json
The cleanest Gemini workflow is side-by-side: let Codex edit and verify in your repo, then ask Gemini CLI for an independent review, long-context explanation, search-grounded research, or alternative test strategy.
| Lane | Use it when |
|---|---|
| Codex | You want patch-based repo edits, code review, worktrees, goals, skills, hooks. |
| Gemini CLI | You want Google Search grounding, Gemini model behavior, another read on requirements or test gaps. |
| Gateway | Your org exposes Gemini or other models through a Responses-compatible endpoint for Codex. |
| Local | You want Ollama or LM Studio for private, lower-capability experiments. |
# ~/.codex/config.toml model_provider = "qa-gateway" model = "gemini/gemini-2.5-pro" [model_providers.qa-gateway] name = "QA model gateway" base_url = "https://gateway.example.com/v1" env_key = "QA_GATEWAY_API_KEY" wire_api = "responses" supports_websockets = false
Important · Codex custom providers currently use the Responses protocol. Google's Gemini OpenAI-compatibility examples use Chat Completions, so direct Gemini endpoint routing may not be enough unless your gateway translates to Responses. For practical QA teams, native Gemini CLI plus Codex is the reliable path.
A tester's routing table.
Risky framework work
Use gpt-5.5 or gpt-5.4 high/xhigh. Require plan, diff, targeted tests, and review.
Spec edits
Use gpt-5.4-mini or current recommended mini. Run the exact spec immediately.
Subagents
Use mini for explorers and one frontier reviewer for final synthesis.
Gemini CLI
Ask for independent risk review, missing scenarios, and edge-case brainstorming.
Local model
Use Ollama/LM Studio for docs summarization, not production code edits unless proven.
Bounded automation
Use narrow prompts, max turns, focused diff, and explicit output caps.
Spawn specialists, not chaos.
Codex can spawn specialized agents in parallel when you explicitly ask. Built-ins include default, worker, and explorer. Custom agents live as TOML files under ~/.codex/agents/ or .codex/agents/.
Codebase map
Find test owners, fixtures, helper APIs, flaky waits, and routes without editing.
Implementation
Make a bounded change after the plan is approved.
QA diff review
Read the final diff for flake risk, missing assertions, bad test data, and CI gaps.
> Review this branch vs main. Spawn one agent per topic:
1. security risk
2. test flakiness
3. missing assertions
4. API contract risk
5. maintainability
Wait for all agents, then summarize the top 8 findings.
Skills are playbooks Codex loads on demand.
A skill is a directory with a required SKILL.md plus optional scripts/, references/, assets/, and helper files. Codex starts with the skill name and description, then reads the full instructions only when the task matches.
--- name: flake-hunter description: Use when a Playwright spec fails intermittently or contains waitForTimeout, sleep, networkidle, or brittle locator patterns. --- # Flake Hunter 1. Read the failing spec and related fixture. 2. Search for hard waits and brittle selectors. 3. Replace with role locators and assertion waits. 4. Run the spec with --repeat-each=20. 5. Report pass rate, changed lines, and remaining risk.
Where Codex finds skills
| Scope | Location |
|---|---|
| Repo | $CWD/.agents/skills or repo-root/.agents/skills |
| User | $HOME/.agents/skills |
| Admin | /etc/codex/skills |
| System | Bundled skills such as skill-creator |
$skill-name or via /skills.Build your own Codex for QA
Package your testing brain as skills.
locator-auditor
Scans specs for XPath, nth-child, CSS chains, and missing accessible names.
api-contract-maker
Turns curl/OpenAPI/Postman exports into Playwright APIRequestContext suites.
bug-from-trace
Reads trace, screenshot, console, and network logs and drafts a Jira-ready bug.
> Use $skill-creator to create a repo-scoped skill named locator-auditor.
It should trigger when tests use XPath, CSS chains, nth-child, test-only waits,
or missing assertions. It should read tests/, output a risk table, and only edit
when I explicitly say "fix them".When a skill is not enough, use a plugin.
Plugins package skills, MCP servers, and apps together. For a QA organization, a plugin can ship the company browser tools, Jira connector, test-data service, and house skills as one installable bundle.
Team playbooks
Common workflows: smoke triage, accessibility audit, release-readiness report.
Tool servers
Playwright, Jira, test data, internal QA dashboards, contract registry.
Local UI
A mini dashboard for traces, screenshots, and run summaries inside Codex.
Hooks fire around tool calls.
Hooks let you run scripts on Codex lifecycle events: prompt submit, pre-tool, permission request, post-tool, compaction, subagent start/stop, session start, and stop. Use them for formatting, test targeting, audit logs, and safety gates.
"PostToolUse": [{ "matcher": "apply_patch", "hooks": [{ "type": "command", "command": "npm run lint -- --quiet" }] }], "PreToolUse": [{ "matcher": "shell", "hooks": [{ "type": "command", "command": "node scripts/block-main-edits.js" }] }]
| Hook | QA use |
|---|---|
| UserPromptSubmit | Log prompts or block secrets. |
| PreToolUse | Prevent destructive shell commands. |
| PermissionRequest | Auto-deny unsafe escalations. |
| PostToolUse | Format or run impacted tests. |
| Stop | Emit a run summary. |
MCP turns Codex into a browser, Jira, GitHub, docs, and your internal API.
Model Context Protocol exposes external tools as structured actions. For QA, the high-value servers are Playwright/browser, Jira/Confluence, GitHub, design systems, test data, and custom product APIs.
Playwright MCP
Navigate, click by role, inspect accessibility tree, capture screenshots and network logs.
Atlassian MCP
Read acceptance criteria, write test plans, file bugs with traces.
GitHub
PRs, checks, issues, release notes, review comments, and branch status.
Notion / Confluence
Pull specs and publish execution reports.
Test data MCP
Create safe seeded users, orders, payments, feature flags.
Your own MCP
Wrap internal CLIs and APIs so Codex can test like your team tests.
codex mcp add playwright -- npx @playwright/mcp@latest
A QA agent needs eyes and hands.
Use the in-app browser for local apps, screenshots, accessibility snapshots, and visual verification. Use Chrome automation when cookies, extensions, or logged-in remote sessions matter.
Open
Localhost or remote target.
Snapshot
Capture accessible structure.
Act
Click, type, upload, resize.
Assert
Check text, pixels, console.
Codify
Turn findings into tests.
Demo · authoring a spec without typing locators
"Open saucedemo.com, log in, add an item, screenshot the cart."
▸ browser_navigate('https://saucedemo.com')
▸ browser_snapshot() · captured a11y tree
▸ browser_click(role=button, name='Login')
▸ browser_click(name='Add to cart')
▸ browser_take_screenshot('cart.png')
▸ apply_patch(tests/cart.spec.ts) ✓
▸ npx playwright test tests/cart.spec.ts · green
import { test, expect } from '@playwright/test'; test('adds backpack to cart', async ({ page }) => { await page.goto('https://www.saucedemo.com'); await page.getByRole('textbox', { name: /user/i }).fill('standard_user'); await page.getByRole('textbox', { name: /pass/i }).fill('secret_sauce'); await page.getByRole('button', { name: 'Login' }).click(); await page.getByRole('button', { name: /add to cart/i }).first().click(); await expect(page.getByText('Sauce Labs Backpack')).toBeVisible(); });
API tests from a single curl.
Paste a curl, OpenAPI URL, or Postman export. Codex can infer happy path, negative path, schema validation, auth variants, and fixture structure.
> generate Playwright API tests for this endpoint. Include positive, invalid email, missing auth, schema, and one contract drift check. Use zod for runtime validation. curl -X POST https://api.demo.dev/v1/users \ -H 'Authorization: Bearer $T' \ -H 'Content-Type: application/json' \ -d '{"email":"[email protected]","plan":"pro"}'
tests/api/users.spec.ts fixtures/apiClient.ts tests/contracts/user.schema.ts Coverage: - 201 create user - 400 invalid payload - 401 missing token - schema validation - idempotency or duplicate email behavior
Tests from requirements
A Jira ticket in. A test plan out.
> fetch QA-482 from Jira, read acceptance criteria,
produce: 1) Gherkin scenarios, 2) Playwright skeleton,
3) coverage matrix mapping each AC to a test id,
4) risk list for untestable or ambiguous criteria.AC-1 valid promo → TC-482-001 @smoke AC-2 expired promo → TC-482-002 @negative AC-3 country tax → TC-482-003 @regression AC-4 rounding rule → TC-482-004 @edge Ambiguous: tax source of truth missing.
From a screenshot to a filed bug.
> [screenshot attached]
User reports the price chip overflows on mobile.
Reproduce at 390x844, capture screenshot + trace,
find likely component, and draft a Jira-ready bug
with steps, expected, actual, severity, and evidence.▸ viewport: 390x844
▸ component: PricingCard / price-chip
▸ evidence: screenshot + trace.zip
▸ bug title: Pricing chip overflow on mobile
done
QA reviews code too. Now they have leverage.
| Prompt | Outcome |
|---|---|
| /review | Severity-tagged findings on current diff. |
| review as QA | Flaky waits, missing assertions, untested branches. |
| review as security | Auth, SSRF, injection, secret leakage. |
| spawn reviewers | Parallel specialist review, final synthesis. |
tests/login.spec.ts:14 · P1 page.waitForTimeout(2000) hides race. Fix: wait for dashboard heading and toast. src/auth/middleware.ts:42 · P0 Token branch skips expiry validation. Fix: assert exp before session creation. playwright.config.ts:8 · P2 retries: 3 masks flakes. Fix: retry once, quarantine with owner.
Run Codex in your pipeline.
Use headless Codex only for narrow, bounded CI jobs: summarize failing tests, draft PR review comments, triage smoke failures, or create a follow-up issue. Keep prompts small and permissions tight.
name: qa-bot on: { pull_request: { types: [opened, synchronize] } } jobs: review: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npm ci - run: npx playwright install --with-deps - name: Codex QA review env: { OPENAI_API_KEY: secrets.OPENAI_API_KEY } run: | codex exec --model gpt-5.4-mini \ "Review this PR for QA risk only. Focus on flaky waits, hard-coded data, missing assertions, and broken smoke coverage. Output max 8 bullets with file:line when possible."
Plan first. Then let it loose.
Read-only thinking
Use /plan for unknown codebases, migrations, auth, payments, CI, and shared test fixtures.
Isolated sandbox
Run the risky work in a branch or worktree so your main checkout and dev server stay stable.
Together these are the two controls that let you delegate real work without losing engineering judgment.
Things that will save you a workday.
/planForce strategy before edits./diffInspect exactly what changed./reviewAsk Codex to critique its own patch./compactPreserve decisions, free context.@filePoint Codex at the exact spec, fixture, or trace.$skillExplicitly invoke a QA skill./sideAsk a side question without polluting the main task.rg firstSearch before editing. Tests fail slower than search.proofAsk for command, exit code, and report path.branchNever let autonomous work start on protected main.A QA day · before vs after.
Before
09:00Four flaky tests overnight.10:00Manually translate Jira AC into tests.12:00Still debugging the locator.15:00Write bug report and attach evidence by hand.18:00Backlog grew.After
09:00Codex summarizes overnight failures.09:30Gemini gives second-opinion scenario gaps.11:00Codex patches the flake and runs 20x proof.14:00Playwright MCP exploration produces trace evidence.17:00PR reviewed, CI gated, release notes drafted.When one repo has 12 apps and 8 suites
Codex in a monorepo.
| Problem | Pattern |
|---|---|
| Too much context | Run Codex from the package folder and add only needed dirs. |
| Different test commands | Put package-specific AGENTS.md files near each suite. |
| Shared fixtures | Give Codex ownership rules before editing shared files. |
| Slow CI | Teach impacted-test selection and smoke tags. |
| Cross-app flows | Use a project-level plan before patching any suite. |
Do not rewrite by hand
Selenium · Cypress · TestCafe → Playwright.
Let Codex inspect the old suite, build a migration matrix, convert one vertical slice, run it, then scale. Ask Gemini to independently review missing behavior before deleting the old tests.
> Migrate cypress/e2e/checkout to Playwright under tests/e2e/checkout.
Rules: prefer getByRole/getByLabel/getByTestId, no waitForTimeout,
cy.intercept becomes page.route, fixtures become typed builders.
Convert one spec first, run it, show diff and pass/fail before continuing.Numbers that drive QA decisions
Flake rate · p95 · MTTR — let Codex do the math.
Flake rate
Failures that pass on retry divided by total executions.
p95 duration
Protects CI from slow suite creep.
MTTR
Time from red build to green fix.
Coverage by AC
Links acceptance criteria to tests and release confidence.
Hands-on · run these now
Five drills. Do them in order.
First contact
Run codex, then /init. Edit AGENTS.md with real test commands.
Review first
Ask Codex to review a recent PR for QA risk only.
Tame a flake
Remove one waitForTimeout and prove stability with repeat-each.
Gemini second pass
Use Gemini CLI to find missing scenarios in the same change.
Ship a skill
Create locator-auditor under .agents/skills and run it on tests/.
Codex in CI
Run a bounded QA review job on a PR and post an artifact.
Capstone · build your own Codex for QA
Skill · model routing · MCP · site · live URL.
By the end, you have a repo-scoped QA agent: AGENTS.md, three skills, Playwright MCP, Gemini second-review lane, GitHub Actions, and a portfolio page showing how the system works.
Your QA system.
Codified once.
Codex handles the toil; you own risk, judgment, release confidence.
Step 1 · foundation
Create AGENTS.md and the first skill.
> Build the QA agent foundation for this repo.
Create AGENTS.md with commands, locator policy, wait policy, data policy,
review rules, and stop conditions. Then create .agents/skills/locator-auditor/SKILL.md.
Do not edit product code. Run the skill on tests/ and report the top 10 risks.Step 2 · model routing
Codex edits. Gemini critiques.
> Add a docs/model-routing.md file for our QA team.
Include when to use gpt-5.5, gpt-5.4, gpt-5.4-mini, Gemini CLI,
and local models. Add examples for: flaky spec, API contract suite,
release-readiness review, and migration planning. Keep it practical.Step 3 · Playwright + CI
Cover every route. Gate every merge.
> Add Playwright smoke, a11y, visual, link, and SEO suites.
Run mobile 390x844 and desktop 1440x900. Add CI with lint, typecheck,
e2e, and report upload. If anything fails locally, fix it before reporting done.Step 4 · deploy
From localhost to thetestingacademy.com.
> Deploy ./qa-portfolio to Vercel.
Run the Playwright suite against the prod URL.
Print preview URL, prod URL, test report path, and next manual checks.> Publish this Codex masterclass page under
app.thetestingacademy.com/masterclass/codex.html.
Verify with curl and a browser snapshot after deploy.Don't do these. Ever.
Skipping AGENTS.md
Without it Codex guesses your conventions and you fight it every turn.
Trusting green without proof
Ask for command, exit code, and report path.
Putting secrets in prompts
Use env vars, vaults, and CI secrets.
Direct Gemini config without protocol check
Use native Gemini CLI unless your gateway supports Codex's required API protocol.
Auto-merging agent PRs
The agent writes. You review.
Running full access on protected main
Use worktrees, feature branches, and scoped permissions.
The new QA toolbelt.
Codex
The editor, reviewer, tester, and orchestrator.
GPT-5 family
Frontier reasoning down to fast mini work.
Gemini CLI
Independent review and search-grounded planning.
Playwright
Browser, API, trace, visual, a11y.
MCP
Connects agent to tools.
AGENTS.md
Team rules and commands.
Skills
Reusable QA workflows.
You
Risk, release confidence, product sense.