TLDR
A test script tells the system exactly what to do. An LLM-based testing agent is told what to verify, and figures out the rest. That difference is what changes the maintenance equation entirely.
Why Test Scripts Break: It Is a Structural Problem
A script is a recording. It encodes every step in sequence: which element to click, in what order, at what selector or coordinate. It works as long as nothing changes.
Modern products do not stay still. New features, new device variants, new markets, firmware updates. Each one invalidates some number of recorded paths. By the time broken tests are fixed, the team has already manually tested the same build anyway. And then the next release comes out and the cycle starts again.
The maintenance burden compounds in a predictable way. Teams using selector-based tools find themselves stuck below 40 to 60 percent automation coverage, not because they stop trying, but because every hour spent writing new tests is offset by hours spent fixing existing ones.
The endpoint is always the same. Teams stop writing new tests entirely. They spend their time managing instabilities in the test infrastructure instead.
The team is not the bottleneck. The architecture of the tests is.
How an LLM Executes a Test Differently
Instead of encoding every step, you write what the test needs to verify:
from askui import ComputerAgent
with ComputerAgent() as agent:
agent.act("Log in with these credentials and verify the dashboard loads")That single instruction is passed to the LLM. The LLM reads the intent, reasons about what needs to happen first, selects a tool, executes it, observes the result, and reasons again. It loops until the intent is fulfilled.
The Execution Loop
- Read the instruction
- Reason about what action is needed
- Select the right tool for that step
- Execute and observe the result
- Reason again and pick the next action
- Repeat until done
At each step, the LLM decides what to do based on what it sees, not based on what was pre-recorded. The LLM is the orchestrator. The tools are what it calls.
How the LLM Selects Tools
A common assumption is that agentic testing means screen-based testing: that the LLM looks at screenshots instead of reading selectors.
That is not how it works. The LLM does not replace one technique with another. It chooses between all available tools depending on what the current step requires.
In a desktop environment, the agent has access to mouse, keyboard, and screenshot tools. For web flows, it extends this with browser tools including goto, get_page_url, and get_page_title. For Android, it uses tap, swipe, shell, and drag_and_drop. When a selector is available and faster, it uses that. When the interface has no DOM, it uses screen-based execution. When the step requires a shell command or an API call, it calls those instead.
| Environment | What the LLM Calls |
|---|---|
| Web / DOM available | Selectors, browser tools |
| No DOM | Screen-based execution |
| Backend verification | API calls, database queries |
| System operations | Shell commands |
| External services | MCP-connected tools |
The LLM picks whichever tool fits the current step. It is not screen-based execution instead of selectors. It is the agent choosing the right method for each situation, the way a human engineer would.
Why UI Changes Stop Breaking Agentic Tests
When a test encodes intent rather than steps, a UI change does not make the test wrong.
A button that moves is still a button with the same purpose. A label that changes still represents the same action. The LLM was never following the old path. It was working toward a goal. So it finds the new path on its own.
This breaks the coupling between coverage and maintenance that makes script-based suites expensive to scale. Every new script is another recording that will eventually need updating. A natural language instruction does not encode which element IDs exist or which selectors will match. When those things change, the instruction remains valid.
The practical consequence is that QA teams can expand coverage without expanding the maintenance work that comes with it. The agent adapts. The team focuses on defining what needs to be verified, not how.
FAQ
What is the core difference between an LLM-based testing agent and a traditional script?
A script encodes every step and breaks when any step is wrong. An LLM reads an intent, reasons about how to fulfill it, and selects the right tool for each step. One is a recording. The other is a reasoning process.
Is agentic testing just another name for screen-based testing?
No. Screen-based execution is one method the LLM can call. It also uses DOM selectors, terminal commands, API calls, and external service tools via MCP. The agent selects the execution method based on what the environment supports, not as a workaround for broken selectors.
Does this mean test cases never need updating?
UI changes such as elements moving or labels changing no longer require updates because the agent reasons from intent, not from a recorded path. Test cases need updating when the intended behavior itself changes.
How does this integrate with existing CI/CD pipelines?
The same way any automated test does. Tests are triggered normally. The execution layer handles the rest.
