TLDR:
AskUI, a German tech startup, announced that its Vision Agent achieved 2nd place on the AndroidWorld benchmark (developed by Google Research) with a 94.8% task completion rate (108 out of 116 tasks).
- What it does: The agent automates complex, multi-app mobile tasks on Android using simple natural language instructions.
- How it works: It uses a vision-first approach, relying only on what's visible on the screen. It employs Anthropic's Claude Sonnet 4.5 for reasoning and planning, combined with a proprietary controller for precise, pixel-level UI interactions.
- Significance: The result validates that AskUI's single-model, vision-based automation approach is highly effective and ready for production use in areas like QA testing and enterprise workflow automation.
- Next Steps: The team is focused on improving performance in tasks with dynamic or rapidly changing UIs (where timing mismatch is an issue).
Imagine a world where routine mobile tasks, from managing contacts to tracking expenses, from scheduling appointments to organizing notes, can be automated with simple natural language instructions. Android, the world's most popular mobile operating system with over 3 billion active devices, represents an enormous opportunity for this. From automatically organizing photos to scheduling recurring calendar events and backing up important notes, the use cases span productivity, communication, entertainment, and beyond. For enterprises, this means QA teams can automate complex multi-app user journeys. For individuals, it means reclaiming hours spent on repetitive mobile tasks.
But in a crowded field of automation tools and AI agents, why should you choose AskUI to automate your Android phone? We constantly challenge ourselves with this exact question. Rather than relying on marketing claims, we sought a quantitative, objective answer.
That's why we're excited to announce: AskUI Vision Agent has achieved 2nd place on the AndroidWorld benchmark, with a task completion rate of 94.8% .
This is a validation that our vision-first approach to UI automation, powered by our Android Vision Agent, delivers real-world results on one of the most challenging mobile automation benchmarks available.
What is AndroidWorld?
AndroidWorld is a comprehensive benchmark environment developed by Google Research specifically for evaluating autonomous agents on Android devices. In total, it consists of 116 tasks spanning various domains, such as communication, productivity, entertainment, utilities, and system operations. These tasks represent multi-step workflows that require the agent to navigate across multiple apps, extract relevant information, and interact with dynamic UIs. As AndroidWorld maintains a public leaderboard it is a great opportunity for us to challenge our Android Vision Agent against the systems of other research teams and companies.
What is the AskUI Vision Agent?
We are a small tech startup from Germany. We have developed a library that allows for reliable, automated end-to-end-automation that only depends on what is shown on your screen instead of the technology or platform you are running on. One core feature of the library is our Vision Agent. It is a system that interfaces with modern large language models on one end, and your system through our AskUI Controller on the other end. It allows to control your desktop, or mobile phone, and hence automate tasks. With support for multiple AI models, multi-platform compatibility, and enterprise-ready features.
How We Run our Vision Agent on AndroidWorld
Running our Vision Agent on AndroidWorld is very straightforward. You can find detailed instructions how to set this system up on your machine in a few minutes in our recent post here. In contrast to our many competitors, we do not use complicated, expensive and slow multi-step pipelines. We believe in starting with simplicity and are convinced that modern state-of-the-art UI Agents can be built on a single AI model when using 1) meaningful prompts, 2) effectively engineered context, and 3) a high-bandwith, precise, and error free interface to OS operations.
Hence, we use Anthropic's Claude Sonnet 4.5 as our reasoning and planning model and our proprietary controller to capture screenshots and handle UI interactions. For each task, Claude analyzes the task, evaluates the state of the UI, and plans what interaction it needs to do next. These interactions range from capturing a screenshot, over execution of touch, or swipe gestures to data entry via keyboard. Our controller will then execute these interactions pixel-precise on the UI.
As we found Claude 4.5 to occasionally refuse the execution of a task without obvious reason, we use its less capable but more robust predecessor Claude 4.0 as a fallback in these situations (we indicate these tasks with an asterix in the overview below).
What We Found: ~95% Success Rate
Our Android Vision Agent completed 108 of AndroidWorld’s 116 tasks successfully, securing us second place on the leaderboard! While this proves, that our agent is capable of navigating UIs effectively and of accomplishing tasks at human level, we see this only as the starting point and not as the finish line. Through our analysis, we identified a key challenge: tasks involving dynamic or rapidly changing UIs. When screen content changes between the moment our agent captures a screenshot and when it executes an action, the visual context becomes stale. These scenarios create a timing mismatch in our perception-action loop. We're actively working on solutions, including faster screenshot-to-action cycles, predictive UI state modeling, and more robust handling of transient UI elements.
Full Results Overview
Conclusion
Numbers don’t lie. Achieving 2nd place on AndroidWorld with 95% task completion is a validation that vision-based AI automation is ready for production use. Our combination of powerful reasoning with robust and precise OS-level control delivers the reliability and flexibility needed for real-world mobile automation.
Whether you're automating QA testing, building accessibility tools, or streamlining enterprise mobile workflows, askUI provides the technology to make it happen. The future of mobile automation is agentic. And it's available today.


