Combining the powerful visual perception of AskUI with the advanced reasoning of an LLM like Anthropic's Claude 3.5 Sonnet allows you to build incredibly powerful autonomous agents that can interact with any UI.
But it's not a simple task.
As the team at AskUI, we know exactly how complex and powerful this integration is. Most users can get this full, integrated power immediately, with no code required, by using our enterprise platform, caesr.ai.
This tutorial is different. This post is for the expert developer who wants to understand the core technology on a deeper level, or the engineer who needs to build a highly customized, from-scratch agent.
In this guide, we'll walk you through building the core logic of a vision agent using AskUI's Python client and the Claude 3.5 Sonnet API to create an agent that can perceive, reason, and act.
The Core Concept: Perceive ➔ Reason ➔ Act
Our agent will follow three core steps:
- Perceive: AskUI's vision AI "sees" the screen and extracts a list of all UI elements (buttons, text, icons).
- Reason: The Claude LLM "understands" this list of elements, and based on a given goal, determines the next logical action in a structured JSON format.
- Act: AskUI parses Claude's JSON decision and performs the physical action like a click or keystroke on the screen.
Step 1: Environment Setup and Installation
First, you'll need to install the AskUI and Anthropic Python libraries.
# (Recommended) Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install the AskUI and Anthropic libraries
pip install askui anthropicStep 2: Initialize AskUI and Claude Clients
Create a new Python file named agent_test.py. This script will require your API keys for both AskUI and Claude. (Always use environment variables in real projects).
import asyncio
import os
import json # Added for parsing JSON responses
from askui_core.askui_core import AskuiCore
from anthropic import Anthropic
# AskUI Credentials
ASKUI_WORKSPACE_ID = "YOUR_WORKSPACE_ID"
ASKUI_ACCESS_TOKEN = "YOUR_ACCESS_TOKEN"
# Anthropic (Claude) API Key
CLAUDE_API_KEY = "YOUR_CLAUDE_API_KEY"
# Initialize AskUI Core
aui = AskuiCore(
workspace_id=ASKUI_WORKSPACE_ID,
access_token=ASKUI_ACCESS_TOKEN
)
# Initialize Anthropic Client
claude_client = Anthropic(api_key=CLAUDE_API_KEY)Step 3: Build the AI Agent Logic in Python (Runnable Version)
Now, let's connect these two clients to create our "Perceive, Reason, Act" loop.
The key is to engineer a System Prompt that forces Claude to return a reliable JSON response we can actually parse.
async def run_vision_agent():
try:
# Connect to the AskUI Controller
await aui.connect()
# --- 1. PERCEIVE ---
# AskUI visually perceives all elements on the screen
print("Agent is 'seeing' the screen...")
elements = await aui.get().all().exec()
# Convert elements to a simple string for the LLM prompt
element_list_str = ", ".join([f"'{el.text}' ({el.name})" for el in elements if el.text])
print(f"Seen elements: {element_list_str}")
# --- 2. REASON ---
# Ask Claude to reason about the next action based on the visible elements
print("Agent is 'thinking' with Claude...")
# System prompt engineered to force a JSON response
system_prompt = """
You are an AI automation assistant. Your goal is to decide the single most logical next action to achieve a user's goal.
You must ONLY respond with a JSON object in the following format:
{"action": "click", "element_text": "text_on_element_to_click"}
If no logical action can be taken, respond with:
{"action": "wait", "element_text": "No action needed"}
"""
user_prompt = f"My goal is to log in. The current screen elements are: [{element_list_str}]. What is my next action?"
message = claude_client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=200,
temperature=0.0,
system=system_prompt,
messages=[
{"role": "user", "content": user_prompt}
]
)
claude_response_text = message.content[0].text
print(f"Claude's raw response: {claude_response_text}")
# --- 3. ACT ---
# Parse the JSON response from Claude and execute the action
try:
action_data = json.loads(claude_response_text)
action_type = action_data.get("action")
element_text = action_data.get("element_text")
if action_type == "click" and element_text:
print(f"Agent is 'acting': Clicking the element with text '{element_text}'...")
# We use .text() to click any element with the text Claude identified
await aui.click().text().with_text(element_text).exec()
await asyncio.sleep(3)
print("Action successful!")
else:
print(f"Claude decided to 'wait' or returned an unknown action: {element_text}")
except json.JSONDecodeError:
print(f"Error: Claude did not return valid JSON. Response: {claude_response_text}")
except Exception as e:
print(f"Automation failed: {e}")
await aui.annotate() # Take a screenshot on failure
finally:
await aui.close()
# Run the asynchronous function
if __name__ == "__main__":
asyncio.run(run_vision_agent())Conclusion: Why This is Still Complex (And What the Easier Path Is)
As you can see, you can absolutely build a working agent by combining AskUI's Python client and the Claude API.
However, this tutorial only covers the "happy path." To make this agent robust enough for a production environment, you would still need to handle the enormous complexity of LLMOps:
- JSON Parsing Failures: What if Claude breaks the system prompt and returns invalid JSON?
- Complex Reasoning: How do you get the agent to remember its "state" and perform multi-step tasks (e.g., what to do after logging in)?
- Error Handling: What if the
aui.click()action fails? You need a retry loop that feeds the failure information back to Claude for a new decision.
This is precisely the problem that caesr.ai solves.
caesr.ai is our enterprise platform where all of this complex orchestration the perception, reasoning, JSON parsing, action, and error handling is already built-in. It allows developers and non-developers alike to simply state their goal in natural language and let the platform handle the rest.
While you can build it yourself as an expert, caesr.ai provides the finished, reliable solution to accelerate your business.
Learn more about how caesr.ai simplifies all of this.
About the AskUI Content Team
This article was written and fact checked by the AskUI Content Team. Our team works closely with engineers and product experts, including the minds behind caesr.ai, to bring you accurate, insightful, and practical information about the world of Agentic AI. We are passionate about making technology more accessible to everyone.

