Developing an Automated UI Controller using GPT Agents and GPT-4 Vision

February 23, 2024
Academy
linkedin icontwitter icon

1. Introduction

The emergence of Large Language Models (LLMs) like ChatGPT has ushered in a new era in text generation and AI advancements. While tools such as AutoGPT have aimed at task automation, their reliance on underlying DOM/HTML code poses challenges for desktop applications built on .NET/.SAP. The GPT-4 Vision (GPT-4v) API addresses this limitation by focusing on visual inputs, eliminating the need for HTML code.

See the original post here

1.1 The Challenge of Grounding

Although GPT-4v excels in image analysis, accurately identifying UI element locations remains a challenge. In response, a solution called "Grounding" has been introduced, involving pre-annotating images with tools like Segment-Anything (SAM) to simplify object identification for GPT-4v.

GPT-4 Vision model struggling to locate people correctly. (Source: https://huyenchip.com/2023/10/10/multimodal.html)

Set of Mark example

1.2 Adapting to the Usecase

UI elements can be numbered for reference, streamlining the interaction process. For example, the "Start for Free" button might be associated with the number "34," enabling GPT-4v to provide instructions that can be translated into corresponding coordinates for controller actions.

Object Detection on a UI screen

2. Building Agents

The envisioned product/tool aims to enter a goal/task and receive a set of actions to achieve it. The system operates in three sequential capabilities: goal breakdown, vision (image understanding), and system control. A GPT-4 text model acts as the orchestrator, interfacing with both GPT-4v and the device controller.

UI controller agent which can plan and analyze images

3. GPT-4V Integration: Visionary Function

GPT-4v serves as the visual interpreter, analyzing UI screenshots and identifying elements for interaction. An example function is provided to showcase how UI element identification is requested.

-- CODE language-python line-numbers -- def request_ui_element_gpt4v(base64_image, query, openai_api_key): headers = { "Content-Type": "application/json", "Authorization": f"Bearer {openai_api_key}" } payload = { "model": "gpt-4-vision-preview", "temperature": 0.1, "messages": [ { "role": "user", "content": [ { "type": "text", "text": f"Hey, imagine that you are guiding me navigating the UI elements in the provided image(s). All the UI elements are numbered for reference. The associated numbers are on top left of corresponding bbox. For the prompt/query asked, return the number associated to the target element to perform the action. query: {query}" }, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{base64_image}" } } ] } ], "max_tokens": 400 } response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload) return response.json()

4. Device Controller: The Interactive Core

The Controller class is pivotal, executing UI actions based on GPT-4v's guidance. Functions include moving the mouse, double-clicking, entering text, and capturing annotated screenshots, ensuring seamless interaction with the system:

- move_mouse(): Moves the cursor and performs clicks.

- double_click_at_location(): Executes double clicks at specified locations.

- enter_text_at_location(): Inputs text at a given location.

- take_screenshot(): Captures and annotates the current UI state.

The below class ensures seamless interaction with the system, acting based on GPT-4v's guidance.

-- CODE language-python line-numbers -- import subprocess import time class Controller: def __init__(self, window_name="Mozilla Firefox") -> None: # Get the window ID (replace 'Window_Name' with your window's title) self.window_name = window_name self.get_window_id = ["xdotool", "search", "--name", self.window_name] self.window_id = subprocess.check_output(self.get_window_id).strip() def move_mouse(self, x, y, click=1): # AI logic to determine the action (not shown) action = {"x": x, "y": y, "click": click} # Move the mouse and click within the window if action["click"]: subprocess.run(["xdotool", "mousemove", "--window", self.window_id, str(action["x"]), str(action["y"])]) subprocess.run(["xdotool", "click", "--window", self.window_id, "1"]) # wait before next action time.sleep(2) def double_click_at_location(self, x, y): # Move the mouse to the specified location subprocess.run(["xdotool", "mousemove", "--window", self.window_id, str(int(x)), str(int(y))]) # Double click subprocess.run(["xdotool", "click", "--repeat", "1", "--window", self.window_id, "1"]) time.sleep(0.1) subprocess.run(["xdotool", "click", "--repeat", "1", "--window", self.window_id, "1"]) def enter_text_at_location(self, text, x, y): # Move the mouse to the specified location subprocess.run(["xdotool", "mousemove", "--window", self.window_id, str(int(x)), str(int(y))]) # Click to focus at the location subprocess.run(["xdotool", "click", "--window", self.window_id, "1"]) # Type the text subprocess.run(["xdotool", "type", "--window", self.window_id, text]) def press_enter(self): subprocess.run(["xdotool", "key", "--window", self.window_id, "Return"]) def take_screenshot(self): # Take a screenshot screenshot_command = ["import", "-window", self.window_id, "screenshot.png"] subprocess.run(screenshot_command) # Wait before next action time.sleep(1) self.image = Image.open("screenshot.png").convert("RGB") self.aui_annotate() return "screenshot taken with UI elements numbered at screenshot_annotated.png " def aui_annotate(self): assert os.path.exists("screenshot.png"), "Screenshot not taken" self.raw_data = (request_image_annotation("screenshot.png", workspaceId, token)).json() self.image_with_bboxes = draw_bboxes(self.image, self.raw_data) self.image_with_bboxes.save("screenshot_annotated.png") def extract_location_from_index(self, index): bbox = extract_element_bbox([index], self.raw_data) return [(bbox[0]+bbox[2])/2, (bbox[1]+bbox[3])/2] def convert_image_to_base64(self): return encode_image("screenshot_annotated.png") def get_target_UIelement_number(self, query): base64_image = self.convert_image_to_base64() gpt_response = request_ui_element_gpt4v(base64_image, query) gpt_response = gpt_response["choices"][0]["message"]["content"] # Extract numbers from the GPT response numbers = extract_numbers(gpt_response) return numbers[0] def analyze_ui_state(self, query): base64_image = self.convert_image_to_base64() gpt_response = analyze_ui_state_gpt4v(base64_image, query) gpt_response = gpt_response["choices"][0]["message"]["content"] # Extract numbers from the GPT response numbers = extract_numbers(gpt_response) return numbers[0]

5. Orchestrator: The Strategic Planner

The Orchestrator, a GPT-4 text model, plans and executes tasks by leveraging the capabilities of the Device Controller and GPT-4v. Key components include the Planner (goal breakdown and action strategy), UserProxyAgent (communication facilitator), and Controller (UI action execution).

Key Components

  • Planner: Breaks down goals and strategizes actions.
  • UserProxyAgent: Facilitates communication between the planner and the controller.
  • Controller: Executes actions within the UI.

Workflow Example

1. Initialization: The Orchestrator receives a task.

2. Planning: It outlines the necessary steps.

3. Execution: The Controller interacts with the UI, guided by the Orchestrator’s strategy.

4. Verification: The Orchestrator checks the outcomes and adjusts actions if needed.

We use AutoGen, an open source library that lets GPTs talk to each other, to implement the Automated UI controller. After giving each GPT its system message (or identity), we register the functions so that the GPT is aware of them and can call them when needed.

-- CODE language-python line-numbers -- planner = autogen.AssistantAgent( name="Planner", system_message= """ You are the orchestrator that must achieve the given task. You are given functions to handle the UI window. Remember that you are given a UI window and you start the task by taking a screenshot and take screenshot after each action. For coding tasks, only use the functions you have been provided with. Reply TERMINATE when the task is done. Take a deep breath and think step-by-step """, llm_config=llm_config, ) # create a UserProxyAgent instance named "user_proxy" user_proxy = autogen.UserProxyAgent( name="user_proxy", is_termination_msg=lambda x: x.get("content", "") and x.get("content", "").rstrip().endswith("TERMINATE"), human_input_mode="NEVER", max_consecutive_auto_reply=10, code_execution_config={"work_dir": "coding"}, llm_config=llm_config, ) controller = Controller(window_name = "Mozilla Firefox") # register the functions user_proxy.register_function( function_map={ "take_screenshot": controller.take_screenshot, "move_mouse": controller.move_mouse, "extract_location_from_index": controller.extract_location_from_index, "convert_image_to_base64": controller.convert_image_to_base64, "get_target_UIelement_number": controller.get_target_UIelement_number, "enter_text_at_location": controller.enter_text_at_location, "press_enter": controller.press_enter, "double_click_at_location": controller.double_click_at_location, "analyze_UI_image_state": controller.analyze_ui_state, } ) def start_agents(query): user_proxy.initiate_chat( planner, message=f"Task is to: {query}. Check if the task is acheived by looking at the window. Don't quit immediately", )

6. Workflow on a Sample Task

The workflow involves the Orchestrator receiving a task, planning the necessary steps, executing actions through the Controller, and verifying outcomes. The article demonstrates a sample task—"click on the GitHub icon and click on the 'blogs' repository"—with a log history showcasing various executed functions.

Attached below is a log history, performing various actions. We can see that various functions are called and executed accordingly.

This article provides insight into the evolving automation landscape, anticipating remarkable developments built upon these technologies.

Murali Kondragunta
·
February 23, 2024
On this page