Creating a New Device Agent - Complete Tutorial

This comprehensive tutorial teaches you how to create a new device agent (like MobileAgent, AndroidAgent, or iOSAgent) and integrate it with UFO³'s multi-device orchestration system. We'll use LinuxAgent as our primary reference implementation.

Introduction

What is a Device Agent?

A Device Agent is a specialized AI agent that controls and automates tasks on a specific type of device or platform. Unlike traditional third-party agents that extend specific functionality, device agents represent entire computing platforms with their own:

Execution Environment: Device-specific OS, runtime, and APIs
Control Mechanism: UI automation, CLI commands, or platform APIs
Communication Protocol: Client-server architecture via WebSocket
MCP Integration: Device-specific MCP servers for command execution

Device Agent vs Third-Party Agent

Aspect	Device Agent	Third-Party Agent
Scope	Full platform control (Windows, Linux, Mobile)	Specific functionality (Hardware, Web)
Architecture	Client-Server separation	Runs on orchestrator server
Communication	WebSocket + AIP Protocol	Direct method calls
MCP Servers	Platform-specific MCP servers	Shares MCP servers
Examples	WindowsAgent, LinuxAgent, MobileAgent	HardwareAgent, WebAgent
Deployment	Separate client process on device	Part of orchestrator

When to Create a Device Agent

Create a Device Agent when you need to:

Control an entirely new platform (mobile, IoT, embedded)
Execute tasks on remote or distributed devices
Integrate with Galaxy multi-device orchestration
Isolate execution for security or scalability

Create a Third-Party Agent when you need to:

Extend existing platform with new capabilities
Add specialized tools or APIs
Run alongside existing agents

Prerequisites

Before starting this tutorial, ensure you have:

Knowledge Requirements

✅ Python 3.10+: Intermediate Python programming skills
✅ Async Programming: Understanding of async/await patterns
✅ UFO³ Basics: Familiarity with Agent Architecture
✅ MCP Protocol: Understanding of Model Context Protocol
✅ WebSocket: Basic knowledge of WebSocket communication

Priority	Topic	Link	Time
🥇	Agent Architecture Overview	Infrastructure/Agents	20 min
🥇	LinuxAgent Quick Start	Quick Start: Linux	15 min
🥈	Server-Client Architecture	Server Overview, Client Overview	30 min
🥈	MCP Integration	MCP Overview	20 min
🥉	AIP Protocol	AIP Protocol	15 min

Development Environment

# Clone UFO³ repository
git clone https://github.com/microsoft/UFO.git
cd UFO

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import ufo; print('UFO³ installed successfully')"

Understanding Device Agents

Three-Layer Architecture

All device agents in UFO³ follow a unified three-layer architecture:

graph TB
    subgraph "Device Agent Architecture"
        subgraph "Level-1: State Layer (FSM)"
            S1[AgentState]
            S2[State Machine]
            S3[State Transitions]
            S1 --> S2 --> S3
        end
        
        subgraph "Level-2: Strategy Layer (Execution Logic)"
            P1[ProcessorTemplate]
            P2[DATA_COLLECTION]
            P3[LLM_INTERACTION]
            P4[ACTION_EXECUTION]
            P5[MEMORY_UPDATE]
            P1 --> P2 --> P3 --> P4 --> P5
        end
        
        subgraph "Level-3: Command Layer (System Interface)"
            C1[CommandDispatcher]
            C2[MCP Tools]
            C3[Device Commands]
            C1 --> C2 --> C3
        end
        
        S3 -->|delegates to| P1
        P5 -->|executes via| C1
    end
    
    style S1 fill:#e1f5ff
    style P1 fill:#fff3e0
    style C1 fill:#f3e5f5

Key Layers:

State Layer (Level-1): Finite State Machine controlling agent lifecycle
Strategy Layer (Level-2): Processing pipeline with modular strategies
Command Layer (Level-3): Atomic system operations via MCP

For detailed architecture, see Agent Architecture Documentation.

Server-Client Separation

Device agents use a server-client architecture for security and scalability:

graph LR
    subgraph "Server Side (Orchestrator)"
        Server[Device Agent Server]
        State[State Machine]
        Processor[Strategy Processor]
        LLM[LLM Service]
        
        Server --> State
        Server --> Processor
        Processor -.-> LLM
    end
    
    subgraph "Communication"
        AIP[AIP Protocol<br/>WebSocket]
    end
    
    subgraph "Client Side (Device)"
        Client[Device Client]
        MCP[MCP Server Manager]
        Tools[Platform Tools]
        OS[Device OS]
        
        Client --> MCP
        MCP --> Tools
        Tools --> OS
    end
    
    Server <-->|Commands/Results| AIP
    AIP <-->|Commands/Results| Client
    
    style Server fill:#e1f5ff
    style Client fill:#c8e6c9
    style AIP fill:#fff3e0

Separation Benefits:

Component	Location	Responsibilities	Security
Agent Server	Orchestrator	Reasoning, planning, state management	Untrusted (LLM-driven)
Device Client	Target Device	Command execution, resource access	Trusted (validated operations)
AIP Protocol	Network	Message transport, serialization	Encrypted channel

Separation Benefits:

Security: Isolates LLM reasoning from system-level execution
Scalability: Single orchestrator manages multiple devices
Flexibility: Clients run on resource-constrained devices (mobile, IoT)
Safety: Client validates all commands before execution

LinuxAgent: Reference Implementation

Why LinuxAgent as Reference?

LinuxAgent is the ideal reference for creating new device agents because:

✅ Simple Architecture: Single-tier agent (no HostAgent delegation)
✅ Clear Separation: Clean server-client boundary
✅ Well-Documented: Comprehensive code and documentation
✅ Production-Ready: Battle-tested in real deployments
✅ Minimal Complexity: Focuses on core device agent patterns

LinuxAgent Components

graph TB
    subgraph "Server Side (ufo/agents/)"
        LA[LinuxAgent Class<br/>customized_agent.py]
        LAP[LinuxAgentProcessor<br/>customized_agent_processor.py]
        LAS[LinuxAgent Strategies<br/>linux_agent_strategy.py]
        LAST[LinuxAgent States<br/>linux_agent_state.py]
        
        LA --> LAP
        LAP --> LAS
        LA --> LAST
    end
    
    subgraph "Client Side (ufo/client/)"
        Client[UFO Client<br/>client.py]
        MCP[MCP Server Manager<br/>mcp_server_manager.py]
        LinuxMCP[Linux MCP Server<br/>linux_mcp_server.py]
        
        Client --> MCP
        MCP --> LinuxMCP
    end
    
    subgraph "Configuration"
        Config[third_party.yaml]
        Devices[devices.yaml]
        Prompts[Prompt Templates]
    end
    
    LA -.reads.-> Config
    Client -.reads.-> Devices
    LA -.uses.-> Prompts
    
    style LA fill:#c8e6c9
    style LAP fill:#c8e6c9
    style LAS fill:#c8e6c9
    style LAST fill:#c8e6c9
    style Client fill:#e1f5ff
    style MCP fill:#e1f5ff
    style LinuxMCP fill:#e1f5ff

File Locations:

Component	File Path	Purpose
Agent Class	`ufo/agents/agent/customized_agent.py`	LinuxAgent definition
Processor	`ufo/agents/processors/customized/customized_agent_processor.py`	LinuxAgentProcessor
Strategies	`ufo/agents/processors/strategies/linux_agent_strategy.py`	LLM & Action strategies
States	`ufo/agents/states/linux_agent_state.py`	State machine states
Prompter	`ufo/prompter/customized/linux_agent_prompter.py`	Prompt construction
Client	`ufo/client/client.py`	Device client entry point
MCP Server	`ufo/client/mcp/http_servers/linux_mcp_server.py`	Command execution

LinuxAgent Architecture Diagram

sequenceDiagram
    participant User
    participant Server as LinuxAgent Server
    participant AIP as AIP Protocol
    participant Client as Linux Client
    participant MCP as Linux MCP Server
    participant Shell as Bash Shell
    
    User->>Server: User Request: "List files in /tmp"
    
    Server->>Server: State: ContinueLinuxAgentState
    Server->>Server: Processor: LinuxAgentProcessor
    
    Server->>Server: Strategy: LLM_INTERACTION
    Note over Server: Construct prompt, call LLM
    Server->>Server: LLM Response: execute_command("ls -la /tmp")
    
    Server->>Server: Strategy: ACTION_EXECUTION
    Server->>AIP: COMMAND: execute_command
    AIP->>Client: WebSocket: COMMAND
    
    Client->>MCP: Call MCP Tool: execute_command
    MCP->>Shell: Execute: ls -la /tmp
    Shell-->>MCP: stdout, stderr, exit_code
    MCP-->>Client: Result
    Client->>AIP: WebSocket: RESULT
    AIP->>Server: RESULT
    
    Server->>Server: Strategy: MEMORY_UPDATE
    Server->>Server: Update memory & blackboard
    
    Server->>Server: State Transition: FINISH
    Server->>User: Task Complete

Key Execution Flow:

User Request → LinuxAgent Server receives request
State Machine → Activates ContinueLinuxAgentState
Processor → Executes LinuxAgentProcessor strategies
LLM Interaction → Generates shell command
Action Execution → Sends command via AIP to client
MCP Execution → Client executes via Linux MCP Server
Result Handling → Server receives result, updates memory
State Transition → Moves to FINISH state

Architecture Overview

Complete Device Agent Architecture

When creating a new device agent (e.g., MobileAgent), you'll implement these components:

graph TB
    subgraph "1. Agent Definition"
        A1[Agent Class<br/>MobileAgent]
        A2[Processor<br/>MobileAgentProcessor]
        A3[State Manager<br/>MobileAgentStateManager]
    end
    
    subgraph "2. Processing Strategies"
        S1[DATA_COLLECTION<br/>Screenshot, UI Tree]
        S2[LLM_INTERACTION<br/>Prompt Construction]
        S3[ACTION_EXECUTION<br/>Command Dispatch]
        S4[MEMORY_UPDATE<br/>Context Update]
    end
    
    subgraph "3. MCP Server"
        M1[MCP Server<br/>mobile_mcp_server.py]
        M2[MCP Tools<br/>tap, swipe, type, etc.]
    end
    
    subgraph "4. Configuration"
        C1[third_party.yaml<br/>Agent Config]
        C2[devices.yaml<br/>Device Registry]
        C3[Prompt Templates<br/>LLM Prompts]
    end
    
    subgraph "5. Client"
        CL1[Device Client<br/>client.py]
        CL2[MCP Manager<br/>mcp_server_manager.py]
    end
    
    A1 --> A2
    A2 --> S1 & S2 & S3 & S4
    S3 --> M1
    M1 --> M2
    A1 -.reads.-> C1
    CL1 --> CL2
    CL2 --> M1
    CL1 -.reads.-> C2
    A2 -.uses.-> C3
    
    style A1 fill:#c8e6c9
    style A2 fill:#c8e6c9
    style A3 fill:#c8e6c9
    style M1 fill:#e1f5ff
    style CL1 fill:#e1f5ff

Implementation Checklist:

Tutorial Roadmap

This tutorial is split into 6 detailed guides:

📘 Part 1: Core Components

Learn to implement the server-side components:

Agent Class definition
Processor and strategies
State Manager and states
Prompter for LLM interaction

Time: 45 minutes
Difficulty: ⭐⭐⭐

📘 Part 2: MCP Server Development

Create a platform-specific MCP server:

MCP server architecture
Defining MCP tools
Command execution logic
Error handling and validation

Time: 30 minutes
Difficulty: ⭐⭐

📘 Part 3: Client Configuration

Set up the device client:

Client initialization
MCP server manager integration
WebSocket connection setup
Platform detection

Time: 20 minutes
Difficulty: ⭐⭐

📘 Part 4: Configuration & Deployment

Configure and deploy your agent:

third_party.yaml configuration
devices.yaml device registration
Prompt template creation
Galaxy integration

Time: 25 minutes
Difficulty: ⭐⭐

📘 Part 5: Testing & Debugging

Test and debug your implementation:

Unit testing strategies
Integration testing
Debugging techniques
Common issues and solutions

Time: 30 minutes
Difficulty: ⭐⭐⭐

📘 Part 6: Complete Example: MobileAgent

Hands-on walkthrough creating MobileAgent:

Step-by-step implementation
Android/iOS platform specifics
UI Automator integration
Complete working example

Time: 60 minutes
Difficulty: ⭐⭐⭐⭐

Quick Start Guide

For experienced developers, here's a minimal implementation checklist:

1️⃣ Create Agent Class

# ufo/agents/agent/customized_agent.py

@AgentRegistry.register(
    agent_name="MobileAgent",
    third_party=True,
    processor_cls=MobileAgentProcessor
)
class MobileAgent(CustomizedAgent):
    def __init__(self, name, main_prompt, example_prompt):
        super().__init__(name, main_prompt, example_prompt,
                         process_name=None, app_root_name=None, is_visual=None)
        self._blackboard = Blackboard()
        self.set_state(self.default_state)
        self._context_provision_executed = False
    
    @property
    def default_state(self):
        return ContinueMobileAgentState()

2️⃣ Create Processor

# ufo/agents/processors/customized/customized_agent_processor.py

class MobileAgentProcessor(CustomizedProcessor):
    def _setup_strategies(self):
        # Compose multiple data collection strategies
        self.strategies[ProcessingPhase.DATA_COLLECTION] = ComposedStrategy(
            strategies=[
                MobileScreenshotCaptureStrategy(fail_fast=True),
                MobileAppsCollectionStrategy(fail_fast=False),
                MobileControlsCollectionStrategy(fail_fast=False),
            ],
            name="MobileDataCollectionStrategy",
            fail_fast=True,
        )
        
        self.strategies[ProcessingPhase.LLM_INTERACTION] = (
            MobileLLMInteractionStrategy(fail_fast=True)
        )
        self.strategies[ProcessingPhase.ACTION_EXECUTION] = (
            MobileActionExecutionStrategy(fail_fast=False)
        )
        self.strategies[ProcessingPhase.MEMORY_UPDATE] = (
            AppMemoryUpdateStrategy(fail_fast=False)
        )

3️⃣ Create MCP Server

# ufo/client/mcp/http_servers/mobile_mcp_server.py

def create_mobile_mcp_server(host="localhost", port=8020):
    mcp = FastMCP("Mobile MCP Server", stateless_http=False, 
                  json_response=True, host=host, port=port)
    
    @mcp.tool()
    async def tap_element(x: int, y: int) -> dict:
        # Execute tap via ADB or platform API
        pass
    
    mcp.run(transport="streamable-http")

4️⃣ Configure Agent

# config/ufo/third_party.yaml

ENABLED_THIRD_PARTY_AGENTS: ["MobileAgent"]

THIRD_PARTY_AGENT_CONFIG:
  MobileAgent:
    VISUAL_MODE: True
    AGENT_NAME: "MobileAgent"
    APPAGENT_PROMPT: "ufo/prompts/third_party/mobile_agent.yaml"
    APPAGENT_EXAMPLE_PROMPT: "ufo/prompts/third_party/mobile_agent_example.yaml"
    INTRODUCTION: "MobileAgent controls Android/iOS devices..."

5️⃣ Register Device

# config/galaxy/devices.yaml

devices:
  - device_id: "mobile_agent_1"
    server_url: "ws://localhost:5010/ws"
    os: "android"
    capabilities: ["ui_automation", "app_testing"]
    metadata:
      device_model: "Pixel 6"
      android_version: "13"
    max_retries: 5

6️⃣ Start Server & Client

# Terminal 1: Start Agent Server
python -m ufo.server.app --port 5010

# Terminal 2: Start Device Client
python -m ufo.client.client \
  --ws --ws-server ws://localhost:5010/ws \
  --client-id mobile_agent_1 \
  --platform android

# Terminal 3: Start MCP Server (on device or accessible endpoint)
python -m ufo.client.mcp.http_servers.mobile_mcp_server --port 8020

Next Steps

Ready to Build Your Device Agent?

Start with Part 1: Core Components →

Or jump to a specific topic:

Summary

Key Takeaways:

Device Agents control entire platforms (Windows, Linux, Mobile)
Server-Client Architecture separates reasoning from execution
Three-Layer Design provides modular, extensible framework
LinuxAgent is the best reference implementation
6-Part Tutorial covers all aspects of device agent creation
MCP Integration enables platform-specific command execution
Galaxy Integration supports multi-device orchestration

Ready to build your first device agent? Let's get started! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a New Device Agent - Complete Tutorial

📋 Table of Contents

Introduction

What is a Device Agent?

Device Agent vs Third-Party Agent

When to Create a Device Agent

Prerequisites

Knowledge Requirements

Recommended Reading

Development Environment

Understanding Device Agents

Three-Layer Architecture

Server-Client Separation

LinuxAgent: Reference Implementation

Why LinuxAgent as Reference?

LinuxAgent Components

LinuxAgent Architecture Diagram

Architecture Overview

Complete Device Agent Architecture

Tutorial Roadmap

📘 Part 1: Core Components

📘 Part 2: MCP Server Development

📘 Part 3: Client Configuration

📘 Part 4: Configuration & Deployment

📘 Part 5: Testing & Debugging

📘 Part 6: Complete Example: MobileAgent

Quick Start Guide

1️⃣ Create Agent Class

2️⃣ Create Processor

3️⃣ Create MCP Server

4️⃣ Configure Agent

5️⃣ Register Device

6️⃣ Start Server & Client

Next Steps

Related Documentation

Summary

FilesExpand file tree

overview.md

Latest commit

History

overview.md

File metadata and controls

Creating a New Device Agent - Complete Tutorial

📋 Table of Contents

Introduction

What is a Device Agent?

Device Agent vs Third-Party Agent

When to Create a Device Agent

Prerequisites

Knowledge Requirements

Recommended Reading

Development Environment

Understanding Device Agents

Three-Layer Architecture

Server-Client Separation

LinuxAgent: Reference Implementation

Why LinuxAgent as Reference?

LinuxAgent Components

LinuxAgent Architecture Diagram

Architecture Overview

Complete Device Agent Architecture

Tutorial Roadmap

📘 Part 1: Core Components

📘 Part 2: MCP Server Development

📘 Part 3: Client Configuration

📘 Part 4: Configuration & Deployment

📘 Part 5: Testing & Debugging

📘 Part 6: Complete Example: MobileAgent

Quick Start Guide

1️⃣ Create Agent Class

2️⃣ Create Processor

3️⃣ Create MCP Server

4️⃣ Configure Agent

5️⃣ Register Device

6️⃣ Start Server & Client

Next Steps

Related Documentation

Summary