Desktop AI Agents

Build AI agents that use computers

Give LLMs the ability to see and interact with desktop applications. Build autonomous agents that can complete complex tasks across any software.

desktop-agent.ts
import {
    screen, mouse, Button, keyboard, straightTo
} from "@nut-tree/nut-js";
import Anthropic from "@anthropic-ai/sdk";

async function runDesktopAgent(task: string) {
    const anthropic = new Anthropic({
        apiKey: process.env['ANTHROPIC_API_KEY'],
    });

    while (true) {
        // Capture what the agent "sees"
        const screenshot = await screen.grab();
        const base64Image = await screenshot.toDataURL(FileType.PNG);

        // Ask the LLM what to do
        const response = await anthropic.messages.create({
            model: "claude-sonnet-4-5-20250929",
            messages: [{
                role: "user",
                content: [
                    { type: "image", source: { type: "base64", data: base64Image } },
                    { type: "text", text: `Task: ${task}\nWhat action should I take?` }
                ]
            }],
        });

        // Execute the action
        await executeAction(response.content);

        if (isTaskComplete(response)) break;
    }
}

The building blocks for computer use

nut.js provides everything you need to enable AI agents to interact with desktop applications.

๐Ÿ‘๏ธ

Visual Understanding

Capture screenshots for LLM vision models. Let AI see and understand the desktop.

๐ŸŽฏ

Multiple Search Methods

Find elements by UI structure, images, text, or colors. Flexible strategies for any app.

๐Ÿ“

OCR Integration

Extract text from screen for enhanced context. Help AI understand UI elements.

๐Ÿ’ป

Cross-Platform

Build agents that work on Windows, macOS, and Linux without code changes.

Translate AI decisions into real actions

Connect your LLM's output to precise desktop interactions. nut.js handles the execution while your AI handles the reasoning.

Coordinate-based clicking

LLM outputs coordinates, nut.js executes clicks

Natural text input

Type text with configurable speed and timing

Complex interactions

Drag and drop, right-click menus, keyboard shortcuts

execute-action.ts
import {
    mouse, Button, keyboard, straightTo, sleep
} from "@nut-tree/nut-js";

interface AgentAction {
    type: "click" | "type" | "scroll" | "wait";
    x?: number;
    y?: number;
    text?: string;
    direction?: "up" | "down";
    amount?: number;
}

async function executeAction(action: AgentAction) {
    switch (action.type) {
        case "click":
            await mouse.move(
                straightTo({ x: action.x!, y: action.y! })
            );
            await mouse.click(Button.LEFT);
            break;

        case "type":
            await keyboard.type(action.text!);
            break;

        case "scroll":
            if (action.direction === "up") {
                await mouse.scrollUp(action.amount ?? 3);
            } else {
                await mouse.scrollDown(action.amount ?? 3);
            }
            break;

        case "wait":
            await sleep(1000);
            break;
    }
}
screen-context.ts
import { screen, getActiveWindow } from "@nut-tree/nut-js";

async function getScreenContext() {
    // Capture the screen
    const screenshot = await screen.grab();

    // Get active window information
    const activeWindow = await getActiveWindow();

    return {
        screenshot: screenshot.data.toString("base64"),
        activeApp: await activeWindow.getTitle(),
        windowRegion: await activeWindow.getRegion(),
        screenSize: {
            width: screen.width,
            height: screen.height
        },
    };
}

// Provide rich context to your LLM
async function buildPrompt(task: string) {
    const context = await getScreenContext();
    return `
Current screen shows: ${context.activeApp}
Screen size: ${context.screenSize.width}x${context.screenSize.height}

Task: ${task}
`;
}
Enhanced Vision

Give your agent rich context

Combine screenshots with OCR text extraction, window information, and screen metadata to help your LLM make better decisions.

  • Full-screen and region capture
  • OCR text extraction with @nut-tree/plugin-ocr
  • Active window detection and metadata
  • Multi-monitor support

Choose your architecture

nut.js is flexible enough to support various agent architectures and AI models.

Vision-Language Models

Use GPT-4 Vision, Claude, or Gemini to interpret screenshots and decide actions.

Claude 3.5GPT-4 VisionGemini Pro

Hybrid Approaches

Combine LLM reasoning with traditional image recognition for reliable automation.

LLM + Template MatchingLLM + OCRMulti-modal

Autonomous Agents

Build agents that can plan, execute, and verify multi-step tasks independently.

ReAct PatternPlan-ExecuteSelf-Correcting

What can you build?

๐Ÿค–

AI Assistants

Build AI assistants that can operate any desktop application on behalf of users.

โš™๏ธ

Process Automation

Create intelligent RPA that adapts to UI changes and handles exceptions.

๐Ÿงช

Testing Agents

Build AI-powered testers that explore applications and find issues autonomously.

๐Ÿ“Š

Data Extraction

Extract data from any application by teaching AI what to look for.

Ready to build intelligent desktop agents?

Get started with nut.js and give your AI the power to use any desktop application.