Image Matching

AI SDK

AI vision-based content matching using LLMs via Vercel AI SDK.

Overview

The @nut-tree/plugin-ai-sdk plugin enables AI vision-based content matching in nut.js using large language models via the Vercel AI SDK. Instead of pixel-based template matching, it uses multimodal LLMs to understand screen content by natural language description — both for locating UI elements and for formulating test expectations.

Content Matching

Find UI elements by describing what you see

screen.find(contentMatchingDescription("a login button"))

Test Assertions

Write visual expectations in natural language

expect(screen).toShow(contentMatchingDescription("a welcome dialog"))

Multiple Providers

OpenAI, Anthropic, and Ollama support

useOpenAIVisionProvider()

Installation

Install the plugin along with the AI SDK provider for your preferred LLM:

typescript
# Core plugin
npm install @nut-tree/plugin-ai-sdk

# Choose one or more AI SDK providers:
npm install @ai-sdk/openai      # OpenAI (GPT-5, etc.)
npm install @ai-sdk/anthropic   # Anthropic (Claude Opus 4.6, etc.)
npm install ollama-ai-provider   # Ollama (local models)

Subscription Required

This package is included in Solo and Team subscription plans.

Quick Reference

Provider Functions

useOpenAIVisionProvider

useOpenAIVisionProvider(options?)
void

Activate OpenAI as the vision provider for content matching

useAnthropicVisionProvider

useAnthropicVisionProvider(options?)
void

Activate Anthropic as the vision provider for content matching

useOllamaVisionProvider

useOllamaVisionProvider(options?)
void

Activate Ollama as the vision provider for local AI matching

Query Functions

contentMatchingDescription

contentMatchingDescription(description: string)
ContentQuery

Creates a content query that describes what to look for on screen. Used with screen.find, screen.findAll, screen.waitFor, and test matchers like toShow and toMatchContentDescription.


Provider Setup

OpenAI

typescript
import { screen, contentMatchingDescription } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";

useOpenAIVisionProvider({
    model: openai("gpt-5"),
});

// Find elements by description
const region = await screen.find(
    contentMatchingDescription("the login button")
);

API Key Required

Set the OPENAI_API_KEY environment variable with your OpenAI API key.

Anthropic

typescript
import { screen, contentMatchingDescription } from "@nut-tree/nut-js";
import { useAnthropicVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { anthropic } from "@ai-sdk/anthropic";

useAnthropicVisionProvider({
    model: anthropic("claude-opus-4-6"),
});

const region = await screen.find(
    contentMatchingDescription("the submit button")
);

API Key Required

Set the ANTHROPIC_API_KEY environment variable with your Anthropic API key.

Ollama (Local)

Use Ollama for fully local AI matching without external API calls:

typescript
import { screen, contentMatchingDescription } from "@nut-tree/nut-js";
import { useOllamaVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { ollama } from "ollama-ai-provider";

useOllamaVisionProvider({
    model: ollama("llava"),
});

const region = await screen.find(
    contentMatchingDescription("the search input field")
);

Local Setup

Make sure Ollama is running locally (ollama serve) and you have pulled a vision-capable model like llava or llava-llama3.

Configuration

All provider setup functions accept a configuration object:

typescript
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";

useOpenAIVisionProvider({
    // Required: the AI model to use
    model: openai("gpt-5"),

    // Optional: default confidence threshold (0-1)
    defaultConfidence: 0.7,

    // Optional: maximum matches to return per search
    defaultMaxMatches: 5,

    // Optional: matching strategy
    matching: "default",
});

Options Reference

model

model: LanguageModelV1
required

The AI SDK language model instance to use for vision analysis

defaultConfidence

defaultConfidence?: number
optional

Default confidence threshold for matches (0-1). Can be overridden per search.

defaultMaxMatches

defaultMaxMatches?: number
optional

Maximum number of matches to return per search. Can be overridden per search.

matching

matching?: VisionMatchingSettings
optional

Explicit matcher tuning settings. When present, these values override equivalent top-level options.


Usage

Finding Elements

Use contentMatchingDescription with screen.find to locate UI elements by describing what they look like:

typescript
import { screen, mouse, centerOf, straightTo, Button, contentMatchingDescription } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";

useOpenAIVisionProvider({ model: openai("gpt-5") });

// Find a specific UI element by description
const region = await screen.find(
    contentMatchingDescription("a system menu called 'Navigate'"), {
        confidence: 0.8,
    }
);

// Interact with the found region
await mouse.move(straightTo(centerOf(region)));
await mouse.click(Button.LEFT);

Waiting for Elements

Wait for content to appear on screen within a timeout:

typescript
// Wait up to 10 seconds for a dialog to appear, checking every second
const dialog = await screen.waitFor(
    contentMatchingDescription("a confirmation dialog with 'Save changes?' text"),
    10000,
    1000
);

Test Assertions

The real power of contentMatchingDescription shines in end-to-end tests, where you can formulate visual expectations in plain language:

typescript
import { screen, contentMatchingDescription } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";

useOpenAIVisionProvider({ model: openai("gpt-5") });

// Assert that the screen shows specific content
await expect(screen).toShow(
    contentMatchingDescription("a browser window with an open npmjs.org tab")
);

// Assert with a custom confidence threshold
await expect(screen).toShow(
    contentMatchingDescription("a navigation bar with a 'Home' link"),
    { confidence: 0.9 }
);

The toShow matcher is available through the Jest and Vitest integration matchers provided by @nut-tree/nut-js.

Full E2E Test Example

Here is a complete example combining content matching with test assertions:

typescript
import { describe, it, expect, beforeAll } from "vitest";
import { screen, keyboard, contentMatchingDescription } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";

beforeAll(() => {
    useOpenAIVisionProvider({ model: openai("gpt-5") });
});

describe("Application navigation", () => {
    it("should open the settings page", async () => {
        // Find and click the settings menu
        const settingsMenu = await screen.find(
            contentMatchingDescription("a menu item labeled 'Settings'")
        );
        await mouse.move(straightTo(centerOf(settingsMenu)));
        await mouse.click(Button.LEFT);

        // Verify the settings page is displayed
        await expect(screen).toShow(
            contentMatchingDescription("a settings page with 'General' and 'Advanced' tabs")
        );
    });

    it("should display search results", async () => {
        await keyboard.type("nut.js automation");

        await expect(screen).toShow(
            contentMatchingDescription("a list of search results related to 'nut.js'")
        );
    });
});

Best Practices

Writing Good Descriptions

  • Be specific about what you see (e.g., "a system menu called 'Navigate'" vs "a menu")
  • Include visual characteristics like color, position, or text content
  • Describe the element in context (e.g., "a browser window with an open npmjs.org tab")

Precision and Flakiness

Vision models do not return pixel-perfect results. Using descriptive queries to precisely locate a small element (e.g. a specific button to click) is highly dependent on model precision and carries a significant risk of flaky or failing results. For interactions that require accurate coordinates, prefer template-based matching with nl-matcher. Descriptive queries shine in test expectations, where the goal is to verify what is shown on screen rather than to pinpoint an exact location:
expect(screen).toShow(contentMatchingDescription("an upwards trending graph"))

Performance Considerations

  • AI vision matching is slower than template matching (nl-matcher) due to API latency
  • Cloud providers (OpenAI, Anthropic) require internet connectivity and incur API costs
  • Ollama provides local matching but requires a capable GPU for good performance
  • Consider using nl-matcher for speed-critical operations and AI SDK for complex visual understanding

Was this page helpful?