Image Matching
AI SDK
AI vision-based content matching using LLMs via Vercel AI SDK.
Overview
The @nut-tree/plugin-ai-sdk plugin enables AI vision-based content matching in nut.js using large language models via the Vercel AI SDK. Instead of pixel-based template matching, it uses multimodal LLMs to understand screen content by natural language description — both for locating UI elements and for formulating test expectations.
Content Matching
Find UI elements by describing what you see
screen.find(contentMatchingDescription("a login button"))Test Assertions
Write visual expectations in natural language
expect(screen).toShow(contentMatchingDescription("a welcome dialog"))Multiple Providers
OpenAI, Anthropic, and Ollama support
useOpenAIVisionProvider()Installation
Install the plugin along with the AI SDK provider for your preferred LLM:
# Core plugin
npm install @nut-tree/plugin-ai-sdk
# Choose one or more AI SDK providers:
npm install @ai-sdk/openai # OpenAI (GPT-5, etc.)
npm install @ai-sdk/anthropic # Anthropic (Claude Opus 4.6, etc.)
npm install ollama-ai-provider # Ollama (local models)Subscription Required
Quick Reference
Provider Functions
useOpenAIVisionProvider
useOpenAIVisionProvider(options?)Activate OpenAI as the vision provider for content matching
useAnthropicVisionProvider
useAnthropicVisionProvider(options?)Activate Anthropic as the vision provider for content matching
useOllamaVisionProvider
useOllamaVisionProvider(options?)Activate Ollama as the vision provider for local AI matching
Query Functions
contentMatchingDescription
contentMatchingDescription(description: string)Creates a content query that describes what to look for on screen. Used with screen.find, screen.findAll, screen.waitFor, and test matchers like toShow and toMatchContentDescription.
Provider Setup
OpenAI
import { screen, contentMatchingDescription } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";
useOpenAIVisionProvider({
model: openai("gpt-5"),
});
// Find elements by description
const region = await screen.find(
contentMatchingDescription("the login button")
);API Key Required
OPENAI_API_KEY environment variable with your OpenAI API key.Anthropic
import { screen, contentMatchingDescription } from "@nut-tree/nut-js";
import { useAnthropicVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { anthropic } from "@ai-sdk/anthropic";
useAnthropicVisionProvider({
model: anthropic("claude-opus-4-6"),
});
const region = await screen.find(
contentMatchingDescription("the submit button")
);API Key Required
ANTHROPIC_API_KEY environment variable with your Anthropic API key.Ollama (Local)
Use Ollama for fully local AI matching without external API calls:
import { screen, contentMatchingDescription } from "@nut-tree/nut-js";
import { useOllamaVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { ollama } from "ollama-ai-provider";
useOllamaVisionProvider({
model: ollama("llava"),
});
const region = await screen.find(
contentMatchingDescription("the search input field")
);Local Setup
ollama serve) and you have pulled a vision-capable model like llava or llava-llama3.Configuration
All provider setup functions accept a configuration object:
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";
useOpenAIVisionProvider({
// Required: the AI model to use
model: openai("gpt-5"),
// Optional: default confidence threshold (0-1)
defaultConfidence: 0.7,
// Optional: maximum matches to return per search
defaultMaxMatches: 5,
// Optional: matching strategy
matching: "default",
});Options Reference
model
model: LanguageModelV1The AI SDK language model instance to use for vision analysis
defaultConfidence
defaultConfidence?: numberDefault confidence threshold for matches (0-1). Can be overridden per search.
defaultMaxMatches
defaultMaxMatches?: numberMaximum number of matches to return per search. Can be overridden per search.
matching
matching?: VisionMatchingSettingsExplicit matcher tuning settings. When present, these values override equivalent top-level options.
Usage
Finding Elements
Use contentMatchingDescription with screen.find to locate UI elements by describing what they look like:
import { screen, mouse, centerOf, straightTo, Button, contentMatchingDescription } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";
useOpenAIVisionProvider({ model: openai("gpt-5") });
// Find a specific UI element by description
const region = await screen.find(
contentMatchingDescription("a system menu called 'Navigate'"), {
confidence: 0.8,
}
);
// Interact with the found region
await mouse.move(straightTo(centerOf(region)));
await mouse.click(Button.LEFT);Waiting for Elements
Wait for content to appear on screen within a timeout:
// Wait up to 10 seconds for a dialog to appear, checking every second
const dialog = await screen.waitFor(
contentMatchingDescription("a confirmation dialog with 'Save changes?' text"),
10000,
1000
);Test Assertions
The real power of contentMatchingDescription shines in end-to-end tests, where you can formulate visual expectations in plain language:
import { screen, contentMatchingDescription } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";
useOpenAIVisionProvider({ model: openai("gpt-5") });
// Assert that the screen shows specific content
await expect(screen).toShow(
contentMatchingDescription("a browser window with an open npmjs.org tab")
);
// Assert with a custom confidence threshold
await expect(screen).toShow(
contentMatchingDescription("a navigation bar with a 'Home' link"),
{ confidence: 0.9 }
);The toShow matcher is available through the Jest and Vitest integration matchers provided by @nut-tree/nut-js.
Full E2E Test Example
Here is a complete example combining content matching with test assertions:
import { describe, it, expect, beforeAll } from "vitest";
import { screen, keyboard, contentMatchingDescription } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";
beforeAll(() => {
useOpenAIVisionProvider({ model: openai("gpt-5") });
});
describe("Application navigation", () => {
it("should open the settings page", async () => {
// Find and click the settings menu
const settingsMenu = await screen.find(
contentMatchingDescription("a menu item labeled 'Settings'")
);
await mouse.move(straightTo(centerOf(settingsMenu)));
await mouse.click(Button.LEFT);
// Verify the settings page is displayed
await expect(screen).toShow(
contentMatchingDescription("a settings page with 'General' and 'Advanced' tabs")
);
});
it("should display search results", async () => {
await keyboard.type("nut.js automation");
await expect(screen).toShow(
contentMatchingDescription("a list of search results related to 'nut.js'")
);
});
});Best Practices
Writing Good Descriptions
- Be specific about what you see (e.g., "a system menu called 'Navigate'" vs "a menu")
- Include visual characteristics like color, position, or text content
- Describe the element in context (e.g., "a browser window with an open npmjs.org tab")
Precision and Flakiness
expect(screen).toShow(contentMatchingDescription("an upwards trending graph"))Performance Considerations
- AI vision matching is slower than template matching (nl-matcher) due to API latency
- Cloud providers (OpenAI, Anthropic) require internet connectivity and incur API costs
- Ollama provides local matching but requires a capable GPU for good performance
- Consider using nl-matcher for speed-critical operations and AI SDK for complex visual understanding