Image Matching
AI SDK
AI vision-based image matching using LLMs via Vercel AI SDK.
Overview
The @nut-tree/plugin-ai-sdk plugin enables AI vision-based image matching in nut.js using large language models via the Vercel AI SDK. Instead of pixel-based template matching, it uses multimodal LLMs to understand and locate UI elements by description.
AI Vision
Locate elements by natural language description
screen.find("login button")Multiple Providers
OpenAI, Anthropic, and Ollama support
useOpenAIVisionProvider()Configurable
Custom models, prompts, and matching options
{ model, systemPrompt }Installation
Install the plugin along with the AI SDK provider for your preferred LLM:
# Core plugin
npm install @nut-tree/plugin-ai-sdk
# Choose one or more AI SDK providers:
npm install @ai-sdk/openai # OpenAI (GPT-5, etc.)
npm install @ai-sdk/anthropic # Anthropic (Claude Opus 4.6, etc.)
npm install ollama-ai-provider # Ollama (local models)Subscription Required
Quick Reference
useOpenAIVisionProvider
useOpenAIVisionProvider(options?)Activate OpenAI as the vision provider for image matching
useAnthropicVisionProvider
useAnthropicVisionProvider(options?)Activate Anthropic as the vision provider for image matching
useOllamaVisionProvider
useOllamaVisionProvider(options?)Activate Ollama as the vision provider for local AI matching
Provider Setup
OpenAI
import { screen } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";
useOpenAIVisionProvider({
model: openai("gpt-5"),
});
// Find elements by description
const loginButton = await screen.find("the login button");API Key Required
OPENAI_API_KEY environment variable with your OpenAI API key.Anthropic
import { screen } from "@nut-tree/nut-js";
import { useAnthropicVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { anthropic } from "@ai-sdk/anthropic";
useAnthropicVisionProvider({
model: anthropic("claude-opus-4-6"),
});
const submitButton = await screen.find("the submit button");API Key Required
ANTHROPIC_API_KEY environment variable with your Anthropic API key.Ollama (Local)
Use Ollama for fully local AI matching without external API calls:
import { screen } from "@nut-tree/nut-js";
import { useOllamaVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { ollama } from "ollama-ai-provider";
useOllamaVisionProvider({
model: ollama("llava"),
});
const element = await screen.find("the search input field");Local Setup
ollama serve) and you have pulled a vision-capable model like llava or llava-llama3.Configuration
All provider setup functions accept a configuration object:
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";
useOpenAIVisionProvider({
// Required: the AI model to use
model: openai("gpt-5"),
// Optional: default confidence threshold (0-1)
defaultConfidence: 0.7,
// Optional: maximum matches to return per search
defaultMaxMatches: 5,
// Optional: matching strategy
matching: "default",
// Optional: custom system prompt for the AI model
systemPrompt: "You are analyzing a desktop application screenshot.",
});Options Reference
model
model: LanguageModelV1The AI SDK language model instance to use for vision analysis
defaultConfidence
defaultConfidence?: numberDefault confidence threshold for matches (0-1). Can be overridden per search.
defaultMaxMatches
defaultMaxMatches?: numberMaximum number of matches to return per search. Can be overridden per search.
matching
matching?: "default"Matching strategy to use
systemPrompt
systemPrompt?: stringCustom system prompt to provide context to the AI model about what it is analyzing
Usage Examples
Finding Elements
import { screen, mouse, centerOf, straightTo, Button } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";
useOpenAIVisionProvider({ model: openai("gpt-5") });
// Find by natural language description
const button = await screen.find("the blue Submit button");
await mouse.move(straightTo(centerOf(button)));
await mouse.click(Button.LEFT);
// Find multiple matches
const items = await screen.findAll("list items in the sidebar");
console.log(`Found ${items.length} sidebar items`);
// Wait for an element to appear
const dialog = await screen.waitFor("a confirmation dialog", 10000, 1000);With Confidence Override
import { screen } from "@nut-tree/nut-js";
import { useOpenAIVisionProvider } from "@nut-tree/plugin-ai-sdk";
import { openai } from "@ai-sdk/openai";
useOpenAIVisionProvider({
model: openai("gpt-5"),
defaultConfidence: 0.7,
});
// Override confidence for a specific search
const result = await screen.find("the navigation menu", {
confidence: 0.9,
});Best Practices
Descriptions
- Be specific in your descriptions (e.g., "the blue Submit button" vs "a button")
- Include visual characteristics like color, position, or text content
- Use the
systemPromptoption to give the model context about your application
Performance Considerations
- AI vision matching is slower than template matching (nl-matcher) due to API latency
- Cloud providers (OpenAI, Anthropic) require internet connectivity and incur API costs
- Ollama provides local matching but requires a capable GPU for good performance
- Consider using nl-matcher for speed-critical operations and AI SDK for complex visual understanding