OCR plugins

@nut-tree/plugin-azure


Installation

npm i @nut-tree/plugin-azure

Buy

@nut-tree/plugin-azure is included in the Solo and Team plans.


Prerequisites

In order to use @nut-tree/plugin-azure, you need to have an Azure account and an Azure AI Vision OCR resource. You can use the free pricing tier (F0) to try the service, and upgrade later to a paid tier for production.

Once you have both things set up, you'll need a key and the endpoint of the resource you created to connect your application to the Azure AI Vision service:

  • After your Azure Vision resource deployed, select Go to resource.
  • In the left navigation menu, select Keys and Endpoint.
  • Copy one of the keys and the endpoint.
  • Use them in your code via e.g. environment variables.

Description

@nut-tree/plugin-azure is a nut.js plugin which currently integrates the Azure AI Vision OCR service with nut.js. It provides an implementation of the TextFinderInterface to perform on-screen text search. Additionally, it provides a plugin that extends the nut.js Screen with the ability to extract text from screen regions.


Configuration

@nut-tree/plugin-azure is designed to provide multiple subpackages which will provide their own configuration options. Currently, there is only one subpackage available, @nut-tree/plugin-azure/ocr, which provides the ability to perform on-screen text search.

You can either use the separate configuration methods of each subpackage, or use the configure() method of the main package to configure all subpackages at once.

interface AzurePluginConfiguration {
    ocr: AzureVisionOCRApiConfiguration;
}

configure()

Configure the plugin by providing an AzurePluginConfiguration. This config object holds all configuration options for all subpackages.

Azure AI Vision OCR

@nut-tree/plugin-azure/ocr uses the Azure AI Vision OCR service to perform OCR.

Configuration

The subpackage comes with the following configuration options:

interface AzureVisionOCRApiConfiguration {
    apiEndpoint: string;
    apiKey: string;
    checkResultInterval?: number;
    checkResultRetryCount?: number;
    language?: string;
    modelVersion?: string;
    readingOrder?: AzureOcrServiceReadingOrder;
}

AzureVisionOCRApiConfiguration.apiKey

Your api key for the Azure AI Vision OCR resource. This is a required option to use the Azure AI Vision OCR service.

AzureVisionOCRApiConfiguration.apiEndpoint

The URL of your Azure AI Vision OCR resource. This is a required option to use the Azure AI Vision OCR service.

AzureVisionOCRApiConfiguration.modelVersion

@nut-tree/plugin-azure/ocr allows you to explicitly specify which of the available models to use. If you don't specify a model version, the latest model will be used.

AzureVisionOCRApiConfiguration.language

@nut-tree/plugin-azure/ocr allows you to explicitly specify a single language to use for OCR. By default, the service will extract all text, including mixed language, so if you want to force usage of a single, specific language, you can do so by setting this option.

Available languages are:

export enum Language {
    Afrikaans,
    Albanian,
    Angika,
    Arabic,
    Asturian,
    AwadhiHindi,
    Azerbaijani,
    Bagheli,
    Basque,
    BelarusianCyrillic,
    BelarusianLatin,
    BhojpuriHindi,
    Bislama,
    Bodo,
    Bosnian,
    Brajbha,
    Breton,
    Bulgarian,
    Bundeli,
    Buryat,
    Catalan,
    Cebuano,
    Chamling,
    Chamorro,
    Chhattisgarhi,
    ChineseSimplified,
    ChineseTraditional,
    Cornish,
    Corsican,
    CrimeanTatar,
    Croatian,
    Czech,
    Danish,
    Dari,
    Dhimal,
    Dogri,
    Dutch,
    English,
    Erzya,
    Estonian,
    Faroese,
    Fijian,
    Filipino,
    Finnish,
    French,
    Friulian,
    Gagauz,
    Galician,
    German,
    Gilbertese,
    Gondi,
    Greenlandic,
    Gurung,
    HaitianCreole,
    Halbi,
    Hani,
    Haryanvi,
    Hawaiian,
    Hindi,
    HmongDaw,
    Ho,
    Hungarian,
    Icelandic,
    InariSami,
    Indonesian,
    Interlingua,
    Inuktitut,
    Irish,
    Italian,
    Japanese,
    Jaunsari,
    Javanese,
    Kabuverdianu,
    KachinLatin,
    KangriDevanagiri,
    KarachayBalkar,
    KaraKalpakCyrillic,
    KaraKalpakLatin,
    Kashubian,
    KazakhCyrillic,
    KazakhLatin,
    Khaling,
    Khasi,
    Kiche,
    Korean,
    Korku,
    Koryak,
    Kosraean,
    Kumyk,
    KurdishArabic,
    KurdishLatin,
    KurukhDevanagiri,
    KyrgyzCyrillic,
    Lakota,
    Latin,
    Lithuanian,
    LowerSorbian,
    LuleSami,
    Luxembourgish,
    MahasuPahari,
    Malay,
    Maltese,
    Malto,
    Manx,
    Maori,
    Marathi,
    Mongolian,
    MontenegrinCyrillic,
    MontenegrinLatin,
    Neapolitan,
    Nepali,
    Niuean,
    Nogay,
    NorthernSami,
    Norwegian,
    Occitan,
    Ossetic,
    Pashto,
    Persian,
    Polish,
    Portuguese,
    Punjabi,
    Ripuarian,
    Romanian,
    Romansh,
    Russian,
    Sadri,
    Samoan,
    Sanskrit,
    Santali,
    Scots,
    ScottishGaelic,
    Serbian,
    Sherpa,
    Sirmauri,
    SkoltSami,
    Slovak,
    Slovenian,
    Somali,
    SouthernSami,
    Spanish,
    Swahili,
    Swedish,
    Tajik,
    Tatar,
    Tetum,
    Thangmi,
    Tongan,
    Turkish,
    Turkmen,
    Tuvan,
    UpperSorbian,
    Urdu,
    Uyghur,
    UzbekArabic,
    UzbekCyrillic,
    UzbekLatin,
    Volapuk,
    Walser,
    Welsh,
    WesternFrisian,
    YucatecMaya,
    Zhuang,
    Zulu
}

AzureVisionOCRApiConfiguration.readingOrder

@nut-tree/plugin-azure/ocr allows you to explicitly specify the reading order to use for OCR. The default is AzureOcrServiceReadingOrder.BASIC, which will use a left-to-right reading order. AzureOcrServiceReadingOrder.NATURAL will use a more natural reading order, but this is only available for latin languages.

AzureVisionOCRApiConfiguration.checkResultInterval

@nut-tree/plugin-azure/ocr submits async jobs to the Azure AI Vision OCR service and then polls for the result. To avoid depleting your API quota, you can configure the polling interval.

AzureVisionOCRApiConfiguration.checkResultRetryCount

@nut-tree/plugin-azure/ocr submits async jobs to the Azure AI Vision OCR service and then polls for the result. To avoid depleting your API quota, you can configure a maximum number of polls.

Let's dive right into an example:

const {centerOf, getActiveWindow, mouse, screen, singleWord, straightTo} = require("@nut-tree/nut-js");
const {configure} = require("@nut-tree/plugin-azure/ocr");

configure({
    apiKey: process.env.VISION_KEY,
    apiEndpoint: process.env.VISION_ENDPOINT,
});

(async () => {
    const location = await screen.find(singleWord("WebStorm"));
    await mouse.move(
        straightTo(
            centerOf(
                location
            )
        )
    );
})();

As you can see, the minimal configuration for @nut-tree/plugin-azure/ocr only requires you to provide your Azure Vision AI OCR API key and endpoint, which are read from environment variables in this case.

That's all you need to search for text on your screen using text queries. singleWord is one of the currently supported text queries, which are singleWord and textLine. singleWord will search for a single word, while textLine will search for a while line of text.

ProviderData

You can pass a ProviderData object to screen.find to override the configuration options for text search on a per-call basis.

The TextFinderConfig is defined as follows:

interface TextFinderConfig {
    apiEndpoint?: string;
    apiKey?: string;
    caseSensitive?: boolean,
    checkResultInterval?: number;
    checkResultRetryCount?: number;
    language?: string;
    modelVersion?: string;
    partialMatch?: boolean;
    readingOrder?: AzureOcrServiceReadingOrder;
}

As you can see, you're able to override the global configuration options for @nut-tree/plugin-azure/ocr on a per-call basis. This allows you to use different endpoints, languages or models for different calls to screen.find, overriding the global configuration.

You can also tweak some text search related options on a per-call basis:

TextFinderConfig.caseSensitive

@nut-tree/plugin-azure/ocr will perform case-insensitive text search by default. Toggle this flag to enable case-sensitive text search.

TextFinderConfig.partialMatch

@nut-tree/plugin-azure/ocr will search for an exact match by default. Toggle this flag to enable partial text matches.


Usage: On-screen text extraction

Just as with Usage: On-screen text search, we'll start with an example:

const {getActiveWindow, screen} = require("@nut-tree/nut-js");
const {configure, TextSplit, useAzureVisionOCR} = require("@nut-tree/plugin-azure/ocr");

configure({
    apiKey: process.env.VISION_KEY,
    apiEndpoint: process.env.VISION_ENDPOINT,
});

useAzureVisionOCR();

const activeWindowRegion = async () => {
    const activeWindow = await getActiveWindow();
    return activeWindow.region;
}

(async () => {
    const text = await screen.read({searchRegion: activeWindowRegion(), split: TextSplit.WORD});
})();

screen.read uses the same configuration as screen.find.

Additionally, screen.read supports a set of configuration options for text extraction, passed via ReadTextConfig:

interface ReadTextConfig {
    apiEndpoint?: string;
    apiKey?: string;
    checkResultInterval?: number;
    checkResultRetryCount?: number;
    language?: Language,
    modelVersion?: string;
    readingOrder?: AzureOcrServiceReadingOrder;
    searchRegion?: Region | Promise<Region>,
    split?: TextSplit,
}

As you can see, you're able to override the configuration options for text extraction on a per-call basis. This allows you to use different endpoints, languages or models for different calls to screen.read, overriding the global configuration.

Additionally, you can pass a searchRegion to screen.read, which will be used to limit the screen area to extract text from.

The split option allows you to configure the level of detail for text extraction. With the default, TextSplit.NONE, a single block of text which contains all extracted text will be returned.

TextSplit

TextSplit is an enum that defines how the extracted text should be split:

enum TextSplit {
    WORD,
    LINE,
    NONE
}

This allows to configure the level of detail for text extraction. TextSplit.LINE will split the result at line level, TextSplit.WORD on word level and so on.

The default value is TextSplit.NONE, which will return the extracted text as a single block of text.

Depending on the configured text split, the result of screen.read is one of the following types:

interface WordOCRResult {
    text: string,
    confidence: number,
}

interface LineOCRResult {
    text: string,
    confidence: number,
    words: WordOCRResult[],
}

interface BlockOCRResult {
    text: string,
    confidence: number,
    lines: LineOCRResult[],
}

Buy

@nut-tree/plugin-azure is included in the Solo and Team plans.

Which OCR package should I choose?

As always in software, it depends :)

The obvious difference is that the @nut-tree/plugin-ocr package is a standalone package, while the @nut-tree/plugin-azure package is a wrapper around the Azure Cognitive Services API. This means that with @nut-tree/plugin-ocr you can use the OCR functionality without involvement of any third-party service, while @nut-tree/plugin-azure requires a (free) Azure subscription.

With @nut-tree/plugin-ocr, OCR is performed locally on your machine, so no data will be sent to any third-party service. This might be a requirement for some use-cases. On the other hand, @nut-tree/plugin-azure offers a more powerful OCR engine, which performs better on complex images and has a higher accuracy, even in cases of low-quality images.

Previous
@nut-tree/plugin-ocr