@nut-tree/plugin-ocr

Kind: OCR provider


Installation

1npm i @nut-tree/plugin-ocr

Attention: @nut-tree/plugin-ocr is only available to sponsors of nut.js.
In case you want to get access to it, please consult the sponsoring profile


Description

@nut-tree/plugin-ocr is an OCR plugin for nut.js.
It provides an implementation of the TextFinderInterface to perform on-screen text search.
Additionally, it provides a plugin that extends the nut.js Screen with the ability to extract text from screen regions.


Configuration

@nut-tree/plugin-ocr both extends existing nut.js functionality and exports a set of configuration and utility functions.

configure()

Configure the plugin by providing an OcrPluginConfiguration. Calling configure() is optional, as the plugin comes with sensible defaults.

1interface OcrPluginConfiguration {
2    languageModelType?: LanguageModelType;
3    dataPath?: string;
4}

OcrPluginConfiguration.languageModelType

The type of language model to use. Defaults to LanguageModelType.DEFAULT.

@nut-tree/plugin-ocr uses language models to perform OCR.
There are different language models available which might lead to more accurate or faster results.

In total, there are three different language models available:

  • DEFAULT: The default language model.
  • BEST: Better accuracy, but slower.
  • FAST: Faster, but less accurate.

OcrPluginConfiguration.dataPath

The path to store language models.

You can adjust this path to avoid re-downloading language models.

preloadLanguages()

@nut-tree/plugin-ocr supports multiple languages.

By default, the plugin will check if a required language model is available on every OCR run and download it if necessary.

If you want to avoid delays during execution due to language model downloads, you can preload language models by calling preloadLanguages().

1function preloadLanguages(languages: Language[], languageModels: LanguageModelType[] = [Location]): Promise<void[]> {
2    
3}
  • languages: An array of languages to preload.
  • languageModels: An array of language models to preload. Defaults to [LanguageModelType.DEFAULT].

Supported languages are:

1export enum Language {
2    Afrikaans,
3    Albanian,
4    Amharic,
5    Arabic,
6    Armenian,
7    Assamese,
8    Azerbaijani,
9    AzerbaijaniCyrilic,
10    Basque,
11    Belarusian,
12    Bengali,
13    Bosnian,
14    Breton,
15    Bulgarian,
16    Burmese,
17    Catalan,
18    Cebuano,
19    CentralKhmer,
20    Cherokee,
21    ChineseSimplified,
22    ChineseTraditional,
23    Corsican,
24    Croatian,
25    Czech,
26    Danish,
27    Dutch,
28    Dzongkha,
29    English,
30    EnglishMiddle,
31    Esperanto,
32    Estonian,
33    Faroese,
34    Filipino,
35    Finnish,
36    French,
37    FrenchMiddle,
38    Galician,
39    Georgian,
40    GeorgianOld,
41    German,
42    GermanFraktur,
43    GreekAncient,
44    GreekModern,
45    Gujarati,
46    Haitian,
47    Hebrew,
48    Hindi,
49    Hungarian,
50    Icelandic,
51    Indonesian,
52    Inuktitut,
53    Irish,
54    Italian,
55    ItalianOld,
56    Japanese,
57    Javanese,
58    Kannada,
59    Kazakh,
60    Kirghiz,
61    Korean,
62    KoreanVertical,
63    Kurdish,
64    Kurmanji,
65    Lao,
66    Latin,
67    Latvian,
68    Lithuanian,
69    Luxembourgish,
70    Macedonian,
71    Malay,
72    Malayalam,
73    Maltese,
74    Maori,
75    Marathi,
76    Math,
77    Mongolian,
78    Nepali,
79    Norwegian,
80    Occitan,
81    Oriya,
82    Panjabi,
83    Persian,
84    Polish,
85    Portuguese,
86    Pushto,
87    Quechua,
88    Romanian,
89    Russian,
90    Sanskrit,
91    ScottishGaelic,
92    Serbian,
93    SerbianLatin,
94    Sindhi,
95    Sinhala,
96    Slovak,
97    Slovenian,
98    Spanish,
99    SpanishOld,
100    Sundanese,
101    Swahili,
102    Swedish,
103    Syriac,
104    Tagalog,
105    Tajik,
106    Tamil,
107    Tatar,
108    Telugu,
109    Thai,
110    Tibetan,
111    Tigrinya,
112    Tonga,
113    Turkish,
114    Uighur,
115    Ukrainian,
116    Urdu,
117    Uzbek,
118    UzbekCyrilic,
119    Vietnamese,
120    Welsh,
121    WesternFrisian,
122    Yiddish,
123    Yoruba
124}

Let's dive right into an example:

1import {centerOf, getActiveWindow, mouse, screen, singleWord, straightTo} from "@nut-tree/nut-js";
2import {configure, Language, LanguageModelType, preloadLanguages} from "@nut-tree/plugin-ocr";
3
4configure({
5    dataPath: "/path/to/store/language/models",
6    languageModelType: LanguageModelType.BEST
7});
8
9(async () => {
10    await preloadLanguages([Language.English, Language.German]);
11
12    screen.config.ocrConfidence = 0.8;
13    screen.config.autoHighlight = true;
14
15    const location = await screen.find(singleWord("WebStorm"), {
16        providerData: {
17            lang: [Language.English, Language.German],
18            partialMatch: false,
19            caseSensitive: false
20        }
21    });
22    await mouse.move(
23        straightTo(
24            centerOf(
25                location
26            )
27        )
28    );
29})();

We already talked about configure() and preloadLanguages() in the configuration section, but there are a few additional things to note here:

  • screen.config.ocrConfidence: When using both image and text search, you can explicitly set the confidence threshold for text search to use two different confidence thresholds for image and text search.
  • singleWord: nut.js currently supports two kinds of text search, singleWord and textLine. singleWord will search for a single word, while textLine will search for a while line of text.

Search configuration

@nut-tree/plugin-ocr supports a set of configuration options for text search, passed via the providerData property of OptionalSearchParameters object.

1export interface TextFinderConfig {
2    lang?: Language[], // Languages used for OCR, defaults to [Language.English]
3    partialMatch?: boolean, // Allow partial matches, defaults to false
4    caseSensitive?: boolean, // Case sensitive search, defaults to false
5    preprocessConfig?: ImagePreprocessingConfig // Image preprocessing configuration
6}

Usage: On-screen text extraction

Just as with Usage: On-screen text search, we'll start with an example:

1import {getActiveWindow, screen} from "@nut-tree/nut-js";
2import {configure, Language, LanguageModelType, preloadLanguages} from "@nut-tree/plugin-ocr";
3
4configure({
5    dataPath: "/path/to/store/language/models",
6    languageModelType: LanguageModelType.BEST
7});
8
9const activeWindowRegion = async () => {
10    const activeWindow = await getActiveWindow();
11    return activeWindow.region;
12}
13
14(async () => {
15    await preloadLanguages([Language.English, Language.German]);
16    const text = await screen.read({searchRegion: activeWindowRegion(), split: TextSplit.LINE});
17})();

screen.read uses the same configuration and preload mechanisms as screen.find.

Additionally, screen.read supports a set of configuration options for text extraction, passed via ReadTextConfig:

1export interface ReadTextConfig {
2    searchRegion?: Region | Promise<Region>, // The region to extract text from. Defaults to the entire screen
3    languages?: Language[], // An array of languages to use for OCR. Defaults to `Language.English`
4    split?: TextSplit, // How to split the extracted text. Defaults to `TextSplit.NONE`
5    preprocessConfig?: ImagePreprocessingConfig // Image preprocessing configuration
6}

TextSplit

TextSplit is an enum that defines how the extracted text should be split:

1enum TextSplit {
2    SYMBOL,
3    WORD,
4    LINE,
5    PARAGRAPH,
6    BLOCK,
7    NONE
8}

This allows to configure the level of detail for text extraction.
TextSplit.SYMBOL will split the result at single character level, TextSplit.WORD on word level and so on.

The default value is TextSplit.NONE, which will return the extracted text as a single string (similar to TextSplit.BLOCK in most cases).

Depending on the configured text split, the result of screen.read is one of the following types:

1interface SymbolOCRResult {
2    text: string,
3    confidence: number,
4    isSuperscript: boolean,
5    isSubscript: boolean,
6    isDropcap: boolean
7}
8
9interface WordOCRResult {
10    text: string,
11    confidence: number,
12    isNumeric: boolean,
13    isInDictionary: boolean,
14    textDirection: result.textDirection,
15    symbols: SymbolOCRResult[],
16    font: FontInfo,
17}
18
19interface FontInfo {
20    isBold: boolean;
21    isItalic: boolean;
22    isUnderlined: boolean;
23    isMonospace: boolean;
24    isSerif: boolean;
25    isSmallcaps: boolean;
26    fontSize: number;
27    fontId: number;
28}
29
30interface LineOCRResult {
31    text: string,
32    confidence: number,
33    words: WordOCRResult[],
34}
35
36interface ParagraphOCRResult {
37    text: string,
38    confidence: number,
39    isLeftToRight: boolean,
40    lines: LineOCRResult[],
41}
42
43interface BlockOCRResult {
44    text: string,
45    confidence: number,
46    blockType: TextBlockType,
47    paragraphs: ParagraphOCRResult[],
48}
49
50enum TextBlockType {
51    UNKNOWN,         // Type is not yet known. Keep as the first element.
52    FLOWING_TEXT,    // Text that lives inside a column.
53    HEADING_TEXT,    // Text that spans more than one column.
54    PULLOUT_TEXT,    // Text that is in a cross-column pull-out region.
55    EQUATION,        // Partition belonging to an equation region.
56    INLINE_EQUATION, // Partition has inline equation.
57    TABLE,           // Partition belonging to a table region.
58    VERTICAL_TEXT,   // Text-line runs vertically.
59    CAPTION_TEXT,    // Text that belongs to an image.
60    FLOWING_IMAGE,   // Image that lives inside a column.
61    HEADING_IMAGE,   // Image that spans more than one column.
62    PULLOUT_IMAGE,   // Image that is in a cross-column pull-out region.
63    HORZ_LINE,       // Horizontal Line.
64    VERT_LINE,       // Vertical Line.
65    NOISE,           // Lies outside of any column.
66    COUNT
67}

© 2023