Kind: OCR provider
1npm i @nut-tree/plugin-ocr
Attention: @nut-tree/plugin-ocr
is only available to sponsors of nut.js.
In case you want to get access to it, please consult the sponsoring profile
@nut-tree/plugin-ocr
is an OCR plugin for nut.js.
It provides an implementation of the TextFinderInterface to perform on-screen text search.
Additionally, it provides a plugin that extends the nut.js Screen with the ability to extract text from screen regions.
@nut-tree/plugin-ocr
both extends existing nut.js functionality and exports a set of configuration and utility functions.
Configure the plugin by providing an OcrPluginConfiguration
. Calling configure()
is optional, as the plugin comes with sensible defaults.
1interface OcrPluginConfiguration { 2 languageModelType?: LanguageModelType; 3 dataPath?: string; 4}
The type of language model to use. Defaults to LanguageModelType.DEFAULT
.
@nut-tree/plugin-ocr
uses language models to perform OCR.
There are different language models available which might lead to more accurate or faster results.
In total, there are three different language models available:
DEFAULT
: The default language model.BEST
: Better accuracy, but slower.FAST
: Faster, but less accurate.The path to store language models.
You can adjust this path to avoid re-downloading language models.
@nut-tree/plugin-ocr
supports multiple languages.
By default, the plugin will check if a required language model is available on every OCR run and download it if necessary.
If you want to avoid delays during execution due to language model downloads, you can preload language models by calling preloadLanguages()
.
1function preloadLanguages(languages: Language[], languageModels: LanguageModelType[] = [Location]): Promise<void[]> { 2 3}
languages
: An array of languages to preload.languageModels
: An array of language models to preload. Defaults to [LanguageModelType.DEFAULT]
.Supported languages are:
1export enum Language { 2 Afrikaans, 3 Albanian, 4 Amharic, 5 Arabic, 6 Armenian, 7 Assamese, 8 Azerbaijani, 9 AzerbaijaniCyrilic, 10 Basque, 11 Belarusian, 12 Bengali, 13 Bosnian, 14 Breton, 15 Bulgarian, 16 Burmese, 17 Catalan, 18 Cebuano, 19 CentralKhmer, 20 Cherokee, 21 ChineseSimplified, 22 ChineseTraditional, 23 Corsican, 24 Croatian, 25 Czech, 26 Danish, 27 Dutch, 28 Dzongkha, 29 English, 30 EnglishMiddle, 31 Esperanto, 32 Estonian, 33 Faroese, 34 Filipino, 35 Finnish, 36 French, 37 FrenchMiddle, 38 Galician, 39 Georgian, 40 GeorgianOld, 41 German, 42 GermanFraktur, 43 GreekAncient, 44 GreekModern, 45 Gujarati, 46 Haitian, 47 Hebrew, 48 Hindi, 49 Hungarian, 50 Icelandic, 51 Indonesian, 52 Inuktitut, 53 Irish, 54 Italian, 55 ItalianOld, 56 Japanese, 57 Javanese, 58 Kannada, 59 Kazakh, 60 Kirghiz, 61 Korean, 62 KoreanVertical, 63 Kurdish, 64 Kurmanji, 65 Lao, 66 Latin, 67 Latvian, 68 Lithuanian, 69 Luxembourgish, 70 Macedonian, 71 Malay, 72 Malayalam, 73 Maltese, 74 Maori, 75 Marathi, 76 Math, 77 Mongolian, 78 Nepali, 79 Norwegian, 80 Occitan, 81 Oriya, 82 Panjabi, 83 Persian, 84 Polish, 85 Portuguese, 86 Pushto, 87 Quechua, 88 Romanian, 89 Russian, 90 Sanskrit, 91 ScottishGaelic, 92 Serbian, 93 SerbianLatin, 94 Sindhi, 95 Sinhala, 96 Slovak, 97 Slovenian, 98 Spanish, 99 SpanishOld, 100 Sundanese, 101 Swahili, 102 Swedish, 103 Syriac, 104 Tagalog, 105 Tajik, 106 Tamil, 107 Tatar, 108 Telugu, 109 Thai, 110 Tibetan, 111 Tigrinya, 112 Tonga, 113 Turkish, 114 Uighur, 115 Ukrainian, 116 Urdu, 117 Uzbek, 118 UzbekCyrilic, 119 Vietnamese, 120 Welsh, 121 WesternFrisian, 122 Yiddish, 123 Yoruba 124}
Let's dive right into an example:
1import {centerOf, getActiveWindow, mouse, screen, singleWord, straightTo} from "@nut-tree/nut-js"; 2import {configure, Language, LanguageModelType, preloadLanguages} from "@nut-tree/plugin-ocr"; 3 4configure({ 5 dataPath: "/path/to/store/language/models", 6 languageModelType: LanguageModelType.BEST 7}); 8 9(async () => { 10 await preloadLanguages([Language.English, Language.German]); 11 12 screen.config.ocrConfidence = 0.8; 13 screen.config.autoHighlight = true; 14 15 const location = await screen.find(singleWord("WebStorm"), { 16 providerData: { 17 lang: [Language.English, Language.German], 18 partialMatch: false, 19 caseSensitive: false 20 } 21 }); 22 await mouse.move( 23 straightTo( 24 centerOf( 25 location 26 ) 27 ) 28 ); 29})();
We already talked about configure()
and preloadLanguages()
in the configuration section, but there are a few additional things to note here:
screen.config.ocrConfidence
: When using both image and text search, you can explicitly set the confidence threshold for text search to use two different confidence thresholds for image and text search.singleWord
: nut.js currently supports two kinds of text search, singleWord
and textLine
. singleWord
will search for a single word, while textLine
will search for a while line of text.@nut-tree/plugin-ocr
supports a set of configuration options for text search, passed via the providerData
property of OptionalSearchParameters object.
1export interface TextFinderConfig { 2 lang?: Language[], // Languages used for OCR, defaults to [Language.English] 3 partialMatch?: boolean, // Allow partial matches, defaults to false 4 caseSensitive?: boolean, // Case sensitive search, defaults to false 5 preprocessConfig?: ImagePreprocessingConfig // Image preprocessing configuration 6}
Just as with Usage: On-screen text search, we'll start with an example:
1import {getActiveWindow, screen} from "@nut-tree/nut-js"; 2import {configure, Language, LanguageModelType, preloadLanguages} from "@nut-tree/plugin-ocr"; 3 4configure({ 5 dataPath: "/path/to/store/language/models", 6 languageModelType: LanguageModelType.BEST 7}); 8 9const activeWindowRegion = async () => { 10 const activeWindow = await getActiveWindow(); 11 return activeWindow.region; 12} 13 14(async () => { 15 await preloadLanguages([Language.English, Language.German]); 16 const text = await screen.read({searchRegion: activeWindowRegion(), split: TextSplit.LINE}); 17})();
screen.read
uses the same configuration and preload mechanisms as screen.find
.
Additionally, screen.read
supports a set of configuration options for text extraction, passed via ReadTextConfig
:
1export interface ReadTextConfig { 2 searchRegion?: Region | Promise<Region>, // The region to extract text from. Defaults to the entire screen 3 languages?: Language[], // An array of languages to use for OCR. Defaults to `Language.English` 4 split?: TextSplit, // How to split the extracted text. Defaults to `TextSplit.NONE` 5 preprocessConfig?: ImagePreprocessingConfig // Image preprocessing configuration 6}
TextSplit
is an enum that defines how the extracted text should be split:
1enum TextSplit { 2 SYMBOL, 3 WORD, 4 LINE, 5 PARAGRAPH, 6 BLOCK, 7 NONE 8}
This allows to configure the level of detail for text extraction.
TextSplit.SYMBOL
will split the result at single character level, TextSplit.WORD
on word level and so on.
The default value is TextSplit.NONE
, which will return the extracted text as a single string (similar to TextSplit.BLOCK
in most cases).
Depending on the configured text split, the result of screen.read
is one of the following types:
1interface SymbolOCRResult { 2 text: string, 3 confidence: number, 4 isSuperscript: boolean, 5 isSubscript: boolean, 6 isDropcap: boolean 7} 8 9interface WordOCRResult { 10 text: string, 11 confidence: number, 12 isNumeric: boolean, 13 isInDictionary: boolean, 14 textDirection: result.textDirection, 15 symbols: SymbolOCRResult[], 16 font: FontInfo, 17} 18 19interface FontInfo { 20 isBold: boolean; 21 isItalic: boolean; 22 isUnderlined: boolean; 23 isMonospace: boolean; 24 isSerif: boolean; 25 isSmallcaps: boolean; 26 fontSize: number; 27 fontId: number; 28} 29 30interface LineOCRResult { 31 text: string, 32 confidence: number, 33 words: WordOCRResult[], 34} 35 36interface ParagraphOCRResult { 37 text: string, 38 confidence: number, 39 isLeftToRight: boolean, 40 lines: LineOCRResult[], 41} 42 43interface BlockOCRResult { 44 text: string, 45 confidence: number, 46 blockType: TextBlockType, 47 paragraphs: ParagraphOCRResult[], 48} 49 50enum TextBlockType { 51 UNKNOWN, // Type is not yet known. Keep as the first element. 52 FLOWING_TEXT, // Text that lives inside a column. 53 HEADING_TEXT, // Text that spans more than one column. 54 PULLOUT_TEXT, // Text that is in a cross-column pull-out region. 55 EQUATION, // Partition belonging to an equation region. 56 INLINE_EQUATION, // Partition has inline equation. 57 TABLE, // Partition belonging to a table region. 58 VERTICAL_TEXT, // Text-line runs vertically. 59 CAPTION_TEXT, // Text that belongs to an image. 60 FLOWING_IMAGE, // Image that lives inside a column. 61 HEADING_IMAGE, // Image that spans more than one column. 62 PULLOUT_IMAGE, // Image that is in a cross-column pull-out region. 63 HORZ_LINE, // Horizontal Line. 64 VERT_LINE, // Vertical Line. 65 NOISE, // Lies outside of any column. 66 COUNT 67}
© 2023