Text Analysis Activities package for UiPath manual - Microsoft

Text Analysis Activities package for UiPath ? manual

by Pawel Wesolowski ? 2019

Text Analysis (or Text Mining) is the automated process of obtaining information from text, the goal is to create structured data out of free text content, which can be easier interpreted by computer. It can be used for categorizing press articles, user reviews, incoming e-mails, tickets, monitoring comments or social media and many many more.

First release contains several basic, but powerful methods and techniques:

1. Detect Language, 2. Find Collocations, 3. Prepare String To Analyze, 4. Remove Stopwords, 5. Word Frequency Analyse, 6. Word Position Analyse.

Newest version and documentation always on:



Detect language Using base of 1000+ most common words for each supported language, recognize language of provided string. Longer input string provides to better confidence. Returned confidence score can vary from 0 to 100%.

Returns 3 digits code for recognized language, code is compatible with ISO-639-2-B & ISO-639-2-T. So can be used i.e. with Tesseract OCR without any changes. If language has not been detected (provided string is too short or empty) by default returns English (eng). Support 16 languages, list at the end of manual.

Find Collocations Collocation helps identify words that commonly co-occur. Returns bigrams - most commonly two adjacent words i.e. "invoice date" for invoice document.

Returns collocations for provided string as . Returns only unique values.

Prepare String To Analyze Basic activity for text analyze, prepares provided string lowercasing it, removing special characters, numbers, single letters. Returned text contains only full words separated by spaces. Stacked spaces (more spaces than one) will be converted to single space.

It is advised to prepare text using this activity before using other activities from this set to analyze string.

Remove Stopwrods To provide more accurate automated analysis of the text, it is important to remove from play all the words that are very frequent, but provide very little semantic information or no meaning at all. These words are also known as stopwords. Supports 16 languages.

As input takes and language code as , which can be detected by Detect Language activity. Returns without stopwords.

Word Frequency Analyse Finds most frequently used words in document, this is useful i.e. to identify document type through comparing results with defined pattern.

As input takes , returns with two columns ("Word", "Count"), first contains word, second count of word occurrences in provided string. Datatable is sorted by word count descending.

Word Position Analyse Position of word in document tells a lot about document type. "Invoice" keyword position at the top of the document in header - gives higher possibility that this is an invoice, than if keyword is mentioned somewhere on 5th page of document.

As input takes , returns with two columns ("Words", "Positions"), first contains word, second position of word in provided string. Returned words are unique (without duplicates) and position is average position in document. Datatable is sorted by position descending.

Supported Languages for Detect Language and Remove Stopwords activities 1. Bulgarian (bul), 2. Croatian (hrv), 3. Czech (ces), 4. Dutch (nld), 5. English (eng), 6. French (fra), 7. German (deu), 8. Hungarian (hun), 9. Italian (ita), 10. Polish (pol), 11. Romanian (ron), 12. Russian (rus), 13. Slovak (slk), 14. Slovenian (slv), 15. Spanish (spa), 16. Turkish (tur).

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download