N-Gram Word Frequency Counter for Various Languages
Introduction
This app displays the frequency of n-gram (n = 1, 2, 3, ...) words that appear in a given text according to the user's selection.
A 1-gram (or monogram/unigram) is a one-word sequence.
A 2-gram (or bigram/digram) is a two-word sequence of words, like "I love," "United States, or "Latin America."
A 3-gram (or trigram) is a three-word sequence of words like "I love reading," "about data science" or "New York Times," etc.
Spaces generally serve as word delimiters in most languages. In languages such as Chinese, Japanese, and Thai that have no spaces between words, counting word frequencies can be very challenging. In this app, foreign-language-to-English dictionaries are used to identify words of such languages.
The most distinctive feature in this app
When a user wants to know how many times the word ‘USA’ appears within a certain text, most word frequency tools entirely omit the frequencies of variants like ‘US’, ‘U.S.’, ‘U.S.A.’, ‘America’, ‘United States of America’, etc. It would be useful if the sum of the frequencies of all variants is displayed as well as the frequency of the variants separately.
The most distinctive feature of this app is its ability to automatically develop variants of a representative lexicon (here, ‘USA’, for example). Some other examples of representative lexicons are ‘book’ and ‘study’. The noun ‘book’ may have ‘Book’, ‘book's’, ‘books’, etc. as its variants, and the verb ‘study’ may have ‘studies’, ‘studied’, ‘studying’, etc. as its variants. As for Korean, since particles such as -이, -가, -을, -를, -에게 (-i, -ka, -ul, -lul, -e.key), etc. can be attached to a basic or dictionary word such as 미국 (mi.kuk ‘America’), all such word+particle strings such as 미국이 (mi.kuk.i), 미국을 (mi.kuk.ul), 미구에게 (mi.kuk.e.key), etc. can be regarded as variants of 미국 (mi.kuk). This app displays the frequency of each representative word, those of its variants, as well as together.