Unigram tokenization

The Unigram algorithm is a subword tokenization algorithm based on the unigram language model in NLP. It is used to preprocess text input for some deep learning language models.

I followed this section on the Hugging Face NLP course to implement the algorithm in Javascript. I played with Byte-Pair Encoding (BPE), WorldPiece, and Unigram algorithms as explained in the course and found that Unigram gives the most reasonable results (and slowest).

Enter training text in the text area below and press the 'Train' button to train the Unigram model of given target size. Note that the algorithm here is very slow except for small texts.

In the Tokenization section below, you can interactively see the tokenization result of input text.