Within the realm of artificial intelligence, tokenization refers to the transformation of information into portions/bits of recurrent data.

Known as Byte-pair encoding, tokenization is basically breaking down text strings into micro-batches of characters and tagging them so that they can be adeptly stored and understood by the binary functions of a computer.

Consider the combination of the three letters i-n-g. Each letter is an independent token; however, when put together to form “ing,” they take on a wholly different representation that is commonly found at the tail end of past tense actions (end-ing, mean-ing, vot-ing, etc.).

Taking this a step further, within the three characters, you will also see the ability for a multitude of two-letter combinations for tags such as “ig,” “ng,” “gi,” and so on. Each combination of the letters would be its own unique token. By tokenizing every possible combination of the individual letters, it becomes easier to recognize patterns in large datasets rather than processing each letter individually. For situations where “in” would arise, we know that if the two letters are detached from any other letters, then it would classify as a word; however, if they are included with other letters, then we can ignore the word take and operate on the sentence structure more accurately.



Last updated