Tokenizer: Get a better understanding of what counts as a token (Grok)

in blog •  yesterday 

In the context of Large Language Models (LLMs), a token is a unit of text that the model processes. Tokens can represent whole words, subwords, characters, or even punctuation, depending on the tokenisation method used.

How Tokens Work in LLMs

  • Tokenisation: Before processing text, LLMs convert input text into tokens using a tokeniser. This step breaks text into manageable pieces based on predefined rules.
  • Vocabulary: LLMs have a fixed vocabulary of tokens they can understand. If a word isn't in the vocabulary, it may be split into multiple subword tokens.
  • Processing: Each token is assigned a numerical representation (embedding), which the model processes to generate output.

Examples of Tokenisation

Word-based: "Hello world!" → ["Hello", "world", "!"]

Subword-based (Byte-Pair Encoding, BPE, used in GPT models):

  • "unhappiness" → ["un", "happiness"]
  • "running" → ["run", "ning"]

Character-based (used in some models):

  • "Hello" → ["H", "e", "l", "l", "o"]

Why Tokens Matter

  • Cost: LLMs charge based on token usage (e.g., OpenAI models like GPT-4 have pricing based on tokens).
  • Context Length: Models have a maximum number of tokens they can process in a single request (e.g., GPT-4 Turbo has a 128K token limit).
  • Processing Speed: More tokens mean longer processing times and higher computational costs.

Grok Tokenizer

You can click the Tokenizer in Grok
image.png

To have a rough idea of the tokens - IMHO, they are similar to words!
image.png

Steem to the Moon🚀!

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!