Skip to content

Bpetokenizer

A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer(tiktoken), allows you to train your own tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports save and load tokenizers in the json format. The bpetokenizer also supports pretrained tokenizers.

Overview

The Byte Pair Encoding (BPE) algorithm is a simple yet powerful method for building a vocabulary of subword units for a given text corpus. This tokenizer can be used for training your tokenizer of the LLM on various languages of text corpus.

this algorithm is first introduced in the paper Neural Machine Translation of Rare Words with Subword Units and then used this in the gpt2 tokenizer(Language Models are Unsupervised Multitask Learners)

Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their own text dataset.

Features

  • Implements Byte Pair Encoding (BPE) algorithm.
  • Handles special tokens.
  • Uses a customizable regex pattern for tokenization.
  • Compatible with Python 3.9 and above

This repository has 3 different Tokenizers:

  • BPETokenizer
  • Tokenizer
  • PreTrained

  • Tokenizer: This class contains train, encode, decode and functionalities to save and load. Also contains few helper functions get_stats, merge, replace_control_characters.. to perform the BPE algorithm for the tokenizer.

  • BPETokenizer: This class emphasizes the real power of the tokenizer(used in gpt4 tokenizer..tiktoken), uses the GPT4_SPLIT_PATTERN to split the text as mentioned in the gpt4 tokenizer. also handles the special_tokens. which inherits the save and load functionlities to save and load the tokenizer respectively.

  • PreTrained Tokenizer: PreTrained Tokenizer wi17k_base, has a 17316 vocabulary. trained with the wikitext dataset (len: 1000000). with 6 special_tokens.

Usage

this tutorial leverages the special_tokens usage in the Tokenizer.

Install the package

pip install bpetokenizer
from bpetokenizer import BPETokenizer

special_tokens = {
    "<|endoftext|>": 1001,
    "<|startoftext|>": 1002,
    "[SPECIAL1]": 1003,
    "[SPECIAL2]": 1004,
}

tokenizer = BPETokenizer(special_tokens=special_tokens) # you can also use the method _special_tokens to register the special tokens (if not passed when intializing)
texts = "<|startoftext|> Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.<|endoftext|>"

tokenizer.train(texts, vocab_size=310, verbose=True)
# tokenizer._special_tokens(special_tokens) # if not passed when intialization of the BPETokenizer

encode_text = """
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.
Hello, World! This is yet another sample text, with [SPECIAL1] and [SPECIAL2] making an appearance.
Hey there, World! Testing the tokenizer with [SPECIAL1] and [SPECIAL2] to see if it handles special tokens properly.
Salutations, Planet! The tokenizer should recognize [SPECIAL1] and [SPECIAL2] in this long string of text.
Hello again, World! [SPECIAL1] and [SPECIAL2] are special tokens that need to be handled correctly by the tokenizer.
Welcome, World! Including [SPECIAL1] and [SPECIAL2] multiple times in this large text to ensure proper encoding.
Hi, World! Let's add [SPECIAL1] and [SPECIAL2] in various parts of this long sentence to test the tokenizer thoroughly.
<|endoftext|>
"""
ids = tokenizer.encode(encode_text, special_tokens="all")
print(ids)

decode_text = tokenizer.decode(ids)
print(decode_text)

tokenizer.save("sample_bpetokenizer", mode="json")

To Load the Tokenizer

from bpetokenizer import BPETokenizer

tokenizer = BPETokenizer()

tokenizer.load("sample_bpetokenizer.json", mode="json")

encode_text = """
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.<|endoftext|>"""

print("vocab: ", tokenizer.vocab)
print('---')
print("merges: ", tokenizer.merges)
print('---')
print("special tokens: ", tokenizer.special_tokens)

ids = tokenizer.encode(encode_text, special_tokens="all")
print('---')
print(ids)

decode_text = tokenizer.decode(ids)
print('---')
print(decode_text)

# you can also print the tokens and the text chunks split with the pattern.
tokens = tokenizer.tokens(encode_text, verbose=True) # if verbose, prints the text chunks and also the pattern used to split.
print('---')
print("tokens: ", tokens)
refer to the load_json_vocab and run the bpetokenizer_json to get an overview of vocab, merges, special_tokens and to view the tokens that are split by the tokenizer using pattern, look at tokens

To load the pretrained tokenizers

from bpetokenizer import BPETokenzier

tokenizer = BPETokenizer.from_pretrained("wi17k_base", verbose=True)

texts = """
def get_stats(tokens, counts=None) -> dict:
    "Get statistics of the tokens. Includes the frequency of each consecutive pair of tokens"
    counts = if counts is None else counts
    for pair in zip(tokens, tokens[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts
"""
tokenizer.tokens(texts, verbose=True)
for now, we only have a single 17k vocab tokenizer wi17_base at pretrained

Run Tests

the tests folder tests/ include the tests of the tokenizer, uses pytest.

python3 -m pytest

additionally, the workflows are setup to run the tests when made a PR.

Thanks for reading the blog! I believe you got an understanding on how the tokenizers power and a crucial part of the LLMs.