Bpetokenizer
A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer(tiktoken), allows you to train your own tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports save
and load
tokenizers in the json
format. The bpetokenizer
also supports pretrained tokenizers.
Overview¶
The Byte Pair Encoding (BPE) algorithm is a simple yet powerful method for building a vocabulary of subword units for a given text corpus. This tokenizer can be used for training your tokenizer of the LLM on various languages of text corpus.
this algorithm is first introduced in the paper Neural Machine Translation of Rare Words with Subword Units and then used this in the gpt2 tokenizer(Language Models are Unsupervised Multitask Learners)
Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their own text dataset.
Features¶
- Implements Byte Pair Encoding (BPE) algorithm.
- Handles special tokens.
- Uses a customizable regex pattern for tokenization.
- Compatible with Python 3.9 and above
This repository has 3 different Tokenizers:¶
BPETokenizer
Tokenizer
-
PreTrained
-
Tokenizer: This class contains
train
,encode
,decode
and functionalities tosave
andload
. Also contains few helper functionsget_stats
,merge
,replace_control_characters
.. to perform the BPE algorithm for the tokenizer. -
BPETokenizer: This class emphasizes the real power of the tokenizer(used in gpt4 tokenizer..tiktoken), uses the
GPT4_SPLIT_PATTERN
to split the text as mentioned in the gpt4 tokenizer. also handles thespecial_tokens
. which inherits thesave
andload
functionlities to save and load the tokenizer respectively. -
PreTrained Tokenizer: PreTrained Tokenizer wi17k_base, has a 17316 vocabulary. trained with the wikitext dataset (len: 1000000). with 6 special_tokens.
Usage¶
this tutorial leverages the special_tokens
usage in the Tokenizer.
Install the package
from bpetokenizer import BPETokenizer
special_tokens = {
"<|endoftext|>": 1001,
"<|startoftext|>": 1002,
"[SPECIAL1]": 1003,
"[SPECIAL2]": 1004,
}
tokenizer = BPETokenizer(special_tokens=special_tokens) # you can also use the method _special_tokens to register the special tokens (if not passed when intializing)
texts = "<|startoftext|> Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.<|endoftext|>"
tokenizer.train(texts, vocab_size=310, verbose=True)
# tokenizer._special_tokens(special_tokens) # if not passed when intialization of the BPETokenizer
encode_text = """
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.
Hello, World! This is yet another sample text, with [SPECIAL1] and [SPECIAL2] making an appearance.
Hey there, World! Testing the tokenizer with [SPECIAL1] and [SPECIAL2] to see if it handles special tokens properly.
Salutations, Planet! The tokenizer should recognize [SPECIAL1] and [SPECIAL2] in this long string of text.
Hello again, World! [SPECIAL1] and [SPECIAL2] are special tokens that need to be handled correctly by the tokenizer.
Welcome, World! Including [SPECIAL1] and [SPECIAL2] multiple times in this large text to ensure proper encoding.
Hi, World! Let's add [SPECIAL1] and [SPECIAL2] in various parts of this long sentence to test the tokenizer thoroughly.
<|endoftext|>
"""
ids = tokenizer.encode(encode_text, special_tokens="all")
print(ids)
decode_text = tokenizer.decode(ids)
print(decode_text)
tokenizer.save("sample_bpetokenizer", mode="json")
To Load the Tokenizer¶
from bpetokenizer import BPETokenizer
tokenizer = BPETokenizer()
tokenizer.load("sample_bpetokenizer.json", mode="json")
encode_text = """
<|startoftext|>Hello, World! This is a sample text with the special tokens [SPECIAL1] and [SPECIAL2] to test the tokenizer.
Hello, Universe! Another example sentence containing [SPECIAL1] and [SPECIAL2], used to ensure tokenizer's robustness.
Greetings, Earth! Here we have [SPECIAL1] appearing once again, followed by [SPECIAL2] in the same sentence.<|endoftext|>"""
print("vocab: ", tokenizer.vocab)
print('---')
print("merges: ", tokenizer.merges)
print('---')
print("special tokens: ", tokenizer.special_tokens)
ids = tokenizer.encode(encode_text, special_tokens="all")
print('---')
print(ids)
decode_text = tokenizer.decode(ids)
print('---')
print(decode_text)
# you can also print the tokens and the text chunks split with the pattern.
tokens = tokenizer.tokens(encode_text, verbose=True) # if verbose, prints the text chunks and also the pattern used to split.
print('---')
print("tokens: ", tokens)
bpetokenizer_json
to get an overview of vocab
, merges
, special_tokens
and to view the tokens that are split by the tokenizer using pattern, look at tokens To load the pretrained tokenizers¶
from bpetokenizer import BPETokenzier
tokenizer = BPETokenizer.from_pretrained("wi17k_base", verbose=True)
texts = """
def get_stats(tokens, counts=None) -> dict:
"Get statistics of the tokens. Includes the frequency of each consecutive pair of tokens"
counts = if counts is None else counts
for pair in zip(tokens, tokens[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts
"""
tokenizer.tokens(texts, verbose=True)
wi17_base
at pretrained Run Tests¶
the tests folder tests/
include the tests of the tokenizer, uses pytest.
additionally, the workflows are setup to run the tests when made a PR.
Thanks for reading the blog! I believe you got an understanding on how the tokenizers power and a crucial part of the LLMs.