Optimizing training a GPT style Tokenizer with C++

Introduction

In February 2024, renowned AI researcher and educator Andrej Karpathy, published the following video.

https://www.youtube.com/watch?v=zduSFxRajkE

An excellent introduction to the fascinating topic of tokenization, something he considers a necessary evil right now. Maybe the need for it will go away but for now it must be well understood by AI engineers in order to get good results when building and using large language models.

As software engineers in 2025 the advent of large language models, and generative AI in general, bring about a profound shift in how we design, develop and implement software systems. As such I have been delving into the details of how they they work, with Tokenization being a significant side quest.

In this blog, related video and accompanying C++ project, I will walk through my exploration of the training aspect of tokenizers and look at their performance characteristics and optimization opportunities.

My main goal was to port the Python to C++ but also I became interested in optimizing the training stage.

Training SOTA (State of the Art) models is a massive expense in terms of time and computation needed. The Llama 3 model was trained on 15 trillion tokens, and although it is not known, its tokenizer may also have been trained on the whole data set. Even if it used only a smaller representative sample, tokenizer training is also a huge expense.

For that reason it stands to reason you would want to optimize it as much as possible so that changes to the data set or algorithm can be made more easily and quickly.

Ultimately, I was able to reduce the training time by a factor of 23 (for the large wikitext input) as shown below.

Optimization results

The interactive chart below shows the time taken to train a Tokenizer on different text corpora. Use the buttons to switch between datasets and see the performance difference.

Lower is better. Time in seconds.

Input data sets used:

taylorswift. Text of the Taylor Swift wikipedia page.
shakespeare. The complete works of Shakespeare.
bible. The old and new testament of the Bible.
wikitext-103. https://huggingface.co/datasets/Salesforce/wikitext 100 million tokens of high quality wikipedia articles.

These are included in the repository apart from wikitext which you can download from hugging face.

Briefly, Tokenization and BPE

I don't want to even try and do a better job of explaining the need for tokenization and the BPE algorithm than Karpathy, so I instead recommend watching his video and, for additional insights, read the paper below:

[Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909)

The need for tokenization

Large language models deal with numbers not words or characters, so we must map input text to numbers.
The model requires a fixed size vocabulary.
Unfortunately, that vocabulary would have to be enormous to fit all words from all languages.

Sub word unit tokenization

The solution to these issues is sub-word tokenization; splitting words up into numbers representing components of words. Once you have that you can handle:

Open-Vocabulary Part of the token set is the basic characters of each language so you can represent every word.
Rare words Because the vocabulary set is open it means any rare word is handled.
Translation of Novel Words The model can translate and generate words it has not encountered before by composing them from sub-word units.

Here is a diagram showing how the fixed sized vocabulary of tokens maps to an array of learned embedding vectors that feed into the Transformer model underlying all LLMs.

      Input Text
          │
          ▼
     "I love NLP"
          │
┌─────────┴─────────┐
│   Tokenizer       │
└─────────┬─────────┘
          │
          ▼
      Tokens
   ["i", "love", "nlp"]
          │
┌─────────┴─────────┐
│ Vocabulary Lookup │
└─────────┬─────────┘
          │
          ▼
      Token IDs
      [25, 2097, 12510]
          │
          │         ┌───────────────────────────────────┐
          │         │          Embedding Matrix         │
          │         │ (Size: |V| x d_model)            │
          │         ├───────────────────────────────────┤
          ├────────►│ Row 25:   [0.1, -0.4, 0.2, ...]   │  ───►  Embedding for "i"
          │         ├───────────────────────────────────┤
          ├────────►│ Row 2097: [-0.8, 0.5, 0.9, ...]   │  ───►  Embedding for "love"
          │         ├───────────────────────────────────┤
          ├────────►│ Row 12510: [0.3, 0.7, -0.1, ...]  │  ───►  Embedding for "nlp"
          │         ├───────────────────────────────────┤
          │         │                ...                │
          │         └───────────────────────────────────┘
          │
          ▼
  Input Embeddings
(Dense Vectors fed to the model)

Tokenization means taking text and splitting it into sub-word (and whole-word or even multi-word) units. An input text like "My cat, Blivarian, is making a mess." may be tokenized into something like this:

You can explore this tokenization here: https://platform.openai.com/tokenizer

My cat , Bl iv arian , is making a mess .

[5444, 9059, 11, 3130, 569, 21203, 11, 382, 4137, 261, 13017, 13]

Notice that the commas have the same token value when appearing in different places. Also that common words like cat and mess have their own tokens.

I deliberately made up a name for the made up Cat that is not a real word "Blivarian". You can see that it is split up into 3 sub words. Without tokenization this would instead have been stored with a special "Out of vocabulary" token, that means it carries no semantic meaning. When dealing with sub word tokens however, the LLM has the opportunity to build up meaning for those components that may help with overall model quality.

Byte Pair Encoding - BPE

Why BPE?

From above we understand that we should split words into sub-word components to handle the vast space of human vocabulary in the finite space of the LLMs vocabulary.

How to do that is the next question. Why not, for example, just have a vocabulary consisting of the punctuation and alphabetic characters of every language?

It won't work well because in the LLM training it will build up an embedding vector for each token, the unit of vocabulary. This vector is an array of numbers that represents a direction in multidimensional space. To us those numbers mean nothing, but in LLM training those numbers, when used in conjunction with the rest of the models weights, can be used to learn and represent all kinds of meaning.

Models like Transformers have a finite context window. When sequences are excessively long, it becomes much harder for the model to capture long-range dependencies and relationships between words that are far apart. The model has to work harder to understand the overall context.

Instead we want something that splits things into meaningful chunks, morphemes, as well as capturing commonly used words with tokens. This ends up looking something like Huffman Encoding:

https://en.wikipedia.org/wiki/Huffman_coding

It represents more frequently occurring substrings with less bits, giving us a more efficient data size.

Similarly, BPE, is data-driven algorithm that creates a vocabulary of meaningful and frequently occurring subword units.

BPE algorithm

First you need to train across a large corpus of realistic text. For state of the art (SOTA) LLMs this is likely in the trillions of characters of data.

The algorithm itself is very simple, it works as follows:

Start with 256 tokens (0 to 255), our basic character set.

First turn the text into its underlying numeric representation (typically just the bytes of a UTF-8 input).
Count all the pairs of bytes.
Pick the most frequently occurring pair and generate the next new token (257, 258…).
Replace that pair wherever it occurs with the new token.

Repeat until you have your full vocabulary. You can then save the merge pairs and these are then used by end users to encode their text before sending to the model.

They can also be used to reconstruct the original text in the decoding process when the response comes from the model.

Conflict Resolution

An important decision in tokenization is how to handle pairs with the same frequency. In this post I'll consider two methods:

First in corpus wins.
Lexicographical ordering.

With any tokenization algorithm design we need to consider efficiency of implementation alongside methods that give the best results. Some of these concerns will be highlighted below.

minbpe-cc an exercise in optimization

With these algorithmic decisions in mind, I was ready to dive into the C++ implementation and see how they performed in practice. This led to my project, minbpe-cc. I find the best way to learn a topic is to get my hands dirty, and as such I decided to reimplement Karpathy's Python code in C++.

I also wanted to focus on optimization of the training stage, for no other reason than curiosity.

Why C++?

It's a low level language with generally low to zero cost abstractions.
I've recently been catching up with modern C++ and wanted to try out some of the new features (C++23 required).

The final code here fully implements all the facets of Karpathy's minbpe including encoding, decoding and training. I've included end to end tests and tested in a linux and MacOS environment. I have not tested on Windows yet, but I expect it will work without much modification.

https://github.com/justinhj/minbpe-cc

Implementation tales

Speed bumps

Converting from Python to C++ is fairly straightforward although I hit some speed bumps on the way:

Python dictionary behaviour. The Python dictionary is designed to be flexible for multiple purposes rather than optimized, so getting the same behaviour from C++ containers required some additional thought.
Polymorphism. I didn't really like Karpathy's polymorphic version and instead decided to use a single class design with flags and other parameters to handle whether special tokens are used, what the conflict strategy was, and whether to use a regex or not. It was quite easy to make this work with some tweaks to the original code. Ironically I did use polymorphism on the PairCount class so I can use different implementations at runtime depending on the users preferences.
CMake. CMake is not a casual tool. I found I could just about get my project to build and run using it, but after switching to Zig build instead I found it much easier to manage. In other words to effectively use CMake would require me to read the manual in a lot more detail.

Regex compatibility

Firstly, what are regexes needed for?

In the GPT series of tokenizers, OpenAI realized that it is beneficial to try and keep parts of text together, as such rather than run BPE on the whole input text, they first divide it up into sections by the following regular expressions:

GPT2 "'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+"
GPT4 "'(?i:[sdmt]|ll|ve|re)|[^\\r\\n\\p{L}\\p{N}]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]++[\\r\\n]*|\\s*[\\r\\n]|\\s+(?!\\S)|\\s+"

These expressions are designed to preserve various aspects of English text rather than allow them to be split up during the merge process.

Whilst there are a few established regex libraries for C++ (writing my own being out of scope for this project), finding one that was capable of handling these regular expressions took some looking.

These expressions need support for unicode matchers and also negative lookahead.

I compared several libraries:

RE2 from Google.
std::regex in the C++ standard library.
Boost::regex
Re-Flex

None of these met the requirements.

In the end I found the Perl compatible PRE2 library worked the best.

The biggest footgun was that the Boost::regex library was asserting because Boost was not linking properly with the ICU (internationalization) library. I suspect this could be made to work but I gave up.

Optimization mantras

In System's Performance, Enterprise and the Cloud by Brendan Gregg (2021) the following mantras for performance are listed, ordered from most to least effective. I find these useful when considering optimization.

Don’t do it.
Do it, but don’t do it again.
Do it less.
Do it later.
Do it when they’re not looking.
Do it concurrently.
Do it more cheaply.

We can refer to these during the post.

Data structures

The first step to port the Python code and make it more efficient is to think about the data involved and how that data needs to accessed.

Data

Body text. We will store this as a vector (array) of numbers representing the input text for training.
Pair frequencies. We need to keep track of all the pairs in the body text and their frequencies.

Access patterns

Body text. We need sequential access to scan for pairs. Then we need to be able to delete elements as part of the merge process.
Pair frequencies. We need to be able to store the pairs and their frequencies and efficiently update them as we scan the body text. In addition we need fast access to the next most frequent pair.

Implementation

Body text

Because the body text required sequential access and the ability to quickly remove elements I used a singly linked list, or forward_list. This has the desirable properties of sequential access and O(1) deletions.

forward_list has the lowest memory overhead of all std C++ containers (a single pointer to the next element.

Other valid options considered:

Keep in a vector but use tombstones for removed items. This has the

advantage of eliminating the memory moves for each replacement, and it doesn't have the problem forward list has with giving us a way to know the position in the input text (see later). This is quite a tricky implementation but perfectly feasible.

Keep in a vector and do the memory moves. Requires a lot of memory

bandwidth and cpu for the copying but it is simple.

Pair frequencies

Ultimately I needed multiple structures here as I wanted to support more than one conflict resolution strategy and since these are picked by the user at runtime we need dynamic dispatch. So first I made a virtual class with the required interface for both:

template<typename T>
class PairCount {
public:
    // Virtual destructor to ensure proper cleanup of derived classes.
    virtual ~PairCount() = default;

    // Gets the total number of unique pairs stored.
    virtual size_t get_count() = 0;

    // Retrieves the count for a specific pair.
    virtual optional<int> get_pair(pair<T,T> mp) = 0;

    // Creates a new pair or modifies the frequency of an existing one.
    virtual bool create_or_modify_pair(T a, T b, int freq) = 0;

    // Gets the pair with the highest count.
    virtual optional<pair<T,T>> get_top_pair_count() = 0;

    // Retrieves all pairs and their counts.
    virtual std::vector<std::vector<T>> get_all() = 0;
};

Note that the class has a template parameter, as the Tokenizer can be recompiled with different underlying numeric types for the tokens.

Conflict resolution strategy: First seen in input
Imagine a sequence as follows:

1,2,8,9,3,4…

After counting all the pairs we find that [1,2] and [3,4] have the same frequency.
1. [1,2] => 20
2. [3,4] => 20
In this case we pick the one added first, which means the one first seen in the input text.

In Python this insertion order comes for free because of Raymond Hettinger's 2012 redesign of the Python dictionary. Implemented in Python 3.6 (released December 23, 2016), introduced compact dictionaries with key-sharing and faster performance. A side effect of this redesign was that dictionaries began preserving insertion order as an implementation detail. This was later formalized as a language guarantee in Python 3.7 (released June 27, 2018), meaning dictionaries officially maintain the order of key-value pairs as they are inserted.

In Karpathy's code you can see that he simply relies on this behaviour to get the consistent result based on above.
```
# count up the number of times every consecutive pair appears
stats = get_stats(ids)
# find the pair with the highest count
pair = max(stats, key=stats.get)
```
And from the Python documentation: https://docs.python.org/3/library/functions.html#max

If multiple items are maximal, the function returns the first one encountered. This is consistent with other sort-stability preserving tools such as sorted(iterable, key=keyfunc, reverse=True)[0] and heapq.nlargest(1, iterable, key=keyfunc).

In order to implement that we must track the insertion order. Rather than let the user deal with that I built it into the PairCount class. As elements are added, new ones get the current count and the count is incremented.

Picking a data structure here is tricky because we want to be able to quickly store and modify pair frequencies (unordered_map), and a way to get the most frequent (priority_queue). Furthermore, we want to keep track of insertion order?

Sometimes you need to use multiple data structures to support a use case with conflicting requirements. For this purpose I used the boost::multi_index.

https://www.boost.org/doc/libs/1_88_0/libs/multi_index/doc/index.html

There's nothing to stop you from using a set and a priority queue and tracking them yourself, but multi_index handles that for you based on the declaration of which indexes and access patterns you need.

Let's take a look at the implementation of PairCountInsertOrder:

First the data; we need to store pair, the count and the insert order.
```
template<typename T>
struct PairCountOrder {
    ::pair<T,T> pair;
    int count;
    size_t insert_order;

    PairCountOrder(::pair<T,T> p, int c, size_t fo) : pair(p), count(c), insert_order(fo) {}
    PairCountOrder(::pair<T,T> p, int c) : pair(p), count(c), insert_order(std::numeric_limits<size_t>::max()) {}
};

// Comparison struct for sorting. Sorts by count (descending), then by insertion order (ascending).
template<typename T>
struct CompareCountOrder {
    bool operator()(const PairCountOrder<T>& a, const PairCountOrder<T>& b) const {
        if(a.count == b.count) {
            return a.insert_order < b.insert_order;
        } else {
            return a.count > b.count; // higher count is greater
        }
    }
};
```
Next we define the container itself. We just specify the indexes required and Boost takes care of picking the underlying data structures.
```
template<typename T>
using PairCountStore = boost::multi_index_container<
    PairCountOrder<T>,
    indexed_by<
        // Index 0: Hashed unique index on the 'pair' member for fast lookups.
        hashed_unique<member<PairCountOrder<T>, pair<T,T>, &PairCountOrder<T>::pair>>,
        // Index 1: Ordered non-unique index for sorting by count and insertion order.
        ordered_non_unique<identity<PairCountOrder<T>>, CompareCountOrder<T>>
    >
>;
```
Index 0 explanation: It is hashed so we should get an O(1) lookup type, and unique meaning keys are unique, each pair can occur once only. The rest of the declaration explains how to get the key for this index (use the pair member).

Index 1 explanation: This needs to be an ordered collection so we can extract the highest frequency. It also needs to be non-unique (in its sort criteria), because we can have multiple elements with the same frequency.

Now in our code we can grab the appropriate index depending on the current purpose and when we make modifications to the data the boost library will ensure the changes are synchronized across all the indexes in the container.
```
auto& index_by_key = pcs.template get<0>();
auto f = index_by_key.find(mp);
if(f != pcs.end()) {
    index_by_key.modify(f, [freq](PairCountOrder<T>& pc) { pc.count += freq; });
    return false;
} else {
    pcs.insert(PairCountOrder<T>(mp, freq, next_insert++));
    return true;
}
```

Conflict resolution strategy: Lexicographical

Referred to as lexical in my implementation to save typing, this method means we pick from pairs based on which comes first. For example given the following two pairs:

[1,2] => 20
[3,4] => 20

They have the same frequency so we pick pair 1) as 1 < 3. The second member of the pair is used as the tie-breaker, and of course if both members are the same then they would be combined to a single entry in the PairCount.

Again a multi_index container is needed here. Let's start with the data:

template<typename T>
struct PairCountLexical {
    ::pair<T,T> pair;
    int count;

    PairCountLexical(::pair<T,T> p, int c) : pair(p), count(c) {}
};

// Comparison struct for sorting. Sorts by count (descending), then by pair (lexical ascending).
template<typename T>
struct CompareLexicalOrder {
    bool operator()(const PairCountLexical<T>& a, const PairCountLexical<T>& b) const {
        if(a.count == b.count) {
            if (a.pair.first == b.pair.first) {
                return a.pair.second < b.pair.second;
            } else {
                return a.pair.first < b.pair.first;
            }
        } else {
            return a.count > b.count; // higher count is greater
        }
    }
};

And the container looks like this:

template<typename T>
using PairCountLexicalStore = boost::multi_index_container<
    PairCountLexical<T>,
    indexed_by<
        // Index 0: Hashed unique index on the 'pair' member for fast lookups.
        hashed_unique<member<PairCountLexical<T>, pair<T,T>, &PairCountLexical<T>::pair>>,
        // Index 1: Ordered non-unique index for sorting by count and lexical order.
        ordered_non_unique<identity<PairCountLexical<T>>, CompareLexicalOrder<T>>
    >
>;

Index 0 explanation: Same as above this gives us fast insert, modify and lookup for the pair frequencies.

Index 1 explanation: Same as above except the outcome is different because of the implementation of CompareLexicalOrder.

Optimization of frequency counts
When running the code I see that the biggest cost is regenerating the frequency map each step. For example when churning through wikitext (a 500mb text corpus) it takes the Python code 28 seconds on my Macbook to count all the pairs.

Let's work through Brendan Gregg's impactful optimizations:
1. Don’t do it.
2. Do it, but don’t do it again.
3. Do it less.
4. Do it later.
5. Do it when they’re not looking.
6. Do it concurrently.
7. Do it more cheaply.
Don't do it is not an option, we need those updated counts each step. Do it but not again is fruitful though.

The key insight here is that we only need to do a full frequency count one time. Then we can incrementally update the pair frequencies as we walk through doing the merge process. Essentially we are removing and adding a number of pairs on each replacement.

I noticed that the authors of the paper mentioned this too:

"In practice, we increase efficiency by indexing all pairs, and updating data structures incrementally."

You can see their incremental update code here:

https://github.com/rsennrich/subword-nmt/blob/92d6139d07d30e12735a0af9e7f7f925ebe62c54/subword_nmt/learn_bpe.py#L159

In addition to this optimization they use a pruning technique that drops frequencies of pairs below some threshold. This makes sense because the Python max function iterates the whole collection. In my case our data structures do not, so pruning is probably not worth the additional complexity. Worth trying maybe?

In any case, for my lexicographical conflict strategy I do implement this optimization and it is a huge win on performance as shown in the charts.

Crucially, it is not implemented for the first occurring strategy, because the current implementation gives now way to easily keep track of the first occurence of a pair in the input corpus.

Next steps

For me the project is at a good point to move on to other things but there are some things I would do next otherwise:

Port to Zig. Currently I'm using Zig for some other projects and would be interested in the porting experience and how the performance compares.
Work on different data structures for the input text to support incremental frequency counting for the first strategy.
Optimization of the encode and decode steps.
Implement download and conversion of GPT merges like Karpathy does in his gpt4 code.
Look at implementations of other Tokenization algorithms.
Optimization by parallel computation. At face value it seems possible to do the merge process on multiple cores using a divide and conquer approach. Edge cases where the sections overlap may be tricky.

Conclusion

I had a lot of pain and a lot of fun working on the code. I highly recommend this kind of process to fully understand the nuances and implementation details required for AI engineering.

As a refresher on modern C++ this was a great project. (I recommend (Tour of C++)[https://www.amazon.ca/Tour-C-2nd-Bjarne-Stroustrup/dp/0134997832] and https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines.

Apart from the complexity of CMake I found that using C++ today is a pleasant and safe experience as long as you carefully tread the recommended path.

References

Karpathy's youtube

https://www.youtube.com/watch?v=zduSFxRajkE

And his Python code

https://github.com/karpathy/minbpe

If you want to dive into the code or run the benchmarks yourself, you can find the full project on GitHub.

https://github.com/justinhj/minbpe-cc

An early paper on bpe for tokenization is "Neural Machine Translation of Rare Words with Subword Units"

https://arxiv.org/pdf/1508.07909

The original source code from the paper.

https://github.com/rsennrich/subword-nmt

Thanks for reading!