Home

Awesome

SharpToken

NuGet dotnet Last Commit GitHub Issues Used by Contributors License

SharpToken is a C# library that serves as a port of the Python tiktoken library. It provides functionality for encoding and decoding tokens using GPT-based encodings. This library is built for .NET 6, .NET 8 and .NET Standard 2.0, making it compatible with a wide range of frameworks.

[!Important] The functionality in SharpToken has been added to Microsoft.ML.Tokenizers. Microsoft.ML.Tokenizers is a tokenizer library being developed by the .NET team and going forward, the central place for tokenizer development in .NET. By using Microsoft.ML.Tokenizers, you should see improved performance over existing tokenizer library implementations, including SharpToken. A stable release of Microsoft.ML.Tokenizers is expected alongside the .NET 9.0 release (November 2024). Instructions for migration can be found at https://github.com/dotnet/machinelearning/blob/main/docs/code/microsoft-ml-tokenizers-migration-guide.md.

Installation

To install SharpToken, use the NuGet package manager:

Install-Package SharpToken

Or, if you prefer using the .NET CLI:

dotnet add package SharpToken

For more information, visit the NuGet package page.

Usage

To use SharpToken in your project, first import the library:

using SharpToken;

Next, create an instance of GptEncoding by specifying the desired encoding or model:

// Get encoding by encoding name
var encoding = GptEncoding.GetEncoding("cl100k_base");

// Get encoding by model name
var encoding = GptEncoding.GetEncodingForModel("gpt-4");

You can then use the Encode method to encode a string:

var encoded = encoding.Encode("Hello, world!"); // Output: [9906, 11, 1917, 0]

And use the Decode method to decode the encoded tokens:

var decoded = encoding.Decode(encoded); // Output: "Hello, world!"

SharpToken also provides a high performance count method. It is usefull to check prompt size before sending it to a LLM or to use it in a TextSplitter/Chunker for RAG.

var count = encoding.CountTokens("Hello, world!"); // Output: 4

Supported Models

SharpToken currently supports the following models:

You can use any of these models when creating an instance of GptEncoding:

var r50kBaseEncoding = GptEncoding.GetEncoding("r50k_base");
var p50kBaseEncoding = GptEncoding.GetEncoding("p50k_base");
var p50kEditEncoding = GptEncoding.GetEncoding("p50k_edit");
var cl100kBaseEncoding = GptEncoding.GetEncoding("cl100k_base");
var o200kBaseEncoding = GptEncoding.GetEncoding("o200k_base");

Model Prefix Matching

Apart from specifying direct model names, SharpToken also provides functionality to map model names based on specific prefixes. This allows users to retrieve an encoding based on a model's prefix.

Here are the current supported prefixes and their corresponding encodings:

Model PrefixEncoding
gpt-4oo200k_base
gpt-4-cl100k_base
gpt-3.5-turbo-cl100k_base
gpt-35-turbocl100k_base

Examples of model names that fall under these prefixes include:

To retrieve the encoding name based on a model name or its prefix, you can use the GetEncodingNameForModel method:

string encodingName = Model.GetEncodingNameForModel("gpt-4-0314");  // This will return "cl100k_base"

If the provided model name doesn't match any direct model names or prefixes, the method will return null.

Understanding Encoded Values

When you encode a string using the Encode method, the returned value is a list of integers that represent tokens in the specified encoding. These tokens are a compact way of representing the input text and can be processed more efficiently by various algorithms.

For example, encoding the text "Hello world!" using the cl100k_base encoding might produce the following list of integers:

var encoded = cl100kBaseEncoding.Encode("Hello world!"); // Output: [9906, 1917, 0]

You can then use the Decode method to convert these tokenized integer values back into the original text:

var decoded = cl100kBaseEncoding.Decode(encoded); // Output: "Hello world!"

With SharpToken, you can seamlessly switch between different encodings to find the one that best suits your needs. Just remember to use the same encoding for both the Encode and Decode methods to ensure accurate results.

Advanced usage

Custom Allowed Sets

SharpToken allows you to specify custom sets of allowed special tokens when encoding text. To do this, pass a HashSet<string> containing the allowed special tokens as a parameter to the Encode method:

const string encodingName = "cl100k_base";
const string inputText = "Some Text <|endofprompt|>";
var allowedSpecialTokens = new HashSet<string> { "<|endofprompt|>" };

var encoding = GptEncoding.GetEncoding(encodingName);
var encoded = encoding.Encode(inputText, allowedSpecialTokens);
var expectedEncoded = new List<int> { 8538, 2991, 220, 100276 };

Assert.Equal(expectedEncoded, encoded);

Custom Disallowed Sets

Similarly, you can specify custom sets of disallowed special tokens when encoding text. Pass a HashSet<string> containing the disallowed special tokens as a parameter to the Encode method:

const string encodingName = "cl100k_base";
const string inputText = "Some Text";

var encoding = GptEncoding.GetEncoding(encodingName);

void TestAction()
{
    encoding.Encode(inputText, disallowedSpecial: new HashSet<string> { "Some" });
}

Assert.Throws<ArgumentException>(TestAction);

In this example, an ArgumentException is thrown because the input text contains a disallowed special token

Testing and Validation

SharpToken includes a set of test cases in the TestPlans.txt file to ensure its compatibility with the Python tiktoken library. These test cases validate the functionality and behavior of SharpToken, providing a reliable reference for developers. Running the unit tests and verifying the test cases helps maintain consistency between the C# SharpToken library and the original Python implementation.

Performance Compared to TiktokenSharp and TokenizerLib

SharpToken is the fastest library with the lowest allocations!

<details> <summary>Benchmark Code</summary>
[SimpleJob(RuntimeMoniker.Net60)]
[SimpleJob(RuntimeMoniker.Net80)]
[SimpleJob(RuntimeMoniker.Net471)]
[RPlotExporter]
[MemoryDiagnoser]
public class CompareBenchmark
{
    private GptEncoding _sharpToken;
    private TikToken _tikToken;
    private ITokenizer _tokenizer;
    private Tokenizer _mlTokenizer;
    private string _kLongText;

    [GlobalSetup]
    public async Task Setup()
    {
        _sharpToken = GptEncoding.GetEncoding("cl100k_base");
        _tikToken = await TikToken.GetEncodingAsync("cl100k_base").ConfigureAwait(false);
        _tokenizer = await TokenizerBuilder.CreateByModelNameAsync("gpt-4").ConfigureAwait(false);
        _kLongText = "King Lear, one of Shakespeare's darkest and most savage plays, tells the story of the foolish and Job-like Lear, who divides his kingdom, as he does his affections, according to vanity and whim. Lear’s failure as a father engulfs himself and his world in turmoil and tragedy.";
    }

    [Benchmark]
    public int SharpToken()
    {
        var sum = 0;
        for (var i = 0; i < 10000; i++)
        {
            var encoded = _sharpToken.Encode(_kLongText);
            var decoded = _sharpToken.Decode(encoded);
            sum += decoded.Length;
        }

        return sum;
    }

    [Benchmark]
    public int TiktokenSharp()
    {
        var sum = 0;
        for (var i = 0; i < 10000; i++)
        {
            var encoded = _tikToken.Encode(_kLongText);
            var decoded = _tikToken.Decode(encoded);
            sum += decoded.Length;
        }

        return sum;
    }

    [Benchmark]
    public int TokenizerLib()
    {
        var sum = 0;
        for (var i = 0; i < 10000; i++)
        {
            var encoded = _tokenizer.Encode(_kLongText);
            var decoded = _tokenizer.Decode(encoded.ToArray());
            sum += decoded.Length;
        }

        return sum;
    }

    [Benchmark]
    public int MLTokenizers()
    {
        var sum = 0;
        for (var i = 0; i < 10000; i++)
        {
            var encoded = _mlTokenizer.EncodeToIds(_kLongText);
            var decoded = _mlTokenizer.Decode(encoded);
            sum += decoded.Length;
        }

        return sum;
    }
}
</details>
BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 11 (10.0.22631.3296)
11th Gen Intel Core i9-11950H 2.60GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100-preview.2.24157.14
  [Host]               : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
  .NET 6.0             : .NET 6.0.28 (6.0.2824.12007), X64 RyuJIT AVX2
  .NET 8.0             : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
  .NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256
MethodJobRuntimeMeanErrorStdDevMedianGen0Gen1Allocated
MLTokenizers.NET 8.0.NET 8.060.55 ms1.143 ms1.123 ms60.45 ms1000.0000-13.12 MB
MLTokenizers.NET 6.0.NET 6.095.75 ms1.374 ms1.147 ms95.54 ms10500.0000-126.19 MB
MLTokenizers.NET Framework 4.7.1.NET Framework 4.7.1291.77 ms5.811 ms11.195 ms291.64 ms21000.0000-127.33 MB
SharpToken.NET 8.0.NET 8.087.78 ms1.700 ms1.590 ms87.34 ms1000.0000-22.13 MB
SharpToken.NET 6.0.NET 6.0128.84 ms1.718 ms1.607 ms128.17 ms16250.0000500.0000196.31 MB
SharpToken.NET Framework 4.7.1.NET Framework 4.7.1356.21 ms6.843 ms10.854 ms355.09 ms34000.00001000.0000204.39 MB
TokenizerLib.NET 8.0.NET 8.0109.26 ms2.082 ms4.482 ms107.90 ms18200.0000600.0000217.82 MB
TokenizerLib.NET 6.0.NET 6.0126.16 ms2.959 ms8.630 ms122.34 ms18000.0000500.0000217.82 MB
TokenizerLib.NET Framework 4.7.1.NET Framework 4.7.1374.71 ms7.374 ms16.794 ms370.12 ms40000.00001000.0000243.79 MB
TiktokenSharp.NET 8.0.NET 8.0177.34 ms3.506 ms8.797 ms174.98 ms28000.00001000.0000338.98 MB
TiktokenSharp.NET 6.0.NET 6.0196.17 ms3.912 ms8.422 ms195.52 ms26000.0000666.6667313.26 MB
TiktokenSharp.NET Framework 4.7.1.NET Framework 4.7.1488.22 ms9.696 ms15.931 ms487.17 ms63000.00001000.0000378.31 MB

Performance

SharpToken is extreamly performance optimized on net8.0. It uses modern multibyte CPU instructions and almost no heap allocations.

All core methods have been tested on a large and a small input text.

Inputs:

Methods:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.200
  [Host]               : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  .NET 6.0             : .NET 6.0.16 (6.0.1623.17311), X64 RyuJIT AVX2
  .NET 8.0             : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  .NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256
MethodMeanErrorStdDevRatioRatioSDAllocatedAlloc Ratio
.NET 8.0
Encode_SmallText22.649 us0.4244 us0.4359 us0.280.01696 B0.02
Encode_LargeText4,542.505 us87.7988 us104.5182 us0.240.01155547 B0.03
Decode_SmallText1.623 us0.0324 us0.0373 us0.440.022320 B0.98
Decode_LargeText454.570 us6.8980 us6.4524 us0.800.02286979 B1.00
CountTokens_SmallText22.008 us0.1165 us0.0909 us0.280.00184 B0.005
CountTokens_LargeText4,231.353 us14.5157 us11.3329 us0.230.00195 B0.000
.NET 6.0
Encode_SmallText36.370 us0.7178 us1.0962 us0.450.0237344 B0.91
Encode_LargeText11,213.070 us219.6291 us269.7243 us0.590.025062574 B0.91
Decode_SmallText2.588 us0.0394 us0.0350 us0.700.022320 B0.98
Decode_LargeText489.467 us8.9195 us8.3433 us0.860.02286985 B1.00
CountTokens_SmallText34.758 us0.2027 us0.1896 us0.450.0136832 B0.907
CountTokens_LargeText11,252.083 us215.8912 us212.0340 us0.610.014907169 B0.907
.NET Framework 4.7.1
Encode_SmallText79.947 us1.5621 us3.0097 us1.000.0041138 B1.00
Encode_LargeText18,961.252 us253.1816 us236.8262 us1.000.005567685 B1.00
Decode_SmallText3.723 us0.0728 us0.0997 us1.000.002375 B1.00
Decode_LargeText570.787 us11.0356 us11.8080 us1.000.00287496 B1.00
CountTokens_SmallText77.521 us1.0802 us0.9020 us1.000.0040616 B1.000
CountTokens_LargeText18,485.392 us313.5834 us277.9836 us1.000.005413237 B1.000

Contributions and Feedback

If you encounter any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request on the project's repository.

Hope you find SharpToken useful for your projects and welcome any feedback you may have.