Awesome
Tokenizer
This repo contains Typescript and C# implementation of byte pair encoding(BPE) tokenizer for OpenAI LLMs, it's based on open sourced rust implementation in the OpenAI tiktoken. Both implementation are valuable to run prompt tokenization in Nodejs and .NET environment before feeding prompt into a LLM.
Typescript implementation
Please follow README.
C# implementation
[!IMPORTANT] Users of
Microsoft.DeepDev.TokenizerLib
should migrate toMicrosoft.ML.Tokenizers
. The functionality inMicrosoft.DeepDev.TokenizerLib
has been added toMicrosoft.ML.Tokenizers
.Microsoft.ML.Tokenizers
is a tokenizer library being developed by the .NET team and going forward, the central place for tokenizer development in .NET. By usingMicrosoft.ML.Tokenizers
, you should see improved performance over existing tokenizer library implementations, includingMicrosoft.DeepDev.TokenizerLib
. A stable release ofMicrosoft.ML.Tokenizers
is expected alongside the .NET 9.0 release (November 2024). Instructions for migration can be found at https://github.com/dotnet/machinelearning/blob/main/docs/code/microsoft-ml-tokenizers-migration-guide.md.
Contributing
We welcome contributions. Please follow this guideline.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.