Awesome

MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression

MultiTok provides a novel variable-length tokenizer in the sense that each token can represent a variable number of sub-words. The advantages of MultiTok are it (i) dynamically compresses the necessary training data by close to 33% (ii) allows the LLM model to close to three times faster training and (iii) maintains performance comparable to the current state-of-the-art for tokenization BERT. Specifically, MultiTok can be used to compress repetitive words or phrases within the training data without significantly harming the model performance. We hope that Multitok can mark the beginning of using information-theoretic approaches to provide efficient, secure, and robust LLM systems.

This is an implementation for our paper "MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression" submitted to ICASSP 2025. The full paper can be accessed at https://arxiv.org/abs/2410.21548.

.
├── README.md
├── experiments
│   ├── bert.py
│   ├── bert_multitok.py
│   ├── main.py
│   ├── model.py
│   ├── multitok.py
│   ├── multitok_freq.py
│   └── random.py
├── misc
│   └── pos_emb.py
└── requirements.txt

Installation

Clone the repository

git clone https://github.com/noelkelias/multitok.git

Install the requirements

cd multitok
pip install -r requirements.txt

Datasets

Our experiments focus on three mainstream text classification datasets:

Name	Description
IMDB	The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative.
sst2	Comprises sentences from movie reviews, annotated for sentiment (positive/negative)
AG-News	AG is a collection of news articles annoted for their topic

Usage

The experiments can be found in the experiments folder. Specifically, we point users to the MultiTok tokenization implmentation found in multitok.py. We demonstrate applying MultiTok tokenization on BERT tokens in bert_multitok.py. Additionally, we include a frequency analysis component in multitok_freq.py that can be optionally added to MultiTok for improved performance. Code for a few sample tests end-to-end tests with these tokenization schemes trained on a basic model in main.py.

We encourage users to modify the different parameters and experiment with varying datasets to utilize MultiTok in their own pipelines.

Citation

The full paper can be accessed at https://arxiv.org/abs/2410.21548.

@unpublished{elias24,
    author = {Noel Elias, Homa Esfahanizadeh, H. Kaan Kale, Sriram Vishwanath, and Muriel Medard},
    title = {MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression},
    note = {Manuscript submitted for publication}, 
    year = {2024}
}