Home

Awesome

CircleCI codecov Python pre-commit Code style: black flake8 bandit isort mypy

<a href="https://sambanova.ai/"> <picture> <source media="(prefers-color-scheme: dark)" srcset="/img/SambaNova-light-logo-1.png" height="60"> <img alt="Text changing depending on mode. Light: 'So light!' Dark: 'So dark!'" src="/img/SambaNova-dark-logo-1.png" height="60"> </picture> </a>

Generative data preparation

This software package allows you to prepare datasets for training generative LLMs on SambaStudio and SambaNova's Reconfigurable Data Units (RDUs). Some features include efficient multiprocessing, shuffling data that outsizes RAM, and specifying tokens to attend to during training.

The pipeline.py script streamlines the data preparation process. It takes a single input file, shuffles and splits it into train/dev/test files, tokenizes, sequences, and converts them to HDF5 format using the utilities in data_prep.py. The output directory contains multiple split HDF5 files that are needed to run data parallel training. This output directory will be directly used as a training dataset in SambaStudio. While this package features simple flows that work out of the box, it also supports more customization allowing for many styles of packing varied length text into tokenized sequences.

If you are an advanced user looking to process data with pre-defined splits, integrate with the package validation tools, or contribute, check out the Advanced Usage section below!

</br>

Table of contents

</br>

Requirements

</br>

Installation

git clone https://github.com/sambanova/generative_data_prep.git
cd generative_data_prep
pip install .
</br>

Getting Started

The following simple example will help you get started with your first processed dataset:

Example

python3 -m generative_data_prep pipeline --input_path=<PATH TO DATASET FILE> --output_path=<PATH TO OUTPUT DIRECTORY> --pretrained_tokenizer=openai-community/gpt2 --max_seq_length=1024 --input_packing_config='greedy::drop' --shuffle=on_RAM

Here are a few important parameters to know about when running this example:

<table> <thead> <tr> <th>Flag Name</th> <th>Type</th> <th width="50%">Description</th> <th width="50%">Instructions</th> </tr> </thead> <tbody> <tr> <td><code>input_path</code></td> <td>str</td> <td>An existing file path to the dataset to be processed, or directory of files. File must be in <code>.jsonl</code> or <code>.txt</code> format.</td> <td>Check out the <a href="#input">input</a> section for more details.</td> </tr> <tr> <td><code>output_path</code></td> <td>str</td> <td>A path to the desired output location for the directory of processed dataset files. If the path doesn't exist, a new directory will be created using the provided path.</td> <td>Check out the <a href="#output">output</a> section for more details.</td> </tr> <tr> <td><code>pretrained_tokenizer</code></td> <td>str</td> <td>The model specific tokenizer to use when tokenizing the input dataset.</td> <td>You can specify the tokenizer in two ways. The preferred method is to provide the directory path to the locally downloaded base checkpoint. The alternative method is to use the model ID from the Hugging Face model card, such as <code>"mistralai/Mistral-7B-v0.1"</code> for Mistral-7B-v0.1. If the model is gated on Hugging Face, you must request access and <a href="https://huggingface.co/docs/huggingface_hub/en/guides/cli#huggingface-cli-login">log in via the Hugging Face CLI</a> before executing the data preparation command.</td> </tr> <tr> <td><code>max_seq_length</code></td> <td>int</td> <td>The maximum sequence length (in tokens) that an RDU model training configuration can support.</td> <td>When launching the training job on SambaStudio, under "Hyperparameters and Settings," ensure that the max_seq_length value during training matches exactly with this input flag. Note that the available <code>max_seq_length</code> training configurations may not align with the model’s maximum sequence length on Hugging Face.</td> </tr> <tr> <td><code>input_packing_config</code></td> <td>str</td> <td> Defines the strategy used to pack the provided text data into fixed-length sequences. </td> <td> For pre-training, use <code>'full'</code>.<br><br> For fine-tuning:<br> &bull; <code>'greedy::truncate_right'</code> for efficient training with multiple data points per sequence<br> &bull; <code>'single::truncate_right'</code> for limited data with one data point per sequence<br><br> See <a href="#input_packing_config"><code>input_packing_config</code></a> for all options and details. </td> </tr> <tr> <td><code>shuffle</code></td> <td>str</td> <td>Determines whether to shuffle the input dataset, and whether to shuffle on RAM.</td> <td>There are 3 options for this flag: <code>'False'</code>, <code>'on_RAM'</code>, <code>'large_file'</code>. Check out the <a href="#shuffle"><code>shuffle</code></a> flag below for more details.</td> </tr> <tr> <td><code>apply_chat_template</code></td> <td>bool</td> <td>Whether to tokenize the data using <code>tokenizer.apply_chat_template</code>, adding chatML tags during tokenization (e.g., <code>&lt;user&gt;:</code> ... <code>&lt;assistant&gt;:</code>).</td> <td>This option is typically used for instruction tuning or fine-tuning chat models. To enable this flag, the tokenizer you are loading must have a chat template defined. You can verify this by checking the <code>tokenizer_config.json</code> file for a <code>chat_template</code> key.</td> </tr> </tbody> </table> </br>

Input

The input_path argument must be a file or a directory containing one files, each file must be a .txt or .jsonl.

.jsonl Format

The JSON Lines format can be used for fine-tuning, or pre-training/continual pre-training. Each line in the .jsonl format should be a json object with a prompt, and completion element. For example:

{"prompt": "What did the fox do?", "completion": "The quick brown fox jumped over the lazy dog."}
{"prompt": "How much wood does a woodchuck chuck?", "completion": "A woodchuck chucks 1000 wood."}
{"prompt": "Who sells seashells by the sea shore?", "completion": "She sells seashells by the sea shore."}

We also support lists of prompt/completion pairs within a .jsonl file. This guarantees that the prompt/completion pairs in the list will be placed contiguously in the same sequence. If the input prompt/completion pairs are placed on separate lines rather then a list, then they will get shuffled and appear in different training sequences. Your input file may include lines in both list format and regular prompt/completion pair format. Here's an example structure:

[{"prompt": "What's your favorite type of music?", "completion": "I love hip-hop"}, {"prompt": "That's cool. Who's your favorite rapper?", "completion": "I really like Kendrick Lamar"}]
[{"prompt": "What is your favorite type of dessert?", "completion": "My favorite dessert is cheesecake."}, {"prompt": "What is your favorite flavor of cheesecake?", "completion": "My favorite flavor of cheesecake is raspberry."}]
[{"prompt": "What is your favorite sport?", "completion": "My favorite sport is football."}, {"prompt": "Who is your favorite football player?", "completion": "My favorite football player is Tom Brady."}]

If the JSON objects in your .jsonl contain keywords other than prompt and completion, refer to the prompt_keyword and completion_keyword flags below

.txt Format

This format should only be used for pre-training/continual pre-training, but not fine-tuning. Additionally, even though .txt format is supported, we recommend that you still use prompt/completion .jsonl format because it can handle newlines in the text. If you use .txt format, then newlines within individual text articles will seperate the text into different data points that may be shuffled and not placed into the same contiguous sequences.

The quick brown fox jumped over the lazy dog
I come from a land down under
SambaNova makes extremely good software and hardware that's fun to use

The above txt input would be equivalent to this jsonl input

{"prompt": "", "completion": "The quick brown fox jumped over the lazy dog"}
{"prompt": "", "completion": "I come from a land down under"}
{"prompt": "", "completion": "SambaNova makes extremely good software and hardware that's fun to use"}
</br>

Formatting data for Chat/Instruction/Fine Tuned Models

Many chat and instruct models require very specific formatting to input multi turn conversations for training and inference. The tokenizer.apply_chat_template function easily adapts your jsonl data to this format. To use this feature, prepare your data in jsonl format as specified above, and then include the --apply_chat_template flag to automatically prepare your data in this format.

If your data is in the classic chat template format like [{"role": "user", "content": "..."}...], and you would like to convert it into the prompt completion format to be compatible with this repo, please use the generative_data_prep/utils/convert_chat_template_to_prompt_completion.py script.

Output

The output_path should be a directory that will contain all the tokenized HDF5 split files, and a sub-directory called tokenizer. This output directory constitutes a processed dataset and can be used for training a model after uploading to SambaStudio. The tokenizer sub-directory will be transferred to any output checkpoints that are saved by Sambastudio for the tokenizer to be used for inference later on.

Holdout Evaluation and Test Data

To evaluate on a holdout set of data during training, pipeline.py can create splits of holdout evaluation and test data.

To do this, choose only one of the two options below. Please review the Flags section for in detail descriptions of these flags.

OR

All this evaluation data will saved under the <OUTPUT_DIR>, if you want to run evaluation on the eval_splits during training you must enable do_eval on SambaStudio. All test data will be saved under <OUTPUT_DIR>/test. This data is left in jsonl text format because running evaluation or inference usually requires text inputs instead of tokenized inputs.

View Decoded HDF5 Files

If you want to view the contents of a processed dataset, you can decode an HDF5 file into a human readable text format. To do so, run the following command:

python3 generative_data_prep/utils/decode_hdf5.py --pretrained_tokenizer=<HF TOKENIZER KEY> --hdf5_file_path=<PATH TO HDF5 FILE> --output_decoded_file_path=<PATH TO OUTPUT TXT FILE>

Note: The same tokenizer used to prepare the data must be used for decoding!

Dataset Size Requirements

  1. You need to ensure your dataset is large enough to run one batch of training.
  2. Make sure that the number of sequences in the output dataset files satisfy this by checking max_batch_size_train in the <OUTPUT_DIR>/metadata.yaml file.
  3. Ensure that the batch_size hyper-parameter is <= max_batch_size_train during training. To understand more, expand the details section below or see FAQs
<details>

How to Check and Set

When starting a training job, ensure that the batch_size hyper-parameter is no bigger than the max_batch_size_train shown in metadata.yaml.

For example:

$ cat <PROCESSED DATA DIRECTORY>/metadata.yaml

max_batch_size_dev: null
max_batch_size_train: 7
max_seq_length: 1024
number_of_dev_files: 0
number_of_test_files: 0
number_of_training_files: 32
token_type_ids: true
tokenizer_model_type: <class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'>
vocab_size: 50257

Here you can see that max_batch_size_train is 7, so the batch size hyper-parameter cannot be greater than 7.

Explanation

With a sufficiently large dataset, you are generally fine with the defaults and can ignore. However, when the provided dataset is small (~1000 data points or less), you need to set the above values correctly or else you will likely run into a training error.

<br /> The dataset that you are providing will be split up across multiple hdf5 files based on the input parameters of the pipeline command.

Based on the size and strucutre of the dataset provided + these parameter settings, a different max_batch_size_train will be shown in metadata.yaml which dictates how large you can set the corresponding batch_size hyper-parameter setting when starting a model training job!

Note:: Not all models trained in studio will expose the batch_size parameter. For those that don't you should ensure your max_batch_size_train is larger than the default batch size (generally 16).

</details>

Additional Details

<details>

If you include the keep_split_jsonls flag, then the output_path will additionally contain a splits directory that saves the jsonl versions of the HDF5 files, meaning that splits/train_1_of_X.jsonl is the jsonl text version of train_1_of_X.hdf5.

The output HDF5 files each contain two datasets:

</details> </br>

Flags

This section outlines all the flags you can set to customize the data prep pipeline for your use case!

Flag NameTypeDefaultOptionsDescription
input_pathstrREQUIREDAny existing file pathPath to the input dataset file or directory of files which must be in .jsonl or .txt format. If dataset is in .jsonl format, the dataset needs to conform to the structure specified in the Input section.
output_pathstrinput_path's directoryAny valid directory pathThe directory to store the output files
log_file_pathstroutput_path/logs.logAny valid file pathThe file to save the logs in, this will save the date and time, git commit hash, input arguments and metrics associated with the dataset.
overwrite_output_pathboolFalseInclude flag for True, no argumentsPermission to delete and overwrite files in output_path.
ignore_input_format_errorboolFalseInclude flag for True, no argumentsPermission to skip misformatted lines in the input file, number of skipped lines will be logged and skipped lines are stored in output_path/json_load_failed_lines.log. Warning: the skipped misformatted lines are dropped from the generated dataset.
pretrained_tokenizer <span id="pretrained_tokenizer"></span>strNoneValid tokenizer key from HuggingfaceThe pretrained tokenizer to be used for tokenizing the input data. Loaded using transformers' AutoTokenizer.from_pretrained method. </br> </br> You also have the option of loading a tokenizer from a local file path. This includes a saved model checkpoint where a tokenizer is saved along with the model. </br> </br>Note: Llama models/tokeniers from meta are gated. You can either use a non-gated version like this example, or visit the Llama2 Model Card to request access!
special_tokens_dictstrNonestring representation of jsonAny non-standard special tokens in JSON format to add to tokenizer. e.g. "{'sep_token': "[SEP]"}". Additional tokens can be also added using the "additional_special_tokens" keyword. For example, indentation encoding can be added with "{'additional_special_tokens': ["\t", "\t\t", "\t\t\t"]}".
max_seq_lengthint2048Maximum sequence length of base checkpoint.The maximum sequence length of the model you are using - measured in tokens. Different models use different tokenizers which will impact the number of tokens a given sequence will be represented as. See pretrained_tokenizer above. </br></br> You can find this information in a few places. We recommend first looking at the specific model card within Samba Studio since it will have the most accurate information. In the event the Samba Studio model card has missing info, you can also find this value on the Hugging Face model card, under the "Files and Versions" tab, in the config.json file.
input_packing_config <span id="input_packing_config"></span>str'full'['full', 'single::truncate_left', 'single::truncate_right', 'single::drop', 'greedy::truncate_left', 'greedy::truncate_right', 'greedy::drop']The first argument in the packing config defines the method of placing text into sequences, the second argument defines how to handle jsonls that do not fit within the max_seq_length. 'full': Defines the entire packing config, Completely fill sequences with tokens, as soon as sequences is full start packing into new sequence. Ignore article boundaries, they may be split across multiple sequences. 'greedy': Fit as many articles as possible into a sequence, make sure no article is split across multiple sequences. Fill the left over space in each sequence with padding. 'single': Each sequence contains only 1 article. Fill the rest of the sequence with padding. 'drop': Drop the entire article if there are any tokens that overflow beyond the max sequence length. 'truncate_left': Truncate the article from the left if there are any tokens that overflow beyond the max sequence length. 'truncate_right': Truncate the article from the right if there are any tokens that overflow beyond the max sequence length.
packing_boundarystr'jsonl'['jsonl', 'prompt_completion_pair']'jsonl': When packing text into sequences, keeps json lines together. This means that for greedy or single packing if the entire line does not fit in the sequences it will be thrown out. 'prompt_completion_pair': When packing text into sequences, prompt_completion_pairs together, but may break up json lines that contain a list of prompt completion pairs.
attention_boundarystr'jsonl'['jsonl', 'prompt_completion_pair']The boundary to use when training with --article_attention flag. If you choose prompt_completion_pair tokens will only attend to tokens in the prompt_completion_pair. If you choose jsonl, then tokens will attend to all the prompt completion pairs in the jsonl
prompt_keyword <span id="prompt_keyword"></span>str'prompt'If your input json has a string keyword for prompt other than "prompt", place the keyword here. e.g Input_json: {"source": ... "target": ...} ->prompt_keyword='source'.
completion_keywordstr'completion'If your input json has a string keyword for completion other than "completion", place the keyword here. e.g Input_json: {"source": ... "target": ...} -> --completion_keyword='target'.
apply_chat_templateboolFalseWhether to tokenize the data using tokenizer.apply_chat_template, to add the chatML tags during tokenization (Eg <user>: ... <assistant>:). This should usually be used when instruction tuning or training chat models. The tokenizer you are loading must have a chat_template defined, you can check if it is defined by looking in the tokenizer_config.json file and checking for a chat_template key in there.
prompt_prefixstr'None'text to add before the prompt, for chatML conventions use (e.g. "<human>:")
prompt_postfixstr'None'text to add after the prompt, for chatML conventions use (e.g. "<bot>:")
disable_space_separatorboolFalseInclude flag for True, no argumentsIf you include this flag, NO spaces will be prepended to the completion. (If you do not add this flag then a space is added to every completion if it does not already have a space). Including this flag is dangerous and not recommended because if you have input data like {"prompt": "hello." "completion": "how are you?"}, when the prompt and completion are combined it will look like "hello.how are you?" which will mess up the tokenization.--completion_keyword='target'.
keep_prompt_only_sequencesboolFalseInclude flag for True, no argumentsIf you include this flag, packed sequences with only prompt tokens will not be dropped. Data with only prompt will be dropped by default because training with prompt-only sequences with prompt_loss_weight=0.0 may lead to errors. Data is dropped because of one of the following conditions: 1. the input file data prompt completion pairs contains only a prompt. 2. If the sequence is truncated such that only prompt tokens remain
categories_pathstrFalseValid file pathIf you include this flag, then the 'category' field from your input jsonls will be stored in the 'category_id' dataset in your output hdf5 files. This flag must point to the file path of a json file that contains a list of all the strings of the 'category' keys in your dataset.
shuffle <span id="shuffle"></span>str'False'['False', 'on_RAM', 'large_file']Choose the on_RAM option if your file is small enough to fit on RAM (If you are not sure if it fits on RAM, you can probably use this flag). If you are running a linux operating system and your file is too large to fit on RAM, please choose large_file option, this will run approximate file shuffling that can handle files of any size. If you want to do large file shuffling but you are not on linux, please shuffle the file before using this script. If the input file should not be shuffled, do not include this flag, it defaults to False.
num_training_splitsint32 if input_file_size < 10GB, 128 if 10GB < input_file_size <100GB, 256 if 100GB < input_file_sizeThe number of training files to split input data into. We recommend you do not include this flag and allow it to default. If you do not default this flag, you have two options. Option 1: specify this flag with the dev_ratio and test_ratio flags, The total number of splits will be (num_training_splits / (1-dev_ratio-test_ratio)), and the number of dev and test splits are calculated accordingly. Option 2: specify this flag with the num_dev_splits and num_test_splits flags which define the number of splits directly. NOTE: the number of training splits must be greater than the number of training workers you have, and we recommend that the number of splits is a multiple of the number of workers you have.
dev_ratiofloat0.0[0 - 1]The ratio of data that should be excluded from train set and used for evaluation, defaults to 0%. If you specify this flag, do not specify num_dev_splits or num_test_splits.
test_ratiofloat0.0[0 - 1]The ratio of data that should be excluded from train set and is saved for testing. This data is not tokenized and left in jsonline format, defaults to 0%. If you specify this flag, do not specify num_dev_splits or num_test_splits.
num_dev_splitsintNoneAny intnumber of dev (eval) splits. If you do not specify dev_ratio, you may specify this flag. If you include this flag, you must also include the num_test_splits and num_training_splits flags.
num_test_splitsintNoneAny intNumber of test splits. If you do not specify test_ratio, you may specify num_test_splits. If you include this flag, you must also include the num_dev_splits and num_training_splits flags.
do_not_balance_hdf5boolFalseInclude flag for True, no argumentsInclude this flag if you DO NOT want to balance HDF5 files, this is not recommended unless the you are dealing with a huge amount of data (many terra bytes), or do not want shuffling between splits.
keep_split_jsonlsboolFalseInclude flag for True, no argumentsIf you DO NOT want to delete split jsonls files that are in text format in the output_path/splits directory include this flag. The only reason you would include this flag is if you want to see what text is in each HDF5, meaning that splits/train_1_of_X.jsonl is the jsonl text version of train_1_of_X.hdf5. Including this flag will increase the storage space of your dataset by more than two times.
num_workersintFalse0 <= num_workers<= # of available CPUsThe number of CPU workers to run tokenization with, if the previous run failed due to OOM, you need to decrease this number.
</br>

Examples

Fine-tuning

Fine-tuning (also known as "generative tuning") is a technique used to adapt a pre-trained language model to perform better at a specific task. This approach typically involves training the model on input data that is structured as a "prompt" followed by a "completion". The prompt represents the input for a specific task, while the completion is the output that the model should generate. During training, the model learns to generate the relevant completion tokens based on the context provided by the prompt tokens.

The benefit of using this training format is that the model can learn to generate high-quality outputs for a specific task without requiring a large amount of task-specific training data. By leveraging the pre-trained language model's knowledge gained from being trained on a large corpus of text data, the fine-tuned model can quickly adapt to the new task and generate high-quality outputs with minimal training data.

When training on this kind of data using SambaStudio, set prompt_loss_weight=0.0. This ensures that the model does not learn to generate the prompt tokens, and only learns to generated completion tokens.

Example data

For fine-tuning, your data should be in .jsonl format with prompts and completions designed for the task you're adapting to.

Jsonlines with a prompt and completion

Example command

python3 -m generative_data_prep pipeline --input_path=./tests/examples/generative_tuning/example_generative_tuning_data.jsonl --output_path=./tests/examples/generative_tuning/pipelined_generative_tuning --pretrained_tokenizer=openai-community/gpt2 --max_seq_length=1024 --shuffle=on_RAM --input_packing_config=single::drop

View decoded output

Pre-training

Pre-training on unstructured data enables large languages models to learn general language patterns and structures that are useful for a wide range of downstream tasks. In order to prepare pre-training data, you need a large amount of unstructured text data. To prepare pre-training data use the flag --input_packing_config=full.

Example data

For pre-training you can have your data in two formats.

text separated by newlines.

jsonlines with empty prompts and text in the completions.

We recommend to use jsonlines with empty prompts and all the text in the completion, this is so that newlines in the text do not separate semantically related articles.

Example command

python3 -m generative_data_prep pipeline --input_path=./tests/examples/pretraining/example_pretraining_data.jsonl --output_path=./tests/examples/pretraining/pipelined_pretraining --pretrained_tokenizer=openai-community/gpt2 --max_seq_length=1024 --shuffle=on_RAM --input_packing_config=full

View decoded output

Dialogue

Dialogue data often involves multiple turns in a conversation between a user and an agent. In order to train on this data, the entire conversation needs to be in the same sequence of tokens and the model should only learn to generate the agents responses based on the users inputs. To prepare data like this create a list of prompt completion pairs, and if you train with packing_boundary=jsonl and input_packing_config=greedy::truncate_right/ or input_packing_config=single::truncate_right then these conversations are guaranteed to be in the provided order in the same sequence. Additionally if you include the prompt_loss_weight=0.0 option while training on SambaStudio, only the completions will be learned. Also for training dialogue in chat-ml style, users can set prompt_prefix and prompt_postfix.

Example data

Lists of prompt completion pairs that represent turns in a conversation

Example command

python3 -m generative_data_prep pipeline --input_path=./tests/examples/dialogue/example_dialogue_data.jsonl --output_path=./tests/examples/dialogue/pipelined_dialogue --pretrained_tokenizer=openai-community/gpt2 --max_seq_length=1024 --shuffle=on_RAM --input_packing_config=single::truncate_right

View decoded output

Meta in context learning

Meta In Context Learning improves the few shot performance of a model by including training data formatted in a few shot style. This infrastructure allows you to prepare data in a variant of meta in context learning SambaNova uses called "All Shot" learning. In order to prepare data in this format prepare lists of prompt completion pairs, where every list contains prompt completion pairs that are completing the same instruction/task. Then prepare the data with the input_packing_config=greedy::drop, packing_boundary=prompt_completion_pair and attention_boundary=jsonl. This ensures that every sequence contains prompt completion pairs following the same "instruction", and that when learning a completion the model is attending to all the other prompt completion pairs before it.

Example data

Lists of prompt completion pairs that are all from the same task

Example command

python3 -m generative_data_prep pipeline --input_path=./tests/examples/metaICL/example_metaICL_data.jsonl --output_path=./tests/examples/metaICL/pipelined_metaICL --pretrained_tokenizer=openai-community/gpt2 --max_seq_length=1024 --shuffle=on_RAM --input_packing_config=greedy::drop --packing_boundary=prompt_completion_pair --attention_boundary=jsonl

View decoded output

</br>

Understanding Command Outputs

Terminal Output

The metrics associated with this dataset will be printed in the terminal as well as being logged at <OUTPUT DIR PATH>/logs.log. These metrics give some insight into how the data was packed into sequences, and information about the training dataset.

<table> <thead> <tr> <th width="15%">Metric Name</th> <th width="42.5%">Definition</th> <th width="42.5%">How to Interpret?</th> </tr> </thead> <tbody> <tr> <td>Articles</td> <td>The number of lines in the input dataset.</td> <td>How many text documents in the input dataset.</td> </tr> <tr> <td>Dataset Tokens</td> <td>Number of tokens in the output hdf5 dataset.</td> <td>How many tokens are in the training dataset. But this includes both prompt tokens and padding tokens, so this metric does not necessarily show how many tokens will learned by the model.</td> </tr> <tr> <td>Prompt Tokens</td> <td>Number of prompt tokens in the output hdf5 dataset.</td> <td>&lt;-</td> </tr> <tr> <td>Completion Tokens</td> <td>Number of completion tokens in the output hdf5 dataset.</td> <td>&lt;-</td> </tr> <tr> <td>Padding Tokens</td> <td>Number of padding tokens in the output hdf5 dataset.</td> <td>&lt;-</td> </tr> <tr> <td>Average Completion Length</td> <td>Number of completion tokens divided by number of input articles.</td> <td>The length of the average completion in the dataset.</td> </tr> <tr> <td>Average Prompt Length</td> <td>Number of prompt tokens divided by number of input articles.</td> <td>The length of the average prompt in the dataset.</td> </tr> <tr> <td>Data Utilization</td> <td>Percent of non-padding tokens in output HDF5 dataset divided by number of tokens in input dataset.</td> <td>This metric reveals how much of the input data makes it to the output dataset. If this percent is much less than 100%, that means a lot of the input data will not be trained on. Refer to the "Dropped From Packing" or "Dropped From All Prompt" metrics to see why this is happening.</td> </tr> <tr> <td>Dropped From Packing</td> <td>Number of tokens dropped during packing, divided by number of tokens in input dataset.</td> <td>The percent of tokens are dropped because they do not fit into the sequence length, and the <code>input_packing_config</code> does not allow them to be overflowed.</td> </tr> <tr> <td>Dropped From All Prompt</td> <td>Number of tokens dropped because all the tokens in a sequence are prompt tokens, divided by the number of tokens in input dataset.</td> <td>Sequences that are all prompts or padding (no completion tokens) are dropped. This is because the model will not learn anything from these sequences and the loss will be 0, which may cause errors.</td> </tr> <tr> <td>Sequence Utilization</td> <td>Average number of non-padding tokens in a sequence divided by sequence length.</td> <td>The percent of the tokens in each sequence are actually used for training. This number imrpoved be changed by using different <code>input_packing_config</code>. The packing styles from highest sequence utilization to lowest are: <code>full</code>, <code>greedy::truncate_left</code> (or truncate_right), <code>greedy::drop</code>, <code>single::truncate_left</code> (or truncate_right), <code>single::drop</code>.</td> </tr> <tr> <td>Seq Completion Utilization</td> <td>Average number of completions tokens in a sequence divided by sequence length.</td> <td>The percent of the tokens in a sequence are learned.</td> </tr> </tbody> </table>

Metadata Output File

To help improve speed and cross-checking we provide a metadata file along with the dataset. This file is located right under the output_dir as metadata.yaml. This file is used along with a custom pydantic model which you can import from this library which will verify the dataset parameters and the training parameters. This can be used as a way to catch bugs before training begins.

max_seq_length: int
token_type_ids: bool
vocab_size: int
tokenizer_model_type: str
number_of_training_files: int
number_of_dev_files: int
number_of_test_files: int
max_batch_size_train: int
max_batch_size_dev: Optional[int]

NOTE:

FAQs

"Cannot access gated repo" error

If you pass in a --pretrained_tokenizer for a model tokenizer that is gated on Huggingface, you need to get access to the model on HuggingFace by going to the model card and requesting access, then follow this documentation to generate a HuggingFace API key and finally log in on the HuggingFace CLI.

If you have the model checkpoint downloaded locally you can also pass in the path to the model checkpoint as the --pretrained_tokenizer!

Number of samples in each file must be greater than or equal to batch size

This error will occur if you try to run training with a batch size that is greater than the maximum batch size of the prepared dataset. The maximum batch size is printed in the terminal as "Batch size <=..." and also logged in the logs.log file in the output directory.

To fix this, you can do one of the following:

  1. Increase the amount of input data you use.
  2. Change to a "single" input packing configuration like single::truncate_right, which will not pack the sequences with multple data points, and therefore create more training sequences. However, this may cause training to be inefficient because a lot of the available sequence length is wasted with padding tokens.
  3. Decrease the num_training_splits so that each split has more data. Keep in mind, however, that you must have more training splits than the number of parallel RDUs you use to train.

Advanced Usage

The following are some advanced usage patterns that may be applicable to you. Follow the links for more information: