Home

Awesome

PROJECT NOT UNDER ACTIVE MANAGEMENT

This project will no longer be maintained by Intel.

Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.

Intel no longer accepts patches to this project.

If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

Contact: webadmin@linux.intel.com

Text Data Generation

Introduction

This reference kit highlights how the Intel® AI Tools can be used to generate a Machine Learning (ML) application from PyTorch-based Large Language Models (LLMs). It generates domain specific synthetic news headlines using a Generative Pre-trained Transformer 2 (GPT2) fine-tuned model.

Check out more workflow examples in the Developer Catalog.

Solution Technical Overview

Synthetic data is information that's artificially generated rather than produced by real-world events. Businesses can benefit from synthetic data for privacy issues, faster product testing turnaround and training machine learning algorithms. Most data privacy rules limit how organizations handle sensitive data. Any leaking or disclosure of personally identifiable consumer information can result in costly litigation that harms the brand's reputation. As a result, avoiding privacy issues is the primary reason why businesses engage in synthetic data-generating technologies.

Data is frequently unavailable for completely new goods. Furthermore, human-annotated data is an expensive and time-consuming procedure. This may be avoided if businesses invest in synthetic data, which can be created quickly and aid in the development of solid machine learning models. The text generation model will be built by using a pre-trained GPT2 model provided by the HuggingFace* transformers package and implemented in PyTorch*. To focus the model on the task of generating similar headlines, it will be fine-tuned on a headline dataset specified below. Once the model is fine-tuned, it will be used for generating new pieces of data that look and feel similar to the provided dataset without being exactly the same.

There are many ways of building a text generation system. This implementation will follow a state-of-the-art approach using a pre-trained language model for text generation, and fine-tuning it with a news headline dataset using the Causal Language Model formulation. This will refine the model to generate similar text compared to the source dataset. The chosen pre-trained model is gpt2-medium provided by the HuggingFace* transformers project, though many other models can be used.

For larger and more powerful systems, more modern, but expensive pre-trained Large Language Models can be used to improve performance. For example, gpt-j-6B is a 6B parameter open-source implementation of a smaller GPT3 model, trained on a publicly available dataset. However, use of this model requires at least 48GB of CPU RAM to run the model in full precision.

Intel® Extension for PyTorch* and Intel® Neural Compressor are used to optimize this pipeline for better performance:

Solution Technical Details

Dataset

The dataset used in this reference kit is taken from Kaggle.

It consists of 1,048,576 rows of news headlines sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation), where each row in the dataset represents a news headline with fields:

For the task of text generation publish_date is ignored and the headline_text field is used to train each of the model to generate novel news headlines similar to the source text.

For development, 5000 samples from the million news headlines dataset are considered since they represent the majority of the context information in the current dataset and also because of the computing effort associated with the experiment.

Use caseAI Unstructured Synthetic Text Data Generation
SizeTotal 1.04 Million rows of headlines, Initial 5000 rows used for development
Sourcehttps://www.kaggle.com/datasets/therohk/million-headlines

Note: See the dataset applicable license for terms and conditions. Intel® Corporation does not own the rights to this dataset and does not confer any rights to it.

Validated Hardware Details

Intel® oneAPI is used to achieve quick results even when the data for a model is huge. It provides the capability to reuse the code present in different languages so that the hardware utilization is optimized to provide these results.

Recommended HardwarePrecision
CPU: Intel® 2nd Gen Xeon® Platinum 8280 CPU @ 2.70GHz or higherFP32, BF16, INT8
RAM: 187 GB
Recommended Free Disk Space: 20 GB or more

Code was tested on Ubuntu* 22.04 LTS.

How it Works

Use Case End-To-End Workflow

Use_case_flow

The workflow pipeline follows these steps:

  1. Use the pre-trained tokenizers for gpt2-medium to do tokenization.
  2. Perform fine-tuning training on the pre-trained gpt2-medium model using our dataset and the Causal Language Modeling objective.
  3. Generate new headlines based on either a seed text or from scratch using the fine-tuned model.

The tokenization step for data-preprocessing is automatically done within the provided scripts and utilities using the transformers package.

Expected Input-Output

InputOutput
Optional ContextUnstructured text headline generated ("Alp Claims 10 point lead after victory over London Greens.")

Get Started

Start by defining an environment variable that will store the workspace path, this can be an existing directory or one to be created in further steps. This ENVVAR will be used for all the commands executed using absolute paths.

export WORKSPACE=$PWD/text-data-generation

Define DATA_DIR, CONFIG_DIR and OUTPUT_DIR as follows:

export DATA_DIR=$WORKSPACE/data
export OUTPUT_DIR=$WORKSPACE/output
export CONFIG_DIR=$WORKSPACE/configs

Download the Workflow Repository

Create a working directory for the workflow and clone the Text Data Generation repository into your working directory.

mkdir -p $WORKSPACE && cd $WORKSPACE
git clone https://www.github.com/oneapi-src/text-data-generation.git $WORKSPACE
mkdir -p $DATA_DIR $OUTPUT_DIR/models

Set Up Conda

To learn more, please visit install anaconda on Linux.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
rm Miniconda3-latest-Linux-x86_64.sh

Set Up Environment

The conda yaml dependencies are kept in $WORKSPACE/env/intel_env.yml.

Packages required in YAML file:Version:
python3.9
intel-extension-for-pytorch2.0.100
neural-compressor2.3.1
numpy1.24.3
pandas1.5.3
kaggle1.5.16
pip23.3.1
datasets2.16.0
accelerate0.25.0
transformers4.26.0
optimum[onnxruntime]1.6.4
onnxruntime1.16.3

Follow the next steps for Intel® Python* Distribution setup inside conda environment:

# If you have conda 23.10.0 or greater you can skip the following two lines
# since libmamba is already set as the default solver.
conda install -n base conda-libmamba-solver -y
conda config --set solver libmamba
conda env create -f $WORKSPACE/env/intel_env.yml

Environment setup is required only once. Make sure no conda environment exists with the same name since this step does not cleanup/overwrite the existing environment. During this setup a new conda environment will be created with the dependencies listed in the YAML file.

Once the appropriate environment is created it has to be activated using the conda command as given below:

conda activate text_generation_intel

Download the Datasets

A Kaggle* account is necessary to use the Kaggle* CLI. Instructions can be found at Kaggle* api website.

Within this process, an API Token File will be created and as consequence, a json file named kaggle.json will be downloaded. That json file should be stored in a .kaggle folder that should be created by the user (usually in the home folder).

If you are behind a proxy, the kaggle.json file can be modified to add it. An example is shown as follows:

{"username":"your_user","key":"your_key","proxy":"your_proxy"}

Where your_user and your_key were previously generated by Kaggle*. You should replace your_proxy with you proxy ip address.

To setup the data for benchmarking under these requirements, run the following set of commands:

cd $DATA_DIR
kaggle datasets download -d therohk/million-headlines
unzip million-headlines.zip
rm million-headlines.zip

The following transformations are automatically done when using the finetune_model.py script below:

  1. To each headline, the phrase "|HEADLINE| " is prepended. This allows for more direct generation of headlines without the need for of an explicit seed text. This introduction of baseline text which informs the language model with more specifics about what to generate is known as prompting.
  2. To each headline, we attach the "." token to allow for the model to learn to naturally generate endings for valid headlines.

Supported Runtime Environment

You can execute the references pipelines using the following environments:


Run Using Bare Metal

Set Up System Software

Our examples use the conda package and environment on your local computer. If you don't already have conda installed or the conda environment created, go to Set Up Conda* or see the Conda* Linux installation instructions.

Apply patch for XPU

If the user wants to use GPU, the following steps must be followed:

  1. Follow the instructions in Set Up Environment but use intel_env_xpu.yml instead of intel_env.yml.
  2. Activate conda environment:
conda activate text_generation_xpu_intel
  1. Apply patch:
cd $WORKSPACE/src
python apply_xpu_patch.py
  1. Follow the instructions described in Run Workflow.

  2. Once the user wants to remove the environment run:

conda activate base
conda env remove -n text_generation_xpu_intel

Run Workflow

Setting correct locations inside config files

Before running the scripts, the paths to the prompt file and models need to be set inside the config files. It can be set manually by configuring the different variables inside $CONFIG_DIR/config_base.yml, $CONFIG_DIR/config_finetuned_inc.yml and $CONFIG_DIR/config_finetuned_intel.yml (see Configuration Parameters). For this reference kit use the following commands to do it automatically.

sed -i "s|prompt_file:.*|prompt_file: $CONFIG_DIR/prompt.csv|g" $CONFIG_DIR/config_base.yml
sed -i -e "s|prompt_file:.*|prompt_file: $CONFIG_DIR/prompt.csv|g" -e "s|path:.*|path: $OUTPUT_DIR/models/gpt2-medium-finetuned-inc-onnx|g" $CONFIG_DIR/config_finetuned_inc.yml
sed -i -e "s|prompt_file:.*|prompt_file: $CONFIG_DIR/prompt.csv|g" -e "s|path:.*|path: $OUTPUT_DIR/models/gpt2-medium-finetuned-intel-onnx|g" $CONFIG_DIR/config_finetuned_intel.yml
Configuration Parameters

Within the yaml configuration files, the following optional arguments can be specified:

prompt_file: A prompt file can be provided using the prompt_file argument to generate headlines that start with a certain phrase. For example, the file prompt.csv, which is included in this repo for demo purposes, can be

Community
Farmers

which will ask the script to generate 2 sets of headlines, one starting with "Community" and the other starting with "Farmers". Not providing a prompt file will generate headlines starting with anything.

min_length: The min_length argument will force the model to output a body of text that has at least min_length number of tokens, excluding the ".". However, based on the training data, this can lead to non-sensical generation. For example, for our model and data, headlines are rather short, so after a certain length, the model generates nonsense. This can be post-processed at the users discretion.

max_length_buffer: The max_length_buffer argument limits the max length of the generated text to be max_length = min_length + max_length_buffer.

Training the model

The finetune_model.py script can be used to fine-tune a pre-trained transformers model using the prepared dataset with the aforementioned prompt. This trains the model using the Causal Language Modeling objective which is a key component for the model to naturally learn to generate text. The following command downloads a pre-trained gpt2-medium model from the transformers hub and fine-tunes it on the dataset. The optional --bfloat16 flag reduces the precision to bfloat16 (check if CPU is capable of using bfloat16 data type).

usage: finetune_model.py [-h] --model_config MODEL_CONFIG --data_path DATA_PATH --save_path SAVE_PATH [--num_epochs NUM_EPOCHS] [--lr LR] [--bfloat16]

optional arguments:
  -h, --help            show this help message and exit
  --model_config MODEL_CONFIG
                        yaml configuration file for model.
  --data_path DATA_PATH
                        path to the "ABC million news headlines" csv.
  --save_path SAVE_PATH
                        path to save the model.
  --num_epochs NUM_EPOCHS
                        number of epochs to train the model. Defaults to 3.
  --lr LR               learning rate for training. Defaults to 5e-5.
  --bfloat16            use bfloat16 for training. Defaults to False.

The command to fine-tune the model is:

ipexrun --use_logical_core --enable_tcmalloc $WORKSPACE/src/finetune_model.py --model_config $CONFIG_DIR/config_base.yml --data_path $DATA_DIR/abcnews-date-text.csv --save_path $OUTPUT_DIR/models/gpt2-medium-finetuned-intel

Converting the saved model to ONNX to optimize runtime

Before this pre-trained model is used to generate text, we will first use the optimum package and command line tool to convert the format of the saved model to ONNX format. This allows it to be used by multiple accelerators and generally speeds up execution time. The below command can be used to convert the saved model above to an ONNX format to be used in the generation script with the provided $CONFIG_DIR/config_finetuned_intel.yml model config. Along with accelerating execution time, the ONNX format also enables full use of the next quantization step from Intel® Neural Compressor.

optimum-cli export onnx --model $OUTPUT_DIR/models/gpt2-medium-finetuned-intel --task causal-lm $OUTPUT_DIR/models/gpt2-medium-finetuned-intel-onnx

Note: When running the ONNX conversion using optimum, the following error may be displayed during the validation step

Validating ONNX model...
An error occurred with the error message: 'tuple' object has no attribute 'items'.
The exported model was saved at: saved_models/gpt2-medium-finetuned-stock-onnx

If this occurs, please proceed as normal. Although unable to be validated, the ONNX model is still successfully saved and can be used normally.

The converted ONNX fine-tuned model will be saved to $OUTPUT_DIR/models/gpt2-medium-finetuned-stock-onnx.

Text Generation from the Fine-tuned Model

Using the fine-tuned model to generate new text can be done using the generate_text.py script. This takes in a pre-trained ONNX format model and generates similar entries to the dataset using a generation config file, provided by default here as $CONFIG_DIR/config_finetuned_intel.yml.

The arguments for the generate_text.py script are as follows:

usage: generate_text.py [-h] --model_config MODEL_CONFIG [--benchmark_mode] [--benchmark_seq_length BENCHMARK_SEQ_LENGTH]

optional arguments:
  -h, --help            show this help message and exit
  --model_config MODEL_CONFIG
                        yaml configuration file for model.
  --benchmark_mode      use intel pytorch extension to optimize model.
  --benchmark_seq_length BENCHMARK_SEQ_LENGTH
                        length of generation if benchmark mode is used.

The trained model can be used to generate text using the provided generate_text.py script as well as the provided configuration file $CONFIG_DIR/config_finetuned_intel.yml. An example command is:

python $WORKSPACE/src/generate_text.py --model_config $CONFIG_DIR/config_finetuned_intel.yml

This will output a json string with the generated headlines.

Quantizing the Trained Models

Model Quantization is a way of optimizing the inference process of a trained model by converting weights and operations from FP32 precision to the much smaller INT8 precision. This can dramatically reduce the size of the model and computational cost of the model, leading to accelerated downstream performance. This process can be implemented using Intel® Neural Compressor, a library which offers multiple algorithms for compressing models to accelerate execution time, including Model Quantization.

For the fine-tuned model generated above, we will accelerate text generation time by quantizing the FP32 model to INT8 format using the Accuracy Aware Dynamic Quantization algorithm provided by Intel® Neural Compressor. The Accuracy Aware component automatically tunes the quantization procedure while also monitoring for potential accuracy losses due to changes in precision.

To do dynamic quantization of the above fine-tuned GPT2 model and save a new version that can be used within the above text generation script use quantize_inc_gpt2.py. Other pre-trained model architectures may require their own quantization script:

usage: quantize_inc_gpt2.py [-h] --model_config MODEL_CONFIG --save_model_dir SAVE_MODEL_DIR --data_path DATA_PATH

optional arguments:
  -h, --help            show this help message and exit
  --model_config MODEL_CONFIG
                        yaml configuration file for model.
  --save_model_dir SAVE_MODEL_DIR
                        directory to save the quantized model to.
  --data_path DATA_PATH
                        path to the "ABC million news headlines" csv.

A command to quantize the finetuned ONNX model, $OUTPUT_DIR/models/gpt2-medium-finetuned-intel-onnx, saving it in INT8 ONNX format at $OUTPUT_DIR/models/gpt2-medium-finetuned-inc-onnx/ is:

python $WORKSPACE/src/quantize_inc_gpt2.py --model_config $CONFIG_DIR/config_finetuned_intel.yml --save_model_dir $OUTPUT_DIR/models/gpt2-medium-finetuned-inc-onnx/ --data_path $DATA_DIR/abcnews-date-text.csv

Once the quantized model is created, we can use the generate_text.py script on the quantized model, using the $CONFIG_DIR/config_finetuned_inc.yml model config file:

python $WORKSPACE/src/generate_text.py --model_config $CONFIG_DIR/config_finetuned_inc.yml

Running a GPTJ-6B Model

GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. "GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.

Inference

Run the gptj_generate_text.py script to generate text using the gpt-j-6B IR model:

usage: gptj_generate_text.py [-h] --prompt PROMPT [--max_new_tokens MAX_NEW_TOKENS] [--temperature TEMPERATURE]

optional arguments:
  -h, --help            show this help message and exit
  --prompt PROMPT, -p PROMPT
                        Prompt to be provided
  --max_new_tokens MAX_NEW_TOKENS
                        Maximum no. of new tokens to be generated. Default - 32
  --temperature TEMPERATURE
                        Temperature parameter for the GPT model. Default - 0.9

Example command:

python $WORKSPACE/src/gptj_generate_text.py  --prompt "hello i am"

Clean Up Bare Metal

Before proceeding to the cleaning process, it is strongly recommended to make a backup of the data that the user wants to keep. To clean the previously downloaded and generated data, run the following commands:

conda activate base
conda env remove -n text_generation_intel
rm -rf $DATA_DIR $OUTPUT_DIR

To remove WORKSPACE:

rm -rf $WORKSPACE

Expected Output

A successful execution of finetune_model.py should return similar results as shown below:

...
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████| 4000/4000 [00:00<00:00, 4252.59 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 3381.15 examples/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/localdisk/aagalleg/miniconda3/envs/text_generation_intel/lib/python3.9/site-packages/intel_extension_for_pytorch/optim/_optimizer_utils.py:348: UserWarning: Does not suport fused step for <class 'torch.optim.adamw.AdamW'>, will use non-fused step
  warnings.warn("Does not suport fused step for " + str(type(optimizer)) + ", will use non-fused step")
  0%|                                                                                                              | 0/1500 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'loss': 3.7542, 'learning_rate': 4.795663052543787e-05, 'epoch': 1.0}                                                                      
{'loss': 2.8271, 'learning_rate': 4.587155963302753e-05, 'epoch': 2.0}                                                                      
{'loss': 2.2259, 'learning_rate': 4.378648874061718e-05, 'epoch': 3.0}                                                                      
{'train_runtime': 1423.8045, 'train_samples_per_second': 8.428, 'train_steps_per_second': 1.054, 'train_loss': 2.9357469075520832, 'epoch': 3.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [23:43<00:00,  1.05it/s]
Training time: 1423.8078002929688s

A successful execution of optimum-cli should return similar results as shown below:

...
Local PyTorch model found.
Framework not specified. Using pt to export to ONNX.
Using framework PyTorch: 2.0.1+cpu
Overriding 2 configuration item(s)
        - use_cache -> False
        - pad_token_id -> 0
/localdisk/aagalleg/miniconda3/envs/text_generation_intel/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py:807: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if batch_size <= 0:
============== Diagnostic Run torch.onnx.export version 2.0.1+cpu ==============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Validating ONNX model...
An error occured with the error message: 'tuple' object has no attribute 'items'.
The exported model was saved at: saved_models/gpt2-medium-finetuned-intel-onnx

A successful execution of quantize_inc_gpt2.py should return similar results as shown below:

...
2023-12-27 16:42:07 [INFO] Tune 4 result is: [Accuracy (int8|fp32): 45.0290|44.8773, Duration (seconds) (int8|fp32): 9.7832|20.0922], Best tune result is: [Accuracy: 45.0290, Duration (seconds): 9.7832]
2023-12-27 16:42:07 [INFO] |**********************Tune Result Statistics**********************|
2023-12-27 16:42:07 [INFO] +--------------------+----------+---------------+------------------+
2023-12-27 16:42:07 [INFO] |     Info Type      | Baseline | Tune 4 result | Best tune result |
2023-12-27 16:42:07 [INFO] +--------------------+----------+---------------+------------------+
2023-12-27 16:42:07 [INFO] |      Accuracy      | 44.8773  |    45.0290    |     45.0290      |
2023-12-27 16:42:07 [INFO] | Duration (seconds) | 20.0922  |    9.7832     |     9.7832       |
2023-12-27 16:42:07 [INFO] +--------------------+----------+---------------+------------------+
2023-12-27 16:42:07 [INFO] [Strategy] Found a model that meets the accuracy requirements.
2023-12-27 16:42:07 [INFO] Save tuning history to /localdisk/aagalleg/frameworks.ai.platform.sample-apps.text-data-generation/nc_workspace/2023-12-27_16-38-24/./history.snapshot.
2023-12-27 16:42:07 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2023-12-27 16:42:07 [INFO] Save deploy yaml to /localdisk/aagalleg/frameworks.ai.platform.sample-apps.text-data-generation/nc_workspace/2023-12-27_16-38-24/deploy.yaml

A successful execution of generate_text.py should return similar results as shown below:

...
[{"sentences": ["Community experiences performance enhancing drug (PED) experiences."]}, {"sentences": ["Farmers pleased with figures."]}]
(text_generation_intel) 

A successful execution of gptj_generate_text.py should return similar results as shown below:

...
hello i am using this code to get the location name using latitude and longitude

Summary and Next Steps

Synthetic Text Generators can be used for many purposes, including building intelligent chat bots and creating artificial data generators for calibrating models. In this reference kit, we have demonstrated one approach to building a synthetic text generation model using a small pre-trained Transformer-based Large Language Model (LLM), gpt-2. By leveraging fine-tuning, we were able to orient the model to generate data that matched our dataset while also utilizing the knowledge it was pre-trained with. This underlying solution powers many of the existing Large Language Model technologies seen today.

Appendix

Notices & Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

To the extent that any public or non-Intel datasets or models are referenced by or accessed using tools or code on this site those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.

Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.

*Other names and brands that may be claimed as the property of others. Trademarks.