Home

Awesome

<div align="center">

Intel® Extension for Transformers

<h3>An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere</h3>

Release Notes

🏭Architecture   |   💬NeuralChat   |   😃Inference on CPU   |   😃Inference on GPU   |   💻Examples   |   📖Documentations

</div>

🚀Latest News


<div align="left">

🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For system requirements and other installation tips, please refer to Installation Guide

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:

🔓Validated Hardware

<table> <tbody> <tr> <td rowspan="2">Hardware</td> <td colspan="2">Fine-Tuning</td> <td colspan="2">Inference</td> </tr> <tr> <td>Full</td> <td>PEFT</td> <td>8-bit</td> <td>4-bit</td> </tr> <tr> <td>Intel Gaudi2</td> <td>✔</td> <td>✔</td> <td>WIP (FP8)</td> <td>-</td> </tr> <tr> <td>Intel Xeon Scalable Processors</td> <td>✔</td> <td>✔</td> <td>✔ (INT8, FP8)</td> <td>✔ (INT4, FP4, NF4)</td> </tr> <tr> <td>Intel Xeon CPU Max Series</td> <td>✔</td> <td>✔</td> <td>✔ (INT8, FP8)</td> <td>✔ (INT4, FP4, NF4)</td> </tr> <tr> <td>Intel Data Center GPU Max Series</td> <td>WIP </td> <td>WIP </td> <td>WIP (INT8)</td> <td>✔ (INT4)</td> </tr> <tr> <td>Intel Arc A-Series</td> <td>-</td> <td>-</td> <td>WIP (INT8)</td> <td>✔ (INT4)</td> </tr> <tr> <td>Intel Core Processors</td> <td>-</td> <td>✔</td> <td>✔ (INT8, FP8)</td> <td>✔ (INT4, FP4, NF4)</td> </tr> </tbody> </table>

In the table above, "-" means not applicable or not started yet.

🔓Validated Software

<table> <tbody> <tr> <td rowspan="2">Software</td> <td colspan="2">Fine-Tuning</td> <td colspan="2">Inference</td> </tr> <tr> <td>Full</td> <td>PEFT</td> <td>8-bit</td> <td>4-bit</td> </tr> <tr> <td>PyTorch</td> <td>2.0.1+cpu,</br> 2.0.1a0 (gpu)</td> <td>2.0.1+cpu,</br> 2.0.1a0 (gpu)</td> <td>2.1.0+cpu,</br> 2.0.1a0 (gpu)</td> <td>2.1.0+cpu,</br> 2.0.1a0 (gpu)</td> </tr> <tr> <td>Intel® Extension for PyTorch</td> <td>2.1.0+cpu,</br> 2.0.110+xpu</td> <td>2.1.0+cpu,</br> 2.0.110+xpu</td> <td>2.1.0+cpu,</br> 2.0.110+xpu</td> <td>2.1.0+cpu,</br> 2.0.110+xpu</td> </tr> <tr> <td>Transformers</td> <td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td> <td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td> <td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td> <td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td> </tr> <tr> <td>Synapse AI</td> <td>1.13.0</td> <td>1.13.0</td> <td>1.13.0</td> <td>1.13.0</td> </tr> <tr> <td>Gaudi2 driver</td> <td>1.13.0-ee32e42</td> <td>1.13.0-ee32e42</td> <td>1.13.0-ee32e42</td> <td>1.13.0-ee32e42</td> </tr> <tr> <td>intel-level-zero-gpu</td> <td>1.3.26918.50-736~22.04 </td> <td>1.3.26918.50-736~22.04 </td> <td>1.3.26918.50-736~22.04 </td> <td>1.3.26918.50-736~22.04 </td> </tr> </tbody> </table>

Please refer to the detailed requirements in CPU, Gaudi2, Intel GPU.

🔓Validated OS

Ubuntu 20.04/22.04, Centos 8.

🌱Getting Started

Chatbot

Below is the sample code to create your chatbot. See more examples.

Serving (OpenAI-compatible RESTful APIs)

NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. You can start NeuralChat server either using the Shell command or Python code.

# Shell Command
neuralchat_server start --config_file ./server/config/neuralchat.yaml
# Python Code
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")

NeuralChat service can be accessible through OpenAI client library, curl commands, and requests library. See more in NeuralChat.

Offline

from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

Transformers-based extension APIs

Below is the sample code to use the extended Transformers APIs. See more examples.

INT4 Inference (CPU)

We encourage you to install NeuralSpeed to get the latest features (e.g., GGUF support) of LLM low-bit inference on CPUs. You may also want to use v1.3 without NeuralSpeed by following the document

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs)

You can also load GGUF format model from Huggingface, we only support Q4_0/Q5_0/Q8_0 gguf format for now.

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
gguf_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, gguf_file = gguf_file)
outputs = model.generate(inputs)

You can also load PyTorch Model from Modelscope

Note:require modelscope

from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B"     # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, GPTQConfig

# Hugging Face GPTQ/AWQ model or use local quantize model
model_name = "MODEL_NAME_OR_PATH"
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
outputs = model.generate(inputs)

INT4 Inference (GPU)

import intel_extension_for_pytorch as ipex
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
from transformers import AutoTokenizer
import torch

device_map = "xpu"
model_name ="Qwen/Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Once upon a time, there existed a little girl,"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map)

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
                                              device_map=device_map, load_in_4bit=True)

model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, quantization_config=True, device=device_map)

output = model.generate(inputs)

Note: Please refer to the example and script for more details.

Langchain-based extension APIs

Below is the sample code to use the extended Langchain APIs. See more examples.

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)

🎯Validated Models

You can access the validated models, accuracy and performance from Release data or Medium blog.

📖Documentation

<table> <thead> <tr> <th colspan="8" align="center">OVERVIEW</th> </tr> </thead> <tbody> <tr> <td colspan="4" align="center"><a href="intel_extension_for_transformers/neural_chat">NeuralChat</a></td> <td colspan="4" align="center"><a href="https://github.com/intel/neural-speed/tree/main">Neural Speed</a></td> </tr> <tr> <th colspan="8" align="center">NEURALCHAT</th> </tr> <tr> <td colspan="2" align="center"><a href="intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_spr.ipynb">Chatbot on Intel CPU</a></td> <td colspan="3" align="center"><a href="intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_xpu.ipynb">Chatbot on Intel GPU</a></td> <td colspan="3" align="center"><a href="intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_habana_gaudi.ipynb">Chatbot on Gaudi</a></td> </tr> <tr> <td colspan="4" align="center"><a href="intel_extension_for_transformers/neural_chat/examples/deployment/talkingbot/pc/build_talkingbot_on_pc.ipynb">Chatbot on Client</a></td> <td colspan="4" align="center"><a href="intel_extension_for_transformers/neural_chat/docs/full_notebooks.md">More Notebooks</a></td> </tr> <tr> <th colspan="8" align="center">NEURAL SPEED</th> </tr> <tr> <td colspan="2" align="center"><a href="https://github.com/intel/neural-speed/tree/main/README.md">Neural Speed</a></td> <td colspan="2" align="center"><a href="https://github.com/intel/neural-speed/tree/main/README.md#2-neural-speed-straight-forward">Streaming LLM</a></td> <td colspan="2" align="center"><a href="https://github.com/intel/neural-speed/tree/main/neural_speed/core#support-matrix">Low Precision Kernels</a></td> <td colspan="2" align="center"><a href="https://github.com/intel/neural-speed/tree/main/docs/tensor_parallelism.md">Tensor Parallelism</a></td> </tr> <tr> <th colspan="8" align="center">LLM COMPRESSION</th> </tr> <tr> <td colspan="2" align="center"><a href="docs/smoothquant.md">SmoothQuant (INT8)</a></td> <td colspan="3" align="center"><a href="docs/weightonlyquant.md">Weight-only Quantization (INT4/FP4/NF4/INT8)</a></td> <td colspan="3" align="center"><a href="docs/qloracpu.md">QLoRA on CPU</a></td> </tr> <tr> <th colspan="8" align="center">GENERAL COMPRESSION</th> <tr> <tr> <td colspan="2" align="center"><a href="docs/quantization.md">Quantization</a></td> <td colspan="2" align="center"><a href="docs/pruning.md">Pruning</a></td> <td colspan="2" align="center"><a href="docs/distillation.md">Distillation</a></td> <td align="center" colspan="2"><a href="examples/huggingface/pytorch/text-classification/orchestrate_optimizations/README.md">Orchestration</a></td> </tr> <tr> <td align="center" colspan="2"><a href="docs/data_augmentation.md">Data Augmentation</a></td> <td align="center" colspan="2"><a href="docs/export.md">Export</a></td> <td align="center" colspan="2"><a href="docs/metrics.md">Metrics</a></td> <td align="center" colspan="2"><a href="docs/objectives.md">Objectives</a></td> </tr> <tr> <td align="center" colspan="2"><a href="docs/pipeline.md">Pipeline</a></td> <td align="center" colspan="3"><a href="examples/huggingface/pytorch/question-answering/dynamic/README.md">Length Adaptive</a></td> <td align="center" colspan="3"><a href="docs/examples.md#early-exit">Early Exit</a></td> </tr> <tr> <th colspan="8" align="center">TUTORIALS & RESULTS</a></th> </tr> <tr> <td colspan="2" align="center"><a href="docs/tutorials/README.md">Tutorials</a></td> <td colspan="2" align="center"><a href="https://github.com/intel/neural-speed/blob/main/docs/supported_models.md">LLM List</a></td> <td colspan="2" align="center"><a href="docs/examples.md">General Model List</a></td> <td colspan="2" align="center"><a href="intel_extension_for_transformers/transformers/runtime/docs/validated_model.md">Model Performance</a></td> </tr> </tbody> </table>

🙌Demo

https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b

https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31

📃Selected Publications/Events

View Full Publication List

Additional Content

Acknowledgements

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!