Home

Awesome

Candle Tutorial - Convert Pytorch Models to Candle

Candle is an ML framework written in rust that takes advantage of the speed and memory safety Rust provides for writing machine workloads. It can be used as a drop in replacement for ML frameworks like PyTorch, it also has python bindings so you can use it from python...

This repo provides some guide for converting pytorch models from the transformers library to Candle by directly translating the pytorch code to Candle ...

❗️❗️: To make the code easily understandable, I have annotated each line of the Rust/Candle code with the equivalent PyTorch code. Tutorial Structure:

Getting Started:

0. Important things to note

1. Start a new rust project

The command below will create a new rust project called candle-roberta in the current directory with a Cargo.toml file and a src directory with a main.rs file in it.

$ cargo new candle-roberta

2. Install Candle & Other Packages

You can follow the instructions here to install candle or you can use the command below to install candle directly from github.

For this tutorial, we would be using the candle-core and candle-nn crates. candle-core provides the core functionality of the candle framework. It provides an implementation the basic blocks for building neural networks and also integrations with different backends like Cuda, MKL, CPU etc, while candle-nn provides a high level API for building neural networks.

- cargo add --git https://github.com/huggingface/candle.git candle-core  # install candle-core
- cargo add --git https://github.com/huggingface/candle.git candle-nn # install candle-nn

Other frameworks we would need for this tutorial are:

3. Parallels between Pytorch and Candle

To convert a pytorch model to candle, it is important understand the parallels between the two frameworks.

Tensors

The examples shows below can be found here;

Tensor Operations:

Performing tensor operations is pretty similar across both frameworks
Some examples can be found here:: [Candle CheatSheet](https://github.com/huggingface/candle/blob/main/README.md#how-to-use)

3. Translating a PyTorch Transformer Model into Candle

Here's the fun part! In this section we are going to take a look at translating models from the transformers library to candle. We would be using the RoBERTa and XLM-Roberta model for this tutorial.

We would be translating the Pytorch Source Code into Candle Code and then load the pretrained checkpoint into Rust and compare the output from both frameworks.

Note ❗️❗️: To make the code easily understandable, I have annotated each line of the Rust/Candle code with the equivalent PyTorch code.

3.1. RoBERTa

RoBERTa is a variant of the BERT model. Although both models have different pretraining approaches, structurally both models are very similar and the major difference between both models is that in the RoBERTa layer, Position numbers begin at padding_idx+1, While in BERT, Position numbers begin at 0.

Following the transformers PyTorch implementation, RoBERTa Model can be divided into the 2 main parts (embeddings and encoder):

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0-11): 12 x RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): RobertaIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): RobertaOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
)

Listed above are the main components of the model. Other building blocks for implementing the model include:

Translating Pytorch Modules into Candle

Import necessary Modules:

Import the necessary modules from candle and other crates:

a. Writing Building Blocks:

b. Roberta Config:

Up next is the Roberta Config. This is a struct that holds the configuration of the model. It is similar to the RobertaConfig in the transformers library. For this Struct, We will initialize the default values for the config (We implement the Default trait for the RobertaConfig struct ) and then use the serde crate to deserialize the config from a json file. Alternatively we can create a RobertaConfig::new() method for creating a new instance of RobertaConfig

pub struct RobertaConfig {
    vocab_size: usize,
    hidden_size: usize,
    num_hidden_layers: usize,
    num_attention_heads: usize,
    intermediate_size: usize,
    hidden_act: String,
    hidden_dropout_prob: f64,
    max_position_embeddings: usize,
    type_vocab_size: usize,
    initializer_range: f64,
    layer_norm_eps: f64,
    pad_token_id: usize,
    bos_token_id: usize,
    eos_token_id: usize,
    position_embedding_type: String,
    use_cache: bool,
    classifier_dropout: Option<f64>,
    model_type: Option<String>,
}

impl Default for RobertaConfig {
    fn default() -> Self {
        Self {
            vocab_size: 50265,
            hidden_size: 768,
            num_hidden_layers: 12,
            num_attention_heads: 12,
            intermediate_size: 3072,
            hidden_act: "gelu".to_string(),
            hidden_dropout_prob: 0.1,
            max_position_embeddings: 512,
            type_vocab_size: 2,
            initializer_range: 0.02,
            layer_norm_eps: 1e-12,
            pad_token_id: 1,
            bos_token_id: 0,
            eos_token_id: 2,
            position_embedding_type: PositionEmbeddingType::Absolute,
            use_cache: true,
            classifier_dropout: None,
            model_type: Some("roberta".to_string()),
        }
    }
}

c. RobertaEmbeddings:

HuggingFace PyTorch Implementation

In the __init__ function of the embedding class, we have 3 linear layers for processing word_embeddings, position_embeddings and token_type_ids. Similar to the PyTorch implementation, there are two important class methods that we need to implement.

d. RobertaSelfAttention:

HuggingFace PyTorch Implementation. The self attention layer is made up of 3 linear layers for processing the query, key and value. The output of the self attention layer is the dot product of the query and key. The output is then passed through a softmax layer and a dropout layer which is then multiplied by the value.


```rust
struct RobertaSelfAttention {
    query: Linear,
    key: Linear,
    value: Linear,
    dropout: Dropout,
    num_attention_heads: usize,
    attention_head_size: usize,
}

impl RobertaSelfAttention {
    fn load(vb: VarBuilder, config: &RobertaConfig) -> Result<Self> {
        // config.hidden_size / config.num_attention_heads
        let attention_head_size = config.hidden_size / config.num_attention_heads;
        // self.num_attention_heads * self.attention_head_size
        let all_head_size = config.num_attention_heads * attention_head_size; 
        // nn.Dropout(config.attention_probs_dropout_prob)
        let dropout = Dropout::new(config.hidden_dropout_prob); 
        let hidden_size = config.hidden_size;

        // nn.Linear(config.hidden_size, self.all_head_size)
        let query = linear(hidden_size, all_head_size, vb.pp("query"))?; 
        // nn.Linear(config.hidden_size, self.all_head_size)
        let value = linear(hidden_size, all_head_size, vb.pp("value"))?; 
        // nn.Linear(config.hidden_size, self.all_head_size)
        let key = linear(hidden_size, all_head_size, vb.pp("key"))?; 
        Ok(Self {
            query,
            key,
            value,
            dropout,
            num_attention_heads: config.num_attention_heads,
            attention_head_size,
        })
    }

    fn transpose_for_scores(&self, xs: &Tensor) -> Result<Tensor> {
        
        // x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        let mut new_x_shape = xs.dims().to_vec();
        new_x_shape.pop();
        new_x_shape.push(self.num_attention_heads);
        new_x_shape.push(self.attention_head_size);

        //  x = x.view(new_x_shape) || x.permute(0, 2, 1, 3)
        let xs = xs.reshape(new_x_shape.as_slice())?.transpose(1, 2)?;
        xs.contiguous()
    }

    fn forward(&self, hidden_states: &Tensor) -> Result<Tensor> {
        // self.query(hidden_states)
        let query_layer = self.query.forward(hidden_states)?;
        // self.key(hidden_states) 
        let key_layer = self.key.forward(hidden_states)?; 
        // self.value(hidden_states)
        let value_layer = self.value.forward(hidden_states)?; 

        // self.transpose_for_scores(query_layer)
        let query_layer = self.transpose_for_scores(&query_layer)?;
        // self.transpose_for_scores(key_layer) 
        let key_layer = self.transpose_for_scores(&key_layer)?;
        // self.transpose_for_scores(value_layer)
        let value_layer = self.transpose_for_scores(&value_layer)?; 

        // attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        let attention_scores = query_layer.matmul(&key_layer.t()?)?;
        // attention_scores / math.sqrt(self.attention_head_size)
        let attention_scores = (attention_scores / (self.attention_head_size as f64).sqrt())?; 
        // attention_probs = nn.functional.softmax(attention_scores, dim=-1)
        let attention_probs = {candle_nn::ops::softmax(&attention_scores, candle_core::D::Minus1)?}; 
        // attention_probs = self.dropout(attention_probs)
        let attention_probs = self.dropout.forward(&attention_probs)?; 

        // torch.matmul(attention_probs, value_layer)
        let context_layer = attention_probs.matmul(&value_layer)?;
        // context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        let context_layer = context_layer.transpose(1, 2)?.contiguous()?; 

        // new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        // context_layer = context_layer.view(new_context_layer_shape)
        let context_layer = context_layer.flatten_from(candle_core::D::Minus2)?; // 
        Ok(context_layer)
    }
}

e. RobertaSelfOutput:

HuggingFace PyTorch Implementation. The output of the Self Attention Layer is passed through the Self Output layer which is made up of a linear layer, layer norm and dropout layer.

struct RobertaSelfOutput {
    dense: Linear,
    layer_norm: LayerNorm,
    dropout: Dropout,
}

impl RobertaSelfOutput {
    fn load(vb: VarBuilder, config: &RobertaConfig) -> Result<Self> {
        // nn.Linear(config.hidden_size, config.hidden_size)
        let dense = linear(config.hidden_size, config.hidden_size, vb.pp("dense"))?; 
        //  nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        let layer_norm = layer_norm(
            config.hidden_size,
            config.layer_norm_eps,
            vb.pp("LayerNorm"),
        )?;

        // nn.Dropout(config.hidden_dropout_prob)
        let dropout = Dropout::new(config.hidden_dropout_prob); 
        Ok(Self {
            dense,
            layer_norm,
            dropout,
        })
    }

    fn forward(&self, hidden_states: &Tensor, input_tensor: &Tensor) -> Result<Tensor> {
        // self.dense(hidden_states)
        let hidden_states = self.dense.forward(hidden_states)?;
        // self.dropout(hidden_states)
        let hidden_states = self.dropout.forward(&hidden_states)?;
        // self.LayerNorm(hidden_states + input_tensor)
        self.layer_norm.forward(&(hidden_states + input_tensor)?) 
    }
}

f. RobertaAttention:

HuggingFace PyTorch Implementation. The Roberta Attention Layer is made up of the Self Attention Layer and the Self Output Layer implemented earlier. The output of the Self Attention Layer is passed through the Self Output Layer.

struct RobertaAttention {
    self_attention: RobertaSelfAttention, 
    self_output: RobertaSelfOutput,
}

impl RobertaAttention {
    fn load(vb: VarBuilder, config: &RobertaConfig) -> Result<Self> {
        // RobertaSelfAttention(config, position_embedding_type=position_embedding_type)
        let self_attention = RobertaSelfAttention::load(vb.pp("self"), config)?;
        // RobertaSelfOutput(config) 
        let self_output = RobertaSelfOutput::load(vb.pp("output"), config)?; 

        Ok(Self {
            self_attention,
            self_output,
        })
    }

    fn forward(&self, hidden_states: &Tensor) -> Result<Tensor> {
        //self_outputs = self.self(hidden_states)
        let self_outputs = self.self_attention.forward(hidden_states)?; 
        // attention_output = self.output(self_outputs[0], hidden_states)
        let attention_output = self.self_output.forward(&self_outputs, hidden_states)?; 

        Ok(attention_output)
    }
}

g. RobertaIntermediate

HuggingFace PyTorch Implementation. The intermediate layer is made up of a linear layer and an activation function. Here we use the GELU activation function. This layer combined with the Attention Layer and an Output layer makes up the Encoder.

struct RobertaIntermediate {
    dense: Linear,
    intermediate_act: HiddenActLayer,
}

impl RobertaIntermediate {
    fn load(vb: VarBuilder, config: &RobertaConfig) -> Result<Self> {
        // nn.Linear(config.hidden_size, config.intermediate_size)
        let dense = linear(config.hidden_size, config.intermediate_size, vb.pp("dense"))?; 
        Ok(Self {
            dense,
            intermediate_act: Activation::new(),
        })
    }

    fn forward(&self, hidden_states: &Tensor) -> Result<Tensor> {
        // self.dense(hidden_states)
        let hidden_states = self.dense.forward(hidden_states)?; 
        // self.intermediate_act_fn(hidden_states)
        let ys = self.intermediate_act.forward(&hidden_states)?; 
        Ok(ys)
    }
}

h. RobertaOutput

HuggingFace PyTorch Implementation. The output layer is made up of a linear layer, layer norm and dropout layer. This layer combined with the Attention Layer and an Intermediate layer makes up the Encoder.

struct RobertaOutput {
    dense: Linear,
    layer_norm: LayerNorm,
    dropout: Dropout,
}

impl RobertaOutput {
    fn load(vb: VarBuilder, config: &RobertaConfig) -> Result<Self> {
        // nn.Linear(config.intermediate_size, config.hidden_size)
        let dense = linear(config.intermediate_size, config.hidden_size, vb.pp("dense"))?;
        // nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        let layer_norm = layer_norm(
            config.hidden_size,
            config.layer_norm_eps,
            vb.pp("LayerNorm"),
        )?; 
        let dropout = Dropout::new(config.hidden_dropout_prob);
        Ok(Self {
            dense,
            layer_norm,
            dropout,
        })
    }

    fn forward(&self, hidden_states: &Tensor, input_tensor: &Tensor) -> Result<Tensor> {
        // self.dense(hidden_states)
        let hidden_states = self.dense.forward(hidden_states)?;
        // self.dropout(hidden_states)
        let hidden_states = self.dropout.forward(&hidden_states)?;
        // self.LayerNorm(hidden_states + input_tensor)
        self.layer_norm.forward(&(hidden_states + input_tensor)?) 
    }
}

i. RobertaLayer

HuggingFace PyTorch Implementation: This does not include an implementation of cross-attention as in the Pytorch code. As mentioned in the previous layers, The Robertalayer is made up of an Attention Layer, an Intermediate Layer and an Output Layer. This layer combined with the Attention Layer and an Output layer makes up the Encoder.

struct RobertaLayer {
    attention: RobertaAttention,
    intermediate: RobertaIntermediate,
    output: RobertaOutput,
}

impl RobertaLayer {
    fn load(vb: VarBuilder, config: &RobertaConfig) -> Result<Self> {
        // RobertaAttention(config)
        let attention = RobertaAttention::load(vb.pp("attention"), config)?;
        // RobertaIntermediate(config)
        let intermediate = RobertaIntermediate::load(vb.pp("intermediate"), config)?; 
        // RobertaOutput(config)
        let output = RobertaOutput::load(vb.pp("output"), config)?; 
        Ok(Self {
            attention,
            intermediate,
            output,
        })
    }

    fn forward(&self, hidden_states: &Tensor) -> Result<Tensor> {
        // self.attention(hidden_states)
        let attention_output = self.attention.forward(hidden_states)?; 

        //  self.intermediate(attention_output)
        let intermediate_output = self.intermediate.forward(&attention_output)?; 
        // self.output(intermediate_output, attention_output)
        let layer_output = self
            .output
            .forward(&intermediate_output, &attention_output)?; 
        Ok(layer_output)
    }
}

j. RobertaEncoder

HuggingFace PyTorch Implementation. The Encoder is made up of a stack of RobertaLayers. The output of the Encoder is the output of the last RobertaLayer.

impl RobertaEncoder {
    fn load(vb: VarBuilder, config: &RobertaConfig) -> Result<Self> {
        // nn.ModuleList([RobertaLayer(config) for _ in range(config.num_hidden_layers)])
        let layers = (0..config.num_hidden_layers)
            .map(|index| RobertaLayer::load(vb.pp(&format!("layer.{index}")), config))
            .collect::<Result<Vec<_>>>()?; 
        Ok(RobertaEncoder { layers })
    }

    fn forward(&self, hidden_states: &Tensor) -> Result<Tensor> {
        let mut hidden_states = hidden_states.clone();

        //for i, layer_module in enumerate(self.layer):
        //  layer_outputs = layer_module(hidden_states)

        for layer in self.layers.iter() {
            hidden_states = layer.forward(&hidden_states)?
        }
        Ok(hidden_states)
    }
}

k. RobertaModel

HuggingFace PyTorch Implementation. VOila! We have implemented all the components of the Roberta Model. The Roberta Model is made up of an Embedding Layer and an Encoder. The output of the Roberta Model is the output of the Encoder.

pub struct RobertaModel {
    embeddings: RobertaEmbeddings,
    encoder: RobertaEncoder,
    pub device: Device,
}

impl RobertaModel {
    pub fn load(vb: VarBuilder, config: &RobertaConfig) -> Result<Self> {
        let (embeddings, encoder) = match (
            RobertaEmbeddings::load(vb.pp("embeddings"), config), // RobertaEmbeddings(config)
            RobertaEncoder::load(vb.pp("encoder"), config), // RobertaEncoder(config)
        ) {
            (Ok(embeddings), Ok(encoder)) => (embeddings, encoder),
            (Err(err), _) | (_, Err(err)) => {
                if let Some(model_type) = &config.model_type {
                    if let (Ok(embeddings), Ok(encoder)) = (
                        RobertaEmbeddings::load(vb.pp(&format!("{model_type}.embeddings")), config),
                        RobertaEncoder::load(vb.pp(&format!("{model_type}.encoder")), config),
                    ) {
                        (embeddings, encoder)
                    } else {
                        return Err(err);
                    }
                } else {
                    return Err(err);
                }
            }
        };
        Ok(Self {
            embeddings,
            encoder,
            device: vb.device().clone(),
        })
    }

    pub fn forward(&self, input_ids: &Tensor, token_type_ids: &Tensor) -> Result<Tensor> {
        // self.embedding(input_ids=input_ids)
        let embedding_output = self.embeddings.forward(input_ids, token_type_ids, None, None)?;
         // self.encoder(embedding_output )
        let sequence_output = self.encoder.forward(&embedding_output)?;
        Ok(sequence_output)
    }

}

Debugging the Model

Unit Tests for Different Components

It is important to write unit tests for the different components of the model. This is to ensure that the model is working as expected. Unit tests sometime appear to be time-consuming but they can be very important in the long run. Here are some unit tests I wrote during the porting process:

// Regression_test = https://github.com/huggingface/transformers/blob/21dc5859421cf0d7d82d374b10f533611745a8c5/tests/models/xlm_roberta_xl/test_modeling_xlm_roberta_xl.py#L496
#[test]
fn test_create_position_ids_from_input_embeds() -> Result<()> {

    let config = RobertaConfig::default();
    let vb = VarBuilder::zeros(DType::F32, &Device::Cpu);
    let embeddings_module = RobertaEmbeddings::load(vb, &config).unwrap();

    let input_embeds = Tensor::randn(0f32, 1f32, (2, 4, 30), &Device::Cpu).unwrap();
    let position_ids = embeddings_module.create_position_ids_from_input_embeds(&input_embeds);

    let expected_tensor: &[[u32; 4]; 2] = &[
        [0 + embeddings_module.padding_idx + 1, 1 + embeddings_module.padding_idx + 1, 2 + embeddings_module.padding_idx + 1, 3 + embeddings_module.padding_idx + 1,],
        [0 + embeddings_module.padding_idx + 1, 1 + embeddings_module.padding_idx + 1, 2 + embeddings_module.padding_idx + 1, 3 + embeddings_module.padding_idx + 1,]
    ];

    assert_eq!(position_ids.unwrap().to_vec2::<u32>()?, expected_tensor);

    Ok(())

}