Home

Awesome

spacy-setfit

This repository contains an easy and intuitive approach to using SetFit in combination with spaCy.

Installation

Before using spaCy with SetFit, make sure you have the necessary packages installed. You can install them using pip:

pip install spacy spacy-setfit

Additionally, you will might want to download a spaCy model, for example:

python -m spacy download en_core_web_sm

Getting Started

To use spaCy with SetFit use the following code:

import spacy

# Create some example data
train_dataset = {
    "inlier": ["This text is about chairs.",
               "Couches, benches and televisions.",
               "I really need to get a new sofa."],
    "outlier": ["Text about kitchen equipment",
                "This text is about politics",
                "Comments about AI and stuff."]
}

# Load the spaCy language model:
nlp = spacy.load("en_core_web_sm")

# Add the "spacy_setfit" pipeline component to the spaCy model, and configure it with SetFit parameters:
nlp.add_pipe("spacy_setfit", config={
    "pretrained_model_name_or_path": "paraphrase-MiniLM-L3-v2",
    "setfit_trainer_args": {
        "train_dataset": train_dataset
    }
})
doc = nlp("I really need to get a new sofa.")
doc.cats
# {'inlier': 0.902350975129, 'outlier': 0.097649024871}

The code above processes the input text with the spaCy model, and the doc.cats attribute returns the predicted categories and their associated probabilities.

That's it! You have now successfully integrated spaCy with SetFit for text categorization tasks. You can further customize and train the model using additional data or adjust the SetFit parameters as needed.

Feel free to explore more features and documentation of spaCy and SetFit to enhance your text classification projects.

setfit_trainer_args

The setfit_trainer_args are a simplified version of the official args from the SetFit library.

Arguments

Please note that the above documentation provides an overview of the arguments and their purpose. For more detailed information and usage examples, it is recommended to refer to the official SetFit library documentation or any specific implementation details provided by the library.

Usage

To use the setfit_trainer_args, you can create a dictionary with the desired values for the arguments. Here's an example:

setfit_trainer_args = {
    "train_dataset": train_data,
    "eval_dataset": eval_data,
    "num_iterations": 20,
    "num_epochs": 1,
    "learning_rate": 2e-5,
    "batch_size": 16,
    "seed": 42,
    "column_mapping": column_map,
    "use_amp": False
}

setfit_from_pretrained_args

The setfit_from_pretrained_args are a simplified version of the official args from the SetFit library and Hugging Face transformers.

Arguments

Please note that the above documentation provides an overview of the arguments and their purpose. For more detailed information and usage examples, it is recommended to refer to the official SetFit library documentation or any specific implementation details provided by the library.

Usage

To use the setfit_from_pretrained_args, you can create a dictionary with the desired values for the arguments. Here's an example:

setfit_from_pretrained_args = {
    'pretrained_model_name_or_path': '',  # str or Path
    'revision': None,  # str, optional
    'force_download': False,  # bool, optional
    'resume_download': False,  # bool, optional
    'proxies': None,  # Dict[str, str], optional
    'token': None,  # str or bool, optional
    'cache_dir': None,  # str or Path, optional
    'local_files_only': False,  # bool, optional
    'model_kwargs': None  # Dict, optional
}

Pretrained SetFit models

You can also use pre-trained SetFit models.

import spacy

# Load the spaCy language model:
nlp = spacy.load("en_core_web_sm")

# Add the "spacy_setfit" pipeline component to the spaCy model
nlp.add_pipe("spacy_setfit", config={
    "pretrained_model_name_or_path": "lewtun/my-awesome-setfit-model",
})
nlp("I really need to get a new sofa.")

Saving and Loading models

You can use the pickle module in Python to save and load instances of the pre-trained pipeline. pickle allows you to serialize Python objects, including custom classes, into a binary format that can be saved to a file and loaded back into memory later. Here's an example of how to save and load using pickle:

import pickle

nlp = ...

# Save nlp pipeline
with open("my_cool_model.pkl", "wb") as file:
    pickle.dump(nlp, file)

# Load nlp pipeline
with open("my_cool_model.pkl", "rb") as file:
    nlp = pickle.load(file)

doc = nlp("I really need to get a new sofa.")
doc.cats
# {'inlier': 0.902350975129, 'outlier': 0.097649024871}

Logo Reference

Quotation by Adrien Coquet from <a href="https://thenounproject.com/browse/icons/term/quotation/" target="_blank" title="quotation Icons">Noun Project</a>