Home

Awesome

AI-Powered-Text-Classifier-Harnessing-Large-Language-Models-for-Precise-Data-Categorization(Prompting Techniques/RAG)

Problem Statement

Dataset: train 40k .csv

Design and implement a classifier using any LLM to classify the data in Column name “Text” with Column name “Cat2” and Column name “Cat3”.

Report the Accuracy on a sample test set split from 40k samples.

SOLUTION

DATASET PREPARATION

In addressing this classification problem and aiming to construct a classifier using a Large Language Model (LLM), the data must be formatted in a specific manner for fine-tuning. In this case, I am opting to prepare the data in Alpaca format.The dataset has been segregated into inference data and training data.The initial 1000 datapoints are designated as inference data, while the remaining datapoints are allocated for training data.

Input to your prompt will be a text from column name Text, and output should be class name from Column Name Cat 2

Following steps are performed for data preparation.

### InstructionInstruction.### Input:input+### Output:output

For more details ,please refer the notebook :training data preparation2 .ipynb

Input to your prompt will be a text from column name Text, and output should be class name from Column Name Cat 3

Following steps are performed for data preparation.

    ### InstructionInstruction.### Input:input+### Output:output

For more details ,please refer the notebook :training data preparation2 .ipynb

MODEL FINE TUNING

For our task, we’re employing the Llama 2 model. Given that Llama 2 lacks specific knowledge about our data domain, we plan to enhance its performance by fine-tuning the model. This approach aims to yield improved results tailored to our specific domain.

Input to your prompt will be a text from column name Text, and output should be class name from Column Name Cat 2

Following steps are used for fine tuning the Model:

Input to your prompt will be a text from column name Text, and output should be class name from Column Name Cat 3

EXPERIMENT WITH FINE TUNED MODEL AND PROMPTING TECHNIQUES

We utilized Langchain along with our fine-tuned model to build the classifier.

Input to your prompt will be a text from column name Text, and output should be class name from Column Name Cat 2

Following steps are performed to build the system:

Input to your prompt will be a text from column name Text, and output should be class name from Column Name Cat 3

Following steps are performed to build the system:

RESULT & METRICS

Input to your prompt will be a text from column name Text, and output should be class name from Column Name Cat 2

For simplicity, I have selected 100 data points for inference, and within this sample, there are 24 categories.

Out of 100 records , the model is able to predict 46 records correctly . Note : The inference result can be found in csv file entitled : inference data cat2 with accuracy.csv

Input to your prompt will be a text from column name Text, and output should be class name from Column Name Cat 3

For simplicity, I have selected 100 data points for inference, and within this sample, there are 34 categories.

Out of 100 records , the model is able to predict 14 records correctly . Note : The inference result can be found in csv file entitled :** inference data cat3 with accuracy.csv**

IMPROVEMENT SUGGESTIONS

Explore, Appreciate, and Give the Repository a Shining ⭐

Feel free to explore the repository and show your appreciation by giving it a star⭐! Your support means a lot! 😉