Awesome

🦙🛁 Cleaned Alpaca Dataset

Welcome to the Cleaned Alpaca Dataset repository! This repository hosts a cleaned and curated version of a dataset used to train the Alpaca LLM (Large Language Model). The original dataset had several issues that are addressed in this cleaned version.

On April 8, 2023 the remaining uncurated instructions (~50,000) were replaced with data from the GPT-4-LLM dataset. Curation of the incoming GPT-4 data is ongoing.

A 7b Lora model (trained on April 8, 2023) is available on Hugging Face at yahma/alpaca-7b-lora
A 13B Lora model (trained on April 9, 2023) is available at Hugging Face at yahma/alpaca-13b-lora

Dataset Quality and its Impact on Model Performance

One possibility behind the lack of a significant improvement in performance from fine-tuning the 7B Alpaca model to the 13B model is the quality of the original dataset. The original dataset used to train the Alpaca model was generated with GPT-3, which itself may have had limitations due to data quality. More evidence pointing to poor data quality is that fine-tuning on the original dataset resulted in poor loss curves.

The quality of the dataset plays a crucial role in determining the performance of the natural language processing models trained on it. A dataset that is noisy, inconsistent, or incomplete can result in poor performance even with the most advanced models. In contrast, a high-quality dataset can enable a model to perform well with smaller parameters.

Therefore, it is possible that with better data, we could improve the performance of the models more than what would be gained by simply increasing model size.

Benchmark Results

Using EleutherAIs lm-evaluation-harness, we compare LORA models fine-tuned on various datasets.

Dataset	Model	parameters	WikiText (ppl)	MNLI (acc)	Piqa (acc norm)
Original Alpaca	samwit/alpaca7B-lora	7b (lora)	9.5396	38.33	78.51
Cleaned Alpaca (Mar 26)	tloen/alpaca-lora-7b	7b (lora)	9.4885	51.6	79.33
GPT4All	nomic-ai/gpt4all-lora	7b (lora)	10.09	38.97	78.40

We also ran the Hallucination Trivia Test for Large Language Models (HALTT4LLM)

Model Name	Truthful QA	HQ Trivia	Fake Questions	NOTA Questions
GPT4All	79.51%	88.47%	74.16%	70.32%
GPT-3.5	39.95%	59.33%	81.81%	51.93%
GPT-3	32.15%	55.67%	6.10%	32.25%
Llama-7B-4bit	83.51%	49.75%	2.15%	8.38%
Alpaca-7B-4bit	26.66%	44.32%	0.00%	0.00%
Alpaca-7B-4bit cleaned	37.00%	-	-	12.90%

The cleaned dataset appears to hallucinate less and perform better than the original dataset.

Data Cleaning and Curation

Alpaca is a fine-tuned version of LLAMA that was trained using an Instruct Dataset generated by GPT-3. The generated dataset was designed to be diverse; however, recent analysis indicates it is very US centric. The original dataset used to train the Alpaca LLM was found to have many issues that impacts its quality and usefulness for training a machine learning model.

Issues with the Original Dataset

The original dataset had several issues that likely affect the final performance of the fine-tuned model. These issues include:

Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.

"instruction":"Summarize the given article in 200 Words.",
"input": "https://www.bbc.com/news/world-51461830",
"output": "The recent protests in Belarus are part of a growing wave of discontent against the rule of Alexander Lukashenko, who has led the country since 1994. Demonstrations against his government have been ongoing since August, and come in the wake of an election widely seen as rigged. As police and security forces have cracked down on protesters, the movement has only grown, with hundreds of thousands of people taking to the streets to call for new leadership. Human rights activists have accused the government of using extreme violence in order prosecute protesters, leading to an international outcry and sanctions from the European Union. As the situation in Belarus continues to unfold, the international community is monitoring the situation closely - hoping for a peaceful resolution, while also standing firm against the oppressive tactics employed by the Belarusian state."

"instruction": "Classify the following shape according to simple geometry.",
"input": "<Picture Attached>",
"output": "Triangle"

"instruction": "Given a recent news article, write a headline for it.",
"input": "https://www.news.example.com/article-1",
"output": "\"NASA Launches a Mars 2020 Mission to Search for Signs of Ancient Life\""

Merged Instructions: There were many instructions that were merged together in the original dataset for some reason.

"instruction": "Should I invest in stocks?",
"input": "",
"output": "It depends on your individual situation. Investing in stocks carries a certain amount of risk but can also offer potential rewards. Make sure to research the different types of stocks, understand the risks and rewards, and develop a plan that works for you.\n##\n18. Instruction: Identify the noun in the following sentence:\n18. Input:\nThe large apple was delicious.\n18. Output:\nApple"

Empty outputs: Some entries in the original dataset had empty outputs.

"instruction": "Calculate the area of the following shape in square centimeters.",
"input": "rectangle of size 4 cm x 5 cm",
"output": ""

Empty code examples: Some descriptions in the original dataset were missing code examples, making it difficult to understand the intended behavior of the code.
Instructions to generate images: Some descriptions in the original dataset included instructions to generate images, something obviously not possible.

"instruction": "Create a graphic or logo that visually represents the word \"courage\".",
"input": "",
"output": "<No Output>"

N/A outputs: Some code snippets in the original dataset had N/A outputs.
Inconsistent input field: The original dataset had inconsistent usage of the input field when it was supposed to be empty.

"input":"<no input>"
"input":"No input"
"input":"noinput"
"input":"<noinput>"

Wrong answers: Some instructions/questions in the original dataset had incorrect answers. About 80% of the math problems are estimated to have incorrect answers.

"instruction": "Calculate the median of the following data set.",
"input": "1, 2, 4, 5, 8, 9",
"output": "5"

"instruction": "Convert 25m to km.",
"input": "",
"output": "25km"

Non-Sensical/Unclear instructions: Many instructions are unclear, we try to clarify (or re-write) if instructions are non-sensical. Instructions that are slightly unclear, but where one could deduce the meaning are not altered.

"instruction": "Freeze the following sample of yogurt for 10 minutes.",
"input": "Yogurt sample",
"output": "<noinput>"

"instruction": "Increase the font size to 12 points.",
"input": "",
"output": "The font size has been increased to 12 points."

Extraneous escape and control characters: The original dataset had several entries with extraneous escape and control characters.

Finetune Considerations

Compared to the original alpaca dataset, the average prompt length in this dataset has increased, with a larger number of prompts that exceed a length of 256. For this reason, it is recommended to set max prompt length during finetuning to at least 512 or higher.

Distribution of Prompt Lengths

Hugging Face Hub

The cleaned dataset is also available on the Hugging Face Hub.

Contributions

With over 52k entries, several issues still exist. Please help out by submitting a pull-request.

Goals

The primary goal of this project is to provide a cleaned and curated version of the Alpaca dataset that will improve the performance of natural language processing models trained on this data. By removing errors and inconsistencies, the goal is to improve performance of the fine-tuned llama models and reduce the likelihood of hallucinations.

License

All the code and supporting tools are licensed under the Apache-2.0. The original and cleaned alpaca dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Acknowledgments

The original version of the Alpaca dataset was sourced from tatsu-lab's github repository. We would like to thank the original creators of these datasets for making their data available to the public. We would also like to thank the team at Meta AI for their work in developing Llama. Finally, thanks to Q-Blocks Cloud for donating compute resources allowing us to provide finetuned models.