Awesome
Llama Datasets 🦙📝
This repo is a companion repo to the llama-hub repo meant to be the actual storage of data files associated to a llama-dataset. Like tools, loaders, and llama-packs, llama-datasets are offered through llama-hub. You can view all of the available llama-hub artifacts conviently in the llama-hub website.
The primary use of a llama-dataset is for evaluating the performance of a RAG system. In particular, it serves as a new test set (in traditional machine learning speak) for one to build a RAG system over, predict on, and subsequently perform evaluations comparing the predicted responses versus the reference responses.
How to add a llama-dataset
Similar to the process of adding a tool / loader / llama-pack, adding a llama- datset also requires forking the llama-hub repo and making a Pull Request. However, for a llama-dataset, only its metadata is checked into the llama-hub repo. The actual dataset and it's source files are instead checked into this particular repo. You will need to fork and clone that repo in addition to forking and cloning this one.
Forking and cloning this repository
After forking this repo to your own Github account, the next step would be to
clone from your own fork. This repository is a LFS configured repo, and so, without
special care you may end up downloading large files to your local machine. As such,
we ask that when the time comes to clone your fork, please ensure that when you set the
environment variable GIT_LFS_SKIP_SMUDGE
prior to calling the git clone
command:
# for bash
GIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:<your-github-user-name>/llama-datasets.git # for ssh
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/<your-github-user-name>/llama-datasets.git # for https
# for windows its done in two commands
set GIT_LFS_SKIP_SMUDGE=1
git clone git@github.com:<your-github-user-name>/llama-datasets.git # for ssh
set GIT_LFS_SKIP_SMUDGE=1
git clone https://github.com/<your-github-user-name>/llama-datasets.git # for https
To submit a llama-dataset, follow the submission template notebook.
The high-level steps are:
- Create a
LabelledRagDataset
(the initial class of llama-dataset made available on llama-hub) - Generate a baseline result with a RAG system of your own choosing on the
LabelledRagDataset
- Prepare the dataset's metadata (
card.json
andREADME.md
) - Submit a Pull Request to this repo to check in the metadata
- Submit a Pull Request to the llama-datasets repository to check in the
LabelledRagDataset
and the source files
(NOTE: you can use the above process for submitting any of our other supported
types of llama-datasets such as the LabelledEvaluatorDataset
.)
Usage Pattern
(NOTE: in what follows we present the pattern for producing a RAG benchmark with
the RagEvaluatorPack
over a LabelledRagDataset
. However, there are also other
types of llama-datasets such as LabelledEvaluatorDataset
and corresponding llama-packs
for producing benchmarks on their respective tasks. They all follow the similar
usage pattern. Please refer to the README's to learn more on each type of
llama-dataset.)
As mentioned earlier, llama-datasets are mainly used for evaluating RAG systems.
To perform the evaluation, the recommended usage pattern involves the application of the
RagEvaluatorPack
. We recommend reading the docs
for the "Evaluation" module for more information.
from llama_index.llama_dataset import download_llama_dataset
from llama_index.llama_pack import download_llama_pack
from llama_index import VectorStoreIndex
# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
"PaulGrahamEssayDataset", "./data"
)
# build basic RAG system
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = VectorStoreIndex.as_query_engine()
# evaluate using the RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
"RagEvaluatorPack", "./rag_evaluator_pack"
)
rag_evaluator_pack = RagEvaluatorPack(
rag_dataset=rag_dataset,
query_engine=query_engine
)
benchmark_df = rag_evaluate_pack.run() # async arun() supported as well
Llama-datasets can also be downloaded directly using llamaindex-cli
, which comes installed with the llama-index
python package:
llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data
After downloading them from llamaindex-cli
, you can inspect the dataset and
it source files (stored in a directory /source_files
) then load them into python:
from llama_index import SimpleDirectoryReader
from llama_index.llama_dataset import LabelledRagDataset
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(
input_dir="./data/source_files"
).load_data()