Awesome
LLM RAG Workshop
Chat with your own data - LLM+RAG workshop
The content here is based on LLM Zoomcamp - a free course about the engineering aspects of LLMs. We will run the course again in Spring-Summer 2025. Sign up if you're interested in attending it.
If you want to run a similar workshop in your company, contact me at alexey@datatalks.club.
For this workshop, you need:
- Docker
- Python 3 (we use 3.12)
- GitHub account + VS Code (optional - if you want to use codespaces, already contains Docker and Python)
- OpenAI account (optional)
- Groq account (optional)
- HuggingFace account (optional - if you want to access some open-source LLMs in the extended version)
Plan
- LLM and RAG (theory)
- Preparing the environment (codespaces)
- Installing pipenv and direnv
- Running ElasticSearch
- Indexing and retrieving documents with ElasticSearch
- Generating the answers with OpenAI
Extended workshop:
- Creating a web interface with Streamlit
- Running LLMs locally
- Replacing OpenAI with Ollama
- Running Ollama and ElasticSearch in Docker-Compose
- Using Open-Source LLMs from HuggingFace Hub
Or
- Evaluating retrieval and RAG
LLM and RAG
I generated that with ChatGPT:
Large Language Models (LLMs)
- Purpose: Generate and understand text in a human-like manner.
- Structure: Built using deep learning techniques, especially Transformer architectures.
- Size: Characterized by having a vast number of parameters (billions to trillions), enabling nuanced understanding and generation.
- Training: Pre-trained on large datasets of text to learn a broad understanding of language, then fine-tuned for specific tasks.
- Applications: Used in chatbots, translation services, content creation, and more.
Retrieval-Augmented Generation (RAG)
- Purpose: Enhance language model responses with information retrieved from external sources.
- How It Works: Combines a language model with a retrieval system, typically a document database or search engine.
- Process:
- Queries an external knowledge source based on input.
- Integrates retrieved information into the generation process to provide contextually rich and accurate responses.
- Advantages: Improves the factual accuracy and relevance of generated text.
- Use Cases: Fact-checking, knowledge-intensive tasks like medical diagnosis assistance, and detailed content creation where accuracy is crucial.
Use ChatGPT to show the difference between generating and RAG.
What we will do:
- Index Zoomcamp FAQ documents
- Create a Q&A system for answering questions about these documents
Preparing the Environment
We will use codespaces - but it will work in any environment with Docker and Python 3
In codespaces:
- Create a repository, e.g. "llm-zoomcamp-rag-workshop"
- Start a codespace there
Libraries
Install the packages:
pip install tqdm jupyter openai elasticsearch
if you use groq:
pip install groq
LLM services
If you use OpenAI, we need the key:
- Sign up at https://platform.openai.com/ if you don't have an account
- Go to https://platform.openai.com/api-keys
- Create a new key, copy it
Let's put the key to an env variable:
export OPENAI_API_KEY="TOKEN"
For groq:
- Sign up at https://console.groq.com/
- Go to https://console.groq.com/keys
- Create a new key, copy it
- Use the
GROQ_API_KEY
env variable
Managing secrets
You can also use GitHub Codespaces secrets for better secret management.
If you don't use codespaces, you can do it with direnv:
sudo apt update
sudo apt install direnv
direnv hook bash >> ~/.bashrc
Create / edit .envrc
in your project directory:
export OPENAI_API_KEY='sk-proj-key'
or
export GROQ_API_KEY='your-key'
Make sure .envrc
is in your .gitignore
- never commit it!
echo ".envrc" >> .gitignore
Allow direnv to run:
direnv allow
Jupyter
Start a new terminal, and there run jupyter:
jupyter notebook
Elasticsearch
In another terminal, run elasticsearch with docker:
docker run -it \
--rm \
--name elasticsearch \
-m 2G \
-p 9200:9200 \
-p 9300:9300 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:8.4.3
Verify that ES is running
curl http://localhost:9200
You should get something like this:
{
"name" : "63d0133fc451",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "AKW1gxdRTuSH8eLuxbqH6A",
"version" : {
"number" : "8.4.3",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
"build_date" : "2022-10-04T07:17:24.662462378Z",
"build_snapshot" : false,
"lucene_version" : "9.3.0",
"minimum_wire_compatibility_version" : "7.17.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "You Know, for Search"
}
Retrieval
RAG consists of multiple components, and the first is R - "retrieval". For retrieval, we need a search system. In our example, we will use elasticsearch for searching.
Searching in the documents
Create a nootebook "elastic-rag" or something like that. We will use it for our experiments
First, we need to download the docs:
wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
Let's load the documents
import json
with open('./documents.json', 'rt') as f_in:
documents_file = json.load(f_in)
documents = []
for course in documents_file:
course_name = course['course']
for doc in course['documents']:
doc['course'] = course_name
documents.append(doc)
Now we'll index these documents with elastic search
First initiate the connection and check that it's working:
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
es.info()
You should see the same response as earlier with curl
.
Before we can index the documents, we need to create an index (an index in elasticsearch is like a table in a "usual" databases):
index_settings = {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"text": {"type": "text"},
"section": {"type": "text"},
"question": {"type": "text"},
"course": {"type": "keyword"}
}
}
}
index_name = "course-questions"
response = es.indices.create(index=index_name, body=index_settings)
response
Now we're ready to index all the documents:
from tqdm.auto import tqdm
for doc in tqdm(documents):
es.index(index=index_name, document=doc)
Retrieving the docs
user_question = "How do I join the course after it has started?"
search_query = {
"size": 5,
"query": {
"bool": {
"must": {
"multi_match": {
"query": user_question,
"fields": ["question^3", "text", "section"],
"type": "best_fields"
}
},
"filter": {
"term": {
"course": "data-engineering-zoomcamp"
}
}
}
}
}
This query:
- Retrieves top 5 matching documents.
- Searches in the "question", "text", "section" fields, prioritizing "question" using
multi_match
query with typebest_fields
(see here for more information) - Matches user query "How do I join the course after it has started?".
- Shows results only for the "data-engineering-zoomcamp" course.
Let's see the output:
response = es.search(index=index_name, body=search_query)
for hit in response['hits']['hits']:
doc = hit['_source']
print(f"Section: {doc['section']}")
print(f"Question: {doc['question']}")
print(f"Answer: {doc['text'][:60]}...\n")
Cleaning it
We can make it cleaner by putting it into a function:
def retrieve_documents(query, index_name="course-questions", max_results=5):
es = Elasticsearch("http://localhost:9200")
search_query = {
"size": max_results,
"query": {
"bool": {
"must": {
"multi_match": {
"query": query,
"fields": ["question^3", "text", "section"],
"type": "best_fields"
}
},
"filter": {
"term": {
"course": "data-engineering-zoomcamp"
}
}
}
}
}
response = es.search(index=index_name, body=search_query)
documents = [hit['_source'] for hit in response['hits']['hits']]
return documents
And print the answers:
user_question = "How do I join the course after it has started?"
response = retrieve_documents(user_question)
for doc in response:
print(f"Section: {doc['section']}")
print(f"Question: {doc['question']}")
print(f"Answer: {doc['text'][:60]}...\n")
Generation - Answering questions
Now let's do the "G" part - generation based on the "R" output
OpenAI
Today we will use OpenAI (it's the easiest to get started with). In the course, we will learn how to use open-source models
Make sure we have the SDK installed and the key is set.
This is how we communicate with ChatGPT3.5:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "The course already started. Can I still join?"}]
)
print(response.choices[0].message.content)
Groq
With groq it's almost the same:
from groq import Groq
client = Groq()
response = client.chat.completions.create(
model="llama3-8b-8192",
messages=[{"role": "user", "content": "The course already started. Can I still join?"}]
)
print(response.choices[0].message.content)
Building a Prompt
Now let's build a prompt. First, we put all the documents together in one string:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()
context_docs = retrieve_documents(user_question)
context_result = ""
for doc in context_docs:
doc_str = context_template.format(**doc)
context_result += ("\n\n" + doc_str)
context = context_result.strip()
print(context)
Now build the actual prompt:
prompt = f"""
You're a course teaching assistant. Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Only use the facts from the CONTEXT. If the CONTEXT doesn't contan the answer, return "NONE"
QUESTION: {user_question}
CONTEXT:
{context}
""".strip()
Now we can put it to OpenAI API:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
answer
(Replace with llama3-8b-8192
if using groq)
Note: there are system and user prompts, we can also experiment with them to make the design of the prompt cleaner.
Cleaning
Now let's put everything together in one function:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()
prompt_template = """
You're a course teaching assistant.
Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Don't use other information outside of the provided CONTEXT.
QUESTION: {user_question}
CONTEXT:
{context}
""".strip()
def build_context(documents):
context_result = ""
for doc in documents:
doc_str = context_template.format(**doc)
context_result += ("\n\n" + doc_str)
return context_result.strip()
def build_prompt(user_question, documents):
context = build_context(documents)
prompt = prompt_template.format(
user_question=user_question,
context=context
)
return prompt
def ask_openai(prompt, model="gpt-4o"):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
return answer
# use ask_groq and model="llama3-8b-8192" if using groq
def qa_bot(user_question):
context_docs = retrieve_documents(user_question)
prompt = build_prompt(user_question, context_docs)
answer = ask_openai(prompt)
return answer
Now we can ask it different questions
qa_bot("I'm getting invalid reference format: repository name must be lowercase")
qa_bot("I can't connect to postgres port 5432, my password doesn't work")
qa_bot("how can I run kafka?")
What's next
- Use Open-Souce
- Build an interface, e.g. streamlit
- Deploy it
Extended version
For an extended version of this workshop, we will
- Build a UI with streamlit
- Experiment with open-source LLMs and replace OpenAI
Streamlit UI
We can build simple UI apps with streamlit. Let's install it
pipenv install streamlit
If you want to learn more about streamlit, you can use this material.
We need a simple form with
- Input box for the prompt
- Button
- Text field to display the response (in markdown)
import streamlit as st
def qa_bot(prompt):
import time
time.sleep(2)
return f"Response for the prompt: {prompt}"
def main():
st.title("DTC Q&A System")
with st.form(key='rag_form'):
prompt = st.text_input("Enter your prompt")
response_placeholder = st.empty()
submit_button = st.form_submit_button(label='Submit')
if submit_button:
response_placeholder.markdown("Loading...")
response = qa_bot(prompt)
response_placeholder.markdown(response)
if __name__ == "__main__":
main()
Let's run it
streamlit run app.py
Now we can replace the function qa_bot
. Let's create
a file rag.py
with the content from the notebook.
You can see the content of the file here.
Also, we add a special dropdown menu to select the course:
courses = [
"data-engineering-zoomcamp",
"machine-learning-zoomcamp",
"mlops-zoomcamp"
]
zoomcamp_option = st.selectbox("Select a zoomcamp", courses)
Open-Source LLMs
There are many open-source LLMs. We will use two platforms:
- Ollama for running on CPU
- HuggingFace for running on GPU
Ollama
The easiest way to run an LLM without a GPU is using Ollama
Note that the 2 core codespaces instance is not enough. For this part it's better to create a separate instance with 4 cores.
You can also run it locally. I have 8 cores on my laptop, so it's faster than doing it on codespaces.
Installing for Linux:
curl -fsSL https://ollama.com/install.sh | sh
Installing for other OS - check the Ollama website. I successfully tested it on Windows too.
Let's run it:
# in one terminal
ollama start
# in another terminal
ollama run phi3
Prompt example:
Question: I just discovered the couse. can i still enrol
Context:
Course - Can I still join the course after the start date? Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.
Environment - Is Python 3.9 still the recommended version to use in 2024? Yes, for simplicity (of troubleshooting against the recorded videos) and stability. [source] But Python 3.10 and 3.11 should work fine.
How can we contribute to the course? Star the repo! Share it with friends if you find it useful ❣️ Create a PR if you see you can improve the text or the structure of the repository.
Answer:
Ollama's API is compatible with OpenAI's python client, so we can use it by changing only a few lines of code:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1/',
api_key='ollama',
)
response = client.chat.completions.create(
model='phi3',
messages=[{"role": "user", "content": prompt}]
)
response.choices[0].message.content
That's it! Now let's put everything in Docker
Ollama + Elastic in Docker
We already know how to run Elasticsearch in Docker:
docker run -it \
--rm \
--name elasticsearch \
-p 9200:9200 \
-p 9300:9300 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:8.4.3
This is how we run Ollama in Docker:
docker run -it \
--rm \
--name ollama \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama
When we run it, we need to log in to the container to download the phi3 model:
docker exec -it ollama bash
ollama pull phi3
After pulling the model, we can query it with OpenAI's python package. Because we do volume mapping, the model files will stay in the container across multiple runs.
Let's now combine them into one docker-compose file.
Create a docker-compose.yaml
file with both Ollama and Elasticsearch.
And now run it:
docker-compose up
HuggingFace Hub
Ollama can run locally on a CPU. But there are many models that require a GPU.
For running them, we will use Colab or other notebook platform with a GPU (for example, SaturnCloud). Let's stop our codespace for now.
In Colab, you need to enable GPU:
- Create a notebook: https://colab.research.google.com/#create=true
- Runtime -> Change runtime type -> T4 GPU
!nvidia-smi
to verify you have a GPU
Now we need to install the dependencies:
!pip install -U transformers accelerate bitsandbytes
Also, it's tricky to run Elasticsearch on Colab, so we will replace it with minsearch - a simple in-memory search library:
!pip install minsearch
Let's get the data and create an index:
import requests
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()
documents = []
for course in documents_raw:
course_name = course['course']
for doc in course['documents']:
doc['course'] = course_name
documents.append(doc)
import minsearch
index = minsearch.Index(
text_fields=["question", "text", "section"],
keyword_fields=["course"]
)
index.fit(documents)
Searching with minsearch:
query = "I just discovered the course, can I still join?"
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3}
index.search(query, filter_dict, boost_dict, num_results=5)
Let's replace our search function:
def retrieve_documents(query, max_results=5):
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3}
return index.search(query, filter_dict, boost_dict, num_results=5)
We will use Google's FLAN T5 model: google/flan-t5-xl
.
Downloading and loading it:
from transformers import T5ForConditionalGeneration, T5Tokenizer
model_name = "google/flan-t5-xl"
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False)
tokenizer.model_max_length = 4096
Using it:
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
Let's put it to a function:
def llm(prompt):
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, )
result = tokenizer.decode(outputs[0])
return result
Everything together:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()
prompt_template = """
You're a course teaching assistant.
Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Don't use other information outside of the provided CONTEXT.
QUESTION: {user_question}
CONTEXT:
{context}
""".strip()
def build_context(documents):
context_result = ""
for doc in documents:
doc_str = context_template.format(**doc)
context_result += ("\n\n" + doc_str)
return context_result.strip()
def build_prompt(user_question, documents):
context = build_context(documents)
prompt = prompt_template.format(
user_question=user_question,
context=context
)
return prompt
def qa_bot(user_question):
context_docs = retrieve_documents(user_question)
prompt = build_prompt(user_question, context_docs)
answer = llm(prompt)
return answer
Making the answers longer:
def llm(prompt, generate_params=None):
if generate_params is None:
generate_params = {}
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(
input_ids,
max_length=generate_params.get("max_length", 100),
num_beams=generate_params.get("num_beams", 5),
do_sample=generate_params.get("do_sample", False),
temperature=generate_params.get("temperature", 1.0),
top_k=generate_params.get("top_k", 50),
top_p=generate_params.get("top_p", 0.95),
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
return result
Explanation of the parameters:
max_length
: Set this to a higher value if you want longer responses. For example,max_length=300
.num_beams
: Increasing this can lead to more thorough exploration of possible sequences. Typical values are between 5 and 10.do_sample
: Set this toTrue
to use sampling methods. This can produce more diverse responses.temperature
: Lowering this value makes the model more confident and deterministic, while higher values increase diversity. Typical values range from 0.7 to 1.5.top_k
andtop_p
: These parameters control nucleus sampling.top_k
limits the sampling pool to the topk
tokens, whiletop_p
uses cumulative probability to cut off the sampling pool. Adjust these based on the desired level of randomness.
Final notebook:
Other models:
microsoft/Phi-3-mini-128k-instruct
mistralai/Mistral-7B-v0.1
- And many more
Alternative - evaluation
Retrieval evaluation
Using hitrate to evaluate
First, we generate ground truth data (notebook) (also add ID to documents)
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record.
The record:
section: {section}
question: {question}
answer: {text}
Provide the output in parsable JSON without using code blocks:
["question1", "question2", ..., "question5"]
wget https://github.com/DataTalksClub/llm-zoomcamp/blob/main/03-vector-search/eval/ground-truth-data.csv?raw=1
mv ground-truth-data.csv?raw=1 ground-truth-data.csv
Then evaluate it:
relevance_total = []
for q in tqdm(ground_truth):
doc_id = q['document']
results = elastic_search(query=q['question'], course=q['course'])
relevance = [d['id'] == doc_id for d in results]
relevance_total.append(relevance)
Metrics:
def hit_rate(relevance_total):
cnt = 0
for line in relevance_total:
if True in line:
cnt = cnt + 1
return cnt / len(relevance_total)
def mrr(relevance_total):
total_score = 0.0
for line in relevance_total:
for rank in range(len(line)):
if line[rank] == True:
total_score = total_score + 1 / (rank + 1)
return total_score / len(relevance_total)
Cosine
Q->A->Q cosine similarity (see here)
from sentence_transformers import SentenceTransformer
model_name = 'multi-qa-MiniLM-L6-cos-v1'
model = SentenceTransformer(model_name)
answer_orig = 'Yes, sessions are recorded if you miss one. Everything is recorded, allowing you to catch up on any missed content. Additionally, you can ask questions in advance for office hours and have them addressed during the live stream. You can also ask questions in Slack.'
answer_llm = 'Everything is recorded, so you won’t miss anything. You will be able to ask your questions for office hours in advance and we will cover them during the live stream. Also, you can always ask questions in Slack.'
v_llm = model.encode(answer_llm)
v_orig = model.encode(answer_orig)
v_llm.dot(v_orig)
LLM as a Judge
prompt1_template = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer compared to the original answer provided.
Based on the relevance and similarity of the generated answer to the original answer, you will classify
it as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".
Here is the data for evaluation:
Original Answer: {answer_orig}
Generated Question: {question}
Generated Answer: {answer_llm}
Please analyze the content and context of the generated answer in relation to the original
answer and provide your evaluation in parsable JSON without using code blocks:
{{
"Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
"Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()
prompt2_template = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".
Here is the data for evaluation:
Question: {question}
Generated Answer: {answer_llm}
Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:
{{
"Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
"Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()
Conclusions
That was fun - thanks!