6. Building RAG Pipeline
Our python_faq_retrieval_tool is currently just a promise; it relies on an faq_engine object to do the heavy lifting of vector search, but we haven’t built this engine yet. This component is the heart of the “Retrieval” part of our RAG system. It will be responsible for two critical tasks:
- Indexing: Taking our plain text data, converting it into numerical representations (embeddings), and storing them efficiently in our Qdrant vector database.
- Searching: Providing a method that can take a user’s query, convert it into an embedding, and search the database for the most relevant pieces of stored information.
To accomplish this, we’ll orchestrate several powerful libraries. We’ll use llama-index for its robust embedding capabilities, qdrant-client to communicate with our database, and a few standard Python utilities to handle the data processing efficiently.
Let’s begin by creating the rag_app.py file and adding the necessary imports that will form the backbone of our pipeline.
import uuid
from itertools import islice
from typing import List, Dict, Any, Generator
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from tqdm import tqdm
from qdrant_client import models, QdrantClient
The first step in building our pipeline is to define the knowledge base itself. This is the proprietary data our agent will search through to answer questions related to Python.
In our rag_app.py file, add our source text just below the imports:
PYTHON_FAQ_TEXT = """
Question: What is the difference between a list and a tuple in Python?
Answer: Lists are mutable, meaning their elements can be changed, while tuples are immutable. Lists use square brackets `[]` and tuples use parentheses `()`.
Question: What are Python decorators?
Answer: Decorators are a design pattern in Python that allows a user to add new functionality to an existing object without modifying its structure. They are often used for logging, timing, and access control.
Question: How does Python's Global Interpreter Lock (GIL) work?
Answer: The GIL is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecode at the same time. This means that even on a multi-core processor, only one thread can execute Python code at once.
Question: What is the difference between `==` and `is` in Python?
Answer: `==` checks for equality of value (do two objects have the same content?), while `is` checks for identity (do two variables point to the same object in memory?).
Question: Explain list comprehensions and their benefits.
Answer: List comprehensions provide a concise way to create lists. They are often more readable and faster than using traditional `for` loops and `.append()` calls.
Question: What is `*args` and `**kwargs` in function definitions?
Answer: `*args` allows you to pass a variable number of non-keyword arguments to a function, which are received as a tuple. `**kwargs` allows you to pass a variable number of keyword arguments, received as a dictionary.
Question: What are Python's magic methods (e.g., `__init__`, `__str__`)?
Answer: Magic methods, or dunder methods, are special methods that you can define to add "magic" to your classes. They are invoked by Python for built-in operations, like `__init__` for object creation or `__add__` for the `+` operator.
Question: How does error handling work in Python?
Answer: Python uses `try...except` blocks to handle exceptions. Code that might raise an error is placed in the `try` block, and the code to handle the exception is placed in the `except` block.
Question: What is the purpose of the `if __name__ == "__main__":` block?
Answer: This block ensures that the code inside it only runs when the script is executed directly, not when it is imported as a module into another script.
Question: What are generators in Python?
Answer: Generators are a simple way to create iterators. They are functions that use the `yield` keyword to return a sequence of values one at a time, saving memory for large datasets.
"""
For this guide, we’re embedding our knowledge directly into the script as a multi-line string. In a production application, you would typically load this data from a collection of documents, like text files, PDFs, or a database, but the principle remains the same.
This PYTHON_FAQ_TEXT variable contains a curated list of common Python questions and answers. Each Q&A pair is a self-contained chunk of information, making it an ideal format for our vector database. When we index this data, each chunk will become a searchable unit that our agent can retrieve.
With our data defined, the next step is to create a class that can process it.
With our data defined, we now need to build the engine that can process, index, and query it. This is where the FAQEngine class comes in. It’s the central component of our RAG pipeline, encapsulating all the logic for interacting with both the embedding model and the Qdrant vector database.
First, we’ll add a small helper function to our rag_app.py. This function will help us process our data in manageable chunks, which is crucial for memory efficiency and performance when dealing with large datasets.
def batch_generator(data: List[Any], batch_size: int) -> Generator[List[Any], None, None]:
"""Yields successive n-sized chunks from a list."""
for i in range(0, len(data), batch_size):
yield data[i : i + batch_size]
Now, let’s define the main class that puts everything together.
class FAQEngine:
"""
An engine for setting up and querying a FAQ database using Qdrant and HuggingFace embeddings.
"""
def __init__(self,
qdrant_url: str = "http://localhost:6333",
collection_name: str = "python-faq",
embed_model_name: str = "nomic-ai/nomic-embed-text-v1.5"):
self.collection_name = collection_name
print("Loading embedding model...")
self.embed_model = HuggingFaceEmbedding(
model_name=embed_model_name,
trust_remote_code=True
)
self.vector_dim = len(self.embed_model.get_text_embedding("test"))
print(f"Embedding model loaded. Vector dimension: {self.vector_dim}")
self.client = QdrantClient(url=qdrant_url, prefer_grpc=True)
print("Connected to Qdrant.")
@staticmethod
def parse_faq(text: str) -> List[str]:
"""Parses the raw FAQ text into a list of Q&A strings."""
return [
qa.replace("\n", " ").strip()
for qa in text.strip().split("\n\n")
]
def setup_collection(self, faq_contexts: List[str], batch_size: int = 64):
"""
Creates a Qdrant collection (if it doesn't exist) and ingests the FAQ data.
"""
try:
self.client.get_collection(collection_name=self.collection_name)
print(f"Collection '{self.collection_name}' already exists. Skipping creation.")
except Exception:
print(f"Creating collection '{self.collection_name}'...")
self.client.create_collection(
collection_name=self.collection_name,
vectors_config=models.VectorParams(
size=self.vector_dim,
distance=models.Distance.DOT
)
)
print(f"Embedding and ingesting {len(faq_contexts)} documents...")
for batch in tqdm(batch_generator(faq_contexts, batch_size),
total=(len(faq_contexts) // batch_size) + 1,
desc="Ingesting FAQ data"):
embeddings = self.embed_model.get_text_embedding_batch(batch, show_progress_bar=False)
points = [
models.PointStruct(
id=str(uuid.uuid4()),
vector=vector,
payload={"context": context}
)
for context, vector in zip(batch, embeddings)
]
self.client.upload_points(
collection_name=self.collection_name,
points=points,
wait=False
)
print("Data ingestion complete.")
print("Updating collection indexing threshold...")
self.client.update_collection(
collection_name=self.collection_name,
optimizer_config=models.OptimizersConfigDiff(indexing_threshold=20000)
)
print("Collection setup is finished.")
def answer_question(self, query: str, top_k: int = 3) -> str:
"""
Searches the vector database for a given query and returns the most relevant contexts.
"""
query_embedding = self.embed_model.get_query_embedding(query)
search_result = self.client.search(
collection_name=self.collection_name,
query_vector=query_embedding,
limit=top_k,
score_threshold=0.5
)
if not search_result:
return "I couldn't find a relevant answer in my knowledge base."
relevant_contexts = [
hit.payload["context"] for hit in search_result
]
formatted_output = "Here are the most relevant pieces of information I found:\n\n---\n\n".join(relevant_contexts)
return formatted_output
The constructor, __init__, is responsible for setting up all the necessary components. It initializes the HuggingFaceEmbedding model from llama-index, which will handle the conversion of text to vectors.
A key detail here is that we dynamically determine the vector_dim by embedding a test string; this makes our code robust and adaptable to different embedding models. Finally, it establishes a connection to our Qdrant database. The print statements provide helpful feedback to the user when the script is run.
Before we can index our data, we need to clean it up. The parse_faq static method is a simple utility for this. It takes our raw PYTHON_FAQ_TEXT and splits it into a clean list of individual Question/Answer strings, which is the perfect format for our database.
The setup_collection method is where the indexing magic happens. It first checks if a Qdrant collection with our specified name already exists, preventing us from wastefully re-indexing data every time we start the application. If it doesn’t exist, it creates a new one, configuring it with the correct vector size and distance metric.
Then, it begins the main ingestion loop, using our batch_generator helper to process the FAQ data in efficient chunks. Inside this loop, a three-step process occurs for each batch:
- Embed: It converts the batch of text chunks into a list of numerical vectors.
- Structure: It packages each vector into a Qdrant PointStruct, which includes a unique ID and a payload containing the original text. This payload is crucial, as it allows us to retrieve the human-readable context after a search.
- Upload: It sends these points to the Qdrant collection.
Finally, the answer_question method is what our agent’s tool will call. This is the “retrieval” step. It takes a user’s query, converts it into a vector using the same embedding model, and then uses the client.search function to find the most similar vectors in our database.
It then extracts the original text from the payload of the top results and formats it into a single, clean string. This formatted text is the “context” that will ultimately be used to generate a high-quality answer.