The following section shows an overview of the project's artifacts including the thoughts and "raison d'être" behind every component.
The following list shows the relevant notebooks of this project. Every Jupyter Notebook also was exported into
an .html
file so the outputs can be viewed without running a Jupyter Server. These exports can be found in
the notebooks/exports
folder.
- Main Notebook: The main notebook (
main.ipynb
in the Project's root) was used to explore different ways to build a RAG pipeline. It's the main artifact of this project and holds the team's observations and conclusions on the results the baseline system and its extensions yielded. The notebook guides you through all explorations, starting from monitoring, through preprocessing, chunking and embedding to all experiments. DISCLAIMER: If you plan to run themain.ipynb
please consult the Setup section inside the Main-Notebook that will give you more information on available runtime settings (i.e. for Caching and Monitoring). Also make sure to read the Getting Started section below. - Exploration Notebook: The exploration notebook (in
/notebooks/exploration.ipynb
) holds the exploratory data analysis of the challenge's dataset (Cleantech Media Dataset). The observations in that notebook led to a lot of initiatives inside the Preprocessing step of the project found in/src/preprocessing.py
. - MVP Notebook: The mvp notebook (in
/notebooks/mvp.ipynb
) holds the teams first fully working RAG pipeline which then was consolidated in themain.ipynb
notebook. - Eval Mapping: To evaluate each explored RAG system it was necessary to map each evaluation sample to its relevant
chunk. This entire process is described inside the
/notebooks/eval_mapping.ipynb
.
The scripts used in this challenge are listed here.
- Subset Generation: In order to develop in a lightweight environment
the
generate_subset.py
script helps to reduce the size of the dataset to a number of samplesn
. This way the vector store doesn't need to get ingested with the full size of the dataset and saves some time if the embedding and retrieval step gets changed. - Testset Generation: Additionally to the subset generation there is a testset generator (
in
/scripts/generate_testset.py
) that acts as a wrapper around the RAGASTestsetGenerator
. This way we are able to control how we would like the evaluation set to be distributed.
There are numerous processes under the hood that are used in the main.ipynb
. These components are listed
here:
- Evaluation: The
/src/evaluation.py
holds the main component for evaluation, namely theEvaluator
. This class steers the entire evaluation process for all explored experiments insidemain.ipynb
. This abstraction allowed for a common place for all explored systems to get rated with the same metrics and same adaptations of the evaluation set. Since we mainly used RAGAS for evaluation we also had to wrap our evaluation set into a RAGAS-digestible format. To achieve that, theevaluation.py
also holds aDatasetCreator
that aligns our evaluation set with the RAGAS structure. - Generation: The
/src/generation.py
functionality streamlines the LLM that is used inside themain.ipynb
notebook. Theget_llm_model
function allows us to control the model that should and its temperature in one place. - Preprocessing: The
Preprocessor
class inside/src/preprocessing.py
allows for a controlled and streamlined way of removing unwanted noise in the Cleantech Dataset. The preprocessing measures taken mainly stem from observations made inside theexploration.ipynb
notebook. This component removes duplicate chunks, non-english language chunks and chunks with special characters that wouldn't add meaningful information. - Vector Store: The
VectorStore
class inside/src/vector_store.py
wraps around theChroma
object, allowing for a structured way to control how the Vector Store can be accessed.
The cache folder holds the cached evaluation results of the experiments that were conducted in the
main.ipynb
. The cache is used to speed up the evaluation process and to avoid re-running the token and
time intensive evaluation process for each experiment.
The excerpt on the usage of assistants like ChatGPT and GitHub CoPilot was written in USE-OF-AI.md
in
the root folder of the project.
In order to save time on indexing the ChromaDB dataset, we provide a pre-indexed SQLite database. Download the ChromaDB SQLite database from the following link: ChromaDB.
After downloading the database, place it in the root directory of the project. The expected path structure should look like this:
npr-rag/
chroma/
chroma.sqlite3
...
...
To configure access to the OpenAI API for the project, start by duplicating the default.env
file and renaming it
to .env
. Once copied, you'll need to update the environment variables as per your API access details.
For OpenAI API Users:
- Locate the
OPENAI_API_KEY
variable in your.env
file. - Replace the placeholder
...
with your actual OpenAI API key. - If you are not using Azure OpenAI, ensure this is the only API key line un-commented.
For Azure OpenAI API Users: If you are using the Azure OpenAI API, follow these steps instead:
- Comment out or remove the
OPENAI_API_KEY
line. - Fill in the
AZURE_OPENAI_API_KEY
with your Azure API key. - Update
AZURE_OPENAI_ENDPOINT
with your specific Azure endpoint URL. - Set the
AZURE_OPENAI_DEPLOYMENT
to your designated deployment ID.
Additional Settings:
- The
TOKENIZERS_PARALLELISM
variable should be set tofalse
to avoid parallelism in tokenizers, which can lead to better performance in certain environments.
- Docker: Install from here.
- Docker Compose: Install from Docker Compose Installation Guide.
- Start the JupyterLab server:
Access the server at
docker-compose up
http://localhost:8888
. The project directory is mounted within the container for real-time file synchronization.
-
Build the Docker image:
docker build -t npr-rag-jupyterlab .
-
Run the Docker container:
docker run -p 8888:8888 -v "$(pwd):/usr/src/app" npr-rag-jupyterlab
Navigate to
http://localhost:8888
in your web browser.