Using LLMs on private data, all locally

This project is a learning exercise on using large language models (LLMs) to retrieve information from private data, running all pieces (including the model) locally. The goal is to run an LLM on your computer to ask questions on a set of files also on your computer. The files can be any type of document, such as PDF, Word, or text files.

This method of combining LLMs and private data is known as retrieval-augmented generation (RAG). It was introduced in this paper.

Credit where credit is due: I based this project on the original privateGPT (what they now call the primordial version). I reimplemented the pieces to understand how they work. See more in the sources section.

What we are trying to achieve: given a set of files on a computer (A), we want a large language model (B) running on that computer to answer questions (C) on them.

However, we cannot feed the files directly to the model. Large language models (LLMs) have a context window that limits how much information we can feed into them (their working memory). To overcome that limitation, we split the files into smaller pieces, called chunks, and feed only the relevant ones to the model (D).

But then, the question becomes "how do we find the relevant chunks?". We use similarity search (E) to match the question and the chunks. Similarity search, in turn, requires vector embeddings (F), a representation of words with vectors that encode semantic relationships (technically, a dense vector embedding, not to confuse it with sparse vector representations such as bag-of-words and TF-IDF). Once we have the relevant chunks, we combine them with the question to create a prompt (G) that instructs the LLM to answer the question.

We need one last piece: persistent storage. Creating embeddings for the chunks takes time. We don't want to do that every time we ask a question. Therefore, we need to save the embeddings and the original text (the chunks) in a vector store (or database) (H). The vector store can grow large because it stores the original text chunks and their vector embeddings. We use a vector index (I) to find relevant chunks efficiently.

Now we have all the pieces we need.

We can divide the implementation into two parts: ingesting and retrieving data.

Ingestion: The goal is to divide the local files into smaller chunks that fit into the LLM input size (context window). We also need to create vector embeddings for each chunk. The vector embeddings allow us to find the most relevant chunks to help answer the question. Because chunking and embedding take time, we want to do that only once, so we save the results in a vector store (database).
Retrieval: Given a user question, we use similarity search to find the most relevant chunks (i.e. the pieces of the local files related to the question). Once we determine the most relevant chunks, we can use the LLM to answer the question. To do so, we combine the user question with the relevant chunks and a prompt instructing the LLM to answer the question.

These two steps are illustrated in the following diagram.

How to use this project

If you haven't done so yet, prepare the environment. If you have already prepared the environment, activate it with source venv/bin/activate.

There are two ways to use this project:

Command line interface: use this one to see more logs and understand what is going on (see the --verbose flag below).
Streamlit app: use this one for a more user-friendly experience.

Command-line interface

Copy the files you want to use into the data folder.
Run python main.py ingest to ingest the files into the vector store.
Run python main.py retrieve to retrieve data from the vector store. It will prompt you for a question.

Use the --verbose flag to get more details on what the program is doing behind the scenes.

To re-ingest the data, delete the vector_store folder and run python main.py ingest again.

Streamlit app

Run streamlit run app.py. It will open the app in a browser window.

This command may fail the first you run it. There is a glitch somewhere in how the Python environment works together with pyenv. If Streamlit show a "cannot import module message", deactivate the Python environment with deactivate, activate it again with source venv/bin/activate, and run streamlit run app.py.

Design

Ingesting data

If you haven't done so yet, prepare the environment. If you have already prepared the environment, activate it with source venv/bin/activate.

Command: python main.py ingest [--verbose]

The goal of this stage is to make the data searchable. However, the user's question and the data contents may not match exactly. Therefore, we cannot use a simple search engine. We need to perform a similarity search supported by vector embeddings. The vector embedding is the most important part of this stage.

Ingesting data has the following steps:

Load the file: a document reader that matches the document type is used to load the file. At this point, we have an array of characters with the file contents (a "document" from now on). Metadata, pictures, etc., are ignored.
Split the document into chunks: a document splitter divides the document into chunks of the specified size. We need to split the document to fit the context size of the model (and to send fewer tokens when using a paid model). The exact size of each chunk depends on the document splitter. For example, a sentence splitter attempts to split at the sentence level, making some chunks smaller than the specified size.
Create vector embeddings for each chunk: an embedding model creates a vector embedding for each chunk. This is the crucial step that allows us to find the most relevant chunks to help answer the question.
Save the embeddings into the vector database (store): persist all the work we did above so we don't have to repeat it in the future.

Future improvements:

More intelligent document parsing. For example, do not mix figure captions with the section text; do not parse the reference section (alternatively, replace the inline references with the actual reference text).
Improve parallelism. Ideally, we want to run the entire workflow (load document, chunk, embed, persist) in parallel for each file. This requires a solution that parallelizes not only I/O-bound but also CPU-bound tasks. The vector store must also support multiple writers.
Try different chunking strategies, e.g. check if sentence splitters ( NLTKTextSplitter or SpacyTextSplitter) improve the answers.
Choose chunking size based on the LLM input (context) size. It is currently hardcoded to a small number, which may affect the quality of the results. On the other hand, it saves costs on the LLM API. We need to find a balance.
Automate the ingestion process: detect if there are new or changed files and ingest them.

Retrieving data

If you haven't done so yet, prepare the environment. If you have already prepared the environment, activate it with source venv/bin/activate.

Command: python main.py retrieve [--verbose]

The goal of this stage is to retrieve information from the local data. We do that by fetching the most relevant chunks from the vector store and combining them with the user's question and a prompt. The prompt instructs the language model (LLM) to answer the question.

Retrieving data has the following steps:

Find the most relevant chunks: the vector store is queried to find the most relevant chunks to the question.
Combine the chunks with the question and a prompt: the chunks are combined with the question and a prompt. The prompt instructs the LLM to answer the question.
Send the combined text to the LLM: the combined text is sent to the LLM to get the answer.

Future improvements:

Add LangChain callbacks to view the steps of the retrieval process.
Improve the prompt to answer only with what is in the local documents, e.g. "Use only information from the following documents: ...". Without this step the model seems to dream up an answer from the training data, which is not always relevant.
Add moderation to filter out offensive answers.
Improve the answers with reranking: "over-fetch our search results, and then deterministically rerank based on a modifier or set of modifiers.".
Try different chain types (related to the previous point).

Improving results

We had to make some compromises to make it run on a local machine in a reasonable amount of time.

We use a small model. This one is hard to change. The model has to run on a CPU and fit in memory.
We use a small embedding size. We can increase the embedding size if we wait longer for the ingestion process.
Keep everything the same and try different chains.

Sources

Most of the ingest/retrieve code is based on the original privateGPT, the one they call now primordial.

What is different:

Streamlit app for the UI.
Use newer embeddings and large language model versions.
Modernized the Python code. For example, it uses pathlib instead of os.path and has proper logging instead of print statements.
Added more logging to understand what is going on. Use the --verbose flag to see the details.
Added a main program to run the ingest/retrieve steps.
Filled in requirements.txt with the indirect dependencies, for example, for HuggingFace transformers and LangChain document loaders.

See this file for more notes collected during the development of this project.

Preparing the environment

This is a one-time step. If you have already done this, just activate the virtual environment with source venv/bin/activate.

Python environment

Run the following commands to create a virtual environment and install the required packages.

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

PDF parsing

The PDF parser in unstructured is a layer on top of the actual parser packages. Follow the instructions in the unstructured README, under the "Install the following system dependencies" bullets. The poppler and tesseract packages are required (ignore the others).

Model

I suggest starting with a small model that run on CPU. GPT4All has a list of models here. I tested with mistral-7b-openorca Q4. It requires 8 GB of RAM to run. Note that some of the models have restrictive licenses. Check the license before using them in commercial projects.

Create a folder named models.
Click here to download Mistral 7B OpenOrca (3.8 GB download, 8 GB RAM).
Copy the model to the models folder.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
.vscode		.vscode
pics		pics
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app.py		app.py
constants.py		constants.py
ingest.py		ingest.py
logger.py		logger.py
main.py		main.py
notes.md		notes.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
retrieve.py		retrieve.py
vector_store.py		vector_store.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using LLMs on private data, all locally

How to use this project

Command-line interface

Streamlit app

Design

Ingesting data

Retrieving data

Improving results

Sources

Preparing the environment

Python environment

PDF parsing

Model

About

Releases

Packages

Languages

fau-masters-collected-works-cgarbin/gpt-all-local

Folders and files

Latest commit

History

Repository files navigation

Using LLMs on private data, all locally

How to use this project

Command-line interface

Streamlit app

Design

Ingesting data

Retrieving data

Improving results

Sources

Preparing the environment

Python environment

PDF parsing

Model

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages