diff --git a/README.md b/README.md index 5dcd92dfe5b9db666908ff3f24a3ebafb65a26aa..7474178287d520b25dcf26aeef52999d648acefe 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,71 @@ # README -## Future Work +This repo is intended to serve as a simple example of how to use `ollama` to build a rudimentary Retrieval Augmented Generation (RAG) model using the UAB RC Documentation. This repo assumes you are using Cheaha at UAB (<https://rc.uab.edu>). If not, some of the commands will have to be modified. +There isn't anything particularly special or magical about RAG, at a basic level. I found breaking into the terminology more challenging than actually building a conceptual understanding of what is happening. + +1. Take some data of interest, like the [UAB RC Documentation](https://docs.rc.uab.edu). +1. Generate a database of vectors embedding into some latent space. This is generally referred to colloquially as the "Embedding". These embedding vectors are typically generated by some embedding-specific deep learning model, but this isn't necessary. It could be as simple as some old-school Natural Language Processing, though the results might not be great. +1. Create a prepared prompt that includes three parts: + 1. An engineered prompt suitable for your application. For a help-desk chat bot, you might instruct the model to reply as though it were a help-desk agent working at a call center. + 1. A "blank space" for the user-supplied prompt. + 1. A "blank space" for supporting data from the Embedding. +1. When the user submits a prompt, generate an embedding vector into the same latent space as the Embedding. +1. Use the prompt embedding vector to find nearest-neighbors in the embedding database. Select one or more of these to be the supporting data. +1. Using the prepared prompt, fill out the "blank spaces" with the user-supplied prompt and supporting data. +1. Submit the filled-out prepared prompt to the LLM and return the result. + +While this is the most basic approach, there are several places where improvement can be made. + +- More complex graph structures for the data making up the embedding. Having a graph structure in the database enables more complex lookup schemes that may produce more accurate results. +- Adding a quality-checking model to select supporting data. +- Tuning level of granularity of data in the embedding. Sentences? Sections? Pages? +- See [Future Directions](#future-directions) for even more ideas of where to go next. + +## How to Use + +It is recommended to use an HPC Desktop job in the Interactive Apps section of <https://rc.uab.edu>. + +### One-time Setup + +1. Clone this repository using `git clone`. +1. Install the `conda` environment. + 1. `module load Miniforge3` + 1. `conda env create --file environment.yml` +1. Obtain the rendered UAB RC Documentation pages by running `pull-site.sh`. +1. Setup `ollama` by running `setup-ollama.sh`. +1. Start the `ollama` server by running `./ollama serve`. + +### Once-per-job Setup + +1. Load the Miniforge module with `module load Miniforge3`. +1. Start the `ollama` server application with `./ollama serve`. + +### To Run + +1. Run the Jupyter notebook `main.ipynb`. + - At time of writing, the Documentation pages are enough data that it takes about 7-10 minutes to generate the embeddings. Please be patient. + +## Supporting Software Used + +The `llama-index` framework is used for RAG data parsing. The embedding database is generated using a custom section-based approach with ChromaDB as backend. Frameworks like `langchain` could be used instead for this purpose but, frankly, writing the custom code was simpler, easier to understand, and did not report errors and warnings. + +The models are supplied by `ollama`. + +- LLM: <https://ollama.com/library/llama3.1> +- Embedding: <https://ollama.com/library/bge-m3> + +## Using other versions and models + +Newer versions of `ollama` are compressed as `.tar.gz` files on the GitHub releases page (<https://github.com/ollama/ollama/releases>). When modifying the `setup-ollama.sh` script to use these models, you will need to take this into account. + +Changing to other models may require varying levels of modification in the Jupyter notebook depending on the model. + +## Future Directions + +- other models + - LLM: there is llama 3.2, and models with more parameters + - Embedding - cloud deployment - improve chunking strategy - probably too fine-grained right now using individual sections @@ -9,14 +73,16 @@ - try a hierarchical retrieval strategy - use full pages as initial pass - then use sections only from within that page as second pass -- "BS" mitigation strategies? + - try a graph-based strategy + - use internal linking to connect sections and pages in the database +- "BS" (hallucination) mitigation strategies - improve embedding db persistence strategy - CI/CD triggered by docs changes -- mitigate prompt injection attacks +- mitigation for prompt injection attacks - <https://github.com/protectai/rebuff> not yet fully local - word counts limits (start at 1k maybe?) - check if response is similar to system prompt, if so, emit message - server-client model - client should be a page sending queries to a server which runs the backend code - client should be very thin and light-weight - - streamlit could be a starting point: <https://docs.streamlit.io/develop/api-reference/chat> + - streamlit could be a starting point for prototyping: <https://docs.streamlit.io/develop/api-reference/chat>