update readme

c7724090 · William E Warriner · 74f3e9b7 · c7724090
Commit c7724090 authored 8 months ago by William E Warriner
--- a/README.md
+++ b/README.md
 # README
-## Future Work
+This repo is intended to serve as a simple example of how to use `ollama` to build a rudimentary Retrieval Augmented Generation (RAG) model using the UAB RC Documentation. This repo assumes you are using Cheaha at UAB (<https://rc.uab.edu>). If not, some of the commands will have to be modified.
+There isn't anything particularly special or magical about RAG, at a basic level. I found breaking into the terminology more challenging than actually building a conceptual understanding of what is happening.
+1. Take some data of interest, like the [UAB RC Documentation](https://docs.rc.uab.edu).
+1. Generate a database of vectors embedding into some latent space. This is generally referred to colloquially as the "Embedding". These embedding vectors are typically generated by some embedding-specific deep learning model, but this isn't necessary. It could be as simple as some old-school Natural Language Processing, though the results might not be great.
+1. Create a prepared prompt that includes three parts:
+    1. An engineered prompt suitable for your application. For a help-desk chat bot, you might instruct the model to reply as though it were a help-desk agent working at a call center.
+    1. A "blank space" for the user-supplied prompt.
+    1. A "blank space" for supporting data from the Embedding.
+1. When the user submits a prompt, generate an embedding vector into the same latent space as the Embedding.
+1. Use the prompt embedding vector to find nearest-neighbors in the embedding database. Select one or more of these to be the supporting data.
+1. Using the prepared prompt, fill out the "blank spaces" with the user-supplied prompt and supporting data.
+1. Submit the filled-out prepared prompt to the LLM and return the result.
+While this is the most basic approach, there are several places where improvement can be made.
+- More complex graph structures for the data making up the embedding. Having a graph structure in the database enables more complex lookup schemes that may produce more accurate results.
+- Adding a quality-checking model to select supporting data.
+- Tuning level of granularity of data in the embedding. Sentences? Sections? Pages?
+- See [Future Directions](#future-directions) for even more ideas of where to go next.
+## How to Use
+It is recommended to use an HPC Desktop job in the Interactive Apps section of <https://rc.uab.edu>.
+### One-time Setup
+1. Clone this repository using `git clone`.
+1. Install the `conda` environment.
+    1. `module load Miniforge3`
+    1. `conda env create --file environment.yml`
+1. Obtain the rendered UAB RC Documentation pages by running `pull-site.sh`.
+1. Setup `ollama` by running `setup-ollama.sh`.
+1. Start the `ollama` server by running `./ollama serve`.
+### Once-per-job Setup
+1. Load the Miniforge module with `module load Miniforge3`.
+1. Start the `ollama` server application with `./ollama serve`.
+### To Run
+1. Run the Jupyter notebook `main.ipynb`.
+    - At time of writing, the Documentation pages are enough data that it takes about 7-10 minutes to generate the embeddings. Please be patient.
+## Supporting Software Used
+The `llama-index` framework is used for RAG data parsing. The embedding database is generated using a custom section-based approach with ChromaDB as backend. Frameworks like `langchain` could be used instead for this purpose but, frankly, writing the custom code was simpler, easier to understand, and did not report errors and warnings.
+The models are supplied by `ollama`.
+- LLM: <https://ollama.com/library/llama3.1>
+- Embedding: <https://ollama.com/library/bge-m3>
+## Using other versions and models
+Newer versions of `ollama` are compressed as `.tar.gz` files on the GitHub releases page (<https://github.com/ollama/ollama/releases>). When modifying the `setup-ollama.sh` script to use these models, you will need to take this into account.
+Changing to other models may require varying levels of modification in the Jupyter notebook depending on the model.
+## Future Directions
+- other models
+  - LLM: there is llama 3.2, and models with more parameters
+  - Embedding
 - cloud deployment
 - improve chunking strategy
  - probably too fine-grained right now using individual sections
@@ -9,14 +73,16 @@
  - try a hierarchical retrieval strategy
    - use full pages as initial pass
    - then use sections only from within that page as second pass
- "BS" mitigation strategies?
+  - try a graph-based strategy
+    - use internal linking to connect sections and pages in the database
+- "BS" (hallucination) mitigation strategies
 - improve embedding db persistence strategy
  - CI/CD triggered by docs changes
- mitigate prompt injection attacks
+- mitigation for prompt injection attacks
  - <https://github.com/protectai/rebuff> not yet fully local
  - word counts limits (start at 1k maybe?)
  - check if response is similar to system prompt, if so, emit message
 - server-client model
  - client should be a page sending queries to a server which runs the backend code
  - client should be very thin and light-weight
-  - streamlit could be a starting point: <https://docs.streamlit.io/develop/api-reference/chat>
+  - streamlit could be a starting point for prototyping: <https://docs.streamlit.io/develop/api-reference/chat>