initial commit

19916209 · William E Warriner · 19916209 · 19916209 · 19916209 · 19916209
Commit 19916209 authored 10 months ago by William E Warriner
--- a/.gitignore
+++ b/.gitignore
+# File created using '.gitignore Generator' for Visual Studio Code: https://bit.ly/vscode-gig
+# Created by https://www.toptal.com/developers/gitignore/api/visualstudiocode,linux,macos,python,windows
+# Edit at https://www.toptal.com/developers/gitignore?templates=visualstudiocode,linux,macos,python,windows
+
+### Linux ###
+*~
+
+# temporary files which can be created if a process still has a handle open of a deleted file
+.fuse_hidden*
+
+# KDE directory preferences
+.directory
+
+# Linux trash folder which might appear on any partition or disk
+.Trash-*
+
+# .nfs files are created when an open file is removed but is still being accessed
+.nfs*
+
+### macOS ###
+# General
+.DS_Store
+.AppleDouble
+.LSOverride
+
+# Icon must end with two \r
+Icon
+
+
+# Thumbnails
+._*
+
+# Files that might appear in the root of a volume
+.DocumentRevisions-V100
+.fseventsd
+.Spotlight-V100
+.TemporaryItems
+.Trashes
+.VolumeIcon.icns
+.com.apple.timemachine.donotpresent
+
+# Directories potentially created on remote AFP share
+.AppleDB
+.AppleDesktop
+Network Trash Folder
+Temporary Items
+.apdisk
+
+### macOS Patch ###
+# iCloud generated files
+*.icloud
+
+### Python ###
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+### Python Patch ###
+# Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
+poetry.toml
+
+# ruff
+.ruff_cache/
+
+# LSP config files
+pyrightconfig.json
+
+### VisualStudioCode ###
+.vscode/*
+!.vscode/settings.json
+!.vscode/tasks.json
+!.vscode/launch.json
+!.vscode/extensions.json
+!.vscode/*.code-snippets
+
+# Local History for Visual Studio Code
+.history/
+
+# Built Visual Studio Code Extensions
+*.vsix
+
+### VisualStudioCode Patch ###
+# Ignore all local history of files
+.history
+.ionide
+
+### Windows ###
+# Windows thumbnail cache files
+Thumbs.db
+Thumbs.db:encryptable
+ehthumbs.db
+ehthumbs_vista.db
+
+# Dump file
+*.stackdump
+
+# Folder config file
+[Dd]esktop.ini
+
+# Recycle Bin used on file shares
+$RECYCLE.BIN/
+
+# Windows Installer files
+*.cab
+*.msi
+*.msix
+*.msm
+*.msp
+
+# Windows shortcuts
+*.lnk
+
+# End of https://www.toptal.com/developers/gitignore/api/visualstudiocode,linux,macos,python,windows
+
+# Custom rules (everything added below won't be overriden by 'Generate .gitignore File' if you use 'Update' option)
+/site/
+/embeddings/
+ollama
--- a/README.md
+++ b/README.md
--- a/environment.yml
+++ b/environment.yml
+name: ollama
+dependencies:
+  - conda-forge::python=3.11.9
+  - conda-forge::pip=24.0
+  - conda-forge::ipykernel=6.28.0
+  - pip:
+      - llama-index-core==0.10.62
+      - llama-index-llms-ollama==0.2.2
+      - llama-index-embeddings-nomic==0.4.0
+      - nomic[local]==3.1.1
+      - ollama==0.3.1
--- a/main.ipynb
+++ b/main.ipynb
--- a/notes.md
+++ b/notes.md
+# NOTES
+
+## SETUP
+
+The makefile does something not quite right. `requirements.in` will have `unstructured[md]jupyter` which causes the later `uv pip compile` command to throw an error "Distribution not found". Strip that entry back to just `unstructured` and the rest of the commands will work.
+
+- Prefer to use `ollama`, it is FOSS
+
+### Setting up ollama
+
+- Use `wget https://github.com/ollama/ollama/releases/download/v0.3.4/ollama-linux-amd64` to get ollama
+- Use `chmod u+x ollama` (put in bin folder)
+- Use `./ollama serve` (consider running in background). This sets up the ollama API frontend serving communication over local HTTP
+- In a new terminal use `./ollama pull llama3.1:8b` to get the llama3.1 model locally
+- `./ollama pull nomic-embed-text` to get nomic text embedding locally
+- `pip install unstructured[md]`
+
+## Langchain
+
+- Provides a higher level API interface to ollama
+- Has vector embedding models
+- Has RAG model interface
+- There are dataloaders for MD and HTML: <https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/>
+  - Tables?
+  - Chunking by section?
+  - What about section contexts?
+    - Will need to use some amount of empiricism to optimize
+    - Larger chunks give lower granularity, harder to map back to source, but more context
+    - Smaller chunks have higher granularity, easier to map back to source, but less context
+- Retrievers: <https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/>
+  - Lots of options here, worth examining
+- *Note* not recommended to use the 8B model in document examination in production
+
+> its emprically defined SS like I showed in the rag notebook 3.0, but a good rule of thumb is, if you want granular acess to iformation use strategically small chunk sizes, then experiment across metrics that matter to you, like relevancy, retrieving certian types of information etc...
+
+Consider using `llama-index`
+
+## embeddings
+
+- vectorization of existing stuff as a "pre-analyzer" for the llm model proper
+- helps identify which files/data are most likely relevant to a given query
+- langchain helps build this and provides reasonable results
+  - paths to files
+  - content of files
+  - **question** is there a simple interface to get sections/anchor links from MD?
+  - **question** how to have langchain load a pre-built database?
+    - ultimately we'd like to have a chatbot that uses the docs site content as part of the vector store
+    - it should build as part of ci/cd by cloning the docs repo, building the db, and hosting it on a server
+    - the chatbot would then use that db for similarity search
+
+## doc parsing
+
+- opensource frontend: <https://github.com/nlmatics/llmsherpa>
+- backend is opensource also: <https://github.com/nlmatics/nlm-ingestor>
+
+## temperature
+
+Closer to 0 is more "precise"
+Closer to 1 is more "creative"
+
+## Tool Calling
+
+- This is a way of supplying the AI model with an API call it can interact with. A message is supplied to the AI and it can use the supplied tool call (function, db, etc) as additional information it can interact with.
+- This is an alternative to supplying it with pre-parsed documents
+- The LLM response may indicate or suggest a tool call be used, and provide arguments to use with that tool call.
+- Check 4.0 notebook
+- 8b model is less robust than 70b
+
+## Chat Agent
+
+- Investigate 4.3
+- 5.0 brings things all together
--- a/run-ollama.sh
+++ b/run-ollama.sh
+#!/bin/sh
+
+./ollama serve
--- a/setup-ollama.sh
+++ b/setup-ollama.sh
+#!/bin/sh
+
+VERSION=0.3.4
+TARGET=linux-amd64
+
+wget -O ollama "https://github.com/ollama/ollama/releases/download/${VERSION}/ollama-${TARGET}"
+chmod u+x ollama
+
+./ollama pull llama3.1:8b
+./ollama pull bge-m3:latest # rag model