Fasten Answers AI

This project uses llama.cpp to provide LLM answers to users' questions using documents indexed in Elasticsearch if available. The architecture includes a data flow that allows document uploads, indexing in Elasticsearch, and response generation to queries using a large language model (LLM).

Project Summary

It uses the following components:

Elasticsearch: For document storage and search.
llama.cpp: For response generation using llama3 8B.
FastAPI: To provide the web API.

Data Flow Architecture

Indexing in Elasticsearch:
- FHIR resources must be in JSON format.
- As the project is still in development, we are testing different approaches for storing FHIR resources in the vector database. The alternatives we have tested include: i. saving each resource as a string, divided into chunks with overlap, ii. flattening each resource before chunking with overlap, and iii. summarizing each FHIR resource using the OpenAI API or a local LLM with llama.cpp. You can find more details on each approach in the indexing strategies documentation
- Text embeddings are generated using sentence-transformers and stored in Elasticsearch.
Response Generation:
- Queries are sent through a FastAPI endpoint.
- Relevant results are retrieved from Elasticsearch.
- An LLM, served by llama.cpp, generates a response based on the retrieved results.
- You can find more details on how to setup the generation in the generation strategies documentation.

Hardware and technical recommendations

Since the project utilizes llama.cpp for LLM execution, it’s crucial to configure the command properly to optimize performance based on the available hardware resources. All parameters can be found in the llama.cpp server documentation, but for this project, we use the following configuration:

Parameter	Explanation
-n, --predict, --n-predict N	number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
-c, --ctx-size N	size of the prompt context as a power of 2 (default: 0, 0 = loaded from model)
-t, --threads N	number of threads to use during generation (default: -1). The higher, the faster the inference will be.
-np, --parallel N (optional)	number of parallel sequences to decode (default: 1). If for example you decide to process 4 sequences in a batch in parallel, you shoud set -np 4.

To configure these parameters based on the available resources of the local machine, modify the command line in the llama service section within the docker-compose.yml file directly.

If the hardware has sufficient resources to run llama.cpp in parallel, it is recommended to include the -np parameter in the Docker Compose configuration before running docker compose up. In this scenario, the endpoints that support parallel execution, as specified in routes, are documented in evaluate_generation.md and evaluate_retrieval.md.

Running the Project

Prerequisites

Docker.
Docker Compose.
LLM models should be downloaded and stored in the ./models folder in .gguf format. We have tested the performance of Phi 3.5 Mini and Llama 3.1 in various quantization formats. The prompts for conversation and summary generation are configured in the prompts folder. If you want to add a new model with a different prompt, you must update the prompt files in that directory and place the corresponding model in the models folder.

Instructions to Launch the RAG System

Clone the repository:

git clone <https://github.com/fastenhealth/fasten-answers-ai.git>
cd fasten-answers-ai

Modify the docker-compose.yml file app env variables (if necessary):

ES_HOST=http://elasticsearch:9200
ES_USER=elastic
ES_PASSWORD=changeme
ES_INDEX_NAME: fasten-index
EMBEDDING_MODEL_NAME: all-MiniLM-L6-v2
LLM_HOST: http://llama:9090

Start the services with Docker Compose:
```
docker-compose up --build
```
This command will start the following services:
- Elasticsearch: Available at http://localhost:9200
- Llama: Served by llama.cpp at http://localhost:8080
- FastAPI Application: Available at http://localhost:8000

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
app		app
docs		docs
evaluation		evaluation
models		models
scripts/bash		scripts/bash
.dockerignore		.dockerignore
.gitignore		.gitignore
.ruff.toml		.ruff.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fasten Answers AI

Project Summary

Data Flow Architecture

Hardware and technical recommendations

Running the Project

Prerequisites

Instructions to Launch the RAG System

Running the evaluations

About

Releases

Sponsor this project

Packages

Contributors 3

Languages

License

fastenhealth/fasten-answers-ai

Folders and files

Latest commit

History

Repository files navigation

Fasten Answers AI

Project Summary

Data Flow Architecture

Hardware and technical recommendations

Running the Project

Prerequisites

Instructions to Launch the RAG System

Running the evaluations

About

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 3

Languages

Packages