Configuration

Colette provides two main interfaces for interaction:

JSON Interface: Accessible via Python for direct integration.
HTTP Interface: Usable with curl or any programming language that supports HTTP requests.

Colette operates as an application that is configured through a JSON file, allowing users to define its behavior before sending queries.

Pre-configured RAG systems

Colette comes with several pre-configured RAG systems, each tailored for specific use cases. These systems are designed to simplify the setup process and provide a quick start for users.

They can be used using the Colette CLI:

Indexing

To run the indexing phase.

colette_cli index [OPTIONS]

Options:

 *  --app-dir         PATH  Specify the application directory [default: None] [required]
 *  --data-dir        PATH  Specify the data directory [default: None] [required]
    --models-dir      PATH  Specify the models directory [default: None]
    --config-file     PATH  Specify the config file [default: None]
    --index-file      PATH  [default: <typer.models.OptionInfo object at 0x7f629528c0b0>]
    --help                  Show this message and exit.

Querying

To start a chat with the application.

colette_cli chat [OPTIONS]

Options:

*  --app-dir           PATH  Specify the application directory [default: None] [required]
*  --msg               TEXT  Specify the user message [default: None] [required]
   --models-dir        PATH  Specify the models directory [default: None]
   --help                    Show this message and exit.

Configuring a RAG-based System

Colette supports Retrieval-Augmented Generation (RAG), enabling the ingestion and retrieval of documents to enhance answer generation. Below is an overview of the RAG pipeline used by Colette:

RAG Architecture

Overview of the RAG Process

The diagram above illustrates the five key steps involved in processing queries using a Retrieval-Augmented Generation (RAG) pipeline:

Encoding
- Documents (e.g., PDFs, docx, …) are converted into embeddings using an embedding model.
- When a user submits a query, it is also encoded into an embedding.
Indexing & Similarity Search
- Encoded document embeddings are stored in a vector database (e.g., ChromaDB).
- When a query embedding is received, the system searches for similar documents in the vector database.
Retrieving Relevant Chunks
- The most relevant document chunks are retrieved from the database.
Generating a Prompt
- The retrieved chunks are formatted into a structured prompt that provides context to the language model.
Generating the Response
- The prompt is passed to the LLM, which generates a final response based on the retrieved context.

Configuration Structure

The configuration file consists of three main sections:

app Section - Defines the application repository and logging level.
parameters Section - Contains sub-sections related to data input, preprocessing, retrieval settings, and prompt templates.
llm Section - Specifies the LLM model and inference settings.

Below is a detailed breakdown of each section.

1. Application Settings (`app`)

{
    "app": {
        "repository": "/path/to/rag",
        "verbose": "info"
    }
}

This section controls Colette’s internal configurations, including:

repository: The directory where Colette will store its internal files, configurations, and data.
verbose: The logging level (e.g., info, debug).

2. Parameters (`parameters`)

The parameters section defines how data is processed, indexed, and retrieved before being passed to the LLM for answer generation.

2.1 Input Settings (`input`)

"input": {
    "preprocessing": {
        "files": ["all"],
        "lib": "unstructured",
        "save_output": false,
        "filters": ["\/~[^\/]*$"]
    },
    "rag": {
        "indexdb_lib": "chromadb",
        "embedding_lib": "huggingface",
        "embedding_model": "intfloat/multilingual-e5-small",
        "gpu_id": 0,
        "search": true,
        "reindex": false
    },
    "template": {
        "template_prompt": "Tu es un assistant expert dans le management des systèmes spatiaux et orbitaux. Réponds en francais en utilisant les informations du contexte qui suit. Contexte: {context}. Question: {question}. Réponse: ",
        "template_prompt_variables": ["context", "question"]
    },
    "data": ["/path/to/data/"]
}

Preprocessing (`preprocessing`)

This section controls how Colette filters and processes files before they are indexed in the RAG system:

files: Determines which files to include (["all"] means all files are considered).
lib: Specifies the library used for preprocessing (unstructured for document parsing).
save_output: Boolean flag (false here) indicating whether to store preprocessed files.
filters: Regex patterns to exclude files based on their names.

RAG Configuration (`rag`)

Controls settings for retrieval and vector database indexing:

indexdb_lib: The vector database used (chromadb).
embedding_lib: The library used to compute text embeddings (huggingface).
embedding_model: The embedding model (intfloat/multilingual-e5-small).
gpu_id: Specifies which GPU to use (0 means the first available GPU).
search: If true, retrieval is enabled.
reindex: If false, existing indexes are used instead of re-processing everything.

Prompt Template (`template`)

Defines the system prompt used when querying the LLM:

template_prompt: The structured prompt instructing the assistant to respond as an expert in space systems management.
template_prompt_variables: The placeholders used in the prompt (context, question).

Data Sources (`data`)

Defines where Colette retrieves documents for processing:

data: The directory containing documents (/path/to/data/).

3. LLM Configuration (`llm`)

"llm": {
    "source": "llama3.1:latest",
    "inference": {
        "lib": "ollama"
    }
}

This section defines the language model used for answering queries:

source: The LLM model version (llama3.1:latest).
inference:
- lib: The inference library (ollama).

How Colette Processes RAG Queries

Preprocessing: Colette filters and processes input documents based on the defined rules.
Indexing: The documents are embedded using the huggingface embedding model (intfloat/multilingual-e5-small) and stored in chromadb.
Retrieval: When a user submits a query, Colette searches the document embeddings to retrieve relevant context.
Prompt Construction: The retrieved context is inserted into the template_prompt.
LLM Query: The constructed prompt is sent to llama3.1:latest using ollama for inference.
Response Generation: The model generates an answer using both the retrieved documents and its own knowledge.

Example API Usage

Querying via JSON Interface

import requests

url = "http://localhost:1873/predict"
data = {
    "question": "Quels sont les principes de gestion des débris spatiaux ?"
}

response = requests.post(url, json=data)
print(response.json())

Querying via cURL

curl -X POST http://localhost:1873/predict -H "Content-Type: application/json" -d '{
  "question": "Quels sont les principes de gestion des débris spatiaux ?"
}'