# Vector DB

## **Overview**

The **Vector DB** connector plays a central role in your pipeline by performing the **data ingestion** process.\
It converts information into a **vector representation** that can later be used for **semantic search and retrieval**. When a prompt is received, similarity calculations identify the most relevant vectors, and their corresponding text is retrieved to enrich the context provided to the language model (LLM).

Unlike traditional databases, which store text or structured data, a vector database stores **embeddings**, which are numerical representations that capture the meaning of content. These embeddings make it possible for AI models to find related information based on similarity rather than exact keyword matches.

## **How it works**

The connector’s operation involves a sequential process with three main stages:

{% stepper %}
{% step %}

#### **Data ingestion**

The connector receives data from a previous pipeline step. This data can come from various sources, such as the trigger or other Platform connector.\
You can define the source type through the Source Type parameter:\
Text: To process raw text content.\
File: To process a stored document.
{% endstep %}

{% step %}

#### **Embedding generation**

The received content is processed using the configured **Embedding Model**, which converts the data into a vector (a list of numbers that represents its semantic meaning). These vectors are not human-readable but are essential for AI-based search and retrieval in later stages.

Supported embedding model providers include:

* **Local (default):** A lightweight local embedding model (**all-MiniLM-L6-v2**) that is useful for basic use cases or testing.
* **External providers:** You can select more advanced options, such as:
  * **Hugging Face**: Offering a variety of text and multimodal models.
  * **OpenAI**: Supporting models like `text-embedding-3-small` and `text-embedding-3-large`.
  * **Google Vertex AI**: Enabling enterprise-grade embedding generation.
    {% endstep %}

{% step %}

#### **Vector storage**

After the embeddings are generated, they are stored in the configured **Vector Store**. Currently, the connector supports:

* **Neo4j** (graph-based database).
* **Postgres-compatible** databases.
  {% endstep %}
  {% endstepper %}

## **Vector dimensions**

Each embedding model produces vectors with a specific **dimension** (for example, 3072 values). The dimension used in the model must **match exactly** the dimension defined in the target vector store table. If they differ, the ingestion process will fail.

When the **Auto Create** option is enabled, the connector automatically creates a new table with the correct vector dimension according to the selected embedding model.

## **Supported operations**

At the current stage, the connector supports only **ingestion** operations.

* **Insert:** Stores the generated embeddings in the vector store.
* **Metadata:** You can include metadata (additional key–value pairs) when storing embeddings, but metadata-based filters are not yet available.

## **Output**

The connector returns a **confirmation message** indicating the result of the ingestion process.

If supported by the embedding model, the response may also include additional information such as the **number of tokens processed** during embedding generation.

## **Parameters configuration**

Configure the connector using the parameters below. Fields that support [Double Braces expressions](https://docs.digibee.com/documentation/connectors-and-triggers/double-braces) are marked in the **Supports DB** column.

{% tabs fullWidth="true" %}
{% tab title="General" %}

<table data-full-width="true"><thead><tr><th>Parameter</th><th>Description</th><th>Type</th><th>Supports DB</th><th>Default</th></tr></thead><tbody><tr><td><strong>Alias</strong></td><td>Name (alias) for this connector’s output, allowing you to reference it later in the flow using Double Braces expressions.</td><td>String</td><td>✅</td><td>vector-db-1</td></tr><tr><td><strong>Source Type</strong></td><td>Defines the type of data that the connector will process. Supported types: <strong>Text</strong> and <strong>File</strong>.</td><td>String</td><td>❌</td><td>N/A</td></tr><tr><td><strong>Metadata</strong></td><td>Stores extra information to identify the vectors.</td><td>Key-value pairs</td><td>❌</td><td>N/A</td></tr></tbody></table>
{% endtab %}

{% tab title="Embedding Model" %}
An embedding model converts text or other types of data into numerical vectors that represent their semantic meaning. These vectors allow the system to measure similarity between pieces of content based on meaning rather than exact wording.

Embedding models are commonly used for tasks such as **semantic search**, **clustering**, and **Retrieval-Augmented Generation (RAG)**, where they enable efficient comparison and retrieval of contextually relevant information.

#### **Local (all-MiniLM-L6-v2)**

There are no configurable parameters for this provider. However, the connector still uses an internal **Vector Dimension**, which defines the dimension of the embedding vectors. This dimension must **exactly match** the model’s vector size. If the table doesn’t exist, auto-create will use this dimension; mismatched tables will cause errors. For this provider, the **default dimension is 384**.

#### **OpenAI**

<table data-full-width="true"><thead><tr><th>Parameter</th><th>Description</th><th>Data type</th><th>Supports DB</th><th>Default</th></tr></thead><tbody><tr><td><strong>Embedding Model Name</strong></td><td>Defines the name of the embedding model to use, such as <code>text-embedding-3-large</code>.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Embedding Account</strong></td><td>Specifies the account configured with OpenAI credentials. Supported type: <a href="https://app.gitbook.com/s/jvO5S91EQURCEhbZOuuZ/platform-administration/settings/accounts"><strong>Secret Key</strong></a>.</td><td>Select</td><td>❌</td><td>N/A</td></tr><tr><td><strong>Vector Dimension</strong></td><td>Sets the dimension of the embedding vectors. Must exactly match the model’s vector size. If the table doesn’t exist, auto-create uses this dimension; mismatched tables cause errors.</td><td>Integer</td><td>❌</td><td>N/A</td></tr><tr><td><strong>Timeout</strong></td><td>Defines the maximum time limit (in seconds) for the operation before it is aborted. For example, <code>120</code> equals 2 minutes.</td><td>Integer</td><td>❌</td><td><code>30</code></td></tr></tbody></table>

#### **Google Vertex AI**

<table data-full-width="true"><thead><tr><th>Parameter</th><th>Description</th><th>Data type</th><th>Supports DB</th><th>Default</th></tr></thead><tbody><tr><td><strong>Embedding Model Name</strong></td><td>Defines the name of the embedding model to use, such as <code>textembedding-gecko@003</code>.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Embedding Account</strong></td><td>Specifies the account configured with Google Cloud credentials. Supported type: <a href="https://app.gitbook.com/s/jvO5S91EQURCEhbZOuuZ/platform-administration/settings/accounts"><strong>Google Key</strong></a>.</td><td>Select</td><td>❌</td><td>N/A</td></tr><tr><td><strong>Vector Dimension</strong></td><td>Sets the dimension of the embedding vectors. Must exactly match the model’s vector size. If the table doesn’t exist, auto-create uses this dimension; mismatched tables cause errors.</td><td>Integer</td><td>❌</td><td>N/A</td></tr><tr><td><strong>Project ID</strong></td><td>Defines the ID of the Google Cloud project associated with the account.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Location</strong></td><td>Specifies the region where the Vertex AI model is deployed, such as <code>us-central1</code>.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Endpoint</strong></td><td>Defines the endpoint of the embedding model, such as <code>us-central1-aiplatform.googleapis.com:443</code>.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Publisher</strong></td><td>Specifies the publisher of the model, typically <code>google</code>.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Max Retries</strong></td><td>Defines the maximum number of retry attempts in case of temporary API failures.</td><td>Integer</td><td>❌</td><td>3</td></tr><tr><td><strong>Timeout</strong></td><td>Defines the maximum time limit (in seconds) for the operation before it is aborted. For example, <code>120</code> equals 2 minutes.</td><td>Integer</td><td>❌</td><td><code>30</code></td></tr></tbody></table>

#### **Hugging Face**

<table data-full-width="true"><thead><tr><th>Parameter</th><th>Description</th><th>Data type</th><th>Supports DB</th><th>Default</th></tr></thead><tbody><tr><td><strong>Embedding Model Name</strong></td><td>Defines the name of the embedding model to use, such as <code>sentence-transformers/all-mpnet-base-v2</code>.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Embedding Account</strong></td><td>Specifies the account configured with Hugging Face credentials. Supported types: <a href="https://app.gitbook.com/s/jvO5S91EQURCEhbZOuuZ/platform-administration/settings/accounts"><strong>Secret Key</strong></a>.</td><td>Select</td><td>❌</td><td>N/A</td></tr><tr><td><strong>Vector Dimension</strong></td><td>Sets the dimension of the embedding vectors. Must exactly match the model’s vector size. If the table doesn’t exist, auto-create uses this dimension; mismatched tables cause errors.</td><td>Integer</td><td>❌</td><td>N/A</td></tr><tr><td><strong>Wait for Model</strong></td><td>Determines whether the system should wait for the model to load before generating embeddings (<code>true</code>) or return an error if the model is not ready (<code>false</code>).</td><td>Boolean</td><td>❌</td><td>True</td></tr></tbody></table>
{% endtab %}

{% tab title="Vector Store" %}
A vector store is a specialized database designed to store and retrieve vector representations of data (embeddings). It enables similarity searches by comparing numerical vectors instead of exact text matches, allowing more relevant and semantic results.

<table data-full-width="true"><thead><tr><th>Parameter</th><th>Description</th><th>Data type</th><th>Supports DB</th><th>Default</th></tr></thead><tbody><tr><td><strong>Vector Store Provider</strong></td><td>Defines the database provider used for storing and querying embeddings. Options are: <strong>PostgreSQL (PGVector)</strong> and <strong>Neo4j</strong>.</td><td>Select</td><td>❌</td><td>N/A</td></tr></tbody></table>

#### **PostgreSQL (PGVector)**

<table data-full-width="true"><thead><tr><th>Parameter</th><th>Description</th><th>Data type</th><th>Supports DB</th><th>Default</th></tr></thead><tbody><tr><td><strong>Host</strong></td><td>Defines the hostname or IP address of the PostgreSQL server.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Port</strong></td><td>Defines the port number used to connect to the PostgreSQL server.</td><td>Number</td><td>❌</td><td><code>5432</code></td></tr><tr><td><strong>Database Name</strong></td><td>Defines the name of the PostgreSQL database containing the vector table.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Vector Store Account</strong></td><td>Specifies the account configured with PostgreSQL credentials.</td><td>Select</td><td>❌</td><td>N/A</td></tr><tr><td><strong>Table Name</strong></td><td>Defines the name of the table where vectors are stored.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Auto-Create Table</strong></td><td>Automatically creates the table if it doesn’t exist (<strong>PGVector only</strong>).</td><td>Boolean</td><td>❌</td><td>True</td></tr><tr><td><strong>Clear Table Before Ingest</strong></td><td>Deletes all existing records before ingesting new data (<strong>PGVector only</strong>).</td><td>Boolean</td><td>❌</td><td>False</td></tr><tr><td><strong>Auto-Create Index</strong></td><td>Automatically creates the vector index if it doesn’t exist.</td><td>Boolean</td><td>❌</td><td>True</td></tr></tbody></table>

#### **Neo4j**

<table data-full-width="true"><thead><tr><th>Parameter</th><th>Description</th><th>Data type</th><th>Supports DB</th><th>Default</th></tr></thead><tbody><tr><td><strong>Database Name</strong></td><td>Defines the name of the Neo4j database where the vector index is stored.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Vector Store Account</strong></td><td>Specifies the account configured with Neo4j credentials.</td><td>Select</td><td>❌</td><td>N/A</td></tr><tr><td><strong>Index Name</strong></td><td>Defines the name of the index used to store and query vectors.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>URI</strong></td><td>Defines the connection URI to the Neo4j instance.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Node Label</strong></td><td>Defines the label assigned to nodes containing embedding data.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Embedding Property</strong></td><td>Defines the node property used to store the embedding vector.</td><td>String</td><td>✅</td><td>N/A</td></tr><tr><td><strong>Text Property</strong></td><td>Defines the node property used to store the original text or document.</td><td>String</td><td>✅</td><td>N/A</td></tr></tbody></table>
{% endtab %}

{% tab title="Ingestion Process" %}

<table data-full-width="true"><thead><tr><th>Parameter</th><th>Description</th><th>Type</th><th>Supports DB</th><th>Default</th></tr></thead><tbody><tr><td><strong>Splitting Strategy</strong></td><td>Defines how documents are split into smaller chunks for embedding.</td><td>String</td><td>❌</td><td>Recursive Character Splitter (Recommended)</td></tr><tr><td><strong>Max Segment Size</strong></td><td>Maximum number of characters allowed per chunk. Larger values generate fewer, longer segments.</td><td>Integer</td><td>❌</td><td><code>500</code></td></tr><tr><td><strong>Segment Overlap</strong></td><td>Number of characters shared between consecutive chunks to preserve context.</td><td>Integer</td><td>❌</td><td><code>50</code></td></tr></tbody></table>
{% endtab %}

{% tab title="Documentation" %}

<table data-full-width="true"><thead><tr><th>Parameter</th><th>Description</th><th>Type</th><th>Supports DB</th><th>Default</th></tr></thead><tbody><tr><td><strong>Documentation</strong></td><td>Optional field to describe the connector configuration and any relevant business rules.</td><td>String</td><td>❌</td><td>N/A</td></tr></tbody></table>
{% endtab %}
{% endtabs %}
