Home

Awesome

GPT-RAG - Data Ingestion Component

Part of GPT-RAG

Table of Contents

  1. GPT-RAG - Data Ingestion Component
  2. How-to: Developer
  3. How-to: User
  4. Reference

Concepts

Document Ingestion Process

The diagram below provides an overview of the document ingestion pipeline, which handles various document types, preparing them for indexing and retrieval.

Document Ingestion Pipeline
Document Ingestion Pipeline

Workflow

  1. The ragindex-indexer-chunk-documents indexer reads new documents from the documents blob container.

  2. For each document, it calls the document-chunking function app to segment the content into chunks and generate embeddings using the ADA model.

  3. Finally, each chunk is indexed in the AI Search Index.

Document Chunking Process

The document_chunking function breaks documents into smaller segments called chunks.

When a document is submitted, the system identifies its file type and selects the appropriate chunker to divide it into chunks suitable for that specific type.

This setup ensures each document is processed by the most suitable chunker, leading to efficient and accurate chunking.

Important: The file extension determines the choice of chunker as outlined above.

Customization

The chunking process is customizable. You can modify existing chunkers or create new ones to meet specific data processing needs, optimizing the pipeline.

NL2SQL Ingestion Process

If you are using the few-shot or few-shot scaled NL2SQL strategies in your orchestration component, you may want to index NL2SQL content for use during the retrieval step. The idea is that this content will aid in SQL query creation with these strategies. More details about these NL2SQL strategies can be found in the orchestrator repository.

The NL2SQL Ingestion Process indexes three content types:

[!NOTE] If you are using the few-shot strategy, you will only need to index queries.

Each item—whether a query, table, or column—is represented in a JSON file with information specific to the query, table, or column, respectively.

Here’s an example of a query file:

{
    "question": "What are the top 5 most expensive products currently available for sale?",
    "query": "SELECT TOP 5 ProductID, Name, ListPrice FROM SalesLT.Product WHERE SellEndDate IS NULL ORDER BY ListPrice DESC",
    "selected_tables": [
        "SalesLT.Product"
    ],
    "selected_columns": [
        "SalesLT.Product-ProductID",
        "SalesLT.Product-Name",
        "SalesLT.Product-ListPrice",
        "SalesLT.Product-SellEndDate"
    ],
    "reasoning": "This query retrieves the top 5 products with the highest selling prices that are currently available for sale. It uses the SalesLT.Product table, selects relevant columns, and filters out products that are no longer available by checking that SellEndDate is NULL."
}

In the nl2sql directory of this repository, you can find additional examples of queries, tables, and columns for the following Adventure Works sample SQL Database tables.

Document Ingestion Pipeline
Sample Adventure Works Database Tables

[!NOTE]
You can deploy this sample database in your Azure SQL Database.

The diagram below illustrates the NL2SQL data ingestion pipeline.

NL2SQL Ingestion Pipeline
NL2SQL Ingestion Pipeline

Workflow

This outlines the ingestion workflow for query elements.

Note:
The workflow for tables and columns is similar; just replace queries with tables or columns in the steps below.

  1. The AI Search queries-indexer scans for new query files (each containing a single query) within the queries folder in the nl2sql storage container.

    Note:
    Files are stored in the queries folder, not in the root of the nl2sql container. This setup also applies to tables and columns.

  2. The queries-indexer then uses the #Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill to create a vectorized representation of the question text using the Azure OpenAI Embeddings model.

    Note:
    For query items, the question itself is vectorized. For tables and columns, their descriptions are vectorized.

  3. Finally, the indexed content is added to the nl2sql-queries index.

Sharepoint Indexing

The SharePoint connector operates through two primary processes, cada um deles roda em uma function da Data Ingestion Function App:

  1. Indexing SharePoint Files: sharepoint_index_files function retrieves files from SharePoint, processes them, and indexes their content into the Azure AI Search Index (ragindex).
  2. Purging Deleted Files: sharepoint_purge_deleted_files identifies and removes files that have been deleted from SharePoint to keep the search index up-to-date.

Both processes are managed by scheduled Azure Functions that run at regular intervals, leveraging configuration settings to determine their behavior. The diagram below illustrates the Sharepoint indexing.

Sharepoint Data Ingestion
Sharepoint Indexing Workflow

Workflow

1. Indexing Process (sharepoint_index_files)

1.1. List files from a specific SharePoint site, directory, and file types configured in the settings.
1.2. Check if the document exists in the AI Search Index. If it exists, compare the metadata_storage_last_modified field to determine if the file has been updated.
1.3. Use the Microsoft Graph API to download the file if it is new or has been updated.
1.4. Process the file content using the regular document chunking process. For specific formats, like PDFs, use Document Intelligence.
1.5. Use Azure OpenAI to generate embeddings for the document chunks.
1.6. Upload the processed document chunks, metadata, and embeddings into the Azure AI Search Index.

2. Purging Deleted Files (sharepoint_purge_deleted_files)

2.1. Connect to the Azure AI Search Index to identify indexed documents.
2.2. Query the Microsoft Graph API to verify the existence of corresponding files in SharePoint.
2.3. Remove entries in the Azure AI Search Index for files that no longer exist.

Azure Function triggers automate the indexing and purging processes. Indexing runs at regular intervals to ingest updated SharePoint files, while purging removes deleted files to maintain an accurate search index. By default, both processes run every 10 minutes when enabled.

If you'd like to learn how to set up the SharePoint connector, check out SharePoint Connector Setup.

How-to: Developer

Redeploying the Ingestion Component

Running Locally

Configuring Sharepoint Connector

Follow the instructions to configure the SharePoint Connector in the Configuration Guide: SharePoint Connector.

How-to: User

Uploading Documents for Ingestion

Reindexing Documents in AI Search

Reference

Supported Formats and Chunkers

Here are the formats supported by each chunker. The file extension determines which chunker is used.

Doc Analysis Chunker (Document Intelligence based)

ExtensionDoc Int API Version
pdf3.1, 4.0
bmp3.1, 4.0
jpeg3.1, 4.0
png3.1, 4.0
tiff3.1, 4.0
xlsx4.0
docx4.0
pptx4.0

LangChain Chunker

ExtensionFormat
mdMarkdown document
txtPlain text file
htmlHTML document
shtmlServer-side HTML document
htmHTML document
pyPython script
jsonJSON data file
csvComma-separated values file
xmlXML data file

Transcription Chunker

ExtensionFormat
vttVideo transcription

Spreadsheet Chunker

ExtensionFormat
xlsxSpreadsheet

External Resources