Home

Awesome

GPT on your data Ingestion

Part of GPT-RAG

Getting started

You can provision the infrastructure and deploy the whole solution using the GPT-RAG template, as instructed at: https://aka.ms/gpt-rag.

What if I want to redeploy just the ingestion component?

Eventually, you may want to make some adjustments to the data ingestion code and redeploy the component.

To redeploy only the ingestion component (after the initial deployment of the solution), you will need:

Then just clone this repository and reproduce the following commands within the gpt-rag-ingestion directory:

azd auth login  
azd env refresh  
azd deploy  

Note: When running the azd env refresh, use the same environment name, subscription, and region used in the initial provisioning of the infrastructure.

Running Locally with VS Code

How can I test the data ingestion component locally in VS Code?

Document Intelligence API version

To use version 4.0 of Document Intelligence, it is necessary to add the property DOCINT_API_VERSION with the value 2024-07-31-preview in the function app properties. It's important to check if this version is supported in the region where the service was created. More information can be found at this link. If the property has not been defined (default behavior), the version 2023-07-31 (3.1) will be used.

Document Chunking Process

The document_chunking function is responsible for breaking down documents into smaller pieces known as chunks.

When a document is submitted, the system identifies its file extension and selects the appropriate chunker to divide it into chunks, each tailored to the specific file type.

This flow ensures that each document is processed with the chunker best suited for its format, leading to efficient and accurate chunking tailored to the specific file type.

[!IMPORTANT] Note that the choice of chunker is determined by the format, following the guidelines provided above.

Customization

The chunking process is flexible and can be customized. You can modify the existing chunkers or create new ones to suit your specific data processing needs, allowing for a more tailored and efficient processing pipeline.

Supported Formats

Here are the formats supported by the chunkers. Note that the decision on which chunker will be used based on the format is described earlier.

Doc Analysis Chunker (Document Intelligence based)

ExtensionDoc Int API Version
pdf3.1, 4.0
bmp3.1, 4.0
jpeg3.1, 4.0
png3.1, 4.0
tiff3.1, 4.0
xslx4.0
docx4.0
pptx4.0

LangChain Chunker

ExtensionFormat
mdMarkdown document
txtPlain text file
htmlHTML document
shtmlServer-side HTML document
htmHTML document
pyPython script
jsonJSON data file
csvComma-separated values file
xmlXML data file

Transcription Chunker

ExtensionFormat
vttVideo transcription

Spreadsheet Chunker

ExtensionFormat
xlsxSpreadsheet

NL2SQL Chunker

ExtensionDescription
nl2sqlJSON files containing natural language questions and corresponding SQL queries

References

Contributing

We appreciate your interest in contributing to this project! Please refer to the CONTRIBUTING.md page for detailed guidelines on how to contribute, including information about the Contributor License Agreement (CLA), code of conduct, and the process for submitting pull requests.

Thank you for your support and contributions!

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.