Awesome

<picture> <source media="(prefers-color-scheme: dark)" srcset="https://assets.vespa.ai/logos/Vespa-logo-green-RGB.svg"> <source media="(prefers-color-scheme: light)" srcset="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg"> <img alt="#Vespa" width="200" src="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg" style="margin-bottom: 25px;"> </picture>

Managed Vector Search using Vespa Cloud

There is a growing interest in AI-powered vector representations of unstructured multimodal data and searching efficiently over these representations. This repository describes how your organization can unlock the full potential of multimodal AI-powered vector representations using Vespa Cloud -- the industry-leading managed Vector Search Service.

Create your tenant in the Vespa Cloud

If you don't already have a Vespa Cloud tenant, create one at console.vespa-cloud.com. Onboarding the Vespa Cloud requires a Google or GitHub account. Onboarding Vespa Cloud will start your free trial period, no credit card required.

Clone this repo

git clone --depth 1 https://github.com/vespa-cloud/vector-search.git && cd vector-search

Install Vespa-CLI

Install the Vespa-CLI which is the official command-line client for interacting with Vespa. Vespa-CLI works with both Vespa Cloud and self-serve on-premise Vespa deployments.

brew install vespa-cli

You can also download Vespa CLI binaries for Windows, Linux and macOS.

Configure Vespa-CLI

Replace <tenant-name> with your Vespa Cloud tenant name. In this case, the application name used is vector-search and instance is default:

vespa config set target cloud && \
vespa config set --local application <tenant-name>.vector-search.default

Security

Authorize access to the Vespa Cloud control plane:

vespa auth login

Create a self-signed certificate for data plane (read and write) endpoint access:

vespa auth cert

Read more about how Vespa Cloud keeps your data safe and private at rest and in transit in the Vespa Cloud Security Guide.

Configure Vector Schema

Now the app is ready to be deployed. The vector schema is configured with 768 dimensions using float precision.

The vector schema could be changed before deploying to match your vector data:

Change vector dimensionality (default 768).
Change vector precision type (default float) - choose between int8, bfloat16, float or double.
Change distance-metric (default angular useful for models trained with cosine similarity) - also supported euclidean, innerproduct and hamming.

Note that this sample application ships with CI/CD tests for production deployment that uses 768 dimensions. Changing the schema requires changes of the CI/CD tests.

Deploy to dev environment

Vespa Cloud supports multiple different environments. The following guides you through:

Deploying to dev for developing and testing of your vector search use case
Deploying to perf for performance validation and benchmarking
Deploying to prod for high availability production serving

Vespa Cloud dev zone is where development happens, resources are downscaled to nodes with 2 v-cpu, 8GB of RAM and 50 GB of disk. A single content node dev deployment can index about 1M 768 dimensional vectors.

Deploy app to dev:

vespa deploy

The very first deployment to dev environment takes about 12 minutes for provisioning resources and signing endpoint certificates. Later deployments takes less than a minute.

Deploy to perf environment

The perf zone is used for benchmarking and performance testing. It uses the same resource specification as in production, except for redundancy.

Deploy app to perf by using the --zone parameter:

vespa deploy --zone perf.aws-us-east-1c

Deploy to production environment

This deploys the application to production via automated deployment pipeline which executes:

System test tests/system-test/feed-and-search-test.json
Staging setup test tests/staging-setup/staging-feed-before-upgrade.json
Staging test tests/staging-test/staging-after-upgrade.json

The above tests also demonstrates Vespa vector search query and feed usage.

Deploying to production require choosing which production regions the app should be deployed to. The deployment.xml in this sample app uses aws-us-east-1c.

For high availability and low network latency, consider using multiple regions. Vespa Cloud supports global query traffic routing so that query requests are served by the region which is closest to the client. See deployment.xml global endpoints.

Currently available Vespa Cloud production zones is listed in zones. Request for new regions can be made by sending an email to support@vespa.ai.

The following deploys the application to the production regions specified in deployment.xml:

vespa prod deploy

Refer to Production Deployment to deploy to the production environment with CI/CD.

Vespa Cloud - Vector Search Price Examples

Vespa Cloud pricing is simple and transparent. All customers receive all features and services, and is charged a fee proportional to the resources the application uses.

The production env configuration in services.xml specifies the following resources:

<nodes deploy:environment="prod" count="2" groups="2">
      <resources memory="32GB" vcpu="8" disk="300GB" storage-type="local" />
</nodes>

Above specifies a redundant high availability deployment using grouped data distribution with one node per group and 2 groups for redundancy.

Vectors	Dimensionality	Precision Type	Queries per second	Writes per second	Estimated cost per hour ($)
5M	768	float	2000	1000	$ 3.36
5M	768	float	6000	1000	$ 10.08
10M	384	float	2000	1000	$ 3.36
20M	384	bfloat16	1500	750	$ 3.36

Lower number of vector dimensions and lower precision type (e.g, bfloat16 instead of float), increases number of vectors which can be indexed per node (memory resource limits). Supported queries per second and writes per second depends on vector search parameters.

Vespa Cloud sizing experts can assist in finding the most cost efficient resource specification matching your vector search use case. Sizing and cost estimation uses samples of your data in the perf environment.

Vespa Cloud also supports auto-scaling which lowers the cost of deployment as resources can be scaled with query volume changes throughout the week.

Vespa Cloud endpoint testing

In the security section above, the vespa auth cert command downloads data-plane credentials:

vespa auth cert
Success: Certificate written to security/clients.pem
Success: Certificate written to ~/.vespa/<tenant-name>.vector-search.default/data-plane-public-cert.pem
Success: Private key written to ~/.vespa/<tenant-name>.vector-search.default/data-plane-private-key.pem

This is the certificate/key-pair used when feeding and querying documents. The endpoint is found in the console and used in the commands below.

Before feeding or running queries, one can easily check the endpoint:

curl --verbose \
  --cert ~/.vespa/<tenant-name>.vector-search.default/data-plane-public-cert.pem \
  --key ~/.vespa/<tenant-name>.vector-search.default/data-plane-private-key.pem \
  https://vector-search.<tenant-name>.aws-us-east-1c.dev.z.vespa-app.cloud/

Expect a 200 OK with output like:

{
  "handlers" : [ {
    "id" : "com.yahoo.container.usability.BindingsOverviewHandler",
    "class" : "com.yahoo.container.usability.BindingsOverviewHandler",
    "bundle" : "container-disc:8.89.6",
    "serverBindings" : [ "http://*/" ]
  } ...

Or simply use the vespa cli:

vespa status query

Container (query API) at https://vector-search.<tenant-name>.aws-us-east-1c.dev.z.vespa-app.cloud/ is ready

Feeding example

feed.py is a script to generate test documents based on the schema. Use this as a template for feeding your own vector data. Example feed using feed.py to generate 20K test vectors with 768 dimensions:

vespa feed <(python3 feed.py 20000 768)

Query examples

Test the query API by querying for all documents - examples:

Using the Vespa CLI

vespa query 'yql=select * from vectors where true' \
'ranking=unranked' \
'hits=1'

Using the configured all document-summary, which also returns the vector data, this is slower as more data is returned:

vespa query 'yql=select * from vectors where true' \
'ranking=unranked' \
'hits=1' \
'summary=all'

Same query, but with different rendering of the vector tensor field:

vespa query 'yql=select * from vectors where true' \
'ranking=unranked' \
'hits=1' \
'summary=all' \
'presentation.format.tensors=short-value'

Use vespa query -v to print the curl equivalent.

Using HTTP GET

Using GET, the request parameters must be url encoded. Here space is replaced with +:

curl \
 --cert ~/.vespa/<tenant-name>.vector-search.default/data-plane-public-cert.pem \
 --key ~/.vespa/<tenant-name>.vector-search.default/data-plane-private-key.pem \
 https://vector-search.<tenant-name>.aws-us-east-1c.dev.z.vespa-app.cloud/search/?yql=select+*+from+vector+where+true

Using HTTP POST

 curl \
 --cert ~/.vespa/<tenant-name>.vector-search.default/data-plane-public-cert.pem \
 --key ~/.vespa/<tenant-name>.vector-search.default/data-plane-private-key.pem \
 --json '
  {
   "yql": "select * from vectors where true",
   "ranking": "unranked",
   "hits":1
  }' \
  https://vector-search.<tenant-name>.aws-us-east-1c.dev.z.vespa-app.cloud/search/

Using POST is recommended for large request payloads.

Nearest neighbor queries

Approximate nearest neighbor search search, asking for ten nearest neighbors {targetHits:10}:

query=$(cat query-vector.json) && \
curl \
 --cert ~/.vespa/<tenant-name>.vector-search.default/data-plane-public-cert.pem \
 --key ~/.vespa/<tenant-name>.vector-search.default/data-plane-private-key.pem \
 --json "
  {
   'yql': 'select * from vectors where {targetHits:10}nearestNeighbor(vector, q)',
   'input.query(q)': '$query' 
  }" \
 https://vector-search.<tenant-name>.aws-us-east-1c.dev.z.vespa-app.cloud/search/

Exact nearest neighbor search search, asking for ten nearest neighbors:

query=$(cat query-vector.json) && \
curl \
 --cert ~/.vespa/<tenant-name>.vector-search.default/data-plane-public-cert.pem \
 --key ~/.vespa/<tenant-name>.vector-search.default/data-plane-private-key.pem \
 --json "
  {
   'yql': 'select * from vectors where {targetHits:10,approximate:false}nearestNeighbor(vector, q)',
   'input.query(q)': '$query' 
  }" \
 https://vector-search.<tenant-name>.aws-us-east-1c.dev.z.vespa-app.cloud/search/

Approximate nearest neighbor search combined with a filter on tags:

query=$(cat query-vector.json) && \
curl \
 --cert ~/.vespa/<tenant-name>.vector-search.default/data-plane-public-cert.pem \
 --key ~/.vespa/<tenant-name>.vector-search.default/data-plane-private-key.pem \
 --json "
  {
   'yql': 'select * from vectors where {targetHits:10}nearestNeighbor(vector, q) and tags contains \"tag1\"',
   'input.query(q)': '$query' 
  }" \
 https://vector-search.<tenant-name>.aws-us-east-1c.dev.z.vespa-app.cloud/search/

See also using Vespa Vector Search for nearest neighbor search queries.

Visit and exporting the data

Getting the vector data out of Vespa can be equally important as getting the vector data in, especially when using native embedders that embeds text into vector representation(s). Use the Vespa visit functionality to export data.

vespa visit --field-set vector:vector,id > ../vector-data.jsonl

Will export all documents from the vector schema with only the vector and id field. Using [all] which exports both both schema and derived fields:

vespa visit --field-set "[all]" > ../vector-data.jsonl

Using document selection to limit what is returned. Note that the Vespa selection is not evaluated using index data structures and is a linear scan operation.

vespa visit --field-set "vector:vector" \
--selection "vector.tags != null" > ../vector-data.jsonl