Vespa Documentation Search

Vespa Documentation Search is a Vespa Cloud instance for searching documents in:

This sample app is auto-deployed to Vespa Cloud, see deploy-vespa-documentation-search.yaml


Deployment status:

Query API

Open API endpoints:

Example requests:

<pre data-test="exec" data-test-assert-contains="namespace"> $ curl "https://api.search.vespa.ai/document/v1/open/doc/docid/open%2Fen%2Freference%2Fquery-api-reference.html" </pre> <pre data-test="exec" data-test-assert-contains="the-great-search-engine-debate"> $ curl --data-urlencode 'yql=select * from doc where userInput(@userinput)' \ --data-urlencode 'userinput=vespa ranking is great' \ https://api.search.vespa.ai/search/ </pre>

Using these endpoints is a good way to get started with Vespa - see the github deploy action (use vespa:deploy to deploy to a dev instance or the quick-start) to deploy using Docker.

Refer to getting-started-ranking for example use of the Query API.

Feed your own instance

It is easy to set up your own instance on Vespa Cloud and feed documents from vespa-engine/documentation:

1: Generate the open_index.json feed file: cd vespa-engine/documentation && bundle exec jekyll build -p _plugins-vespafeed. Refer to the vespa_index_generator.rb for how the feed file is generated.

2: Add data plane credentials:

$ pwd; ll *.pem
-rwxr-xr-x@ 1 myuser  staff  3272 Mar 17 09:30 data-plane-private-key.pem
-rwxr-xr-x@ 1 myuser  staff  1696 Mar 17 09:30 data-plane-public-key.pem

3: Set endpoint in _config.yml (get this from the Vespa Cloud Console):

diff --git a/_config.yml b/_config.yml
-        - url: https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud/
-          indexes:
-              - open_index.json
-        - url: https://vespacloud-docsearch.vespa-team.aws-ap-northeast-1a.z.vespa-app.cloud/
+        - url: https://myinstance.vespacloud-docsearch.mytenant.aws-us-east-1c.dev.z.vespa-app.cloud/

Feed open_index.json:

$ ./feed_to_vespa.py


The ranking is quite simplistic, and an introduction to using query rank features and summary features:

    rank-profile documentation inherits default {
        inputs {
            query(titleWeight) double: 2.0
            query(headersWeight) double: 1.0
            query(contentWeight) double: 1.0
            query(keywordsWeight) double: 10.0
            query(pathWeight) double: 1.0
        first-phase {
            expression {
                query(titleWeight) * bm25(title) +
                query(contentWeight) * bm25(content) +
                query(headersWeight) * bm25(headers) +
                query(pathWeight) * bm25(path) +
                query(keywordsWeight) * bm25(keywords)
        summary-features {

With this it is easy to experiment with ranking by sending rank-properties in the query and observing the values in summary-features, like:

api.search.vespa.ai/search/?yql=select * from doc where userInput(@userinput)&ranking=documentation&input.query(pathWeight)=10&userinput=vespa ranking is great

See approximate-nn-hnsw.md for use of (comma separated) keywords set in the frontmatter to rank higher for those, e.g.

title: "Approximate Nearest Neighbor Search using HNSW Index"
keywords: "ann, approximate nearest neighbor"

Document feed automation

Vespa Documentation is stored in GitHub:

Jekyll is used to serve the documentation, it rebuilds at each commit.

A change also triggers GitHub Actions. The Build step in the workflow uses the Jekyll Generator plugin to build a JSON feed, used in the Feed step:

Vespa Cloud secures endpoints using mTLS. Secrets can be stored in GitHub Settings for a repository. Here, the private key secret is accessed in the GitHub Actions workflow that feeds to Vespa Cloud: feed.yml

Document processing

The documents are split into paragraphs for multi-vector ranking, see example in feed-split.py.

Query integration

Query results are open to the internet. To access Vespa Documentation Search, an AWS Lambda function is used to get the private key secret from AWS Parameter Store, then add it to the https request to Vespa Cloud:

The lambda needs AmazonSSMReadOnlyAccess added to its Role to access the Parameter Store.

Note JSON-P being used (jsoncallback=) - this simplifies the search result page: search.html.

Vespa Cloud Development and Deployments

This is a Vespa Cloud application and has hence implemented automated deployments.

The feed can contain an array of links from each document. The OutLinksDocumentProcessor is custom java code that add an in-link in each target document using the Vespa Document API.

To test this functionality, the VespaDocSystemTest runs for each deployment.

Creating a System Test is also a great way to develop a Vespa application:

Feed grouping examples

cat << EOF | vespa feed -t https://vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud -
{"fields": {"customer": "Smith","date": 1157526000,"item": "Intake valve","price": "1000","tax": "0.24"},"put": "id:purchase:purchase::0"}
{"fields": {"customer": "Smith","date": 1157616000,"item": "Rocker arm","price": "1000","tax": "0.12"},"put": "id:purchase:purchase::1"}
{"fields": {"customer": "Smith","date": 1157619600,"item": "Spring","price": "2000","tax": "0.24"},"put": "id:purchase:purchase::2"}
{"fields": {"customer": "Jones","date": 1157709600,"item": "Valve cover","price": "3000","tax": "0.12"},"put": "id:purchase:purchase::3"}
{"fields": {"customer": "Jones","date": 1157702400,"item": "Intake port","price": "5000","tax": "0.24"},"put": "id:purchase:purchase::4"}
{"fields": {"customer": "Brown","date": 1157706000,"item": "Head","price": "8000","tax": "0.12"},"put": "id:purchase:purchase::5"}
{"fields": {"customer": "Smith","date": 1157796000,"item": "Coolant","price": "1300","tax": "0.24"},"put": "id:purchase:purchase::6"}
{"fields": {"customer": "Jones","date": 1157788800,"item": "Engine block","price": "2100","tax": "0.12"},"put": "id:purchase:purchase::7"}
{"fields": {"customer": "Brown","date": 1157792400,"item": "Oil pan","price": "3400","tax": "0.24"},"put": "id:purchase:purchase::8"}
{"fields": {"customer": "Smith","date": 1157796000,"item": "Oil sump","price": "5500","tax": "0.12"},"put": "id:purchase:purchase::9"}
{"fields": {"customer": "Jones","date": 1157875200,"item": "Camshaft","price": "8900","tax": "0.24"},"put": "id:purchase:purchase::10"}
{"fields": {"customer": "Brown","date": 1157878800,"item": "Exhaust valve","price": "1440","tax": "0.12"},"put": "id:purchase:purchase::11"}
{"fields": {"customer": "Brown","date": 1157882400,"item": "Rocker arm","price": "2330","tax": "0.24"},"put": "id:purchase:purchase::12"}
{"fields": {"customer": "Brown","date": 1157875200,"item": "Spring","price": "3770","tax": "0.12"},"put": "id:purchase:purchase::13"}
{"fields": {"customer": "Smith","date": 1157878800,"item": "Spark plug","price": "6100","tax": "0.24"},"put": "id:purchase:purchase::14"}
{"fields": {"customer": "Jones","date": 1157968800,"item": "Exhaust port","price": "9870","tax": "0.12"},"put": "id:purchase:purchase::15"}
{"fields": {"customer": "Brown","date": 1157961600,"item": "Piston","price": "1597","tax": "0.24"},"put": "id:purchase:purchase::16"}
{"fields": {"customer": "Smith","date": 1157965200,"item": "Connection rod","price": "2584","tax": "0.12"},"put": "id:purchase:purchase::17"}
{"fields": {"customer": "Jones","date": 1157968800,"item": "Rod bearing","price": "4181","tax": "0.24"},"put": "id:purchase:purchase::18"}
{"fields": {"customer": "Jones","date": 1157972400,"item": "Crankshaft","price": "6765","tax": "0.12"},"put": "id:purchase:purchase::19"}

Simplified node.js Lambda code

<pre> 'use strict'; const https = require('https') const AWS = require('aws-sdk') const publicCert = `-----BEGIN CERTIFICATE----- MIIFbDCCA1QCCQCTyf46/BIdpDANBgkqhkiG9w0BAQsFADB4MQswCQYDVQQGEwJO ... NxoOxvYcP8Pnxn8UGILy7sKl3VRQWIMrlOfXK4DEg8EGqeQzlFVScfSdbH0i6gQz -----END CERTIFICATE-----`; exports.handler = async (event, context) => { console.log('Received event:', JSON.stringify(event, null, 4)); const query = event.queryStringParameters.query ? event.queryStringParameters.query : ''; const jsoncallback = event.queryStringParameters.jsoncallback; const path = encodeURI(`/search/?jsoncallback=${jsoncallback}&query=${query}&hits=${hits}&ranking=${ranking}`); const ssm = new AWS.SSM(); const privateKeyParam = await new Promise((resolve, reject) => { ssm.getParameter({ Name: 'ThePrivateKey', WithDecryption: true }, (err, data) => { if (err) { return reject(err); } return resolve(data); }); }); var options = { hostname: 'vespacloud-docsearch.vespa-team.aws-us-east-1c.z.vespa-app.cloud', port: 443, path: path, method: 'GET', headers: { 'accept': 'application/json' }, key: privateKeyParam.Parameter.Value, cert: publicCert } var body = ''; const response = await new Promise((resolve, reject) => { const req = https.get( options, res => { res.setEncoding('utf8'); res.on('data', (chunk) => {body += chunk}) res.on('end', () => { resolve({ statusCode: 200, body: body }); }); }); req.on('error', (e) => { reject({ statusCode: 500, body: 'Something went wrong!' }); }); }); return response }; </pre>