Home

Awesome

zkJSON Litepaper v1.0

     _        _  _____  ____  _   _   _      _ _                                   
     | |      | |/ ____|/ __ \| \ | | | |    (_) |                                  
  ___| | __   | | (___ | |  | |  \| | | |     _| |_ ___ _ __   __ _ _ __   ___ _ __ 
 |_  / |/ /   | |\___ \| |  | | . ` | | |    | | __/ _ \ '_ \ / _` | '_ \ / _ \ '__|
  / /|   < |__| |____) | |__| | |\  | | |____| | ||  __/ |_) | (_| | |_) |  __/ |   
 /___|_|\_\____/|_____/ \____/|_| \_| |______|_|\__\___| .__/ \__,_| .__/ \___|_|   
                                                       | |         | |              
                                                       |_|         |_|              

Zero Knowledge Provable JSON

zkJSON makes any arbitrary JSON data provable with zero knowledge proof, and makes them verifiable both offchain and onchain (blockchain).

EVM blockchains like Ethereum will get a hyper scalable NoSQL database extension whereby off-chain JSON data are directly queriable from within Solidity smart contracts.

<div align="center"><img src="./assets/zkjson.png" /></div>

Why

Most offchain data on the web are represented in JSON format, and blockchains have been failing to connect with them efficiently for some critical reasons.

As a result, data on web2 (offchain) and web3 (onchain) are divided and web3 is missing a great wide variety of use cases with offchain data. What if we could verify any offchain JSON data in onchain smart contracts, and also build a general-purpose database with web2-like performance and scalability? zkJSON and zkDB will allow direct connections from smart contract to offchain database. And we will further make it practical and sustainable with modular blockchain rollups (Ethereum security + Arweave permanency and scalability) and a decentralized physical infrastructure network (DePIN) using Cosmos IBC.

This entire tech stack will enable novel use cases to web3 such as decentralized oracles and indexers, as well as provide a decentralized database alternative to web2 with the performance and scalability of cloud databases. We could, for instance, build a fully decentralized Twitter without any centralized components. Connecting securely with offchain data with privacy is also the way to bring in enterprise use cases to web3 in combination with DID and Verifiable Credentials (VC). We are working on it too with PolygonID and ICP VETKeys.

We envision the web where offchain data are seamlessly connected with blockchains. Our ultimate goal is to liberate the web2 data silos and redirect the huge monopolistic web2 revenue models such as ad networks and future AI-based networks to web3. Any offchain data without zkJSON are not legit, since they are not verifiable onchain.

Onchain verifiability is what scales the decentralized web. Onchain is the new online, and zkJSON expands what's online/onchain (verifiable).

How

There are 4 steps to build a complete solution.

  1. make any JSON provable with zk circuits - zkJSON
  2. build a database structure with merkle trees and zkJSON - zkDB
  3. commit db states to an EVM blockchain - zkRollup
  4. make it queriable with Solidity - zkQuery

And 3 bonus steps to make it practical and sustainable (using Arweave & Cosmos IBC).

  1. make zkDB feature-rich to bear any web2/web3 usages - WeaveDB
  2. make WeaveDB performant, scalable, and secure with Arweave+EVM hybrid rollup - WeaveDB Rollup
  3. make the rollups sustainable with DePIN - WeaveChain

This repo contains only the first 4 steps. You can find the rest here.

zkJSON

The key to making JSON verifiable with zkp is to invent a deterministic encoding that is friendly to zk circuits. zk circuits can only handle arithmetic operations with natural numbers, so we need to convert any JSON to a series of natural numbers back and forth, then pack everything into as few uint as possible to efficiently save space. The default storage block in Solidity is uint256 and Circom uses a modulo just below the 256 bit range. So optimizing for uint makes sense. Just to clarify, you cannot simply convert JSON to a binary format or any existing encoding formats, because it has to specifically make sense to the circuit logic and Solidity.

Encoding
<div align="center"><img src="./assets/encode.png" /></div>

zk circuits can neither handle objects nor dynamically nested arrays. So we first need to flatten all the paths into a simple array.

{
  "a": 1,
  "c": false,
  "b": { "e": null, "d": "four" },
  "f": 3.14,
  "ghi": [ 5, 6, 7 ],
}

becomes

[
  [ "a", 1 ],
  [ "c", false ]
  [ "b.e", null ],
  [ "b.d", "four" ],
  [ "f", 3.14 ],
  [ "ghi", [ 5, 6, 7 ] ],
]

Each path will be converted to an unicode number.

[
  [ [ [ 97 ] ], 1 ],
  [ [ [ 99 ] ], false ]
  [ [ [ 98 ], [ 101 ] ], null ],
  [ [ [ 98 ], [ 100 ] ], "four" ],
  [ [ [ 102 ] ], 3.14 ],
  [ [ [ 103, 104, 105 ] ], [ 5, 6, 7 ] ]
]

To make it deterministic, items must be lexicographically sorted by the paths.

[
  [ [ [ 97 ] ], 1 ],
  [ [ [ 98 ], [ 100 ] ], "four" ],
  [ [ [ 98 ], [ 101 ] ], null ],
  [ [ [ 99 ] ], false ]
  [ [ [ 102 ] ], 3.14 ],
  [ [ [ 103, 104, 105 ] ], [ 5, 6, 7 ] ]
]

Here's a tricky part, if the value is an array, we need to create a path for each element, but we need to tell the difference between ghi.0 and ghi[0] with just numbers. ghi.0 is a path to an object, ghi[0] is a path to an array element. Also there is a case where the key is empty like { "" : "empty" }. Another case to note is that just a primitive value without the top level element being an object is also a valid JSON, such as null, true, [ 1, 2, 3], 1. You can express the paths with empty string , or something like a..b for { "a" : { "" : { "b" : 1 } } }.

To address all these edge cases, we prefix each array key with the number of characters that follow, or 0 if the key is empty (followed by 1) or an array index (followed by another0).

[
  [ [ [ 1, 97 ] ], 1 ],
  [ [ [ 1, 98 ], [ 1, 100 ] ], "four" ],
  [ [ [ 1, 98 ], [ 1, 101 ] ], null ],
  [ [ [ 1, 99 ] ], false ]
  [ [ [ 1, 102 ] ], 3.14 ],
  [ [ [ 3, 103, 104, 105 ], [ 0, 0, 0 ] ], 5 ],
  [ [ [ 3, 103, 104, 105 ] ], [ 0, 0, 1 ], 6 ],
  [ [ [ 3, 103, 104, 105 ] ], [ 0, 0, 2 ], 7 ]
]

Now we flatten the paths but also prefix them with how many nested keys each path contains.

[
  [ 1, 1, 97 ], 1 ],
  [ 2, 1, 98 , 1, 100 ], "four" ],
  [ 2,  1, 98, 1, 101 ], null ],
  [ 1, 1, 99 ], false ]
  [ 1, 1, 102 ], 3.14 ],
  [ 2, 3, 103, 104, 105, 0, 0, 0 ], 5 ],
  [ 2, 3, 103, 104, 105, 0, 0, 1 ], 6 ],
  [ 2, 3, 103, 104, 105, 0, 0, 2 ], 7 ]
]

If the top level is a non-object value such as 1 and null, the flattened path is always [ 1, 0, 1 ].

Let's numerify the values in a similar fashion. There are only 6 valid data types in JSON ( null / boolean / number / string / array / object ), and since the paths are flattened, we need to handle only 4 primitive types. We assign a type number to each.

The first digit will always be the type number.

null (0)

null is always [ 0 ] as there's nothing else to tell.

boolean (1)

There are only 2 cases. true is [ 1, 1 ] and false is [ 1, 0 ].

number (2)

number is a bit tricky as we need to differentiate integers and floats, and also positive numbers and negative ones. Remember that circuits can only handle natural numbers. A number contains 4 elements.

for instance,

string (3)

The first digit is the type 3 and the second digit tells how many characters, then each character is converted to a unicode number (e.g. abc = [ 3, 3, 97, 98, 99 ]).

array | object (4)

In the case of an array and object, it prefixes 4 and recursively encodes all the nested values. The final array includes internal paths too.

Note that the path to 1 is 1, 0, 0, 0 and the path to 2 is 1, 0, 0, 1, and they are included.

Now let's convert the values in our original JSON example.

[
  [ [ 1, 1, 97 ], [ 2, 1, 0, 1 ] ],
  [ [ 2, 1, 98 , 1, 100 ], [ 3, 4, 102, 111, 117, 114 ] ],
  [ [ 2,  1, 98, 1, 101 ], [ 0 ] ],
  [ [ 1, 1, 99 ], [ 1, 0 ] ],
  [ [ 1, 1, 102 ], [ 2, 1, 2, 314 ] ],
  [ [ 2, 3, 103, 104, 105, 0, 0, 0 ], [ 2, 1, 0, 5 ] ],
  [ [ 2, 3, 103, 104, 105, 0, 0, 1 ], [ 2, 1, 0, 6 ] ],
  [ [ 2, 3, 103, 104, 105, 0, 0, 2 ], [ 2, 1, 0, 7 ] ]
]

Now we are to flatten the entire nested arrays, but each number must be prefixed by the number of digits that contains, otherwise, there's no way to tell where to partition the series of digits. And here's another tricky part, if the number contains more than 9 digits, you cannot prefix it with 10, 11, 12 ... because when all the numbers are concatenated later, 10 doesn't mean that 10 digits follow, but it means 1 digit follows and it's 0. So we allow max 8 digits in each partition and 9 means there will be another partition(s) following the current one.

By the way, digits are in fact stored as strings, so a leading 0 won't disappear.

This is the prefixed version.

[
  [ [ 1, 1, 1, 1, 2, 97 ], [ 1, 2, 1, 1, 1, 0, 1, 1 ] ],
  [ [ 1, 2, 1, 1, 2, 98 , 1, 1, 3, 100 ], [ 1, 3, 1, 4, 3, 102, 3, 111, 3, 117, 3, 114 ] ],
  [ [ 1, 2,  1, 1, 3, 98, 1, 1, 3, 101 ], [ 1, 0 ] ],
  [ [ 1, 1, 1, 1, 2, 99 ], [ 1, 1, 1, 0 ] ],
  [ [ 1, 1, 1, 1, 3, 102 ], [ 1, 2, 1, 1, 1, 2, 3, 314 ] ],
  [ [ 1, 2, 1, 3, 3, 103, 3, 104, 3, 105, 1, 0, 1, 0, 1, 0 ], [ 1, 2, 1, 1, 1, 0, 1, 5 ] ],
  [ [ 1, 2, 1, 3, 3, 103, 3, 104, 3, 105, 1, 0, 1, 0, 1, 1 ], [ 1, 2, 1, 1, 1, 0, 1, 6 ] ],
  [ [ 1, 2, 1, 3, 3, 103, 3, 104, 3, 105, 1, 0, 1, 0, 1, 2 ], [ 1, 2, 1, 1, 1, 0, 1, 7 ] ]
]

Then this is the final form all flattened.

[ 1, 1, 1, 1, 2, 97, 1, 2, 1, 1, 1, 0, 1, 1, 1, 2, 1, 1, 2, 98, 1, 1, 3, 100, 1, 3, 1, 4, 3, 102, 3, 111, 3, 117, 3, 114, 1, 2, 1, 1, 3, 98, 1, 1, 3, 101, 1, 0, 1, 1, 1, 1, 2, 99, 1, 1, 1, 0, 1, 1, 1, 1, 3, 102, 1, 2, 1, 1, 1, 2, 3, 314, 1, 2, 1, 3, 3, 103, 3, 104, 3, 105, 1, 0, 1, 0, 1, 0, 1, 2, 1, 1, 1, 0, 1, 5, 1, 2, 1, 3, 3, 103, 3, 104, 3, 105, 1, 0, 1, 0, 1, 1, 1, 2, 1, 1, 1, 0, 1, 6, 1, 2, 1, 3, 3, 103, 3, 104, 3, 105, 1, 0, 1, 0, 1, 2, 1, 2, 1, 1, 1, 0, 1, 7 ]

It's 144 integers, or 182 digits. The original JSON was 66 character long when JSON.stringified, so it's not too bad considering integer vs character (let's say one ascii char takes up 3 digits and one unicode char takes up 7 digits). And zk circuits and Solidity cannot handle just stringified JSONs anyway. But it gets better.

When passed to a circuit, all digits will be concatenated into one integer. Circom by default uses a modulo with

21888242871839275222246405745257275088548364400416034343698204186575808495617 (77 digits)

which means up to 76 digits are safe and a 77-digit number could overflow, which is also within the range of uint / uint256 in Solidity.

So to convert the encoded array to a circuit signal, it becomes

[
  1111297121110111211298113100131431023111311731141211298113101101111299111011,
  1131021211123314121331033104310510101012111015121331033104310510101112111016,
  121331033104310510101212111017
]

If you observe carefully, there's room for more compression. Most digits are a single digit with a prefix of 1, so we can remove the prefixes and join the succession of single digits, and we can use 0 and the number of single digits in the succession. For instance 121110111211 becomes 06210121, and we save 4 digits.

We will prefix each integer with 1, since now 0 could come at the beginning and it disappears without the prefix. So

032123314121331033104310509000210523310331043105090012106233103310431051010

will be prefixed with 1 and become

1032123314121331033104310509000210523310331043105090012106233103310431051010

otherwise the first 0 will disapper when being evaluated as a number.

[
  1111129706210121298113100131431023111311731141211298113101030112990410113102,
  1032123314121331033104310509000210523310331043105090012106233103310431051010,
  10522107
]

Now it's much shorter than before. What's surprising here is that the entire JSON is compressed into just 3 integers in the end (well, almost 2 integers). It's just uint[3] in Solidity. This indeed is extreme efficiency! The zkJSON circuit by default allows up to 256 integers (256 * 76 safe digits), which can contain a huge JSON data size, and Solidity handles it efficiently with a dynamic array uint[], which is optimized with Yul assembly language. What's even better is that the only bits passed to Solidity is the tiny bits of the value at the queried path, and not the entire JSON bits. So if you are querying the value at the path a, 1111297(path: "a") and 1042101(value: 1) are the only digits passed to Solidity as public signals of zkp.

Now we can build a circuit to handle these digits and prove the value of a selected path without revealing the entire JSON. It's easy to explain the encoding, but harder to write the actual encoder/decoder and a circuit to properly process this encoding. But fortunately, we already did write them!

You can use zkjson node package to encode and decode JSON.

yarn add zkjson
const { encode, decode, toSignal, fromSignal } = require("zkjson")

const json = { a : 1 }
const encoded = encode(json) // [ 1, 1, 97, 2, 1, 0,  1 ]
const signal = toSignal(encoded) // [ '11111297042101' ]
const encoded2 = fromSignal(signal) // [ 1, 1, 97, 2, 1, 0, 1 ]
const decoded = decode(encoded2) // { a : 1 }
const { encodePath, decodePath, encodeVal, decodeVal } = require("zkjson")

const path = "a"
const encodedPath = encodePath(path) // [ 1, 1, 97 ]
const decodedPath = decodePath(encodedPath) // "a"

const val = 1
const encodedVal = encodeVal(val) // [ 2, 1, 0, 1 ]
const decodedVal = decodeVal(encodedVal) // 1

zkDB

Once we get zkJSON, we can build a database structure with zkJSON as base building blocks.

A document-based NoSQL database would have collections, and each collection in turn would have a bunch of documents, which are JSONs.

<div align="center"><img src="./assets/structure.png" /></div>
Collection

We can use a sparse merkle tree (SMT) to represent all the document data in a collection with a root hash. SMT is perfect because curcuits cannot handle dynamic tree sizes and SMT can represent a large number of documents efficiently, and any data membership or non-membership can be proven efficiently with a zk proof without the actual merkle proof. This is what enables efficient direct queries to offchain databases from within EVM smart contracts.

<div align="center"><img src="./assets/collection.png" /></div>

Each leaf node will be the poseidon hash of zkJSON encoding of the data. To hash 256 * 76 digits, 16 poseidon hashes are hashed together into another poseidon hash. This allows a fairly large JSON size to be proven.

And each leaf node has an index number, so we need to somehow convert the document IDs to numbers without collisions. How many leaf nodes a SMT has depends on the pre-defined depth of the tree. For example, a 32-level SMT can have 2 ** 32 = 4294967296 leaf nodes. The level must be pre-defined at the circuit compile time, so we need to find the right conversion and balance.

Due to this constraint, we only allow 64 characters to keep things compact and efficient, although there can be different optimized setups for your specific use cases.

Now 2 digits can represent one character with collision free, which means we can have only up to 4 characters in document IDs with a 32-level SMT. The last allowed digit will always have the possibility of overflowing, so we prefix the converted numbers with 1 to differentiate A from AA (they are both 0 without the prefix 1).

We can of course increase the level to have more characters, but the more levels, the more computation with the circuit, so we need to find the right balance. For instance, to allow 10 characters we need 67 levels of SMT.

You can use zkjson to convert the string to an SMT index.

 const { toIndex, fromIndexs } = require("zkjson")
 
 const index = toIndex("zkJSON") // 1513609181413
 const str = fromIndex(index) // "zkJSON"

Practically a 100-level SMT allows 15 character IDs and 1,267,650,600,228,229,401,496,703,205,376 documents in a collection. It should be sufficient for most applications if the IDs are designed wisely.

One way to have a longer ID length with the same depth is to restrict the allowed characters to less than 31 since 31 * 31 = 961. In this case 3 digits can represent 2 characters instead of 4 digits representing 2 characters. But we won't cover it here.

Database

For the database, we can take the exact same approach with the collections. We can use an SMT to represent multiple collection states in a DB with one root hash, and each leaf node will be the merkle root of a collection, which in turn represents the entire documents in the collection. We could give each collection an ID with the same ID-to-index conversion as the documents, however, collection IDs are not as essential as document IDs since document IDs are usually a part of access control rules, but collection IDs are not. We can use an incremental count for collection IDs and no well-structured DB has so many collections as documents. Let's say 2 ** 8 = 256, so an 8 level SMT can give us 256 collections and it should be more than enough for most applications. If you need alphanumeric IDs for collections, you could map them with numeric indexes offchain (e.g. 0 = FirstCollection, 1 = AnotherCollection, 2 = YetAnotherCollection...). Note that this is different from the deterministic toIndex / fromIndex conversion. In this way we can use a smaller tree and keep the circuit small.

<div align="center"><img src="./assets/db.png" /></div>

Now we can write a circuit to prove a collection root hash, then we can write another circuit to prove a database root hash, which represents multiple collections within the database. This circuit can also prove any value in any JSON document in any collection in a database without revealing the entire JSON data. zkJSON enables this.

zkRollup

How do we make zkDB secure and queriable from other blockchains? We can write a circuit to prove the merkle tree hash transitions and deploy a Solidity contract to verify those proofs onchain. Fortunately, Circom auto-generates a Solidity verifier for us, so we can use that function in our verifier contract. We need to keep track of the current database root merkle hash as a Solidity contract state.

interface IZKRollup {
  address public committer;
  uint public root;
  function commit (uint[] memory zkp) external returns (uint);
}
<div align="center"><img src="./assets/rollup.png" /></div>

zkQuery

Finally, we can deploy the previous zkDB query circuit verifier as a Solidity contract too, and make it possible to securely query any paths with the right proof. When querying, the Solidity contract must check the DB root hash to verify the queried value against the current database state.

interface IZKQuery {
  function qNull (uint[] memory path, uint[] memory zkp) external pure returns (bool);
  function qBool (uint[] memory path, uint[] memory zkp) external pure returns (bool);
  function qInt (uint[] memory path, uint[] memory zkp) external pure returns (int);
  function qFloat (uint[] memory path, uint[] memory zkp) external pure returns (uint[3] memory);
  function qString (uint[] memory path, uint[] memory zkp) external pure returns (string memory);
  function qRaw (uint[] memory path, uint[] memory zkp) external pure returns (uint[] memory);
  function qCond (uint[] memory path, uint[] memory cond, uint[] memory zkp) external pure returns (bool);
}

path[0] is a collection index, and path[1] is a doc index, then the rest of the path follows.

qNill returns true only if the value is null and otherwise throws an error. And qFloat returns the array of encoded numbers without the type prefix ( e.g. [ 1, 2, 314 ] ) since Solidity cannot handle float numbers.

qRaw returns the raw encoded value for non-primitive data types (array and object), and you can further query the raw value with the getX functions. Pass the raw value returned from qRaw with the path to query, instead of zkp proof.

interface IZKQuery {
  function getNull (uint[] memory path, uint[] memory raw) external pure returns (bool);
  function getBool (uint[] memory path, uint[] memory raw) external pure returns (bool);
  function getInt (uint[] memory path, uint[]  memory raw) external pure returns (int);
  function getFloat (uint[] memory path, uint[] memory raw) external pure returns (uint[3] memory);
  function getString (uint[] memory path, uint[] memory raw) external pure returns (string memory);
}
Conditional Operators

qCond queries a field with a conditional operator and returns true if the condition is met.

const { encodeQuery } = require("zkjson")

const json = { num: 5, arr: [ 1, 2, 3 ]}

// for num field
const num_gt = encodeQuery([ "$gt", 4 ])
const num_gte = encodeQuery([ "$gte", 5 ])
const num_lt = encodeQuery([ "$lt", 6 ])
const num_lte = encodeQuery([ "$lte", 5 ])
const num_eq = encodeQuery([ "$eq", 5 ])
const num_ne = encodeQuery([ "$ne", 7 ])
const num_in = encodeQuery([ "$in", [ 4, 5, 6 ]])
const num_nin = encodeQuery([ "$nin", [ 1, 2, 3 ]])

// for arr field
const arr_contains = encodeQuery([ "$contains", 3 ])
const arr_contains_any = encodeQuery([ "$contains_any", [ 3, 4, 5 ]])
const arr_contains_all = encodeQuery([ "$contains_all", [ 2, 3 ]])
const arr_contains_none = encodeQuery([ "$contains_none", [ 4, 5, 6 ]])
Other Structures

You could also write a function to get an array of numbers or a specific data structure, but it's up to your applications what data types to extract, so we will leave it up to you.

<div align="center"><img src="./assets/query.png" /></div>

Going Further

With the first 4 components zkJSON / zkDB / zkRollup / zkQuery, it's now technically possible to build a fully verifiable zkp-based DB connecting blockchains and offchain data. But this doesn't mean it's practical in real use cases. We will briefly introduce 3 additional steps of WeaveDB / WeaveDB Rollup / WeaveChain to implement zkDB in our real world.

WeaveDB

WeaveDB is a general-purpose NoSQL database as a smart contract. It utilizes SCP (Storage-based Consensus Paradigm) enabled by Arweave, and the entire database including indexes is a SmartWeave contract. It has a powerful DSL called FPJSON to operate on JSON objects, which enables highly advanced features a decentralized database would require.

WeaveDB queries are almost compatible with Firestore from Google but way more powerful thanks to FPJSON. In the future, we will write a circuit to prove all the FPJSON operations so zkDB will be even more secure with powerful data manipulations.

Each data block of WeaveDB will be a zkJSON document, so we can query WeaveDB data directly from Ethereum smart contracts as well as from Arweave smart contracts (SmartWeave).

WeaveDB Rollup

SmartWeave (Arweave smart contract) provides scalability and cost-effectiveness with lazy offchain computation. This is the only way to hyperscale the decentralized web. But when it comes to databases, a blockchain sequencer is a bottleneck to performance and latency because of how the sequencer processes transactions in sequence and how a DB must maintain ACID properties with hyper-low latency. So WeaveDB has developed a L3 rollup to the L2 sequencer (Warp) to the L1 Arweave permanent storage. In this way, we can have a centralized node for parallel query executions with high performance and low latency of web2 cloud databases, but still keep full decentralization with L1 verifiability and L2 composability.

Although Arweave already guarantees permanent data verifiability and full decentralization, WeaveDB rollups can optionally inherit the Ethereum (or any EVM) security and interoperability with EVM smart contracts via zkp by turning on the zkRollup feature.

<div align="center"><img src="./assets/weavedb.png" /></div>

WeaveDB Rollup will roll out in 3 phases.

WeaveChain

WeaveChain will be a CosmosSDK based DePIN blockchain and a marketplace to match database developers / dapps with rollup operators. It's basically a Filecoin for database. zkDB/WeaveDB is to WeaveChain as IPFS is to Filecoin. We will introduce 2 unique components to connect with real-world data and web2.

WeaveChain will also be a PoS network to manage rollup nodes and IBC compatible to communicate between different chains.

Resources

Tutorials

API Reference

Demos

Other Links