Home

Awesome

tfhesql-rs

A pure Rust library for executing simple FHE-encrypted SQL queries on a clear database using TFHE-rs

APIs

the lib comes with two API flavours. The lib API and the bounty-specific API.

Bounty API Example

use tfhe::{ClientKey, ConfigBuilder};
use tfhesql::test_util::tfhesql_test_db_dir;
use tfhesql::bounty_api::*;

fn main() {
    let csv_dir = tfhesql_test_db_dir("tiny");

    // 1. Call default_parameters
    let params = default_parameters();

    // Generate Keys
    let config = ConfigBuilder::with_custom_parameters(params.0, None).build();
    let client_key = ClientKey::generate(config);
    let sks = client_key.generate_server_key();

    // 2. Call load_tables
    let tables = load_tables(&csv_dir);

    // 3. Call encrypt
    let query = "SELECT CustomerID,PostalCode,Country FROM Customers WHERE Country='Germany'";
    let enc_query = encrypt_query(query, &client_key);
    
    // 4. Call run_fhe_query
    let enc_result = run_fhe_query(&sks, &enc_query, &tables);

    // 5. Call decrypt_result to retrieve the clear result as a csv string
    let csv_string = decrypt_result(&client_key, &enc_result);
    
    println!("{}", csv_string);
}

tfhesql API Example

use tfhesql::test_util::{print_pretty_batches, tfhesql_test_db_dir};
use tfhesql::*;

fn main() {
    let csv_dir = tfhesql_test_db_dir("medium");

    // Client Side
    // ===========

    // 1. Load the SAME schemas in the SAME order as the server.
    //    This is critical since server and client must share
    //    the same table + schema order
    let client_ordered_schemas = OrderedSchemas::load_from_directory(&csv_dir).unwrap();

    // 2. Creates a new FheSqlClient instance
    let sql_client = FheSqlClient::new(client_ordered_schemas.clone()).unwrap();

    // 3. Generates a new SQL query with a SQL SELECT statement and the default options (compress = true, format = by rows + padding).
    let sql = "SELECT CustomerID,PostalCode,Country FROM Customers WHERE Country IN ('France', 'Germany')";
    let clear_sql_query = sql_client.clear_sql(sql, SqlResultOptions::best()).unwrap();

    // Server Side
    // ===========

    // 1. Load csv file located in the specified directory and stores them into an ordered list of tables.
    //    Note: Order is critical and should remain sealed since all the masking operations between the client
    //    and the server are based on it.
    let server_tables = OrderedTables::load_from_directory(&csv_dir).unwrap();

    // 2. Executes the SQL query on the server
    let clear_sql_result = FheSqlServer::run(&clear_sql_query, &server_tables).unwrap();

    // Client Side
    // ===========

    // 1. Extract the RecordBatch from the SQL query result
    let rb = clear_sql_result.clone().into_record_batch().unwrap();

    // 2. Prints the RecordBatch using arrow pretty print
    print_pretty_batches(&[rb]).unwrap();

    // 3. FYI, displays the total number of Boolean + U8 gates
    //    When tfhesql lib is compiled with 'stats' feature
    #[cfg(feature = "stats")]
    clear_sql_result.print_stats();
}

The Problem & The Approach

  1. Define a SQL query format
  2. Write an SQL SELECT interpretor
  3. Define a SQL result

While progressing in solving the bounty, it became very clear that the final solution would be impracticable on real-life SQL databases. It's likely impracticable on very small SQL databases as well (kind of depressing 😩).

Thus, significant efforts were directed towards optimizing performance to enable the execution of a SQL SELECT request on a very small database. Emphasis was placed on refining both the SQL query and result formats.

The SQL interpreter was deliberately tested less for the following reasons:

SQL SELECT type comparison specs

The lib supports the MySQL SELECT type comparison specs.

Note: MSSQL type comparison specs are much more advanced.

SQL AST Tree

  1. Simplification: the AST Tree is simplified to get rid of the parenthesis, +/- signs and 'Not' unary operators.
  2. Optimisation: the tree is parsed to eliminate trivial binary operations (for example: Some U8 > -1 is always true)
  3. Conversion to Binary Tree: the tree is expanded to form a perfect binary tree with the following properties:

Encoding the right operand

One of the major optimisation lies on the type of data sent to the server. The goal was to maximize performance at the expense of a bigger SQL Query size. The tradeoff appeared to be massively beneficial, and the relative large size of the request could be solved by using CompressedFheBool types plus a global zip to reduce the size of the query.

The right operand values are converted into 32xu8 values, each of these values are computed to produce 256 pre-calculated boolean pairs using the following formula:

With 0 <= i < 256
Pair(0,value,i) = (value == i) 
Pair(1,value,i) = (value > i) 

All the computed values are stored in a SqlQueryRightBytes256 structure.

Encrypted SQL Request format

struct EncryptedSqlQuery<B> {
    /// A header containing the crypted projections and table
    header: TableBoolMaskHeader<B>,
    /// A crypted boolean: True if running a SELECT DISTINCT query
    is_distinct: B,
    /// A structure defining the WHERE clause
    where_tree: SqlQueryTree<B>,
}

pub struct TableBoolMaskHeader<B> {
    /// A crypted boolean mask
    /// Len = the number of tables
    pub table_mask: BoolMask<B>,
    /// A crypted boolean mask 
    /// Len = Maximum number of columns in a single table
    pub field_mask: BoolMask<B>,
    /// A crypted boolean mask = NOT(field_mask)
    pub not_field_mask: BoolMask<B>,
}

pub struct SqlQueryTree<B> {
    /// A binary tree of BoolBinOpMask<B> nodes.
    /// Each node can be one of the following
    /// - a AND operator
    /// - a OR operator 
    /// - None
    tree: OptionalBoolTree<B>,
    /// A vector of boolean pairs (len = 2^k).
    /// One boolean pair for each leaf of `tree`.
    /// - True: the leaf is dummy
    /// - False: the leaf is a valid Binary Operation.
    pub(super) dummy_mask: Vec<EqNe<B>>,
    pub(super) compare_ops: SqlQueryBinOpArray<B>,
}

pub struct OptionalBoolTree<B> {
    // tree.len = 2^levels - 1
    tree: Vec<BoolBinOpMask<B>>,
}

pub struct BoolBinOpMask<B> {
    /// True is op is a binary AND operator
    pub is_and: B,
    /// True is op is a binary OR operator
    pub is_or: B,
}

pub struct SqlQueryBinOpArray<B> {
    pub(super) array: Vec<SqlQueryBinaryOp<B>>,
}

pub struct SqlQueryBinaryOp<B> {
    /// Leaf position is the binary tree
    /// The len is equal to 2^n = total number of leaves 
    pub position_mask: BoolMask<B>,
    /// Bool mask which encodes the operator: =, >, <, <=, >=, <>
    /// Len = 6
    pub comparator_mask: ComparatorMask<B>,
    /// The left operand column name encoded as a boolean mask 
    /// The mask len = the number of columns. The mask is full of zeros 
    /// except one 1 at the column index corresponding to the left operand column 
    /// identifier. 
    pub left_ident_mask: BoolMask<B>,
    /// The right operand (can be a value or an identifier)
    pub right: SqlQueryRightOperand<B>,
}

pub struct SqlQueryRightOperand<B> {
    /// The column mask (if not a numerical or ASCII value)
    pub ident_mask: BoolMask<B>,
    /// Right operand data encoded in structure of 4x64Bits words map
    pub bytes_256: SqlQueryRightBytes256<B>,
    /// True is the right operand is a numerical value strictly negative
    pub is_strictly_negative: EqNe<B>,
    /// True is the right operand is a value (numerical or ASCII)
    pub is_value: B,
}

/// A 4x64Bits map
pub struct SqlQueryRightBytes256<B> {
    pub word_0_eq_gt: Bytes64EqGt<B>,
    pub word_1_eq_ne: Bytes64EqNe<B>,
    pub word_2_eq_ne: Bytes64EqNe<B>,
    pub word_3_eq_ne: Bytes64EqNe<B>,
}

Encrypted Result

The lib API offers 3 SQL result formats. For each format type a specific bunch of Bytes are computed. The bytes are later masked with the given table mask to produce the final encrypted result.

pub struct SqlResultOptions {
    compress: bool,
    format: SqlResultFormat,
}

pub enum SqlResultFormat {
    RowBytes(bool),
    TableBytesInRowOrder,
    TableBytesInColumnOrder,
}

Final bytes order as stored in the SQL encrypted result structure

The following table has 4 columns and 2 rows:

Col1Col2Col3Col4
abcd
efgh

The bytes are arranged as follow.

// SqlResultFormat::RowBytes(padding)
[
    [compress_byte_array(a, b, c, d)]
    [compress_byte_array(e, f, g, h)]
]

// SqlResultFormat::TableBytesInRowOrder
[compress_byte_array(a, b, c, d, e, f, g ,h)]

// SqlResultFormat::TableBytesInRowOrder
[compress_byte_array(a, e, b, f, c, g, d ,h)]

The Rust encrypted structure

The following structure describes how the SQL result is computed. It consists of an encrypted part and a clear part.

pub(crate) struct SqlResult<U8, B> {
    /// Encrypted part
    
    /// A boolean mask (redundant) that encodes the unique selected table
    table_mask: BoolMask<B>,
    /// A boolean mask (redundant) that encodes the multiple selected columns
    field_mask: BoolMask<B>,
    /// A boolean mask that encodes the selected rows computed from the SQL SELECT operation
    select_mask: BoolMask<B>,
    /// A two-dimentional array of encrypted bytes which encodes the table data values
    /// How this byte array is computed is explained below
    byte_arrays: Vec<ByteArray<U8>>,

    /// Clear part. Redundant, allows self-decryption.
    pub(crate) options: SqlResultOptions,
    pub(crate) ordered_schemas: OrderedSchemas,

    #[cfg(feature = "stats")]
    #[serde(skip_serializing, skip_deserializing)]
    pub(crate) stats: SqlStats,
}

Benchmarks

The following results where optained running the clear-api.rs example located in the tfhesql lib examples directory.

TableRowsColumns
Customers.csv911xUIn32 column and 6xASCII columns
Categories.csv81xUIn32 column and 2xASCII columns
SELECT CustomerID,PostalCode,Country FROM Customers WHERE Country IN ('France', 'Germany')
FormatBool ORUInt8 ORBool ANDUInt8 ANDBool NOTUInt8 NOTIFTotalGain
compress=false, ByRow(true)1835755217533186611029001137710 %
compress=false, TableBytesInRowOrder1835756917533184301029001129151 %
compress=false, TableBytesInColumnOrder1835756417533184241029001128711 %
compress=true, ByRow(true)1835745517533111021029008314727 %
compress=true, TableBytesInRowOrder183572431753351711029005857549 %
compress=true, TableBytesInColumnOrder183572381753350561029005809549 %

Cost of result

One reason why the overhaul solution appears to always be somewhat impracticable is that a final U8 masking operation will always be performed on every single data value in every table of the database.

Total Number of U8 AND operations = Sum(0 <= t < n_tables; NumOfRows(t)*NumColumns(t))

Where to go from here ?