

Logistic Regression Training Examples using OpenFHE

This repository provides examples of how to perform the training of logistic regression models on data encrypted by FHE using the OpenFHE library. This implementation is intended for demonstrations of how to use OpenFHE for model training. The examples are intended to be used for illustrative purposes only, and not for benchmarking. There are many much more efficient approaches to logistic regression training that either use proprietary designs or aren't as good for illustrative examples. The specific approach we use here is based on Nesterov-accelerated gradient descent.

Note These examples were developed as part of the DARPA DPRIVE program. They are solely for research purposes and should not be used for benchmarking or production purposes where performance is critical. Although the sample code was contributed by Duality Technologies, this sample code is not related to the privacy-preserving logistic regression training capabilities provided in past, present or future Duality Technology products.

Table of Contents

  1. Logistic Regression
  2. Building the Code
  3. Implementation Notes
    1. Iterative Bootstrapping
    2. Sparse Packing
  4. Contents
    1. C++ Files
    2. PyScripts
    3. Results
    4. Sigmoid Approx Results
    5. Train Data
  5. Acknowledgments

Logistic Regression

In this repository we have implemented logistic-regression model training and model inference on the 2014 US Infant Mortality Dataset (as provided in the train_data directory).


Given our design matrix, X, and our vector of weights, $\bar{w}$, our model prediction is expressed as:

\hat{y} = sigmoid(\mathbf{X}\bar{w})  

Note: the sigmoid function can also be called the logistic function, and gives logistic regression its name

Note: one limitation of FHE is that we cannot directly calculate non-linear functions. As such, we use polynomial approximations of our non-linear functions. In OpenFHE, we provide the utility EvalChebyshev to do such approximations. See our official documentation and our working example

Loss function

Our loss function is the cross-entropy loss function

\mathcal{l} = -(y * log(\hat{y}) + (1 - y) * log(1 - \hat{y}))

Gradient Descent

We optimize our model via the gradient descent method so our objective is:

\frac{\partial l}{\partial \bar{w}} = -X^T(y - \hat{y})

Optimization method

Nesterov Accelerated Gradient

Nesterov Accelerated Gradient can be thought of as a classical gradient descent, with a second "phase" that involves a special momentum parameter.

Given our parameters, $\theta$ and $\phi$ such that:

our update can be expressed as follows:

\text{Stage 1, momentum update:} \phi_{t+1} = \theta_t + \eta \nabla f(\theta_t)
\text{Stage 2, weight update:} \theta_{t+1} = \phi_{t+1} + \mu (\phi_{t+1} - \phi_t)

FHE Logistic Regression Notes

Although logistic regression in-the-clear and in FHE are similar, we leverage a number of optimizations. For example, we pre-compute various multiplications in-the-clear to reduce the number of required ciphertext multiplications.


Building this repository

  1. Install OpenFHE as per the instructions

  2. Build this repo

mkdir build
cd build
cmake ..
make -j N

where N is the number of cores you want to use.

  1. Go into your build directory and run ./lr_nag.

1.2) Options

CLI Arguments + Defaults

-b flag: whether to run in bootstrap or not. DEFAULT: true
-n int: number of iterations for training. DEFAULT: 200
-r int: rows to read from the dataset. DEFAULT: -1 (read all)
-x string: train features (CSV). DEFAULT: train_data/X_norm_1024.csv
-y string: train labels (CSV). DEFAULT: train_data/y_1024.csv
-j string: test features (CSV). DEFAULT: train_data/X_norm.csv
-k string: test labels (CSV). DEFAULT: train_data/y.csv
-d int: ring dimension. DEFAULT: 1 << 17
-w string: Outpuit file prefix. DEFAULT: See below
-p int: Output precision. DEFAULT: 0. If non-0 we run 2-iteration bootstrap. See below for more information

-w default: depends on the formulation (sgd/ nag) but amounts to either ../results/nag_ or ../results/sgd_

Implementation Notes:

Multi-iteration bootstrap

In the case of 64-bit precision (specified when installing OpenFHE), we can run bootstrap twice, which leads to improved precision, and should rival that of bootstrapping in 128-bit. If you specify a non-zero precision, we run in 2-iteration mode, else just single iteration. See iterative-ckks-bootstrapping for more information.

Sparse Packing

Note how we pack the Theta and the Phi into a single ciphertext. This is to allow us to run only a single bootstrap as opposed to two, one for each parameter. See advanced-ckks-bootstrapping for more information.

Repository Contents

C++ Code

py_scripts folder

Contains various misc. python scripts.

Note: the code for training in parameter_search.ipynb and step_by_step_training_debugger.ipynb is very similar. However, the code is separated because we use numba to accelerate our hyperparameter search, and numba is not amenable to debugging via print.



results folder

Where we store the results of our encrypted runs.

Note: our encrypted run produces four artifacts:


Investigates what happens as we modify the sigmoidApprox parameters. The goal of this is to explore:




Contains the data files. We prototyped on (X_norm_64, y_64) and then validated on (X_norm_1024, y_1024) before moving to the full-scale (X_norm_32764, y_32764)

reduceDataset.py subsamples the dataset to make the number of true cases and false cases to be equal.


These examples were mainly developed by Ian Quah, with some contributions/suggestions from Ahmad Al Badawi, David Bruce Cousins, and Yuriy Polyakov.


Distribution Statement "A" (Approved for Public Release, Distribution Unlimited). This work is supported in part by DARPA through HR0011-21-9-0003 and HR0011-20-9-0102. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.