Home

Awesome

p256-m is a minimalistic implementation of ECDH and ECDSA on NIST P-256, especially suited to constrained 32-bit environments. It's written in standard C, with optional bits of assembly for Arm Cortex-M and Cortex-A CPUs.

Its design is guided by the following goals in this order:

  1. correctness & security;
  2. low code size & RAM usage;
  3. runtime performance.

Most cryptographic implementations care more about speed than footprint, and some might even risk weakening security for more speed. p256-m was written because I wanted to see what happened when reversing the usual emphasis.

The result is a full implementation of ECDH and ECDSA in less than 3KiB of code, using less than 768 bytes of RAM, with comparable performance to existing implementations (see below) - in less than 700 LOC.

Contents of this Readme:

Correctness

API design:

Testing:

Code quality:

Short Weierstrass pitfalls:

Its has been pointed out that the NIST curves, and indeed all Short Weierstrass curves, have a number of pitfalls including risk for the implementation to:

Security

In addition to the above correctness claims, p256-m has the following properties:

These properties are checked using valgrind and MemSan with the ideas behind ctgrind, see consttime.sh.

In addition to avoiding branches and memory accesses depending on secret data, p256-m also avoid instructions (or library functions) whose execution time depends on the value of operands on cores of interest. Namely, it never uses integer division, and for multiplication by default it only uses 16x16->32 bit unsigned multiplication. On cores which have a constant-time 32x32->64 bit unsigned multiplication instruction, the symbol MUL64_IS_CONSTANT_TIME can be defined by the user at compile-time to take advantage of it in order to improve performance and code size. (On Cortex-M and Cortex-A cores wtih GCC or Clang this is not necessary, since inline assembly is used instead.)

As a result, p256-m should be secure against the following classes of attackers:

  1. attackers who can only manipulate the input and observe the output;
  2. attackers who can also measure the total computation time of the operation;
  3. attackers who can also observe and manipulate micro-architectural features such as the cache or branch predictor with arbitrary precision.

However, p256-m makes no attempt to protect against:

  1. passive physical attackers who can record traces of physical emissions (power, EM, sound) of the CPU while it manipulates secrets;
  2. active physical attackers who can also inject faults in the computation.

(Note: p256-m should actually be secure against SPA, by virtue of being fully constant-flow, but is not expected to resist any other physical attack.)

Warning: p256-m requires an externally-provided RNG function. If that function is not cryptographically secure, then neither is p256-m's key generation or ECDSA signature generation.

Note: p256-m also follows best practices such as securely erasing secret data on the stack before returning.

Code size

Compiled with ARM-GCC 9, with -mthumb -Os, here are samples of code sizes reached on selected cores:

Clang was also tried but tends to generate larger code (by about 10%). For details, see sizes.sh.

What's included:

What's excluded:

RAM usage

p256-m doesn't use any dynamic memory (on the heap), only the stack. Here's how much stack is used by each of its 4 public functions on selected cores:

FunctionCortex-M0Cortex-M4Cortex-A7
p256_gen_keypair608564564
p256_ecdh_shared_secret640596596
p256_ecdsa_sign664604604
p256_ecdsa_verify752700700

For details, see stack.sh, wcs.py and libc.msu (the above figures assume that the externally-provided RNG function uses at most 384 bytes of stack).

Runtime performance

Here are the timings of each public function in milliseconds measured on platforms based on a selection of cores:

FunctionCortex-M0Cortex-M4Cortex-A7
p256_gen_keypair92114511
p256_ecdh_shared_secret92214411
p256_ecdsa_sign99015512
p256_ecdsa_verify197630924
Sum of the above480975359

The sum of these operations corresponds to a TLS handshake using ECDHE-ECDSA with mutual authentication based on raw public keys or directly-trusted certificates (otherwise, add one 'verify' for each link in the peer's certificate chain).

Note: the above figures where obtained by compiling with GCC, which is able to use inline assembly. Without that inline assembly (22 lines for Cortex-M0, 1 line for Cortex-M4), the code would be roughly 2 times slower on those platforms. (The effect is much less important on the Cortex-A7 core.)

For details, see bench.sh, benchmark.c and on-target-benchmark/.

Comparison with other implementations

The most relevant/convenient implementation for comparisons is TinyCrypt, as it's also a standalone implementation of ECDH and ECDSA on P-256 only, that also targets constrained devices. Other implementations tend to implement many curves and build on a shared bignum/MPI module (possibly also supporting RSA), which makes fair comparisons less convenient.

The scripts used for TinyCrypt measurements are available in this branch, based on version 0.2.8.

Code size

Corep256-mTinyCrypt
Cortex-M029886134
Cortex-M429005934
Cortex-A729245934

RAM usage

TinyCrypto also uses no heap, only the stack. Here's the RAM used by each operation on a Cortex-M0 core:

operationp256-mTinyCrypt
key generation608824
ECDH shared secret640728
ECDSA sign664880
ECDSA verify752824

On a Cortex-M4 or Cortex-A7 core (identical numbers):

operationp256-mTinyCrypt
key generation564796
ECDH shared secret596700
ECDSA sign604844
ECDSA verify700808

Runtime performance

Here are the timings of each operation in milliseconds measured on platforms based on a selection of cores:

Cortex-M0 at 48 MHz: STM32F091 board running Mbed OS 6

Operationp256-mTinyCrypt
Key generation921979
ECDH shared secret922975
ECDSA sign9901009
ECDSA verify19761130
Sum of those 448094093

Cortex-M4 at 100 MHz: STM32F411 board running Mbed OS 6

Operationp256-mTinyCrypt
Key generation145178
ECDH shared secret144177
ECDSA sign155188
ECDSA verify309210
Sum of those 4753753

Cortex-A7 at 900 MHz: Raspberry Pi 2B running Raspbian Buster

Operationp256-mTinyCrypt
Key generation1113
ECDH shared secret1113
ECDSA sign1214
ECDSA verify2415
Sum of those 45955

64-bit Intel (i7-6500U at 2.50GHz) laptop running Ubuntu 20.04

Note: results in microseconds (previous benchmarks in milliseconds)

Operationp256-mTinyCrypt
Key generation10601627
ECDH shared secret10601611
ECDSA sign11361712
ECDSA verify22791888
Sum of those 455356838

Other differences

Design overview

The implementation is contained in a single file to keep most functions static and allow for more optimisations. It is organized in multiple layers:

Multi-precision arithmetic.

Large integers are represented as arrays of uint32_t limbs. When carries may occur, casts to uint64_t are used to nudge the compiler towards using the CPU's carry flag. When overflow may occur, functions return a carry flag.

This layer contains optional assembly for Cortex-M and Cortex-A cores, for the internal u32_muladd64() function, as well as two pure C versions of this function, depending on whether MUL64_IS_CONSTANT_TIME.

This layer's API consists of:

Modular arithmetic.

All modular operations are done in the Montgomery domain, that is x is represented by x * 2^256 mod m; integers need to be converted to that domain before computations, and back from it afterwards. Montgomery constants associated to the curve's p and n are pre-computed and stored in static structures.

Modular inversion is computed using Fermat's little theorem to get constant-time behaviour with respect to the value being inverted.

This layer's API consists of:

Operations on curve points.

Curve points are represented using either affine or Jacobian coordinates; affine coordinates are extended to represent 0 as (0,0). Individual coordinates are always in the Montgomery domain.

Not all formulas associated with affine or Jacobian coordinates are complete; great care is taken to document and satisfy each function's pre-conditions.

This layer's API consists of:

Scalar operations.

The crucial function here is scalar multiplication. It uses a signed binary ladder, which is a variant of the good old double-and-add algorithm where an addition/subtraction is performed at each step. Again, care is taken to make sure the pre-conditions for the addition formulas are always satisfied. The signed binary ladder only works if the scalar is odd; this is ensured by negating both the scalar (mod n) and the input point if necessary.

This layer's API consists of:

Public API.

This layer builds on the others, but unlike them, all inputs and outputs are byte arrays. Key generation and ECDH shared secret computation are thin wrappers around internal functions, just taking care of format conversions and errors. The ECDSA functions have more non-trivial logic.

This layer's API consists of:

Testing.

A self-contained, straightforward, pure-Python implementation was first produced as a warm-up and to help check intermediate values. Test vectors from various sources are embedded and used to validate the implementation.

This implementation, p256.py, is used by a second Python script, gen-test-data.py, to generate additional data for both positive and negative testing, available from a C header file, that is then used by the closed-box and open-box test programs.

p256-m can be compiled with extra instrumentation to mark secret data and allow either valgrind or MemSan to check that no branch or memory access depends on it (even indirectly). Macros are defined for this purpose near the top of the file.

Tested platforms.

There are 4 versions of the internal function u32_muladd64: two assembly versions, for Cortex-M/A cores with or without the DSP extension, and two pure-C versions, depending on whether MUL64_IS_CONSTANT_TIME.

Tests are run on the following platforms:

In addition:

Notes about other curves

It should be clear that minimal code size can only be reached by specializing the implementation to the curve at hand. Here's a list of things in the implementation that are specific to the NIST P-256 curve, and how the implementation could be changed to expand to other curves, layer by layer (see "Design Overview" above).

Fixed-width multi-precision arithmetic:

Fixed-width modular arithmetic:

Operations on curve points:

Scalar operations:

Public API:

Notes about other platforms

While p256-m is standard C99, it is written with constrained 32-bit platforms in mind and makes a few assumptions about the platform:

Also, on platforms on which 64-bit addition and subtraction with carry, or even 64x64->128-bit multiplication, are available, p256-m makes no use of them, though they could significantly improve performance.

This could be improved by replacing uses of arrays of uint32_t with a defined type throughout the internal APIs, and then on 64-bit platforms define that type to be an array of uint64_t instead, and making the obvious adaptations in the multi-precision arithmetic layer.

Finally, the optional assembly code (which boosts performance by a factor 2 on tested Cortex-M CPUs, while slightly reducing code size and stack usage) is currently only available with compilers that support GCC's extended asm syntax (which includes GCC and Clang).