Awesome
p256-m is a minimalistic implementation of ECDH and ECDSA on NIST P-256, especially suited to constrained 32-bit environments. It's written in standard C, with optional bits of assembly for Arm Cortex-M and Cortex-A CPUs.
Its design is guided by the following goals in this order:
- correctness & security;
- low code size & RAM usage;
- runtime performance.
Most cryptographic implementations care more about speed than footprint, and some might even risk weakening security for more speed. p256-m was written because I wanted to see what happened when reversing the usual emphasis.
The result is a full implementation of ECDH and ECDSA in less than 3KiB of code, using less than 768 bytes of RAM, with comparable performance to existing implementations (see below) - in less than 700 LOC.
Contents of this Readme:
- Correctness
- Security
- Code size
- RAM usage
- Runtime performance
- Comparison with other implementations
- Design overview
- Notes about other curves
- Notes about other platforms
Correctness
API design:
- The API is minimal: only 4 public functions.
- Each public function fully validates its inputs and returns specific errors.
- The API uses arrays of octets for all input and output.
Testing:
- p256-m is validated against multiple test vectors from various RFCs and NIST.
- In addition, crafted inputs are used for negative testing and to reach corner cases.
- Two test suites are provided: one for closed-box testing (using only the public API), one for open-box testing (for unit-testing internal functions, and reaching more error cases by exploiting knowledge of how the RNG is used).
- The resulting branch coverage is maximal: closed-box testing reaches all
branches except four; three of them are reached by open-box testing using a
rigged RNG; the last branch could only be reached by computing a discrete log
on P-256... See
coverage.sh
. - Testing also uses dynamic analysis: valgrind, ASan, MemSan, UBSan.
Code quality:
- The code is standard C99; it builds without warnings with
clang -Weverything
andgcc -Wall -Wextra -pedantic
. - The code is small and well documented, including internal APIs: with the header file, it's less than 700 lines of code, and more lines of comments than of code.
- However it has not been reviewed independently so far, as this is a personal project.
Short Weierstrass pitfalls:
Its has been pointed out that the NIST curves, and indeed all Short Weierstrass curves, have a number of pitfalls including risk for the implementation to:
- "produce incorrect results for some rare curve points" - this is avoided by carefully checking the validity domain of formulas used throughout the code;
- "leak secret data when the input isn't a curve point" - this is avoided by validating that points lie on the curve every time a point is deserialized.
Security
In addition to the above correctness claims, p256-m has the following properties:
- it has no branch depending (even indirectly) on secret data;
- it has no memory access depending (even indirectly) on secret data.
These properties are checked using valgrind and MemSan with the ideas
behind ctgrind, see consttime.sh
.
In addition to avoiding branches and memory accesses depending on secret data,
p256-m also avoid instructions (or library functions) whose execution time
depends on the value of operands on cores of interest. Namely, it never uses
integer division, and for multiplication by default it only uses 16x16->32 bit
unsigned multiplication. On cores which have a constant-time 32x32->64 bit
unsigned multiplication instruction, the symbol MUL64_IS_CONSTANT_TIME
can
be defined by the user at compile-time to take advantage of it in order to
improve performance and code size. (On Cortex-M and Cortex-A cores wtih GCC or
Clang this is not necessary, since inline assembly is used instead.)
As a result, p256-m should be secure against the following classes of attackers:
- attackers who can only manipulate the input and observe the output;
- attackers who can also measure the total computation time of the operation;
- attackers who can also observe and manipulate micro-architectural features such as the cache or branch predictor with arbitrary precision.
However, p256-m makes no attempt to protect against:
- passive physical attackers who can record traces of physical emissions (power, EM, sound) of the CPU while it manipulates secrets;
- active physical attackers who can also inject faults in the computation.
(Note: p256-m should actually be secure against SPA, by virtue of being fully constant-flow, but is not expected to resist any other physical attack.)
Warning: p256-m requires an externally-provided RNG function. If that function is not cryptographically secure, then neither is p256-m's key generation or ECDSA signature generation.
Note: p256-m also follows best practices such as securely erasing secret data on the stack before returning.
Code size
Compiled with
ARM-GCC 9,
with -mthumb -Os
, here are samples of code sizes reached on selected cores:
- Cortex-M0: 2988 bytes
- Cortex-M4: 2900 bytes
- Cortex-A7: 2924 bytes
Clang was also tried but tends to generate larger code (by about 10%). For
details, see sizes.sh
.
What's included:
- Full input validation and (de)serialisation of input/outputs to/from bytes.
- Cleaning up secret values from the stack before returning from a function.
- The code has no dependency on libc functions or the toolchain's runtime
library (such as helpers for long multiply); this can be checked for the
Arm-GCC toolchain with the
deps.sh
script.
What's excluded:
- A secure RNG function needs to be provided externally, see
p256_generate_random()
inp256-m.h
.
RAM usage
p256-m doesn't use any dynamic memory (on the heap), only the stack. Here's how much stack is used by each of its 4 public functions on selected cores:
Function | Cortex-M0 | Cortex-M4 | Cortex-A7 |
---|---|---|---|
p256_gen_keypair | 608 | 564 | 564 |
p256_ecdh_shared_secret | 640 | 596 | 596 |
p256_ecdsa_sign | 664 | 604 | 604 |
p256_ecdsa_verify | 752 | 700 | 700 |
For details, see stack.sh
, wcs.py
and libc.msu
(the above figures assume
that the externally-provided RNG function uses at most 384 bytes of stack).
Runtime performance
Here are the timings of each public function in milliseconds measured on platforms based on a selection of cores:
- Cortex-M0 at 48 MHz: STM32F091 board running Mbed OS 6
- Cortex-M4 at 100 MHz: STM32F411 board running Mbed OS 6
- Cortex-A7 at 900 MHz: Raspberry Pi 2B running Raspbian Buster
Function | Cortex-M0 | Cortex-M4 | Cortex-A7 |
---|---|---|---|
p256_gen_keypair | 921 | 145 | 11 |
p256_ecdh_shared_secret | 922 | 144 | 11 |
p256_ecdsa_sign | 990 | 155 | 12 |
p256_ecdsa_verify | 1976 | 309 | 24 |
Sum of the above | 4809 | 753 | 59 |
The sum of these operations corresponds to a TLS handshake using ECDHE-ECDSA with mutual authentication based on raw public keys or directly-trusted certificates (otherwise, add one 'verify' for each link in the peer's certificate chain).
Note: the above figures where obtained by compiling with GCC, which is able to use inline assembly. Without that inline assembly (22 lines for Cortex-M0, 1 line for Cortex-M4), the code would be roughly 2 times slower on those platforms. (The effect is much less important on the Cortex-A7 core.)
For details, see bench.sh
, benchmark.c
and on-target-benchmark/
.
Comparison with other implementations
The most relevant/convenient implementation for comparisons is TinyCrypt, as it's also a standalone implementation of ECDH and ECDSA on P-256 only, that also targets constrained devices. Other implementations tend to implement many curves and build on a shared bignum/MPI module (possibly also supporting RSA), which makes fair comparisons less convenient.
The scripts used for TinyCrypt measurements are available in this branch, based on version 0.2.8.
Code size
Core | p256-m | TinyCrypt |
---|---|---|
Cortex-M0 | 2988 | 6134 |
Cortex-M4 | 2900 | 5934 |
Cortex-A7 | 2924 | 5934 |
RAM usage
TinyCrypto also uses no heap, only the stack. Here's the RAM used by each operation on a Cortex-M0 core:
operation | p256-m | TinyCrypt |
---|---|---|
key generation | 608 | 824 |
ECDH shared secret | 640 | 728 |
ECDSA sign | 664 | 880 |
ECDSA verify | 752 | 824 |
On a Cortex-M4 or Cortex-A7 core (identical numbers):
operation | p256-m | TinyCrypt |
---|---|---|
key generation | 564 | 796 |
ECDH shared secret | 596 | 700 |
ECDSA sign | 604 | 844 |
ECDSA verify | 700 | 808 |
Runtime performance
Here are the timings of each operation in milliseconds measured on platforms based on a selection of cores:
Cortex-M0 at 48 MHz: STM32F091 board running Mbed OS 6
Operation | p256-m | TinyCrypt |
---|---|---|
Key generation | 921 | 979 |
ECDH shared secret | 922 | 975 |
ECDSA sign | 990 | 1009 |
ECDSA verify | 1976 | 1130 |
Sum of those 4 | 4809 | 4093 |
Cortex-M4 at 100 MHz: STM32F411 board running Mbed OS 6
Operation | p256-m | TinyCrypt |
---|---|---|
Key generation | 145 | 178 |
ECDH shared secret | 144 | 177 |
ECDSA sign | 155 | 188 |
ECDSA verify | 309 | 210 |
Sum of those 4 | 753 | 753 |
Cortex-A7 at 900 MHz: Raspberry Pi 2B running Raspbian Buster
Operation | p256-m | TinyCrypt |
---|---|---|
Key generation | 11 | 13 |
ECDH shared secret | 11 | 13 |
ECDSA sign | 12 | 14 |
ECDSA verify | 24 | 15 |
Sum of those 4 | 59 | 55 |
64-bit Intel (i7-6500U at 2.50GHz) laptop running Ubuntu 20.04
Note: results in microseconds (previous benchmarks in milliseconds)
Operation | p256-m | TinyCrypt |
---|---|---|
Key generation | 1060 | 1627 |
ECDH shared secret | 1060 | 1611 |
ECDSA sign | 1136 | 1712 |
ECDSA verify | 2279 | 1888 |
Sum of those 4 | 5535 | 6838 |
Other differences
- While p256-m fully validates all inputs, Tinycrypt's ECDH shared secret function doesn't include validation of the peer's public key, which should be done separately by the user for static ECDH (there are attacks when users forget).
- The two implementations have slightly different security characteristics: p256-m is fully constant-time from the ground up so should be more robust than TinyCrypt against powerful local attackers (such as an untrusted OS attacking a secure enclave); on the other hand TinyCrypt includes coordinate randomisation which protects against some passive physical attacks (such as DPA, see Table 3, column C9 of this paper), which p256-m completely ignores.
- TinyCrypt's code looks like it could easily be expanded to support other curves, while p256-m has much more hard-coded to minimize code size (see "Notes about other curves" below).
- TinyCrypt uses a specialised routine for reduction modulo the curve prime, exploiting its structure as a Solinas prime, which should be faster than the generic Montgomery reduction used by p256-m, but other factors appear to compensate for that.
- TinyCrypt uses Co-Z Jacobian formulas for point operation, which should be faster (though a bit larger) than the mixed affine-Jacobian formulas used by p256-m, but again other factors appear to compensate for that.
- p256-m uses bits of inline assembly for 64-bit multiplication on the platforms used for benchmarking, while TinyCrypt uses only C (and the compiler's runtime library).
- TinyCrypt uses a specialised routine based on Shamir's trick for ECDSA verification, which gives much better performance than the generic code that p256-m uses in order to minimize code size.
Design overview
The implementation is contained in a single file to keep most functions static and allow for more optimisations. It is organized in multiple layers:
- Fixed-width multi-precision arithmetic
- Fixed-width modular arithmetic
- Operations on curve points
- Operations with scalars
- The public API
Multi-precision arithmetic.
Large integers are represented as arrays of uint32_t
limbs. When carries may
occur, casts to uint64_t
are used to nudge the compiler towards using the
CPU's carry flag. When overflow may occur, functions return a carry flag.
This layer contains optional assembly for Cortex-M and Cortex-A cores, for the
internal u32_muladd64()
function, as well as two pure C versions of this
function, depending on whether MUL64_IS_CONSTANT_TIME
.
This layer's API consists of:
- addition, subtraction;
- multiply-and-add, shift by one limb (for Montgomery multiplication);
- conditional assignment, assignment of a small value;
- comparison of two values for equality, comparison to 0 for equality;
- (de)serialization as big-endian arrays of bytes.
Modular arithmetic.
All modular operations are done in the Montgomery domain, that is x is
represented by x * 2^256 mod m
; integers need to be converted to that domain
before computations, and back from it afterwards. Montgomery constants
associated to the curve's p and n are pre-computed and stored in static
structures.
Modular inversion is computed using Fermat's little theorem to get constant-time behaviour with respect to the value being inverted.
This layer's API consists of:
- the curve's constants p and n (and associated Montgomery constants);
- modular addition, subtraction, multiplication, and inversion;
- assignment of a small value;
- conversion to/from Montgomery domain;
- (de)serialization to/from bytes with integrated range checking and Montgomery domain conversion.
Operations on curve points.
Curve points are represented using either affine or Jacobian coordinates; affine coordinates are extended to represent 0 as (0,0). Individual coordinates are always in the Montgomery domain.
Not all formulas associated with affine or Jacobian coordinates are complete; great care is taken to document and satisfy each function's pre-conditions.
This layer's API consists of:
- curve constants: b from the equation, the base point's coordinates;
- point validity check (on the curve and not 0);
- Jacobian to affine coordinate conversion;
- point doubling in Jacobian coordinates (complete formulas);
- point addition in mixed affine-Jacobian coordinates (P not in {0, Q, -Q});
- point addition-or-doubling in affine coordinates (leaky version, only used for ECDSA verify where all data is public);
- (de)serialization to/from bytes with integrated validity checking
Scalar operations.
The crucial function here is scalar multiplication. It uses a signed binary ladder, which is a variant of the good old double-and-add algorithm where an addition/subtraction is performed at each step. Again, care is taken to make sure the pre-conditions for the addition formulas are always satisfied. The signed binary ladder only works if the scalar is odd; this is ensured by negating both the scalar (mod n) and the input point if necessary.
This layer's API consists of:
- scalar multiplication
- de-serialization from bytes with integrated range checking
- generation of a scalar and its associated public key
Public API.
This layer builds on the others, but unlike them, all inputs and outputs are byte arrays. Key generation and ECDH shared secret computation are thin wrappers around internal functions, just taking care of format conversions and errors. The ECDSA functions have more non-trivial logic.
This layer's API consists of:
- key-pair generation
- ECDH shared secret computation
- ECDSA signature creation
- ECDSA signature verification
Testing.
A self-contained, straightforward, pure-Python implementation was first produced as a warm-up and to help check intermediate values. Test vectors from various sources are embedded and used to validate the implementation.
This implementation, p256.py
, is used by a second Python script,
gen-test-data.py
, to generate additional data for both positive and negative
testing, available from a C header file, that is then used by the closed-box
and open-box test programs.
p256-m can be compiled with extra instrumentation to mark secret data and allow either valgrind or MemSan to check that no branch or memory access depends on it (even indirectly). Macros are defined for this purpose near the top of the file.
Tested platforms.
There are 4 versions of the internal function u32_muladd64
: two assembly
versions, for Cortex-M/A cores with or without the DSP extension, and two
pure-C versions, depending on whether MUL64_IS_CONSTANT_TIME
.
Tests are run on the following platforms:
make
on x64 tests the pure-C version withoutMUL64_IS_CONSTANT_TIME
(with Clang)../consttime.sh
on x64 tests both pure-C versions (with Clang).make
on Arm v7-A (Raspberry Pi 2) tests the Arm-DSP assembly version (with Clang).on-target-*box
on boards based on Cortex-M0 and M4 cores test both assembly versions (with GCC).
In addition:
sizes.sh
builds the code for three Arm cores with GCC and Clang.deps.sh
checks for external dependencies with GCC.
Notes about other curves
It should be clear that minimal code size can only be reached by specializing the implementation to the curve at hand. Here's a list of things in the implementation that are specific to the NIST P-256 curve, and how the implementation could be changed to expand to other curves, layer by layer (see "Design Overview" above).
Fixed-width multi-precision arithmetic:
- The number of limbs is hard-coded to 8. For other 256-bit curves, nothing to change. For a curve of another size, hard-code to another value. For multiple curves of various sizes, add a parameter to each function specifying the number of limbs; when declaring arrays, always use the maximum number of limbs.
Fixed-width modular arithmetic:
- The values of the curve's constant p and n, and their associated Montgomery
constants, are hard-coded. For another curve, just hard-code the new constants.
For multiple other curves, define all the constants, and from this layer's API
only keep the functions that already accept a
mod
parameter (that is, remove convenience functionsm256_xxx_p()
). - The number of limbs is again hard-coded to 8. See above, but it order to
support multiple sizes there is no need to add a new parameter to functions
in this layer: the existing
mod
parameter can include the number of limbs as well.
Operations on curve points:
- The values of the curve's constants b (constant term from the equation) and gx, gy (coordinates of the base point) are hard-coded. For another curve, hard-code the other values. For multiple curves, define each curve's value and add a "curve id" parameter to all functions in this layer.
- The value of the curve's constant a is implicitly hard-coded to
-3
by using a standard optimisation to save one multiplication in the first step ofpoint_double()
. For curves that don't have a == -3, replace that with the normal computation. - The fact that b != 0 in the curve equation is used indirectly, to ensure that (0, 0) is not a point on the curve and re-use that value to represent the point 0. As far as I know, all Short Weierstrass curves standardized so far have b != 0.
- The shape of the curve is assumed to be Short Weierstrass. For other curve shapes (Montgomery, (twisted) Edwards), this layer would probably look very different (both implementation and API).
Scalar operations:
- If multiple curves are to be supported, all function in this layer need to gain a new "curve id" parameter.
- This layer assumes that the bit size of the curve's order n is the same as
that of the modulus p. This is true of most curves standardized so far, the
only exception being secp224k1. If that curve were to be supported, the
representation of
n
and scalars would need adapting to allow for an extra limb. - The bit size of the curve's order is hard-coded in
scalar_mult()
. For multiple curves, this should be deduced from the "curve id" parameter. - The
scalar_mult()
function exploits the fact that the second least significant bit of the curve's order n is set in order to avoid a special case. For curve orders that don't meet this criterion, we can just handle that special case (multiplication by +-2) separately (always compute that and conditionally assign it to the result). - The shape of the curve is again assumed to be Short Weierstrass. For other curve shapes (Montgomery, (twisted) Edwards), this layer would probably have a very different implementation.
Public API:
- For multiple curves, all functions in this layer would need to gain a "curve id" parameter and handle variable-sized input/output.
- The shape of the curve is again assumed to be Short Weierstrass. For other curve shapes (Montgomery, (twisted) Edwards), the ECDH API would probably look quite similar (with differences in the size of public keys), but the ECDSA API wouldn't apply and an EdDSA API would look pretty different.
Notes about other platforms
While p256-m is standard C99, it is written with constrained 32-bit platforms in mind and makes a few assumptions about the platform:
- The types
uint8_t
,uint16_t
,uint32_t
anduint64_t
exist. - 32-bit unsigned addition and subtraction with carry are constant time.
- 16x16->32-bit unsigned multiplication is available and constant time.
Also, on platforms on which 64-bit addition and subtraction with carry, or even 64x64->128-bit multiplication, are available, p256-m makes no use of them, though they could significantly improve performance.
This could be improved by replacing uses of arrays of uint32_t
with a
defined type throughout the internal APIs, and then on 64-bit platforms define
that type to be an array of uint64_t
instead, and making the obvious
adaptations in the multi-precision arithmetic layer.
Finally, the optional assembly code (which boosts performance by a factor 2 on tested Cortex-M CPUs, while slightly reducing code size and stack usage) is currently only available with compilers that support GCC's extended asm syntax (which includes GCC and Clang).