Awesome
X25519 for ARM AArch64
This implements highly optimimzed assembler versions of X25519 for AArch64. It's especially optimized for Cortex-A53 but works fast on other AArch64 processors as well. It utilizes the NEON floating point engine to archieve the high performance.
X25519
X25519 is an Elliptic curve version of the Diffie-Hellman protocol, using Curve25519 as the elliptic curve, as introduced in https://cr.yp.to/ecdh.html.
API
void X25519_calc_public_key(uint8_t output_public_key[32], const uint8_t input_secret_key[32]);
void X25519_calc_shared_secret(uint8_t output_shared_secret[32], const uint8_t my_secret_key[32], const uint8_t their_public_key[32]);
- To use, first generate a 32 byte random value using a Cryptographically Secure Number Generator (specifically do NOT use
rand()
from the C library), which gives your secret key. - Feed that secret key into
X25519_calc_public_key
which will give you the corresponding public key you then transfer to the other part. The other part does the same. - When you get the other part's public key, feed that into
X25519_calc_shared_secret
together with your private key which will give you the shared secret. Rather than using this shared secret directly, it should be hashed (for example with SHA-256) on both sides before use. For further usage instructions see the official web site.
Note that this library automatically "clamps" the secret key for you (i.e. sets all the three lowest bits to 0 and the two highest to 0 and 1), compared to some other implementations.
Setup
- The header file
X25519-AArch64.h
should be included when using the API from C/C++. - When compiling with GCC,
X25519-AArch64.s
must be added to the project as a compilation unit. The compiler switch-march=armv8-a
or similar might be needed depending on target architecture.
Example
An example can be seen in linux_example.c
that uses /dev/urandom
to get random data. It can be compiled on for example Raspberry Pi 3 running Debian (64-bit version) with:
gcc linux_example.c X25519-AArch64.s -o linux_example
Performance
The library uses only 5840 bytes of code space in compiled form, uses 416 bytes of stack and runs one scalar multiplication in ~145k cycles on Cortex-A53, which is speed record as far as I know. On Amazon AWS's A1 CPU (Cortex-A72), the implementation uses ~150k cycles (at 2.3 GHz).
Code
The code is written in GCC's assembler syntax.
Security
The implementation runs in constant time and uses a constant memory access pattern, regardless of the private key in order to protect against side channel attacks.
Copying
The code is released under CC0.