Home

Awesome

Hardened Library for AES-128 encryption/decryption on ARM Cortex M4 Achitecture

Authors: Ryad Benadjila, Louiza Khati, Emmanuel Prouff and Adrian Thillard

This work is linked to the H2020 funded project REASSURE.

Introduction

The members of ANSSI's laboratory of embedded security have developed a C library to perform AES-128 encryption and decryption on 32-bit Cortex-M ARM architecture while taking Side-Channel Attacks (SCA for short) into account.

The implementation codes are published for research and pedagogical purposes only.

The platforms on which the code has been tested are the STM32F3 and STM32F4, but should be compatible with any Cortex-M3/Cortex-M4 using the Thumb-2 ARMv7-M instruction set.

The STM32 MCUs are not secure ones; in particular no effort has been made to harden them against side-channel attacks and fault injections (e.g. clock jittering, shield, etc.). The information leakage is consequently particularly high and there is almost no jittering (traces' acquisition should therefore not suffer from too much de-synchronization). To secure the implementation, it has been chosen to apply state of the art techniques.

Project Description

This project aims at proposing a C library to perform AES-128 encryption and decryption on 32-bit Cortex M4 ARM architecture while taking Side-Channel Attacks into account.

To ease the use of the library, the AES-128 primitive can be executed by calling the following function:

UINT aes(UCHAR Mode, STRUCT_AES* struct_aes, const UCHARp key, const UCHARp input, UCHARp output, UCHARp random_key, UCHARp random_aes)

where:

An example of AES-128 execution is given here-after:

UINT Mode;
STRUCT_AES aes_struct;
UCHAR key[16] = {0x2b,0x7e, 0x15, 0x16 ,0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88, 0x09, 0xcf, 0x4f, 0x3c};		
UCHAR in[16] = {0x32,0x43, 0xf6, 0xa8, 0x88, 0x5a, 0x30 , 0x8d, 0x31, 0x31, 0x98, 0xa2, 0xe0, 0x37, 0x07, 0x34};
Mode = MODE_KEYINIT|MODE_AESINIT_ENC|MODE_ENC;
ret = aes(Mode, &aes_struct, key, in, out);

Project compilation and tests

Testing the implementation

In order to test the masked AES implementation, we have chosen to provide three different targets:

The two first tests only require software to compile and check the implementation, while the last one requires dedicated hardware (an STM32F4Discovery board and UART converters to communicate with a host PC).

Dependencies to install

The strict minimum required to compile the project is a cross-toolchain that targets ARMv7-M MCUs. Depending on the target you choose (QEMU, Keil or STM32F4Discovery) different toolchains are used.

For compiling for QEMU, an arm-linux-gnueabi-gcc toolchain is needed:

$ sudo apt-get install gcc-arm-linux-gnueabi

For testing with QEMU, the static versions of the emulator are needed:

$ sudo apt-get install qemu-user-static

For compiling and testing with Keil, the Keil ARM MDK is required. The demo version should be sufficient to test the project.

For compiling for the STM32F4Discovery board, you will use a gcc-arm-none-eabi toolchain:

$ sudo apt-get install gcc-arm-none-eabi

The QEMU and on board tests use Python scripts that require some packages that you can install using your distro package manager or pip:

$ sudo apt-get install python-crypto
$ sudo apt-get install python-serial

python-crypto is used to generate AES test vectors, and python-serial is used for the UART serial communication with the STM32F4Discovery board.

Finally, for programming the board with the compiled firmware, the st-link utility is needed. Some Linux distros provide it as a package, others not. In any case, you can compile and install this software by following the dedicated section of this project.

Compiling

Once the compilation toolchains and/or Keil are installed, it is possible to compile the project.

A Makefile is present at the root folder of the project. Here is the help of the targets:

$ make
Please choose the target you want to compile:
	make qemu
	=> for qemu target compilation
	===============================
	make firmware
	=> for firmware target compilation, for the STM32Discovery F407 board
	===============================
	make burn
	=> for firmware burning on the STM32Discovery F407 board.
	   Burning uses the st-utils tools.
	===============================
	The masked AES static library can also be compiled using the dedicated target:
	make libaes
	NOTE: use the UNROLL=1 environment variable to force the fully unrolled version compilation

Compiling is then as straightforward as make qemu for the QEMU target, and make firmware for generating a full firmware for the STM32Discovery board.

Two variants of the masked AES implementation are provided: a fast version where most of the loops are fully unrolled, and a slower version with almost no unrolling. The fast unrolled version uses more non-volatile (flash) memory. By default, the project compiles the small footprint version: the fast unrolled one can be activated using the UNROLL=1 environment variable:UNROLL=1 make qemu or UNROLL=1 make firmware.

The Makefile will produce its binaries in a build folder. For flexibility and versatility, the various subparts of the project are built as static libraries: this helps using them as standalone elements. The masked AES is compiled as libmaskedaes.a that can be linked against various other object files.

In order to compile with Keil, you can open the dedicated project in the keil folder and run the compilation process (tests have been performed with version µVision V5.26.2.0).

The STM32Discovery firmware is produced in the files firmware.bin (flat binary), firmware.elf (ELF file) and firmware.hex (Intel HEX format file). You can use the make burn target to push the firmware on the board (provided that st-link binaries are installed and in your path).

NOTE1: the Makefile is built to be versatile regarding the toolchain and options in order to bring flexibility for advanced users. You can override the cross-compiler using the COMPILER environment variable, override the ar utility with the AR variable, and override the objcopy utility with the OBJCOPY variable. You can also override the CFLAGS and LDFLAGS environment variables with your own. A simple example of such advanced usage is compiling the QEMU binary with clang (provided that the clang and LLVM toolchain is installed):

$ COMPILER=clang make qemu
clang -fPIC -fno-builtin -Os -g -mcpu=cortex-m4 -march=armv7e-m -DSTATIC_TESTS_PRESENT -I"./generated_tests" -Isrc/aes -Isrc/printf -Isrc/tests -Isrc  --target=arm-linux-gnueabi -DUSE_QEMU_PLATFORM src/printf/printf.c -c
ar rcs build/libprintf.a  build/printf.o
...

NOTE2: some compilation toolchains might arise issues during compilation, but such issues could be related to bugs in the version packaged by your distro. For instance, on Ubuntu 16.04 the arm-none-eabi cross-compiler emits warning about using #pragma (see the bug report here). In this specific case, the warning should not break the resulting binary.

Compatibility with other platforms

As previously detailed, the masked AES library should be compatible with any Cortex-M3/Cortex-M4 MCU as it only requires compatibility with the ARMv7-M instruction set.

For our side-channel characterization, we have used the ChipWhisperer CW308T-STM32F target based on a STM32F303RCT7 chip (using a Cortex-M4 core), and the AES library has been easily integrated to the ChipWhisperer simple serial AES example. Using the ChipWhisperer as a board for side-channel characterization is explained by an easy access to power consumption and clean signal acquisition chain. Furthermore, this board natively supports the undercloking of the Cortex-M4 target using an external oscillator: slowing down the core frequency to 4MHz allows to use reasonable sampling rates during the acquisition.

The side-channel characterization is publicly available in the document doc/technical-report/CR_evaluation_AES.pdf.

The test script

We provide a Python script scripts/test_aes.py for performing tests. This script can be used to generate static tests that will be included with the source and self-tested at the beginning of the main loop of the produced executables (either for QEMU or for a hardware target).

The same script can also be used to generate dynamic tests that are provided dynamically to the binaries: standard input for QEMU, and UART line for the STM32Discovery firmware.

Generate static tests

Generating static tests that will be self-run during the execution of the main loop of the produced binary, either for QEMU or for the STM32Discovery board:

python scripts/test_aes.py STATIC 100

will produce 100 static tests in the header file generated_tests/aes_tests.h that is included in the build process. The results of the tests are printed either on stdout for QEMU, or on the UART output.

Testing using QEMU

Generating dynamic tests for QEMU:

python scripts/test_aes.py QEMU 100

will produce 100 dynamic tests in the file generated_tests/aes_tests.bin.

Once you have compiled the project for the QEMU target using make qemu, you will find the produced binary ./build/affine_aes in the build folder.

When executing the binary using QEMU, the static tests should be executed automatically:

$ ./build/affine_aes 
[Tests] Welcome to static AES tests
[Tests] Begin received!
[Tests] Basic test received!
[+] OK!: test "[ENC0]full_0"
[Tests] End received!
[Tests] Begin received!
[Tests] Basic test received!
[+] OK!: test "[ENC0]split_step0_0"
[Tests] Basic test received!
[+] OK!: test "[ENC0]split_step1_0"
[Tests] End received!
...

The dynamic tests produced by python scripts/test_aes.py QEMU 100 are stored in generated_tests/aes_tests.binand can be provided as stdin to the QEMU binary to be processed:

$ ./build/affine_aes < generated_tests/aes_tests.bin 
[Tests] Welcome to dynamic AES tests ... Waiting on the line!
[Tests] Begin received!
[Tests] Basic test received!
[+] OK!: test "[ENC0]full_0"
...

Testing on a STM32Discovery board

Static tests are generated the same way for the embedded firmware as for the other targets.

Generating dynamic tests for STM32Discovery is performed this way:

python scripts/test_aes.py UART 100 /dev/ttyUSB0 

This will produce 100 dynamic tests that are sent to the UART peripheral connected to the pseudo-file /dev/ttyUSB0 (please check that the user executing these tests, has the rights to write in this file).

In order to separate the debug output line and the tests line, we use two UARTs on the STM32Discovery:

NOTE: When performing the tests on the STM32Discovery board, the number of cycles taken by the various steps of the AES encryption are printed. We summarize such information in the table below (we distinguish the compact version of the masked AES and the fully unrolled one):

AES type cyclesaes_load_keyaes_init_encaes_enc
Compact masked AES153291218253072
Unrolled masked AES127781130029920

Testing using Keil

In order to test using Keil, you can generate static tests to be embedded in the source files as described above. Then, using the keil project in keil/aes_m4.uvprojx one can check that all the tests are OK by using the simulator provided by the SDK, and putting a breakoint at the end of the main() routine: reaching the end of the tests implies that all the tests are OK.

Although configuring the UART with Keil's simulator is possible, it is not straightforward: you can check here for more details.

NOTE: The Keil simulator allows to measure the number of cycles taken by the emulated code. One might see some (important) differences between such cycles count and the ones reported by the AES running on the real hardware (STM32Discovery board for instance). Such discrepancies are explained by the fact that the Keil simulator is cycle-accurate with some exceptions (see here). More specifically, the stalls induced by reading from 'slow' memory (such as flash) are not taken into account, usually resulting in better performances in the simulator. The curious reader can refer to the discussion here about the wait states and how ST handles them using the ART (Adaptive Real-Time Memory Accelerator). This explains why the discrepancies between the simulator and the real hardware are more important for large footprint code (unrolled AES) than for compact programs.

Code Architecture

The code architecture is composed of various elements:

Core AES Assembly Functions

The function aes itself makes use of the following primitives which have been implemented directly in assembly for security purpose:

UCHAR aes_loadKey (STRUCT_KEY_CONTEXT*, const UCHARp t_key, UCHARp t_rand_key)
UCHAR aes_init_enc (STRUCT_AES128_CONTEXT*,UCHARp t_rand)
UCHAR aes_init_dec (STRUCT_AES128_CONTEXT*, UCHARp t_rand)
UCHAR aes_dec (STRUCT_AES128_CONTEXT*, STRUCT_KEY_CONTEXT*, const UCHARp t_plaintext, UCHARp t_ciphertext)
UCHAR aes_enc (STRUCT_AES128_CONTEXT*, STRUCT_KEY_CONTEXT*, const UCHARp t_ciphertext, UCHARp t_plaintext)

The variablest_rand_key and t_rand refer to 19-byte arrays containing random values used to secure the implementation against SCA. The other variables t_key, t_plaintext and t_ciphertext refer to 16-byte arrays respectively corresponding to the AES-128 key, the plaintext and the ciphertext. The specific structures STRUCT_KEY_CONTEXT and STRUCT_AES128_CONTEXT contain context elements that are used during key schedule and encryption/decryption masking (see hereafter in the security section).

Security

The internal state of the processing is secure with the so-called affine masking countermeasure (see "Affine Masking against Higher-Order Side Channel Analysis" and "A Technique with an Information-Theoretic Basis for Protecting Secret Data from Differential Power Attacks" for more details). The AES key-scheduling is protected similarly as the encryption/decryption with its own random material. The affinely-masked round keys are computed in the aes function by calling the aes_loadKey primitive which stores them in a dedicated array and the corresponding masks in another array.

AES Round with Affine Masking

The elements of the aes state and of the key state are always manipulated in random order. For security purpose, different random permutations are used for the masked states and the masks states. Also, dedicated random permutations (working on 4-byte arrays) are used for the MixColumns executions. During the round-keys scheduling (function aes_loadKey), the random index permutations are generated at the very beginning of the processing and are deduced from the 4 first random bytes in random_key (or equivalently the corresponding bytes in case of internal random generation). For the aes encryption processing (resp. the aes decryption processing), the random index permutations are generated by function aes_init_enc (resp. aes_init_dec) and are deduced from the 4 first random bytes in random_aes (or equivalently the corresponding bytes in case of internal random generation). Random index permutations dedicated to the MixColumns primitive are deduced from the first random byte only.

The security of the implementation against SCA has been tested on a STM32F303 target (one of ChipWhisperer's CW308 target boards). This evaluation was performed on power consumption traces captured through an oscilloscope, sampling 100 000 000 samples by second. The obtained traces consisted in 2 000 000 samples, encompassing the whole AES implementation. Figure below shows two main steps: first some pre-processing is performed then the raw AES rounds are executed.

AES trace

The security analysis is detailed in the dedicated report doc/technical-report/technical_analysis.pdf.

Licenses

The source codes are released under the BSD License. See the LICENSE file in each source folder for more information.