### Implementation aspects of Keccak

Guido Bertoni<sup>1</sup>, Joan Daemen<sup>1</sup>, <u>Michaël Peeters</u><sup>2</sup>, Gilles Van Assche<sup>1</sup>

Parts jointly with Nicolas Debande<sup>3</sup>, Thanh-Ha Le<sup>3</sup>, Ronny Van Keer<sup>1</sup>

<sup>1</sup>STMicroelectronics <sup>2</sup>NXP Semiconductors <sup>3</sup>Morpho

KECCAK & SHA-3 Day Université Libre de Bruxelles March 27, 2013

### Outline

- 1 Implementing Keccak (how to cut a state)
- 2 Power-attacking Keccak
- 3 Power-protecting Keccak



- 5  $\times$  5 lanes, each containing 2 $^{\ell}$  bits (1, 2, 4, 8, 16, 32 or 64)
- $\bullet$  (5 × 5)-bit slices,  $2^{\ell}$  of them



- 5  $\times$  5 lanes, each containing 2 $^{\ell}$  bits (1, 2, 4, 8, 16, 32 or 64)
- $(5 \times 5)$ -bit slices,  $2^{\ell}$  of them



- 5  $\times$  5 lanes, each containing 2 $^{\ell}$  bits (1, 2, 4, 8, 16, 32 or 64)
- $\bullet$  (5 × 5)-bit slices,  $2^{\ell}$  of them



- 5  $\times$  5 lanes, each containing 2 $^{\ell}$  bits (1, 2, 4, 8, 16, 32 or 64)
- $\bullet$  (5 × 5)-bit slices,  $2^{\ell}$  of them



- 5  $\times$  5 lanes, each containing 2 $^{\ell}$  bits (1, 2, 4, 8, 16, 32 or 64)
- $\bullet$  (5 × 5)-bit slices,  $2^{\ell}$  of them

### Not cutting it: straightforward hardware architecture



- Logic for one round + register for the state
  - very short critical path ⇒ high throughput
- Multiple rounds can be computed in a single clock cycle
  - 2, 3, 4 or 6 rounds in one shot

## Lanes: straightforward software implementation

- $\blacksquare$  Lanes fit in 2 $^{\ell}$ -bit registers
  - 64-bit lanes for Keccak-f[1600]
  - 8-bit lanes for Keccak-f[200]
- Very basic operations required:
  - $\theta$  XOR and 1-bit rotations
  - ρ rotations
  - $\pi$  just reading the correct words
  - $\chi$  XOR, AND, NOT
    - ↓ just a XOR



### Lanes: straightforward software implementation



- Faster than SHA-2 on all modern PC
- KECCAKTREE faster than MD5 on some platforms

| C/b   | Algo             | Strength |
|-------|------------------|----------|
| 4.79  | keccakc256treed2 | 128      |
| 4.98  | md5 broken!      | 64       |
| 5.89  | keccakc512treed2 | 256      |
| 6.09  | sha1 broken!     | 80       |
| 8.25  | keccakc256       | 128      |
| 10.02 | keccakc512       | 256      |
| 13.73 | sha512           | 256      |
| 21.66 | sha256           | 128      |
|       |                  |          |

[eBASH, hydra6, http://bench.cr.yp.to/]

### Bit interleaving

- Ex.: map 64-bit lane to 32-bit words
  - lacksquare ho seems the critical step
  - Even bits in one word Odd bits in a second word
  - $ROT_{64} \leftrightarrow 2 \times ROT_{32}$
- Can be generalized
  - to 16- and 8-bit words
- Can be combined
  - with lane/slice-wise architectures
  - with most other techniques



[KECCAK impl. overview, Section 2.1]

### Interleaved lanes for 32-bit implementations



- Speed between SHA-256 and SHA-512
- Lower RAM usage

| C/b | RAM | Algo        | Strength |
|-----|-----|-------------|----------|
| 41  | 300 | sha256      | 128      |
| 76  | 260 | keccakc256* | 128      |
| 94  | 260 | keccakc512  | 256      |
| 173 | 916 | sha512      | 256      |

[XBX, ARM Cortex-M3, http://xbx.das-labor.org/]

\*estimated for c = 256

#### Lane-wise hardware architecture

- Basic processing unit + RAM
- Improvements over our co-processor:
  - 5 registers and barrel rotator [Kerckhof et al. CARDIS 2011]
  - 4-stage pipeline, ρ in 2 cycles, instruction-based parallel execution
     [San and At, IS] 2012]
- Permutation latency in clock cycles:
  - From 5160, to 2137, down to 1062



- Re-schedule the execution
  - $\blacksquare$   $\chi$ ,  $\theta$ ,  $\pi$  and  $\iota$  on blocks of slices
  - ρ by addressing [Jungk et al, ReConFig 2011]
- Suitable for compact FPGA or ASIC
  - Performance-area trade-offs
    - Possible to select number of processed slices from 1 up to 32



- Re-schedule the execution
  - $\blacksquare$   $\chi$ ,  $\theta$ ,  $\pi$  and  $\iota$  on blocks of slices
  - ρ by addressing
     [Jungk et al, ReConFig 2011]
- Suitable for compact FPGA or ASIC
  - Performance-area trade-offs
    - Possible to select number of processed slices from 1 up to 32



- Re-schedule the execution
  - $\blacksquare$   $\chi$ ,  $\theta$ ,  $\pi$  and  $\iota$  on blocks of slices
  - ρ by addressing [Jungk et al, ReConFig 2011]
- Suitable for compact FPGA or ASIC
  - Performance-area trade-offs
    - Possible to select number of processed slices from 1 up to 32



- Re-schedule the execution
  - $\blacksquare$   $\chi$ ,  $\theta$ ,  $\pi$  and  $\iota$  on blocks of slices
  - ρ by addressing
     [Jungk et al, ReConFig 2011]
- Suitable for compact FPGA or ASIC
- Performance-area trade-offs
  - Possible to select number of processed slices from 1 up to 32



### Cutting the state in lanes or in slices?

Both solutions are efficient, results for Virtex 5

| Architecture   | T.put  | Freq. | Slices   | Latency | Efficiency   |
|----------------|--------|-------|----------|---------|--------------|
|                | Mbit/s | MHz   | (+RAM)   | clocks  | Mbit/s/slice |
| Lane-wise [1]  | 52     | 265   | 448      | 5160    | 0.12         |
| Lane-wise [2]  | 501    | 520   | 151 (+3) | 1062    | 3.32         |
| Slice-wise [3] | 813    | 159   | 372      | 200     | 2.19         |
| High-Speed [4] | 12789  | 305   | 1384     | 24      | 9.24         |

- [1] Keccak Team, Keccak implementation overview
- [2] San, At, ISJ 2012
- [3] Jungk, Apfelbeck, ReConFig 2011 (scaled to r = 1024)
- [4] GMU ATHENa (scaled to r = 1024)

#### Outline

- 1 Implementing Keccak (how to cut a state)
- Power-attacking Keccak
- 3 Power-protecting Keccak

### A model of the power consumption

Consumption at any time instance can be modeled as

$$P = \sum_{i} T_{i}[d_{i}]$$

- $\blacksquare$   $d_i$ : Boolean variables that express activity
  - bit 1 in a given register or gate output at some stage
  - flipping of a specific register or gate output at some stage
- $T_i[0]$  and  $T_i[1]$ : stochastic variables

#### Simplified model

$$P = \alpha + \sum_{i} (-1)^{d_i}$$

### DPA on a keyed sponge function



- Attack the first round after absorbing known input bits
- Compute backward by inverting the permutation

### The Keccak-f round function in a DPA perspective

$$R = \iota \circ \chi \circ \pi \circ \rho \circ \theta$$

- Linear part  $\lambda$  followed by non-linear part  $\chi$
- $\lambda = \pi \circ \rho \circ \theta$ : mixing followed by bit transposition
- $\blacksquare$   $\chi$ : simple mapping operating on rows:

$$b_i \leftarrow b_i + (b_{i+1} + 1)b_{i+2}$$





- Leakage exploited: switching consumption of register bit 0
- Value switches from  $a_0$  to  $b_0 + (b_1 + 1)b_2$
- Activity equation:  $d = a_0 + b_0 + (b_1 + 1)b_2$



- $\blacksquare$  Take the case M=0
- We call *K* the input of  $\chi$ -block if M = 0
- K will be our target



- We call the effect of M at input of  $\chi$ :  $\mu$
- $\blacksquare \mu = \lambda(M||\mathbf{0}^{\mathsf{c}})$
- Linearity of  $\lambda$ :  $B = K + \lambda(M||0^c)$



- $d = a_0 + k_0 + (k_1 + 1)(k_2) + \mu_0 + (\mu_1 + 1)\mu_2 + k_1\mu_2 + k_2\mu_1$
- Fact: value of  $q = a_0 + k_0 + (k_1 + 1)k_2$  is same for all traces
- Let  $M_0$ : traces with d = q and  $M_1$ : d = q + 1



- Selection:  $s(M, K^*) = \mu_0 + (\mu_1 + 1)\mu_2 + k_1^*\mu_2 + k_2^*\mu_1$
- Values of  $\mu_1$  and  $\mu_2$  computed from M
- Hypothesis has two bits only:  $k_1^*$  and  $k_2^*$

- Correct hypothesis *K* 
  - traces in  $M_0$ : d = q
  - traces in  $M_1$ : d = q + 1
- Incorrect hypothesis  $K^* = K + \Delta$ 
  - trace in  $M_0$ :  $d = q + \mu_1 \delta_2 + \mu_2 \delta_1$
  - trace in  $M_1$ :  $d = q + \mu_1 \delta_2 + \mu_2 \delta_1 + 1$
- Remember:  $\mu = \lambda(M||0^c)$ 
  - random inputs M lead to random  $\mu_1$  and  $\mu_2$
  - Incorrect hypothesis: d uncorrelated with  $\{M_0, M_1\}$

### Result of experiments

Analytical prediction of success probability possible [Bertoni, Daemen, Debande, Le, Peeters, Van Assche, HASP 2012]



#### Outline

- 1 Implementing Keccak (how to cut a state)
- 2 Power-attacking Keccak
- 3 Power-protecting Keccak

### Secret sharing

- Countermeasure at algorithmic level:
  - Split variables in *random* shares:  $x = a \oplus b \oplus ...$
  - Keep computed variables *independent* from *native* variables
  - Protection against *n*-th order DPA: at least n + 1 shares

## Software: two-share masking

 $\chi: x_i \leftarrow x_i + (x_{i+1} + 1)x_{i+2}$  becomes:

$$a_i \leftarrow a_i + (a_{i+1} + 1)a_{i+2} + a_{i+1}b_{i+2}$$
  
 $b_i \leftarrow b_i + (b_{i+1} + 1)b_{i+2} + b_{i+1}a_{i+2}$ 

- Independence from native variables, if:
  - we compute left-to-right
  - we avoid leakage in register or bus transitions
- $\lambda = \pi \circ \rho \circ \theta$  becomes:

$$\begin{array}{ll} a & \leftarrow \lambda(a) \\ b & \leftarrow \lambda(b) \end{array}$$

## Software: two-share masking (faster)

- Making it faster!
- $\blacksquare$   $\chi$  becomes:

$$a_i \leftarrow a_i + (a_{i+1} + 1)a_{i+2} + a_{i+1}b_{i+2} + (b_{i+1} + 1)b_{i+2} + b_{i+1}a_{i+2}$$
  
 $b_i \leftarrow b_i$ 

- Precompute  $R = b + \lambda(b)$
- $\lambda = \pi \circ \rho \circ \theta$  becomes:

$$\begin{array}{ll} \textit{a} & \leftarrow \lambda(\textit{a}) + \textit{R} \\ \textit{b} & \leftarrow \textit{b} \end{array}$$

## Software: two-share masking (faster)

- Making it faster!
- $\blacksquare \chi$  becomes:

$$a_i \leftarrow a_i + (a_{i+1} + 1)a_{i+2} + a_{i+1}b_{i+2} + (b_{i+1} + 1)b_{i+2} + b_{i+1}a_{i+2}$$

- Precompute  $R = b + \lambda(b)$
- $\lambda = \pi \circ \rho \circ \theta$  becomes:

$$a \leftarrow \lambda(a) + R$$

### Hardware: two shares are not enough

Unknown order in combinatorial logic!

$$a_i \leftarrow a_i + (a_{i+1} + 1)a_{i+2} + a_{i+1}b_{i+2}$$

## Using a threshold secret-sharing scheme

- Idea: incomplete computations only
  - Each circuit does not leak anything [Nikova, Rijmen, Schläffer 2008]
- Number of shares: at least 1 + algebraic degree 3 shares are needed for  $\chi$
- Glitches as second-order effect
  - A glitch can leak about two shares, say, a + b
  - Another part can leak c
  - $\blacksquare$   $\Rightarrow$  as if two shares only!

## Using a threshold secret-sharing scheme

- Idea: incomplete computations only
  - Each circuit does not leak anything [Nikova, Rijmen, Schläffer 2008]
- Number of shares: at least 1 + algebraic degree 3 shares are needed for  $\chi$
- Glitches as second-order effect
  - A glitch can leak about two shares, say, a + b
  - Another part can leak c
  - $\blacksquare \Rightarrow$  as if two shares only!

## Using a threshold secret-sharing scheme

- Idea: incomplete computations only
  - Each circuit does not leak anything [Nikova, Rijmen, Schläffer 2008]
- Number of shares: at least 1 + algebraic degree

3 shares are needed for  $\chi$ 

- Glitches as second-order effect
  - A glitch can leak about two shares, say, a + b
  - Another part can leak c
  - $\blacksquare$   $\Rightarrow$  as if two shares only!

### Three-share masking for $\chi$

■ Implementing  $\chi$  in three shares:

$$a_{i} \leftarrow b_{i} + (b_{i+1} + 1)b_{i+2} + b_{i+1}c_{i+2} + c_{i+1}b_{i+2}$$

$$b_{i} \leftarrow c_{i} + (c_{i+1} + 1)c_{i+2} + c_{i+1}a_{i+2} + a_{i+1}c_{i+2}$$

$$c_{i} \leftarrow a_{i} + (a_{i+1} + 1)a_{i+2} + a_{i+1}b_{i+2} + b_{i+1}a_{i+2}$$

### One-cycle round architecture



## Three-cycle round architecture



### Parallel vs sequential leakage

■ Generalization of results for protected implementation [Bertoni, Daemen, Debande, Le, Peeters, Van Assche, HASP 2012]



# Some references (1/2)

#### Main references

- Van Keer and KT, Keccak implementation overview
- Debande, Le and KT, PA of HW impl. protected with secret sharing, HASP 2012 + ePrint 2013/067
- Note on side-channel attacks and their counterm..., NIST hash forum 2009
- Building power analysis resistant implementations of KECCAK, SHA-3 2010
- Software implementations and benchmarks
  - Bernstein and Lange, eBASH
  - Wenzel-Benner and Gräf, XBX
  - Balasch et al., CARDIS 2012

Optimized implementations available at http://keccak.noekeon.org/

## Some references (2/2)

- Hardware benchmarks and implementations on FPGA
  - Kerckhof et al., CARDIS 2011
  - Jungk and Apfelbeck, ReConFig 2011
  - San and At, ISJ 2012
  - Gaj et al.; Mahboob et al.; Kaps et al.; SHA-3 2012
- Hardware benchmarks and implementations on ASIC
  - Henzen et al., CHES 2010
  - Tillich et al., SHA-3 2010
  - Guo et al., DATE 2012
  - Gurkaynak et al.; Kavun et al.; SHA-3 2012

VHDL code available at http://keccak.noekeon.org/

## Questions?

