Post

Harm of Dead Store Elimination

Harm of Dead Store Elimination

BackGround

  最近跟某大厂做code review,人家提出了用memzero_explicit代替memset来增加security,一试验,还真的是security issue。本文来展示一下试验的结果。

Concept of Dead Store

  维基百科上这样定义Dead Store。

In computer programming, a dead store is a local variable that is assigned a value but is read by no following instruction. Dead stores waste processor time and memory, and may be detected through the use of static program analysis, and removed by an optimizing compiler.

  以下是一个示例:

1
2
3
4
5
6
7
8
void test_memset(void)
{
        unsigned char test[] = "hello_memzero_explicit";
        unsigned int length = strlen(test);

        printf("%s\n", test);
        memset(test, 0, length);
}

  用大于等于O1的optimizaiotn option编译后objdump结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
0000000000400650 <test_memset>:
  400650:       90000001        adrp    x1, 400000 <__abi_tag-0x254>
  400654:       911be021        add     x1, x1, #0x6f8
  400658:       a9bd7bfd        stp     x29, x30, [sp, #-48]!
  40065c:       910003fd        mov     x29, sp
  400660:       a9400c22        ldp     x2, x3, [x1]
  400664:       a9018fe2        stp     x2, x3, [sp, #24]
  400668:       910063e0        add     x0, sp, #0x18
  40066c:       f840f021        ldur    x1, [x1, #15]
  400670:       f80273e1        stur    x1, [sp, #39]
  400674:       97ffff9f        bl      4004f0 <puts@plt>
  400678:       a8c37bfd        ldp     x29, x30, [sp], #48
  40067c:       d65f03c0        ret

  对比一下用O0的编译结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0000000000400684 <test_memset>:
  400684:       a9bd7bfd        stp     x29, x30, [sp, #-48]!
  400688:       910003fd        mov     x29, sp
  40068c:       90000000        adrp    x0, 400000 <__abi_tag-0x254>
  400690:       911f0000        add     x0, x0, #0x7c0
  400694:       910043e2        add     x2, sp, #0x10
  400698:       aa0003e3        mov     x3, x0
  40069c:       a9400460        ldp     x0, x1, [x3]
  4006a0:       a9000440        stp     x0, x1, [x2]
  4006a4:       91003c61        add     x1, x3, #0xf
  4006a8:       91003c40        add     x0, x2, #0xf
  4006ac:       f9400021        ldr     x1, [x1]
  4006b0:       f9000001        str     x1, [x0]
  4006b4:       910043e0        add     x0, sp, #0x10
  4006b8:       97ffff92        bl      400500 <strlen@plt>
  4006bc:       b9002fe0        str     w0, [sp, #44]
  4006c0:       910043e0        add     x0, sp, #0x10
  4006c4:       97ffffa3        bl      400550 <puts@plt>
  4006c8:       b9402fe1        ldr     w1, [sp, #44]
  4006cc:       910043e0        add     x0, sp, #0x10
  4006d0:       aa0103e2        mov     x2, x1
  4006d4:       52800001        mov     w1, #0x0                        // #0
  4006d8:       97ffff92        bl      400520 <memset@plt>
  4006dc:       d503201f        nop
  4006e0:       a8c37bfd        ldp     x29, x30, [sp], #48
  4006e4:       d65f03c0        ret

  可以看到在用O1以上的优化选项的时候,memset操作被优化没了,也就是所谓的Dead Store Elimination。

Possible Consequence

  想象一下,如果这段buff里保存的是一些sensitive data,比如password,本意是想清掉的,结果还留在内存的stack里,就有了泄露的风险。

Removing buffer scrubbing code is an example of what D’Silva et al. [30] call a “correctness-security gap.” From the perspective of the C standard, removing the memset above is allowed because the contents of unreachable memory are not considered part of the semantics of the C program. However, leaving sensitive data in memory increases the damage posed by memory disclosure vulnerabilities and direct attacks on physical memory. This leaves gap between what the standard considers correct and what a security developer might deem correct. Unfortunately, the C language does not provide a guaranteed way to achieve what the developer intends, and attempts to add a memory scrubbing function to the C standard library have not seen mainstream adoption. Security-conscious developers have been left to devise their own means to keep the compiler from optimizing away their scrubbing functions, and this has led to a proliferation of “secure memset” implementations of varying quality.

Solution

  参考文档3提供了不少方法,开发者可以根据开发环境来选择,这里挑几条和C相关的方法实践下。

OpenBSD explicit_bzero

1
2
3
4
/* Set N bytes of S to 0.  The compiler will not delete a call to this
   function, even if S is dead after the call.  */
extern void explicit_bzero (void *__s, size_t __n) __THROW __nonnull ((1))
    __fortified_attr_access (__write_only__, 1, 2);

Disabling Optimization

  这种方法虽然很保险,但是放弃了编译器的代码优化功能,代码执行效率会有降低,这个需要根据实际情况选用。

Volatile Function Pointer

  OPTEE里memzero_explicit的implementation就用了这种方法。

1
2
3
4
5
6
7
static volatile void * (*memset_func)(void *, int, size_t) =
	(volatile void * (*)(void *, int, size_t))&memset;

void memzero_explicit(void *s, size_t count)
{
	memset_func(s, 0, count);
}

  还有OPENSSL里OPENSSL_cleanse的实现。

1
2
3
4
5
6
7
8
typedef void *(*memset_t)(void *, int, size_t);

static volatile memset_t memset_func = memset;

void OPENSSL_cleanse(void *ptr, size_t len)
{
    memset_func(ptr, 0, len);
}

Volatile Data Pointer

  博主把 例子中的buff申明为volatile,貌似并不起作用。

1
2
3
4
5
6
7
8
void test_memset(void)
{
        volatile unsigned char test[] = "hello_memzero_explicit";
        unsigned int length = strlen(test);

        printf("%s\n", test);
        memset(test, 0, length);
}

Memory Barrier

  代码改为:

1
2
3
4
5
6
7
8
9
10
11
12
#define barrier_data(ptr) \
        __asm__ __volatile__("": :"r"(ptr) :"memory")

void test_memset(void)
{
        unsigned char test[] = "hello_memzero_explicit";
        unsigned int length = strlen(test);

        printf("%s\n", test);
        memset(test, 0, length);
        barrier_data(test);
}

  反汇编为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
0000000000400650 <test_memset>:
  400650:       90000001        adrp    x1, 400000 <__abi_tag-0x254>
  400654:       911c6021        add     x1, x1, #0x718
  400658:       a9bc7bfd        stp     x29, x30, [sp, #-64]!
  40065c:       910003fd        mov     x29, sp
  400660:       a9400c22        ldp     x2, x3, [x1]
  400664:       f9000bf3        str     x19, [sp, #16]
  400668:       f840f021        ldur    x1, [x1, #15]
  40066c:       9100a3f3        add     x19, sp, #0x28
  400670:       a9028fe2        stp     x2, x3, [sp, #40]
  400674:       aa1303e0        mov     x0, x19
  400678:       f80373e1        stur    x1, [sp, #55]
  40067c:       97ffff9d        bl      4004f0 <puts@plt>
  400680:       a902ffff        stp     xzr, xzr, [sp, #40] //clear first 16 Bytes
  400684:       b9003bff        str     wzr, [sp, #56]      //clear 4 Bytes
  400688:       79007bff        strh    wzr, [sp, #60]      //clear 2 Bytes
  40068c:       f9400bf3        ldr     x19, [sp, #16]
  400690:       a8c47bfd        ldp     x29, x30, [sp], #64
  400694:       d65f03c0        ret
  400698:       d503201f        nop
  40069c:       d503201f        nop

  这里乍一看,没调用memset,确实没调用,但stp,str和strh几条语句把stack里分的buff清零了,strlen正好22字节。Linux里memzero_explicit的实现用的是memory barrier的方案。

Performance

  在Reference3中,作者做了详细的performance分析,主要关注Large block size情况下的performance吧。结论就是尽量使用原生的memset,不要让它被优化掉可以达到很好的performance,比如Volatile Function Pointer方式。从Linux使用的memroy barrier方式的反汇编看,它每次都尽可能把能力范围内最大的buffer清0,比如用neon一下清32Bytes,效率应该也不会差,只不过它没有用loop,博主会担心code size比较大。以下是memory barrier清buff size是161Bytes的反汇编。

1
2
3
4
5
6
7
  4006f0:       4f000400        movi    v0.4s, #0x0
  4006f4:       3902827f        strb    wzr, [x19, #160]    // 1 Bytes
  4006f8:       ad000260        stp     q0, q0, [x19]       // 32 Bytes
  4006fc:       ad010260        stp     q0, q0, [x19, #32]  // 32 Bytes
  400700:       ad020260        stp     q0, q0, [x19, #64]  // 32 Bytes
  400704:       ad030260        stp     q0, q0, [x19, #96]  // 32 Bytes
  400708:       ad040260        stp     q0, q0, [x19, #128] // 32 Bytes

  对于small block size也一样,memset效果最好,不过size本身就小,也差不了多少。有兴趣的可以仔细读一下这个pdf。

Reference

Dead Store
Dead Store Elimination
Harm of DSE

This post is licensed under CC BY 4.0 by the author.