ptmalloc 的简单分析

Posted on May 23, 2021

前言

断断续续写了一个多月，终于写完了。写本文的主要目的是获得一个对堆的理性认知，从知其然到知其所以然。主要分析了 __libc_malloc，_int_malloc，__libc_free，_int_free，malloc_consolidate 五个比较长且核心的函数。基本覆盖了较小（即未大到需要进行 sys_malloc）的情况。当然对于较大的请求的处理也是有必要学习的，之后应该会再写一篇。总的来说本文写的比较简略，且略过了对 malloc_chunk 结构体、unlink 函数的介绍，您在看之前应该需要对 ptmalloc 的流程和基础结构已有一定了解。

当然由于本人才学疏浅，肯定会有许多纰漏，欢迎各位指正。

本文分析的 libc 版本为 glibc 2.33

约定：

如无特别说明，内部大小指用户申请的空间所对应的 chunk 的大小（包括 overhead），即 normalized request size。申请的大小则需要根据上下文来区分，可能指用户传入的大小，也可能指内部大小。
bin 和链表：四大 bin（fastbin，smallbin，largebin，unsorted bin）和 tcache 都是由一个或多个储存特定大小的 chunk 的 bin 组成的，而每个 bin 从数据结果上的角度来说都是隔离链表，以下对这样的 bin 可能会有多种称呼，如 bin，链表，隔离链表等，他们都指同样的东西。
largebin 的特性保证了其大小可以容纳两对双向链表指针，一对是和其他 chunk 类似的隔离链表指针，另一对是用来加速 largebin 遍历的 fd_nextsize 和 bk_nextsize 指针，我称之为 skip 链表（glibc 的注释中，称该链表为 skip list）
源码中注释使用的都是 C 风格的（/*..*/），所以笔者使用 // 来表示该注释是笔者加入的。
关于 best-fit 和 exact-fit：在 ptmalloc 中，exact-fit 代表返回的 chunk 大小和内部大小恰好相等，best-fit 则是最优的分配，这样的分配是可能会进行切割的，被切割的 chunk 是大小最接近的内部大小的 chunk。总的来说 ptmalloc 大多数情况下是 exact-fit 的，在 largebin 中则是 best-fit。

一些容易忘记的数据

MINSIZE：

#define MINSIZE  \
  (unsigned long)(((MIN_CHUNK_SIZE+MALLOC_ALIGN_MASK) & ~MALLOC_ALIGN_MASK))

64 位下值为 0x20，32 位下值为 0x10，不过这个也不绝对，在 2.26 及之后，由于头文件包含优先级的问题，导致 32 位下的 MALLOC_ALIGN_MASK 也变成了 15，及此时 MINSIZE 为 0x20，每个 chunk 以 0x10 对齐。具体可以参考此文。

SIZE_SZ：

32 位下为 4，64 位下为 8，就是机器字长。

malloc

__libc_malloc

glibc 通过 strong_alias (__libc_malloc, malloc) 使 malloc 成了 __libc_malloc 的别名，我们在调用 malloc 时，实际调用的是 __libc_malloc。其实现如下。

void *
__libc_malloc (size_t bytes)
{
  mstate ar_ptr;
  void *victim;

  _Static_assert (PTRDIFF_MAX <= SIZE_MAX / 2,
                  "PTRDIFF_MAX is not more than half of SIZE_MAX");

  void *(*hook) (size_t, const void *)
    = atomic_forced_read (__malloc_hook);
  if (__builtin_expect (hook != NULL, 0))
    return (*hook)(bytes, RETURN_ADDRESS (0));
#if USE_TCACHE
  /* int_free also calls request2size, be careful to not pad twice.  */
  size_t tbytes;
  if (!checked_request2size (bytes, &tbytes))
    {
      __set_errno (ENOMEM);
      return NULL;
    }
  size_t tc_idx = csize2tidx (tbytes);

  MAYBE_INIT_TCACHE ();

  DIAG_PUSH_NEEDS_COMMENT;
  if (tc_idx < mp_.tcache_bins
      && tcache
      && tcache->counts[tc_idx] > 0)
    {
      victim = tcache_get (tc_idx);
      return TAG_NEW_USABLE (victim);
    }
  DIAG_POP_NEEDS_COMMENT;
#endif

  if (SINGLE_THREAD_P)
    {
      victim = TAG_NEW_USABLE (_int_malloc (&main_arena, bytes));
      assert (!victim || chunk_is_mmapped (mem2chunk (victim)) ||
	      &main_arena == arena_for_chunk (mem2chunk (victim)));
      return victim;
    }

  arena_get (ar_ptr, bytes);

  victim = _int_malloc (ar_ptr, bytes);
  /* Retry with another arena only if we were able to find a usable arena
     before.  */
  if (!victim && ar_ptr != NULL)
    {
      LIBC_PROBE (memory_malloc_retry, 1, bytes);
      ar_ptr = arena_get_retry (ar_ptr, bytes);
      victim = _int_malloc (ar_ptr, bytes);
    }

  if (ar_ptr != NULL)
    __libc_lock_unlock (ar_ptr->mutex);

  victim = TAG_NEW_USABLE (victim);

  assert (!victim || chunk_is_mmapped (mem2chunk (victim)) ||
          ar_ptr == arena_for_chunk (mem2chunk (victim)));
  return victim;
}
libc_hidden_def (__libc_malloc)

_Static_assert 是静态断言，在编译期完成，和我们分析的关系不大。

void *(*hook) (size_t, const void *)
    = atomic_forced_read (__malloc_hook);
  if (__builtin_expect (hook != NULL, 0))
    return (*hook)(bytes, RETURN_ADDRESS (0));

初始化

进入流程第一步是对 __malloc_hook 进行检测，如果 __malloc_hook 的值不为零，则直接调用其所指的函数。在第一次执行 __libc_malloc，也就是堆尚未初始化的情况下，__malloc_hook 指向的是 malloc_hook_ini 这个函数。

static void *
malloc_hook_ini (size_t sz, const void *caller)
{
  __malloc_hook = NULL;
  ptmalloc_init ();
  return __libc_malloc (sz);
}

这个函数会调用 ptmalloc_init () 来对堆进行一些初始化，主要是对 main_arena 的初始化等，然后重新调用 __libc_malloc 来为用户申请空间。

Tcache

完成堆初始化后，进入 Tcache 分支，总体如下

#if USE_TCACHE
  /* int_free also calls request2size, be careful to not pad twice.  */
  size_t tbytes;
  if (!checked_request2size (bytes, &tbytes))
    {
      __set_errno (ENOMEM);
      return NULL;
    }
  size_t tc_idx = csize2tidx (tbytes);

  MAYBE_INIT_TCACHE ();

  DIAG_PUSH_NEEDS_COMMENT;
  if (tc_idx < mp_.tcache_bins
      && tcache
      && tcache->counts[tc_idx] > 0)
    {
      victim = tcache_get (tc_idx);
      return TAG_NEW_USABLE (victim);
    }
  DIAG_POP_NEEDS_COMMENT;
#endif

预处理

首先通过 checked_request2size 将用户请求的大小转为内部的大小并存储到 tbytes 中，注意到它在 2.33 下它已经成为函数，定义如下

static inline bool
checked_request2size (size_t req, size_t *sz) __nonnull (1)
{
  if (__glibc_unlikely (req > PTRDIFF_MAX))
    return false;

#ifdef USE_MTAG
  /* When using tagged memory, we cannot share the end of the user
     block with the header for the next chunk, so ensure that we
     allocate blocks that are rounded up to the granule size.  Take
     care not to overflow from close to MAX_SIZE_T to a small
     number.  Ideally, this would be part of request2size(), but that
     must be a macro that produces a compile time constant if passed
     a constant literal.  */
  req = (req + ~__mtag_granule_mask) & __mtag_granule_mask;
#endif

  *sz = request2size (req);
  return true;
}

和旧版本主要区别在于不再会主动报错，需要主调函数自行判断申请的空间是否过大

然后通过 csize2tidx(tbytes) 算出请求的大小所处的 idx，存入 tc_idx 中

# define csize2tidx(x) (((x) - MINSIZE + MALLOC_ALIGNMENT - 1) / MALLOC_ALIGNMENT)

如果此时 Tcache 还未初始化，自然要先初始化 Tcache，通过宏 MAYBE_INIT_TCACHE() 判断执行

# define MAYBE_INIT_TCACHE() \
  if (__glibc_unlikely (tcache == NULL)) \
    tcache_init();

Tcache 管理结构体

此处的指针 tcache 的定义为 static __thread tcache_perthread_struct *tcache = NULL;，注意到该全局变量使用了 __thread 关键字修饰，在每个线程对这个变量访问时，访问的都是其副本，修改时只会修改副本的值，而不会对原值（image）发生修改，也就是说每个线程都拥有一个自己的 Tcache。tcache_perthread_struct 的定义如下

typedef struct tcache_perthread_struct
{
  uint16_t counts[TCACHE_MAX_BINS];
  tcache_entry *entries[TCACHE_MAX_BINS];
} tcache_perthread_struct;

在 2.30 及之后版本中，counts 数组每一个计数器都占两字节，而不是旧版本中的一字节，改变的原因尚无权威的说法，这里不做揣测。TCACHE_MAX_BINS 的值为 64，也就是 Tcache 最多支持 64 个不同大小的隔离链表，内部大小从 0x20 到 0x410。由于 counts 类型的变化，新版本中每一个该结构体多会多占用 0x40 的空间，比如 64 位下总共占用 0x290 字节。

tcache_entry 结构体的定义为

typedef struct tcache_entry
{
  struct tcache_entry *next;
  /* This field exists to detect double frees.  */
  struct tcache_perthread_struct *key;
} tcache_entry;

key 字段是在 2.28 及之后的版本后加入的，存在的意义是检测 double free，具体原理后面再说。

注意对 entries 的定义，使用的是 tcache_entry *entries[TCACHE_MAX_BINS];。可知 entries[tc_idx] 指向是链表中的第一个 chunk。

初始化

如果 tcache 指针指向 NULL，就代表该线程中 tcache 尚未初始化，就会调用 tcache_init 尝试为该结构体分配空间。

static void
tcache_init(void)
{
  mstate ar_ptr;
  void *victim = 0;
  const size_t bytes = sizeof (tcache_perthread_struct);

  if (tcache_shutting_down)
    return;

  arena_get (ar_ptr, bytes);
  victim = _int_malloc (ar_ptr, bytes);
  if (!victim && ar_ptr != NULL)
    {
      ar_ptr = arena_get_retry (ar_ptr, bytes);
      victim = _int_malloc (ar_ptr, bytes);
    }


  if (ar_ptr != NULL)
    __libc_lock_unlock (ar_ptr->mutex);

  /* In a low memory situation, we may not be able to allocate memory
     - in which case, we just keep trying later.  However, we
     typically do this very early, so either there is sufficient
     memory, or there isn't enough memory to do non-trivial
     allocations anyway.  */
  if (victim)
    {
      tcache = (tcache_perthread_struct *) victim;
      memset (tcache, 0, sizeof (tcache_perthread_struct));
    }
}

初始化的过程也比较简单，就是获取分配区加锁（这里的 ar_ptr 通过 arena_get 宏指向了 thread_arena），调用 _int_malloc 来为 tcache 分配管理器所需空间然后解锁并对分配的空间清零。

分配

完成初始化后执行

DIAG_PUSH_NEEDS_COMMENT;
  if (tc_idx < mp_.tcache_bins
      && tcache
      && tcache->counts[tc_idx] > 0)
    {
      victim = tcache_get (tc_idx);
      return TAG_NEW_USABLE (victim);
    }
  DIAG_POP_NEEDS_COMMENT;
#endif

DIAG_PUSH_NEEDS_COMMENT; DIAG_POP_NEEDS_COMMENT; 这两句比较怪异，定义在 libc-diag.h 中，不是很重要。

if (tc_idx < mp_.tcache_bins
      && tcache
      && tcache->counts[tc_idx] > 0)

这里判断了

申请的空间是否在 tcache 的范围中（即小于 0x410）
tcache 是否被成功初始化，当机器所能支持的最大堆内存非常小（这种情况应该比较少）或者其他分配区占用了过多的空间时，初始化就会失败，这个时候自然不能再使用 tcache 进行分配。
对应的大小的链表中是否有 chunk。（注意这是在 2.30 版本修改的，在旧版本中是用 tcache->entries[tc_idx] != NULL 来判断的，建议思考一下两者的区别。在旧版本中相对更容易利用一些，直接改最底部的 tcache chunk 的 next 就可以实现任意地址分配）

如果三个条件都满足，则调用 tcache_get (tc_idx) 来为 victim 分配空间

/* Caller must ensure that we know tc_idx is valid and there's
   available chunks to remove.  */
static __always_inline void *
tcache_get (size_t tc_idx)
{
  tcache_entry *e = tcache->entries[tc_idx];
  if (__glibc_unlikely (!aligned_OK (e)))
    malloc_printerr ("malloc(): unaligned tcache chunk detected");
  tcache->entries[tc_idx] = REVEAL_PTR (e->next);
  --(tcache->counts[tc_idx]);
  e->key = NULL;
  return (void *) e;
}

首先取出链表头存入局部变量 e 中，先检测 e 指向的 chunk 是否对齐（64 位下和 0x10 对齐，32 位下和 0x8 对齐），然后对 e->next 通过 REVEAL_PTR 解密后更新 entries[tc_idx]，并更新 counts[tc_idx] 的值，完成取出 chunk 的操作，并对将要返还给用户的 chunk 的 key 字段清零。

关于这里的 REVEAL_PTR，与 2.32（包括）之后的版本中加入的 Safe-Linking 操作有关。

是 2.32 加入的不是 2.31，2.31 下就放心去打吧，不用泄露堆地址

/* Safe-Linking:
   Use randomness from ASLR (mmap_base) to protect single-linked lists
   of Fast-Bins and TCache.  That is, mask the "next" pointers of the
   lists' chunks, and also perform allocation alignment checks on them.
   This mechanism reduces the risk of pointer hijacking, as was done with
   Safe-Unlinking in the double-linked lists of Small-Bins.
   It assumes a minimum page size of 4096 bytes (12 bits).  Systems with
   larger pages provide less entropy, although the pointer mangling
   still works.  */
#define PROTECT_PTR(pos, ptr) \
  ((__typeof (ptr)) ((((size_t) pos) >> 12) ^ ((size_t) ptr)))
#define REVEAL_PTR(ptr)  PROTECT_PTR (&ptr, ptr)

核心的是

#define PROTECT_PTR(pos, ptr) \
  ((__typeof (ptr)) ((((size_t) pos) >> 12) ^ ((size_t) ptr)))

这个宏，进行了异或加密，将 next 指针（对 tcache 而言）或 fd 指针（对 fastbin 而言）与该指针所在的地址的高 52 位相异或进行加解密（这里的位移操作还是有必要的，不然由于高地址相同的概率较大，低 12 位又可预测，容易被爆破）。当前版本只对 tcache 和 fastbin 进行了 Safe-Linking 的操作，加上了这一层保护后进行相关的攻击的时候就可能需要 leak 堆的地址了，给我们的利用增加了不少难度。

通过 tcache_get 取得目标 chunk 了后，就会通过 return TAG_NEW_USABLE (victim); 返回给用户。（这里的 TAG_NEW_USABLE 默认情况下也没做，就是简单的将 victim 返回给用户，这也是在 2.33 中新加入的）。

到这里就可以看出 Tcache 的优先级是所有 bins 中最高的，在 __libc_malloc 中就完成了分配。

准备调用 _int_malloc

如果 Tcache 无法完成分配，就会执行下面的流程来调用 _int_malloc 来分配

if (SINGLE_THREAD_P)
    {
      victim = TAG_NEW_USABLE (_int_malloc (&main_arena, bytes));
      assert (!victim || chunk_is_mmapped (mem2chunk (victim)) ||
	      &main_arena == arena_for_chunk (mem2chunk (victim)));
      return victim;
    }

  arena_get (ar_ptr, bytes);

  victim = _int_malloc (ar_ptr, bytes);

这里会对线程模式进行判断，但都是获取线程的分配区然后调用 _int_malloc 来分配。

_int_malloc

引言

这个基本上是 malloc 的核心函数了，比较长。

先粗略地说一下 _int_malloc 中的分配流程

从 fast bin 中用 best fit 策略进行分配，如果成功，退出流程
从 small bin 中用 best fit 策略进行分配，如果成功，退出流程
到了这一步，说明 fast bin 和 small bin 中都不存在合适的 chunk，就可能需要从 unsorted bin 和 large bin 中尝试分配，但并不是直接分配，还要经过下面的流程
- 如果申请是 large request，则调用 malloc_consolidate 函数。这个函数做的就是把 fast bin 中所有的 bin 能合并的都合并掉，然后把整个 fast bin 中的 bin 都放到 Unsorted Bin 中。
- 遍历 Unsorted Bin，对其中的每个 chunk 分为两种情况
  - small request：存在合适的 chunk（注意这个合适的条件是十分苛刻的，详见后文中“唯一的切割分配”）就会进行切割（tcache 加入后这里会进行一个 stash 操作，tcache 加入前会直接把 chunk 返回给用户），否则就把当前的 chunk 放到其对应的 bin 中
  - 非 small request：放到对应的 bin 中
- 尝试从 large bin 中对于 idx 的链表中分配
到了这里表面没有对应的链表中都没有 chunk 可以分配，从更大的链表中尝试分配，这里使用 bitmap 来遍历。

tcache，fastbin，smallbin 都是基于 bestfit 策略的，而 unsorted bin 是几个 bin 的缓冲区，必要时会切割合并。当然几个 bin 中的 chunk 也会在少数情况下合并。

定义在开头的一些常用变量

INTERNAL_SIZE_T nb;               /* normalized request size */
  unsigned int idx;                 /* associated bin index */
  mbinptr bin;                      /* associated bin */

  mchunkptr victim;                 /* inspected/selected chunk */
  INTERNAL_SIZE_T size;             /* its size */
  int victim_index;                 /* its bin index */

  mchunkptr remainder;              /* remainder from a split */
  unsigned long remainder_size;     /* its size */

  unsigned int block;               /* bit map traverser */
  unsigned int bit;                 /* bit map traverser */
  unsigned int map;                 /* current word of binmap */

  mchunkptr fwd;                    /* misc temp for linking */
  mchunkptr bck;                    /* misc temp for linking */

#if USE_TCACHE
  size_t tcache_unsorted_count;	    /* count of unsorted chunks processed */
#endif

这里的变量 block，bit，map 三个变量是为了从 binmap 中寻址的，这个机构我们比较陌生，这里多说一点，binmap 是在结构体 mstate 中的一个成员数组，定义为

unsigned int binmap[BINMAPSIZE];

fastbin

首先是 fastbin，在 2.26 之后的版本中，由于增加了 fastbin stash 的机制，fastbin 可能需要进行多次解链，将该操作在多线程情况下的处理抽离成了一个宏

#define REMOVE_FB(fb, victim, pp)			\
  do							\
    {							\
      victim = pp;					\
      if (victim == NULL)				\
	break;						\
      pp = REVEAL_PTR (victim->fd);                                     \
      if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp)))       \
	malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \
    }							\
  while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \
	 != victim);					\

注意调用时的方法为 REMOVE_FB (fb, pp, victim);，宏中的 catomic_compare_and_exchange_val_acq 解释如下

/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
   Return the old *MEM value.  */

可见该宏通过原子变量来避免多线程情况下的赋值错误，而做的事情和单线程下也无区别，就是 *fb = REVEAL_PTR (victim->fd);。

if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
    {
      idx = fastbin_index (nb);
      mfastbinptr *fb = &fastbin (av, idx);
      mchunkptr pp;
      victim = *fb;

      if (victim != NULL)
	{
	  if (__glibc_unlikely (misaligned_chunk (victim)))
	    malloc_printerr ("malloc(): unaligned fastbin chunk detected 2");

	  if (SINGLE_THREAD_P)
	    *fb = REVEAL_PTR (victim->fd);
	  else
	    REMOVE_FB (fb, pp, victim);
	  if (__glibc_likely (victim != NULL))
	    {
	      size_t victim_idx = fastbin_index (chunksize (victim));
	      if (__builtin_expect (victim_idx != idx, 0))
		malloc_printerr ("malloc(): memory corruption (fast)");
	      check_remalloced_chunk (av, victim, nb);

如果内部大小小于 fastbin 的最大大小，则尝试从 fastbin 中分配，通过 get_max_fast 函数来取得最大大小

static inline INTERNAL_SIZE_T
get_max_fast (void)
{
  /* Tell the GCC optimizers that global_max_fast is never larger
     than MAX_FAST_SIZE.  This avoids out-of-bounds array accesses in
     _int_malloc after constant propagation of the size parameter.
     (The code never executes because malloc preserves the
     global_max_fast invariant, but the optimizers may not recognize
     this.)  */
  if (global_max_fast > MAX_FAST_SIZE)
    __builtin_unreachable ();
  return global_max_fast;
}

也就是直接返回 global_max_fast，是一个全局变量（pwn 中有时可以通过修改这个变量来进行进一步利用）。

fastbin 的分配处理比较简单：

通过 idx = fastbin_index (nb); 算出申请的空间所属的下标，然后通过 mfastbinptr *fb = &fastbin (av, idx); 取得链表头，两函数的定义为

#define fastbin(ar_ptr, idx) ((ar_ptr)->fastbinsY[idx])
  
  /* offset 2 to use otherwise unindexable first 2 bins */
#define fastbin_index(sz)
  ((((unsigned int) (sz)) >> (SIZE_SZ == 8 ? 4 : 3)) - 2)

把取得的链表头给 victim，判断 victim 是否为 NULL，若不为 NULL，则代表 fastbin 中存在至少一个 chunk，然后就会进行解链，也就是 *fb = REVEAL_PTR (victim->fd);，将头节点指向链表中的下一个节点
进行 size 的合法性检查，正是此检查使我们在做 house of spirit 的时候必须在目标地址伪造 size。

if (__glibc_likely (victim != NULL))
	  {
	    size_t victim_idx = fastbin_index (chunksize (victim));
	    if (__builtin_expect (victim_idx != idx, 0))
		malloc_printerr ("malloc(): memory corruption (fast)");
	    check_remalloced_chunk (av, victim, nb);// 这一句仅调试时启用，做了更细致的检查

在 2.26 tcache 加入后，便增加了 fastbin stash 的机制，紧跟在合法性检查之后，其实现如下

#if USE_TCACHE
	      /* While we're here, if we see other chunks of the same size,
		 stash them in the tcache.  */
	      size_t tc_idx = csize2tidx (nb);
	      if (tcache && tc_idx < mp_.tcache_bins)
		{
		  mchunkptr tc_victim;

		  /* While bin not empty and tcache not full, copy chunks.  */
		  while (tcache->counts[tc_idx] < mp_.tcache_count
			 && (tc_victim = *fb) != NULL)
		    {
		      if (__glibc_unlikely (misaligned_chunk (tc_victim)))
			malloc_printerr ("malloc(): unaligned fastbin chunk detected 3");
		      if (SINGLE_THREAD_P)
			*fb = REVEAL_PTR (tc_victim->fd);
		      else
			{
			  REMOVE_FB (fb, pp, tc_victim);
			  if (__glibc_unlikely (tc_victim == NULL))
			    break;
			}
		      tcache_put (tc_victim, tc_idx);
		    }
		}
#endif

实现也简单，就是从 fastbin 中一个个取出，同时一个个放到 tcache 中，直到 fastbin 空或 tcache 满，注意这里仅做了 chunk 是否对齐的检查，也就是说如果能控制一个 fastbin 的指针，借用 stash 机制，我们可以将任意地址放入 tcache 中，但是一般不会这么做（可以直接用 tcache 打..）。

void *p = chunk2mem (victim);
	      alloc_perturb (p, bytes);
	      return p;

完成 stash 之后就可以返回分配的 chunk 了，return 前的两个语句一般情况下都不会生效。

在加入 tcache 机制后，fastbin attack 变得黯然失色，这里我也放一下 2.23 下 fastbin 的实现

/*
     If the size qualifies as a fastbin, first check corresponding bin.
     This code is safe to execute even if av is not yet initialized, so we
     can try it without checking, which saves some time on this fast path.
   */

  if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
    {
      idx = fastbin_index (nb);
      mfastbinptr *fb = &fastbin (av, idx);
      mchunkptr pp = *fb;
      do
        {
          victim = pp;
          if (victim == NULL)
            break;
        }
      while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
             != victim);
      if (victim != 0)
        {
          if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
            {
              errstr = "malloc(): memory corruption (fast)";
            errout:
              malloc_printerr (check_action, errstr, chunk2mem (victim), av);
              return NULL;
            }
          check_remalloced_chunk (av, victim, nb);
          void *p = chunk2mem (victim);
          alloc_perturb (p, bytes);
          return p;
        }
    }

和 2.33 相比基本上只是少了 tcache stash、safe-linking、align_check 三个操作。

smallbin

如果 fastbin 无法完成分配，则尝试到 smallbin 中寻找 bestfit。（其实 ctf 的堆题中 smallbin 的存在感并不是很高，到现在为止只在 house of orange，tcache stash unlink 以及一次通过 unsortedbin leak 不出来而不得不使用 smallbin leak 中碰到过相关的利用。）

/*
     If a small request, check regular bin.  Since these "smallbins"
     hold one size each, no searching within bins is necessary.
     (For a large request, we need to wait until unsorted chunks are
     processed to find best fit. But for small ones, fits are exact
     anyway, so we can check now, which is faster.)
   */

  if (in_smallbin_range (nb))
    {
      idx = smallbin_index (nb);
      bin = bin_at (av, idx);

      if ((victim = last (bin)) != bin)
        {
          bck = victim->bk;
	  if (__glibc_unlikely (bck->fd != victim))
	    malloc_printerr ("malloc(): smallbin double linked list corrupted");
          set_inuse_bit_at_offset (victim, nb);
          bin->bk = bck;
          bck->fd = bin;

          if (av != &main_arena)
	    set_non_main_arena (victim);
          check_malloced_chunk (av, victim, nb);

这里的处理也比较简单，如果有 bestfit，也就是链表头不指向自己，取出头节点，检查链表完整性，即 victim->bk->fd == victim，合法则给 chunk 的后一个 chunk 的 prev_inuse 位置 1，然后解链。

#if USE_TCACHE
	  /* While we're here, if we see other chunks of the same size,
	     stash them in the tcache.  */
	  size_t tc_idx = csize2tidx (nb);
	  if (tcache && tc_idx < mp_.tcache_bins)
	    {
	      mchunkptr tc_victim;

	      /* While bin not empty and tcache not full, copy chunks over.  */
	      while (tcache->counts[tc_idx] < mp_.tcache_count
		     && (tc_victim = last (bin)) != bin)
		{
		  if (tc_victim != 0)
		    {
		      bck = tc_victim->bk;
		      set_inuse_bit_at_offset (tc_victim, nb);
		      if (av != &main_arena)
			set_non_main_arena (tc_victim);
		      bin->bk = bck;
		      bck->fd = bin;

		      tcache_put (tc_victim, tc_idx);
	            }
		}
	    }
#endif

类似于 fastbin，smallbin 也有类似的 stash 机制，从 smallbin 中一个个取出，同时一个个放到 tcache 中，直到 smallbin 空或 tcache 满，注意这里没有对链表的完整性进行检查，通过进行恰当的伪造，可以实现地址分配或者类似于 unsorted bin attack 的效果，也就是 Tcache stash unlink attack 系列攻击。

void *p = chunk2mem (victim);
          alloc_perturb (p, bytes);
          return p;
        }
    }

完成 stash 后，返回 chunk，没什么可说的。

malloc_consolidate

到了这里说明前面的 chunk 中都不存在 bestfit，为了避免内存碎片过多的情况，会使用 malloc_consolidate 取出并所有的 fastbin，并对每一个 fastbin chunk，尝试进行前后向合并。

/*
     If this is a large request, consolidate fastbins before continuing.
     While it might look excessive to kill all fastbins before
     even seeing if there is space available, this avoids
     fragmentation problems normally associated with fastbins.
     Also, in practice, programs tend to have runs of either small or
     large requests, but less often mixtures, so consolidation is not
     invoked all that often in most programs. And the programs that
     it is called frequently in otherwise tend to fragment.
   */

  else
    {
      idx = largebin_index (nb);
      if (atomic_load_relaxed (&av->have_fastchunks))
        malloc_consolidate (av);
    }

注意这里的 else，只有在申请是 large request 的时候才会调用！

进入函数后，首先定义了并初始化了一些变量

static void malloc_consolidate(mstate av)
{
  mfastbinptr*    fb;                 /* current fastbin being consolidated */
  mfastbinptr*    maxfb;              /* last fastbin (for loop control) */
  mchunkptr       p;                  /* current chunk being consolidated */
  mchunkptr       nextp;              /* next chunk to consolidate */
  mchunkptr       unsorted_bin;       /* bin header */
  mchunkptr       first_unsorted;     /* chunk to link to */

  /* These have same use as in free() */
  mchunkptr       nextchunk;
  INTERNAL_SIZE_T size;
  INTERNAL_SIZE_T nextsize;
  INTERNAL_SIZE_T prevsize;
  int             nextinuse;

  atomic_store_relaxed (&av->have_fastchunks, false);

  unsorted_bin = unsorted_chunks(av);
  /*
    Remove each chunk from fast bin and consolidate it, placing it
    then in unsorted bin. Among other reasons for doing this,
    placing in unsorted bin avoids needing to calculate actual bins
    until malloc is sure that chunks aren't immediately going to be
    reused anyway.
  */

  maxfb = &fastbin (av, NFASTBINS - 1);
  fb = &fastbin (av, 0);

正式进入流程，虽然比较长，但是还是比较好理解的

do {
    p = atomic_exchange_acq (fb, NULL);
    if (p != 0) {
      do {
	{
	  if (__glibc_unlikely (misaligned_chunk (p)))
	    malloc_printerr ("malloc_consolidate(): "
			     "unaligned fastbin chunk detected");

	  unsigned int idx = fastbin_index (chunksize (p));
	  if ((&fastbin (av, idx)) != fb)
	    malloc_printerr ("malloc_consolidate(): invalid chunk size");
	}

	check_inuse_chunk(av, p);
	nextp = REVEAL_PTR (p->fd);

	/* Slightly streamlined version of consolidation code in free() */
	size = chunksize (p);
	nextchunk = chunk_at_offset(p, size);
	nextsize = chunksize(nextchunk);

	if (!prev_inuse(p)) {
	  prevsize = prev_size (p);
	  size += prevsize;
	  p = chunk_at_offset(p, -((long) prevsize));
	  if (__glibc_unlikely (chunksize(p) != prevsize))
	    malloc_printerr ("corrupted size vs. prev_size in fastbins");
	  unlink_chunk (av, p);
	}

	if (nextchunk != av->top) {
	  nextinuse = inuse_bit_at_offset(nextchunk, nextsize);

	  if (!nextinuse) {
	    size += nextsize;
	    unlink_chunk (av, nextchunk);
	  } else
	    clear_inuse_bit_at_offset(nextchunk, 0);

	  first_unsorted = unsorted_bin->fd;
	  unsorted_bin->fd = p;
	  first_unsorted->bk = p;

	  if (!in_smallbin_range (size)) {
	    p->fd_nextsize = NULL;
	    p->bk_nextsize = NULL;
	  }

	  set_head(p, size | PREV_INUSE);
	  p->bk = unsorted_bin;
	  p->fd = first_unsorted;
	  set_foot(p, size);
	}

	else {
	  size += nextsize;
	  set_head(p, size | PREV_INUSE);
	  av->top = p;
	}

      } while ( (p = nextp) != 0);

    }
  } while (fb++ != maxfb);

最外层循环的结束条件是 while (fb++ != maxfb);，也就是内层每一轮循环完成对一个隔离链表的解链和合并操作。

对于每一个隔离链表，首先取出头节点并对分配区中对应的指针置空，注意这里的 p 初始化后，指向要被操作的 fastbin，之后会进行多次更新，实际上指向的是合并完成的 chunk。

/* Store NEWVALUE in *MEM and return the old value.  */
#ifndef atomic_exchange_acq
# define atomic_exchange_acq(mem, newvalue) \

p = atomic_exchange_acq (fb, NULL);

如果 p 不为 NULL，也就是该链表中有 chunk，那么需要对该链表进行操作，使用了一个 do..while 循环对链表进行遍历，每一次遍历首先进行当前 chunk 进行合法性检查，检查了 chunk 是否对齐和 size 字段的合法性，实现如下

if (p != 0) {
      do {
	{
	  if (__glibc_unlikely (misaligned_chunk (p)))
	    malloc_printerr ("malloc_consolidate(): "
			     "unaligned fastbin chunk detected");

	  unsigned int idx = fastbin_index (chunksize (p));
	  if ((&fastbin (av, idx)) != fb)
	    malloc_printerr ("malloc_consolidate(): invalid chunk size");
	}

注意这里用两个花括号将合法性检查扩了起来。

check_inuse_chunk(av, p); // debug
	nextp = REVEAL_PTR (p->fd);

	/* Slightly streamlined version of consolidation code in free() */
	size = chunksize (p);
	nextchunk = chunk_at_offset(p, size);
	nextsize = chunksize(nextchunk);

通过合法性检查后，首先初始化一些局部变量，nextp 是循环控制变量，这里提前置为 fastbin 中的下一个 chunk，size 被初始化为当前 chunk 的大小，之后会更新，维护合并完的 chunk 的大小。

首先考虑前向合并

if (!prev_inuse(p)) {
	  prevsize = prev_size (p);
	  size += prevsize;
	  p = chunk_at_offset(p, -((long) prevsize));
	  if (__glibc_unlikely (chunksize(p) != prevsize))
	    malloc_printerr ("corrupted size vs. prev_size in fastbins");
	  unlink_chunk (av, p);
	}

如果前一个 chunk 不被使用，则可以前向合并，更新 size 和 p，将前一个 chunk unlink。

然后进行后向合并，如果后一个 chunk 是 top_chunk，那么直接和 top_chunk 合并，如果不是，首先检查后一个 chunk 是否在使用中，若不在使用，则进行后向合并并插入 unsorted bin 中（插到链表头和第一个节点中间），如果在使用，则修改其 prev_inuse 位。具体实现如下。

if (nextchunk != av->top) {
	  nextinuse = inuse_bit_at_offset(nextchunk, nextsize);

	  if (!nextinuse) {
	    size += nextsize;
	    unlink_chunk (av, nextchunk);
	  } else
	    clear_inuse_bit_at_offset(nextchunk, 0);

	  first_unsorted = unsorted_bin->fd;
	  unsorted_bin->fd = p;
	  first_unsorted->bk = p;

	  if (!in_smallbin_range (size)) {
	    p->fd_nextsize = NULL;
	    p->bk_nextsize = NULL;
	  }

	  set_head(p, size | PREV_INUSE);
	  p->bk = unsorted_bin;
	  p->fd = first_unsorted;
	  set_foot(p, size);
	}

	else {
	  size += nextsize;
	  set_head(p, size | PREV_INUSE);
	  av->top = p;
	}

完成前后向合并后对该 fastbin 的操作就结束了，对 fastbin 中的每一个 chunk 都做同样的操作后函数流程结束。

遍历和处理 unsorted bin

完成 malloc_consolidate 之后，内部碎片已在一定程度上受到了整理，并且存入了 unsorted bin，就可以尝试从 unsorted bin 中进行分配了。

粗略的说，该流程会遍历每一个 unsorted bin chunk，对于每一个 chunk 的处理只有三种情况

unsorted bin 的大小大于申请的大小且满足一些严苛的条件时，切割后返回给用户，结束流程
unsorted bin 的大小等于申请的大小
unsorted bin 的大小小于申请的大小，将该 chunk 从 unsorted bin 中取出放入对应的 bin 中

每个 chunk 都会被取出，所以 while 循环可以每一次都取链表的尾部。

遍历开头的合法性检查

对 unsorted bin 的遍历是反向的，结束条件为链表中仅有一个节点。需要注意在 2.29 之后，在对每一个 unsorted bin 处理前都加入了一些合法性检查，主要是 chunk 的大小、是否和下一个 chunk 邻接和对 fd 指针的检查（这导致不少 unlink 攻击方法变得困难许多）

/*
     Process recently freed or remaindered chunks, taking one only if
     it is exact fit, or, if this a small request, the chunk is remainder from
     the most recent non-exact fit.  Place other traversed chunks in
     bins.  Note that this step is the only place in any routine where
     chunks are placed in bins.

     The outer loop here is needed because we might not realize until
     near the end of malloc that we should have consolidated, so must
     do so and retry. This happens at most once, and only when we would
     otherwise need to expand memory to service a "small" request.
   */

#if USE_TCACHE
  INTERNAL_SIZE_T tcache_nb = 0;
  size_t tc_idx = csize2tidx (nb);
  if (tcache && tc_idx < mp_.tcache_bins)
    tcache_nb = nb;
  int return_cached = 0;

  tcache_unsorted_count = 0;
#endif

  for (;; )
    {
      int iters = 0;
      while ((victim = unsorted_chunks (av)->bk) != unsorted_chunks (av))
        {
          bck = victim->bk;
          size = chunksize (victim);
          mchunkptr next = chunk_at_offset (victim, size);

          if (__glibc_unlikely (size <= CHUNK_HDR_SZ)
              || __glibc_unlikely (size > av->system_mem))
            malloc_printerr ("malloc(): invalid size (unsorted)");
          if (__glibc_unlikely (chunksize_nomask (next) < CHUNK_HDR_SZ)
              || __glibc_unlikely (chunksize_nomask (next) > av->system_mem))
            malloc_printerr ("malloc(): invalid next size (unsorted)");
          if (__glibc_unlikely ((prev_size (next) & ~(SIZE_BITS)) != size))
            malloc_printerr ("malloc(): mismatching next->prev_size (unsorted)");
          if (__glibc_unlikely (bck->fd != victim)
              || __glibc_unlikely (victim->fd != unsorted_chunks (av)))
            malloc_printerr ("malloc(): unsorted double linked list corrupted");
          if (__glibc_unlikely (prev_inuse (next)))
            malloc_printerr ("malloc(): invalid next->prev_inuse (unsorted)");

这里在 #if USE_TCACHE 宏中的处理是为了后面的处理做准备的，此处按下不表

切割分配

/*
             If a small request, try to use last remainder if it is the
             only chunk in unsorted bin.  This helps promote locality for
             runs of consecutive small requests. This is the only
             exception to best-fit, and applies only when there is
             no exact fit for a small chunk.
           */

          if (in_smallbin_range (nb) &&
              bck == unsorted_chunks (av) &&
              victim == av->last_remainder &&
              (unsigned long) (size) > (unsigned long) (nb + MINSIZE))

结合注释，可知这里的 if 是处理从 unsorted bin 中分配 small chunk 的情况。

仅当满足如下的条件：

申请为 small request
unsorted bin 中仅有一个 chunk
unsorted bin 中仅有的 chunk 是 last remainder chunk
申请的 chunk 的大小加上 MINSIZE 要小于 unsorted bin 中仅有的 chunk 的大小
（一个隐含的条件）无法从 small bin 中分配

的时候才会切割 unsorted bin 的 chunk，返回给用户。从注释中可以看出，这是分配器对 exact-fit 原则的一次例外，这种严苛的处理方式有利于提高分配器的局部性（locality）。

分配的实现方法为

{
              /* split and reattach remainder */
              remainder_size = size - nb;
              remainder = chunk_at_offset (victim, nb);
              unsorted_chunks (av)->bk = unsorted_chunks (av)->fd = remainder;
              av->last_remainder = remainder;
              remainder->bk = remainder->fd = unsorted_chunks (av);
              if (!in_smallbin_range (remainder_size))
                {
                  remainder->fd_nextsize = NULL;
                  remainder->bk_nextsize = NULL;
                }

              set_head (victim, nb | PREV_INUSE |
                        (av != &main_arena ? NON_MAIN_ARENA : 0));
              set_head (remainder, remainder_size | PREV_INUSE);
              set_foot (remainder, remainder_size);

              check_malloced_chunk (av, victim, nb); // 仅 DEBUG 时启用
              void *p = chunk2mem (victim);
              alloc_perturb (p, bytes);
              return p;
            }

首先更新 remainder 相关的参数，并设置 unsorted bin 指向的 chunk 为切割后的 chunk。根据切割后的 chunk 的大小决定是否初始化 large bin chunk 所需使用的指针。然后通过 set_head 和 set_foot 设置 chunk 的各字段，完成后返回给用户，结束 _int_malloc。

如果不满足切割分配的严苛条件，那么该 unsorted bin 要么进入对应的 bin 中，要么满足 exact-fit 可以直接返回，无论如何都会离开 unsorted bin，所以先从 unsorted bin 中解链，注意这里对 bck->fd 做了检查（链表完整性检查），因此 unsorted bin attack 失效。

/* bck = victim->fd */

/* remove from unsorted list */
          if (__glibc_unlikely (bck->fd != victim))
            malloc_printerr ("malloc(): corrupted unsorted chunks 3");
          unsorted_chunks (av)->bk = bck;
          bck->fd = unsorted_chunks (av);

exact-fit

在 2.26（不包括）版本前，若当前遍历到的 unsorted bin 的大小和申请的内部大小恰好相等（exact-fit），就直接分配给用户。2.26 之后 tcache 加入后，即使是 exact-fit 也不直接返回，而是先缓存到 tcache 中（除非 tcache 已被填满）。并置 return_cached 变量为 1，该变量在进入 unsorted bin 遍历前定义。

/* Take now instead of binning if exact fit */

          if (size == nb)
            {
              set_inuse_bit_at_offset (victim, size);
              if (av != &main_arena)
		set_non_main_arena (victim);
#if USE_TCACHE
	      /* Fill cache first, return to user only if cache fills.
		 We may return one of these chunks later.  */
	      if (tcache_nb
		  && tcache->counts[tc_idx] < mp_.tcache_count)
		{
		  tcache_put (victim, tc_idx);
		  return_cached = 1;
		  continue;
		}
	      else
		{
#endif
              check_malloced_chunk (av, victim, nb);
              void *p = chunk2mem (victim);
              alloc_perturb (p, bytes);
              return p;
#if USE_TCACHE
		}
#endif
            }

放入对应的 bin 中

Small Bin

如果该 chunk 不满足前两个分支的条件，该 chunk 就会被放入对应的 bin 中。如果 chunk 属于 small bin，那么处理起来十分容易

          /* place chunk in bin */

          if (in_smallbin_range (size))
            {
              victim_index = smallbin_index (size);
              bck = bin_at (av, victim_index);
              fwd = bck->fd;
            }

首先算出目标的大小所在的 smallbin_index，然后将该 index 的表头所在地址赋值给 bck，把链表尾赋值给 fwd，在之后会把该 chunk 插入到两者之间

Large Bin

如果属于 large bin，处理就会变得较为复杂

          else
            {
              victim_index = largebin_index (size);
              bck = bin_at (av, victim_index);
              fwd = bck->fd;

一开始也和 small bin 一样，首先获得 largebin_index，然后将该 index 的表头所在地址赋值给 bck，把链表尾赋值给 fwd，也就是“插入”（这里的插入之所以打引号，是因为实际上没有插入，只是在流程结束后会通过 fwd 和 bck 来进行插入）到链表头上。但是这还没有结束，之后还需要进行对插入位置的调整。

              /* maintain large bins in sorted order */
              if (fwd != bck)
                {

首先判断 fwd 和 bck 是否相等，由于 fwd = bck->fd，bck->fd 指向 bin 中的第一个 chunk，如果 fwd 和 bck 相等，说明该链表是为空的，不需要做额外处理。

如果两者不相等，说明该链表非空。由于 largebin 的每一个隔离链表中的 chunk 大小不一定相同，而且链表是有序的，插入 chunk 后也需要保证链表的有序性。之后的流程就完成了这个工作，实现如下

                  /* Or with inuse bit to speed comparisons */
                  size |= PREV_INUSE;
                  /* if smaller than smallest, bypass loop below */
                  assert (chunk_main_arena (bck->bk));

首先是预处理，先给 size 或上一个 PREV_INUSE 位，注释里说这是为了加快比较，估计是因为每个 largebin chunk 的 PREV_INUSE 位都为 1，这里提前或上后面比较的时候就不需要或了。

然后还会做一个断言，断言该链表尾的 chunk 必为主分配区 chunk。这个我倒是没看太懂，引用大佬的话就是

因为所有在 large bin 中的 chunk 都处于空闲状态，该标志位一定是清零的。

还不是特别理解。

                  if ((unsigned long) (size)
		      < (unsigned long) chunksize_nomask (bck->bk))
                    {
                      fwd = bck;
                      bck = bck->bk;

                      victim->fd_nextsize = fwd->fd;
                      victim->bk_nextsize = fwd->fd->bk_nextsize;
                      fwd->fd->bk_nextsize = victim->bk_nextsize->fd_nextsize = victim;
                    }

预处理完可以正式进行处理，首先特判链表尾的 chunk 的大小是否大于当前 chunk 的大小，如果是，直接更新 fwd 和 bck，也就是准备将当前 chunk 插入到链表尾后，由此可见 largebin 中的 chunk 是从大到小排序的。这里还会更新 nextsize 的两个指针，可以看出每个 largebin 存在于两个双链表中，一个是 largebin 隔离链表，由 fd bk 指针维护；另一个是 skip 链表，由 fd_nextsize 和 bk_nextsize 维护。

skip 链表

这里特别说一下 skip 链表存在的意义和其结构，largebin 由于大小不固定，寻找 fit 的 chunk 必然是远远慢于别的 bin 的，但是 largebin 的特性就是比较大，所以这里复用更多的空间，增加一对指针来维护 skip 链表以加快遍历速度。该链表的结构有星阑科技的大佬画了一张示意图，很形象，这里不要脸的盗过来（图中只绘出了 fd(_nextsize) 指针，bk(_nextsize) 指针就是反着指回去）

虽然每个隔离链表中的 chunk 都是不一样的，但是他们是有序的，所以如果存在大小相同的 chunk，一个个遍历过去是十分浪费时间的。从隔离链表中每个相同大小的 chunk的取出第一个 chunk 按照大小顺序用 fd_nextsize 和 bk_nextsize 按大小顺序链接在一起，让每一个 fd_nextsize 指针指向下一个大小的 chunk，bk_nextsize 则指向上一个大小的 chunk，这样在遍历时可以通过这两个指针跨越大小相同的 chunk，在一定程度上节省了遍历的时间。

回到刚才对当前 chunk 的 fd_nextsize 和 bk_nextsize 的更新上，由于当前 chunk 的大小小于原链表表尾 chunk 的大小，所以需要设置两个指针。

else
                    {
                      assert (chunk_main_arena (fwd));
                      while ((unsigned long) size < chunksize_nomask (fwd))
                      //#define chunksize_nomask(p)         ((p)->mchunk_size)
                        {
                          fwd = fwd->fd_nextsize;
			  assert (chunk_main_arena (fwd));
                        }

如果不满足上述条件，就需要遍历整个隔离链表了，使用 fd_nextsize 进行正向遍历（从大到小），直到找第一个小于等于当前 chunk 大小的 chunk 为止。注意这里进行了多次断言，和前文所述的相同。

if ((unsigned long) size
			  == (unsigned long) chunksize_nomask (fwd))
                        /* Always insert in the second position.  */
                        fwd = fwd->fd;

退出循环后对找到的第一个满足条件的 chunk 的大小做检测，如果当前 chunk 和满足条件的 chunk 的大小相同，那么把当前 chunk “插入”到满足条件的 chunk 后，这样就不需要再更新两个 next_size 指针了。

else
                        {
                          victim->fd_nextsize = fwd;
                          victim->bk_nextsize = fwd->bk_nextsize;
                          // 以下的检测是在 2.30（包括）后加入的
                          if (__glibc_unlikely (fwd->bk_nextsize->fd_nextsize != fwd))
                            malloc_printerr ("malloc(): largebin double linked list corrupted (nextsize)");
                          fwd->bk_nextsize = victim;
                          victim->bk_nextsize->fd_nextsize = victim;
                        }

否则由于当前 chunk 的大小大于满足条件的 chunk，所以把当前 chunk 插到该 chunk 之前，并更新二者的 nextsize 指针。这里做了链表完整性检测。

bck = fwd->bk;
                      if (bck->fd != fwd)
                        malloc_printerr ("malloc(): largebin double linked list corrupted (bk)");
                    }
                }

最后更新 bck 并做链表完整性检测。结束 if (fwd != bck) 也就是隔离链表中有 chunk 的情况的处理流程。

else
                victim->fd_nextsize = victim->bk_nextsize = victim;
            }

如果隔离链表中没有 chunk，也有必要设置两个 nextsize 指针，这样才能在之后向该链表中插入 chunk 时不出错。

//#define mark_bin(m, i)    ((m)->binmap[idx2block (i)] |= idx2bit (i))
          mark_bin (av, victim_index);
          victim->bk = bck;
          victim->fd = fwd;
          fwd->bk = victim;
          bck->fd = victim;

这一段代码是在

/* place chunk in bin */

          if (in_smallbin_range (size))
	    ...
	  else
	    ...

流程结束时进行的，这里真正将当前 chunk 插入到了 largebin 中，并将 large bin 所对应的 binmap 的相应 bit 置位。

到这里就正式结束了放入对应 bin 中的操作。

一轮操作结束的收尾

前面的一同操作只处理了一个 unsorted bin 哦，处理完一个后需要做一些收尾

#if USE_TCACHE
      /* If we've processed as many chunks as we're allowed while
	 filling the cache, return one of the cached ones.  */
      ++tcache_unsorted_count;
      if (return_cached
	  && mp_.tcache_unsorted_limit > 0
	  && tcache_unsorted_count > mp_.tcache_unsorted_limit)
	{
	  return tcache_get (tc_idx);
	}
#endif

这里是为了防止由于向 tcache 中 stash chunk 导致明明有的 exact-fit 被雪藏于 tache 中，当 stash 进的 tcache_unsorted_count > mp_.tcache_unsorted_limit 时且有 exact-fit 被 stash 进 tcache 中时，会直接调用 tcache_get 从 tcache 中返回。（通过调试，在笔者的机器上，这里的 mp_.tcache_unsorted_limit 值为 0，即默认是不会设置 cache 的上限的。）

#define MAX_ITERS       10000
          if (++iters >= MAX_ITERS)
            break;
        }

这里的 iters 是在进入遍历操作前定义的，记录了 unsorted bin 处理的总轮数，当总轮数过大即大于 10000 轮时直接结束对 unsorted bin 的遍历，避免长时间处理影响速度。

之后就会循环下去做下一轮操作直到满足退出条件。

#if USE_TCACHE
      /* If all the small chunks we found ended up cached, return one now.  */
      if (return_cached)
	{
	  return tcache_get (tc_idx);
	}
#endif

如果结束 unsorted bin 遍历时有在 tcache 中 cache 过chunk，那就直接返回 cache 的 chunk。

largebin

如果执行到这里说明 unsorted bin 中的空闲 chunk 已经进入了对应的 bin 中了，并且没有合适的可以返回给用户的 chunk。如果申请是 large request，就可以尝试从 largebin 中分配了。

       /*
         If a large request, scan through the chunks of current bin in
         sorted order to find smallest that fits.  Use the skip list for this.
       */

      if (!in_smallbin_range (nb))
        {
          bin = bin_at (av, idx);

          /* skip scan if empty or largest chunk is too small */
          if ((victim = first (bin)) != bin
	      && (unsigned long) chunksize_nomask (victim)
	        >= (unsigned long) (nb))
            {

如果申请的大小在 largebin 的范围中，查询该大小所对应链表，如果该链表为空或链表中第一个（也就是最大的一个）chunk 的大小小于申请的大小，说明无法从 largebin 中分配。否则对链表进行遍历，尝试从链表中找到合适的 chunk 分配

              victim = victim->bk_nextsize;
              while (((unsigned long) (size = chunksize (victim)) <
                      (unsigned long) (nb)))
                victim = victim->bk_nextsize;

遍历为反向遍历，即从小到大遍历，直到找到第一个大于等于申请的 chunk 的大小的 chunk 为止。

              /* Avoid removing the first entry for a size so that the skip
                 list does not have to be rerouted.  */
              if (victim != last (bin)
		  && chunksize_nomask (victim)
		    == chunksize_nomask (victim->fd))
                victim = victim->fd;

best-fit 分配

找到最合适的 chunk 后，该 chunk 之后的一个 chunk 的大小是否和该 chunk 相等，如果相等，就把候选的将返回的 chunk 设置为该 chunk 的下一个 chunk，这样可以避免重新设置 skip 链表。

              remainder_size = size - nb;
              unlink_chunk (av, victim);

然后记录切割后的 chunk 余留下的 chunk 的大小（注意这时还没有切割），并把当前的候选 chunk 从链表中 unlink。

              /* Exhaust */
              if (remainder_size < MINSIZE)
                {
                  set_inuse_bit_at_offset (victim, size);
                  if (av != &main_arena)
		    set_non_main_arena (victim);
                }

然后根据余留的 chunk 的大小分类处理，如果余留的 chunk 大小小于 MINSIZE，那么就浪费掉这段空间，不进行切割，在将下一个 chunk 的 prev_inuse 位置 1 并根据当前的分配区设置好 chunk 的 non_main_arena 位后就可以准备一起返回给用户了。

              /* Split */
              else
                {
                  remainder = chunk_at_offset (victim, nb);
                  /* We cannot assume the unsorted list is empty and therefore
                     have to perform a complete insert here.  */
                  bck = unsorted_chunks (av);
                  fwd = bck->fd;
		  if (__glibc_unlikely (fwd->bk != bck))
		    malloc_printerr ("malloc(): corrupted unsorted chunks");
                  remainder->bk = bck;
                  remainder->fd = fwd;
                  bck->fd = remainder;
                  fwd->bk = remainder;
                  if (!in_smallbin_range (remainder_size))
                    {
                      remainder->fd_nextsize = NULL;
                      remainder->bk_nextsize = NULL;
                    }
                  set_head (victim, nb | PREV_INUSE |
                            (av != &main_arena ? NON_MAIN_ARENA : 0));
                  set_head (remainder, remainder_size | PREV_INUSE);
                  set_foot (remainder, remainder_size);
                }

如果切割后的 chunk 的大小可以自成一个 chunk，那么就在切割后分配，注意这是我们第二次碰到 ptmalloc 的非 exact-fit 分配。由于此时 unsorted bin 可能是非空的，所以需要做一次完全的插入，插入到 unsorted bin 的链表头上。同时如果剩下的 chunk 仍然是 largebin chunk，那么就需要将 skip 链表相关的指针置 NULL。然后通过 set_head 和 set_foot 进行切割。

              check_malloced_chunk (av, victim, nb);
              void *p = chunk2mem (victim);
              alloc_perturb (p, bytes);
              return p;
            }
        }

执行到这里，就是从 largebin list 中得到了 chunk，返回给用户，结束流程。

基于 binmap 的再一次扫描

如果之前的分配都失败了，即 exact-fit（smallbin）和 best-fit（largebin）的隔离链表都无法提供 chunk，分配器并不会死心，它仍然会尝试从对应的 size 更大的隔离链表中找最优的 chunk 来分配。注意这里说的最优不是

      /*
         Search for a chunk by scanning bins, starting with next largest
         bin. This search is strictly by best-fit; i.e., the smallest
         (with ties going to approximately the least recently used) chunk
         that fits is selected.

         The bitmap avoids needing to check that most blocks are nonempty.
         The particular case of skipping all bins during warm-up phases
         when no chunks have been returned yet is faster than it might look.
       */

      ++idx;
      bin = bin_at (av, idx);
      block = idx2block (idx);
      map = av->binmap[block];
      bit = idx2bit (idx);

首先增加 idx，然后获取 idx 对应的链表表头，并获取 idx 对应的 binmap 的 block 的下标和 bit 所处的位置，同时将整个 block 赋值给 map。

      for (;; )
        {
          /* Skip rest of block if there are no more set bits in this block.  */
          if (bit > map || bit == 0)
            {
              do
                {
                  if (++block >= BINMAPSIZE) /* out of bins */
                    goto use_top;
                }
              while ((map = av->binmap[block]) == 0);

              bin = bin_at (av, (block << BINMAPSHIFT));
              bit = 1;
            }

然后进行处理，如果 bit > map || bit == 0，说明整个 block 中都没有大于申请的大小 bin 中有 chunk。这个时候尝试从下一个 block 中寻找 bin，这里使用 while 循环找到第一个不为 0 的 block，如果找到了这个 block 中必然有可以分配的隔离链表不为空，这个时候结束循环准备进行切割分配；如果扫完了所有的 block 都没有可以分配的链表，直接 goto 到 use_top 上去，准备使用 top_chunk 分配（这里可以看出 goto 也不是永远不能用的，有些时候还是有使用的意义的——如此处的跳出多重循环）。

          /* Advance to bin with set bit. There must be one. */
          while ((bit & map) == 0)
            {
              bin = next_bin (bin);
              bit <<= 1;
              assert (bit != 0);
            }

执行到这里说明有一个链表中有 best-fit，这里通过循环找到一个不为空的、best-fit 的链表，这里有一个断言，代表 bit 不可能溢出，即一定能找到一个 best-fit 的链表。当 (bit & map) == 1 时退出循环，表示找到了需要的链表。

          /* Inspect the bin. It is likely to be non-empty */
          victim = last (bin);

将链表中的最后一个 chunk（链表中最小的 chunk）作为候选 chunk。

          /*  If a false alarm (empty bin), clear the bit. */
          if (victim == bin)
            {
              av->binmap[block] = map &= ~bit; /* Write through */
              bin = next_bin (bin);
              bit <<= 1;
            }

然后检测找到的链表是否真的不是空的。由于 binmap 的更新不是立刻的，所以可能会出现 binmap 不准确的情况，这时就需要更新 binmap，即将对应的 bit 更新，获取当前 bin 的下一个 bin，并将 bit 移到高一位，即乘二。

如果链表的却非空，那么就可以进行 best-fit 分配，这里的实现和之前 largebin 中的 best-fit 分配是一致的。如下

          else
            {
              size = chunksize (victim);

              /*  We know the first chunk in this bin is big enough to use. */
              assert ((unsigned long) (size) >= (unsigned long) (nb));

              remainder_size = size - nb;

              /* unlink */
              unlink_chunk (av, victim);

              /* Exhaust */
              if (remainder_size < MINSIZE)
                {
                  set_inuse_bit_at_offset (victim, size);
                  if (av != &main_arena)
		    set_non_main_arena (victim);
                }

              /* Split */
              else
                {
                  remainder = chunk_at_offset (victim, nb);

                  /* We cannot assume the unsorted list is empty and therefore
                     have to perform a complete insert here.  */
                  bck = unsorted_chunks (av);
                  fwd = bck->fd;
		  if (__glibc_unlikely (fwd->bk != bck))
		    malloc_printerr ("malloc(): corrupted unsorted chunks 2");
                  remainder->bk = bck;
                  remainder->fd = fwd;
                  bck->fd = remainder;
                  fwd->bk = remainder;

                  /* advertise as last remainder */
                  if (in_smallbin_range (nb))
                    av->last_remainder = remainder;
                  if (!in_smallbin_range (remainder_size))
                    {
                      remainder->fd_nextsize = NULL;
                      remainder->bk_nextsize = NULL;
                    }
                  set_head (victim, nb | PREV_INUSE |
                            (av != &main_arena ? NON_MAIN_ARENA : 0));
                  set_head (remainder, remainder_size | PREV_INUSE);
                  set_foot (remainder, remainder_size);
                }
              check_malloced_chunk (av, victim, nb);
              void *p = chunk2mem (victim);
              alloc_perturb (p, bytes);
              return p;
            }

这里就不再赘述了。

binmap 处理的外层套了一个 for (;; ) 的大循环，这是为了处理 false alarm 即 binmap 出错的而设置的，出错后将 bit 左移，重新进行一次搜索分配，直到完成分配或者 goto 到 use_top 上。这里会对 binmap 进行一定的纠错。

use_top

终于到了使用 top_chunk 分配了，这也代表 _int_malloc 的尾声。当各个隔离链表中都无法提供合适的 chunk，或者申请的空间非常大，就会尝试从 top_chunk 中切割分配。

    use_top:
      /*
         If large enough, split off the chunk bordering the end of memory
         (held in av->top). Note that this is in accord with the best-fit
         search rule.  In effect, av->top is treated as larger (and thus
         less well fitting) than any other available chunk since it can
         be extended to be as large as necessary (up to system
         limitations).

         We require that av->top always exists (i.e., has size >=
         MINSIZE) after initialization, so if it would otherwise be
         exhausted by current request, it is replenished. (The main
         reason for ensuring it exists is that we may need MINSIZE space
         to put in fenceposts in sysmalloc.)
       */

      victim = av->top;
      size = chunksize (victim);

      if (__glibc_unlikely (size > av->system_mem))
        malloc_printerr ("malloc(): corrupted top size");

首先把当前分配区的 top_chunk 的地址赋值给 victim，设之为候选 chunk。记录 top_chunk 的大小，并检测其合法性，及 top_chunk 的大小不得超过 av->system_mem（经过调试，在我的电脑上这个值是 0x21000），该检测的存在使 house of force 攻击在 libc 2.29 之后失效。

切割分配

      if ((unsigned long) (size) >= (unsigned long) (nb + MINSIZE))
        {
          remainder_size = size - nb;
// /* Treat space at ptr + offset as a chunk */
// #define chunk_at_offset(p, s)  ((mchunkptr) (((char *) (p)) + (s)))
          remainder = chunk_at_offset (victim, nb);
          av->top = remainder;
          set_head (victim, nb | PREV_INUSE |
                    (av != &main_arena ? NON_MAIN_ARENA : 0));
          set_head (remainder, remainder_size | PREV_INUSE);

          check_malloced_chunk (av, victim, nb);
          void *p = chunk2mem (victim);
          alloc_perturb (p, bytes);
          return p;
        }

如果 top_chunk 在被切割后仍然可以存留大于 MINSIZE 的大小，那么就进行切割分配。保留的不小于 MINSIZE 大小的空间是做 fencepost 的。分配的流程比较简单，把新的 top_chunk 设置为 remainder 然后设置将要返回的 chunk 和新的 top_chunk 的 size 字段。由于 top_chunk 的尾部仍然做 fencepost，所以这里不需要 set_foot。然后就可以返回给用户了。

关于 fencepost：这似乎是用在 sub_heap 上的一个栅栏，用于各个分配区的区分，主分配区的 fencepost 段上都为 0。具体笔者也不是很了解

第二次 malloc_consolidate

      /* When we are using atomic ops to free fast chunks we can get
         here for all block sizes.  */
      else if (atomic_load_relaxed (&av->have_fastchunks))
        {
          malloc_consolidate (av);
          /* restore original bin index */
          if (in_smallbin_range (nb))
            idx = smallbin_index (nb);
          else
            idx = largebin_index (nb);
        }

然后在这里会再进行一次 malloc_consolidate。如果是 large request，虽然之前清理过了一次 fastbin，但由于 free fastbin 是不需要加锁的，所以在之前的处理过程中仍然可能会有别的线程 free fastbin 导致 fastbin 中又出现 chunk；如果是 small request，之前没有做过 malloc_consolidate，这里 fastbin 中很可能有非 exact-fit 的 chunk。

当分配区中有 fastchunks 时，调用 malloc_consolidate，取出所有的 fastchunk，更新 idx 后结束这个 if。还记得套在 unsorted bin 遍历前的大循环嘛

/*
     Process recently freed or remaindered chunks, taking one only if
     it is exact fit, or, if this a small request, the chunk is remainder from
     the most recent non-exact fit.  Place other traversed chunks in
     bins.  Note that this step is the only place in any routine where
     chunks are placed in bins.

     The outer loop here is needed because we might not realize until
     near the end of malloc that we should have consolidated, so must
     do so and retry. This happens at most once, and only when we would
     otherwise need to expand memory to service a "small" request.
   */

#if USE_TCACHE
  INTERNAL_SIZE_T tcache_nb = 0;
  size_t tc_idx = csize2tidx (nb);
  if (tcache && tc_idx < mp_.tcache_bins)
    tcache_nb = nb;
  int return_cached = 0;

  tcache_unsorted_count = 0;
#endif

  for (;; )
    {

结束这个 if 后就会触发这个循环，回到之前对 unsorted bin 的遍历上，正如注释所说，再一次 malloc_consolidate 后，unsorted bin 中又会增加 chunk，说不定这些 chunk 就可以满足 request 了，说不定就可以不用调用 sysmalloc 分配了（sysmalloc 需要进行系统调用来分配空间，会慢许多）。

其实在老版本中这里有区分是否开启 ATOMIC_FASTBINS 优化并做了不同处理，新版本把两者和到了一起

sysmalloc

执行到这里了，那是真的没办法了，必须使用 sysmalloc 来分配了。

      /*
         Otherwise, relay to handle system-dependent cases
       */
      else
        {
          void *p = sysmalloc (nb, av);
          if (p != NULL)
            alloc_perturb (p, bytes);
          return p;
        }
    }
}

使用 sysmalloc 申请成功后，返回给用户，_int_malloc 正式结束。

至此，malloc 分析结束。

free

free 的代码要比 malloc 短一点。

__libc_free

void
__libc_free (void *mem)
{
  mstate ar_ptr;
  mchunkptr p;                          /* chunk corresponding to mem */

  void (*hook) (void *, const void *)
    = atomic_forced_read (__free_hook);
  if (__builtin_expect (hook != NULL, 0))
    {
      (*hook)(mem, RETURN_ADDRESS (0));
      return;
    }

  if (mem == 0)                              /* free(0) has no effect */
    return;

#ifdef USE_MTAG
  /* Quickly check that the freed pointer matches the tag for the memory.
     This gives a useful double-free detection.  */
  *(volatile char *)mem;
#endif

  int err = errno;

  p = mem2chunk (mem);

  /* Mark the chunk as belonging to the library again.  */
  (void)TAG_REGION (chunk2rawmem (p), CHUNK_AVAILABLE_SIZE (p) - CHUNK_HDR_SZ);

  if (chunk_is_mmapped (p))                       /* release mmapped memory. */
    {
      /* See if the dynamic brk/mmap threshold needs adjusting.
	 Dumped fake mmapped chunks do not affect the threshold.  */
      if (!mp_.no_dyn_threshold
          && chunksize_nomask (p) > mp_.mmap_threshold
          && chunksize_nomask (p) <= DEFAULT_MMAP_THRESHOLD_MAX
	  && !DUMPED_MAIN_ARENA_CHUNK (p))
        {
          mp_.mmap_threshold = chunksize (p);
          mp_.trim_threshold = 2 * mp_.mmap_threshold;
          LIBC_PROBE (memory_mallopt_free_dyn_thresholds, 2,
                      mp_.mmap_threshold, mp_.trim_threshold);
        }
      munmap_chunk (p);
    }
  else
    {
      MAYBE_INIT_TCACHE ();

      ar_ptr = arena_for_chunk (p);
      _int_free (ar_ptr, p, 0);
    }

  __set_errno (err);
}
libc_hidden_def (__libc_free)

类似于 malloc，glibc 也把 __libc_free 作为 free 的别名，我们在调用 free 的时候实际会调用 __libc_free。

我们分段来看

void
__libc_free (void *mem)
{
  mstate ar_ptr;
  mchunkptr p;                          /* chunk corresponding to mem */

  void (*hook) (void *, const void *)
    = atomic_forced_read (__free_hook);
  if (__builtin_expect (hook != NULL, 0))
    {
      (*hook)(mem, RETURN_ADDRESS (0));
      return;
    }

  if (mem == 0)                              /* free(0) has no effect */
    return;

最开始先定义 ar_ptr 和 p 两变量，前者在未来指向当前分配区的 mstate，后者则在未来指向传入的 mem 对应的内部 chunk 地址。

程序一开始做两个特判

先用原子操作把 __free_hook 中的值读出来并判断值是否为 NULL，若不为 NULL 直接执行 __free_hook 指向的函数。__free_hook 不为零主要有两种情况，一是处理线程新建时的操作，另一种是使用用户提供的 free 函数。
判断被 free 的指针是否为 NULL，如果是直接返回。

两者都通过后继续之后的流程。

#ifdef USE_MTAG
  /* Quickly check that the freed pointer matches the tag for the memory.
     This gives a useful double-free detection.  */
  *(volatile char *)mem;
#endif

然后是 2.33 新增加的对 double-free 的检测，MTAG 这个东西，和下面的 TAG_REGION 是一套的，这个还不是很了解，以后再说。

然后是一些初试化

  int err = errno;

  p = mem2chunk (mem);

  /* Mark the chunk as belonging to the library again.  */
  (void)TAG_REGION (chunk2rawmem (p), CHUNK_AVAILABLE_SIZE (p) - CHUNK_HDR_SZ);

这里的 TAG_REGION 也是在 2.33 才加入的，是对 chunk 的染色操作（tagging,sometimes know as coloring），具体干啥的我也不太清楚，暂且放过。

  if (chunk_is_mmapped (p))                       /* release mmapped memory. */
    {
      /* See if the dynamic brk/mmap threshold needs adjusting.
	 Dumped fake mmapped chunks do not affect the threshold.  */
      if (!mp_.no_dyn_threshold
          && chunksize_nomask (p) > mp_.mmap_threshold
          && chunksize_nomask (p) <= DEFAULT_MMAP_THRESHOLD_MAX
	  && !DUMPED_MAIN_ARENA_CHUNK (p))
        {
          mp_.mmap_threshold = chunksize (p);
          mp_.trim_threshold = 2 * mp_.mmap_threshold;
          LIBC_PROBE (memory_mallopt_free_dyn_thresholds, 2,
                      mp_.mmap_threshold, mp_.trim_threshold);
        }
      munmap_chunk (p);
    }

然后是对 chunk 为 mmapped 的情况时的处理，由于我还没有研究过 sys_malloc，所以这里也暂时跳过。

  else
    {
      MAYBE_INIT_TCACHE ();

      ar_ptr = arena_for_chunk (p);
//#define heap_for_ptr(ptr) \
//  ((heap_info *) ((unsigned long) (ptr) & ~(HEAP_MAX_SIZE - 1)))
//#define arena_for_chunk(ptr) \
//  (chunk_main_arena (ptr) ? &main_arena : heap_for_ptr (ptr)->ar_ptr)
      _int_free (ar_ptr, p, 0);
    }
__set_errno (err);
}
libc_hidden_def (__libc_free)

然后是对 tcache 进行可能需要的初试化，即执行 MAYBE_INIT_TCACHE ()，这个在分析 malloc 是已经分析过了，这里不多说，总的就是分配了 tcache_perthread_struct 结构体然后清零。然后获取 chunk 所处的分配区，存储于 ar_ptr 中，调用 _int_free() 来进行 free 操作。

可见 __libc_free 仅处理了 mmaped 的内存，其余的都交给 _int_free 了。

_int_free

static void
_int_free (mstate av, mchunkptr p, int have_lock)
{
  INTERNAL_SIZE_T size;        /* its size */
  mfastbinptr *fb;             /* associated fastbin */
  mchunkptr nextchunk;         /* next contiguous chunk */
  INTERNAL_SIZE_T nextsize;    /* its size */
  int nextinuse;               /* true if nextchunk is used */
  INTERNAL_SIZE_T prevsize;    /* size of previous contiguous chunk */
  mchunkptr bck;               /* misc temp for linking */
  mchunkptr fwd;               /* misc temp for linking */

  size = chunksize (p);

  /* Little security check which won't hurt performance: the
     allocator never wrapps around at the end of the address space.
     Therefore we can exclude some size values which might appear
     here by accident or by "design" from some intruder.  */
  if (__builtin_expect ((uintptr_t) p > (uintptr_t) -size, 0)
      || __builtin_expect (misaligned_chunk (p), 0))
    malloc_printerr ("free(): invalid pointer");
  /* We know that each chunk is at least MINSIZE bytes in size or a
     multiple of MALLOC_ALIGNMENT.  */
  if (__glibc_unlikely (size < MINSIZE || !aligned_OK (size)))
    malloc_printerr ("free(): invalid size");

  check_inuse_chunk(av, p); // debug

进入函数首先对应了一些变量供之后的操作使用，然后进行两个轻量级的检测，都是检测 size 字段的合法性

free to tcache

然后进入 tcache 分支，首先取得当前 chunk 所对应的 tc_idx，然后仅在 tcache 非空且当前 chunk 的大小在 tcache 范围内才会进入处理（显然的）

#if USE_TCACHE
  {
    size_t tc_idx = csize2tidx (size);
    if (tcache != NULL && tc_idx < mp_.tcache_bins)
      {

	/* Check to see if it's already in the tcache.  */
	tcache_entry *e = (tcache_entry *) chunk2mem (p);

	/* This test succeeds on double free.  However, we don't 100%
	   trust it (it also matches random payload data at a 1 in
	   2^<size_t> chance), so verify it's not an unlikely
	   coincidence before aborting.  */
	if (__glibc_unlikely (e->key == tcache))
	  {
	    tcache_entry *tmp;
	    size_t cnt = 0;
	    LIBC_PROBE (memory_tcache_double_free, 2, e, tc_idx);
	    for (tmp = tcache->entries[tc_idx];
		 tmp;
		 tmp = REVEAL_PTR (tmp->next), ++cnt)
	      {
		if (cnt >= mp_.tcache_count)
		  malloc_printerr ("free(): too many chunks detected in tcache");
		if (__glibc_unlikely (!aligned_OK (tmp)))
		  malloc_printerr ("free(): unaligned chunk detected in tcache 2");
		if (tmp == e)
		  malloc_printerr ("free(): double free detected in tcache 2");
		/* If we get here, it was a coincidence.  We've wasted a
		   few cycles, but don't abort.  */
	      }
	  }

然后是 glibc 在 libc-2.27 之后加入的 tcache double free 的检测，首先取得当前 chunk 在 tcache 下的地址（即指向 payload），然后查看该 chunk 的 key 字段（即普通 chunk 的 bk 字段）是否和 tcache 的地址相同，如果相同，基本上可以认为是 double free 了，因为在 tcache_put（把 chunk 放到 tcache 中）中会把 tcache_chunk 的 key 字段写上 tcache 的地址，而 tcache_get（把 chunk 从 tcache 中取出）的时候会把这个字段置 NULL，也就是说一个 tcache_chunk 被 free 掉后 key 才会指向 tcache。

但是由于用户数据有概率恰好和 tcache 的地址相等，所以这里不直接报错，而是遍历一遍 tcache->entries[tc_idx] 链表，如果在遍历过程中发现遍历的次数超过了 tcache 的上限、有 chunk 没有对齐或的确在链表中找到这个 chunk，就会报对应的错误。

如果通过以上检测，就通过 tcache_put 来把 chunk 放入 tcache 中。

	if (tcache->counts[tc_idx] < mp_.tcache_count)
	  {
	    tcache_put (p, tc_idx);
	    return;
	  }
      }
  }
#endif

tcache_put

static __always_inline void
tcache_put (mchunkptr chunk, size_t tc_idx)
{
  tcache_entry *e = (tcache_entry *) chunk2mem (chunk);

  /* Mark this chunk as "in the tcache" so the test in _int_free will
     detect a double free.  */
  e->key = tcache;

  e->next = PROTECT_PTR (&e->next, tcache->entries[tc_idx]);
  tcache->entries[tc_idx] = e;
  ++(tcache->counts[tc_idx]);
}

这里的实现还是比较好理解的，可以注意到这里把 chunk 的 key 字段置为了 tcache 的地址。

注意！在 free 到 tcache 的整个过程都没有对下一个 chunk 的 prev_inuse 位进行修改，而是默认该位一定是 1。特别的，也没有对该位进行检测，这为一些 off-by-one 攻击提供了便利

free to fastbin

  if ((unsigned long)(size) <= (unsigned long)(get_max_fast ())

#if TRIM_FASTBINS
      /*
	If TRIM_FASTBINS set, don't place chunks
	bordering top into fastbins
      */
      && (chunk_at_offset(p, size) != av->top)
#endif
      ) {

如果 chunk 的 size 属于 fastbin 的范围，并且无法 free 到 tcache 中，就尝试把它 free 到 fastbin 里面。如果开启了 TRIM_FASTBINS 选项，还会判断 chunk 是否和 top_chunk 相邻，只有不相邻才会 free 到 fastbin 中，不过该选项默认关闭，也就是 free fastbin chunk 默认情况下不会和 top_chunk 合并。此流程如下

合法性检查

    if (__builtin_expect (chunksize_nomask (chunk_at_offset (p, size))
			  <= CHUNK_HDR_SZ, 0)
	|| __builtin_expect (chunksize (chunk_at_offset (p, size))
			     >= av->system_mem, 0))

首先进行两个合法性检查

检查下一个 chunk 的大小是否小于等于 CHUNK_HDR_SZ（即 2 * SIZE_SZ）。小于表示下一个 chunk 的大小不合法，等于表示下一个 chunk 的 prev_inuse 位为 0，有 double free。
检查下一个 chunk 是否过大，av->system_mem 代表现在已分配的内存，如果下一个 chunk 的大小大于这个，不用说，肯定不合法。

如果上两个检查未通过，则进行第二次检测

	bool fail = true;
	/* We might not have a lock at this point and concurrent modifications
	   of system_mem might result in a false positive.  Redo the test after
	   getting the lock.  */
	if (!have_lock)
	  {
	    __libc_lock_lock (av->mutex);
	    fail = (chunksize_nomask (chunk_at_offset (p, size)) <= CHUNK_HDR_SZ
		    || chunksize (chunk_at_offset (p, size)) >= av->system_mem);
	    __libc_lock_unlock (av->mutex);
	  }

	if (fail)
	  malloc_printerr ("free(): invalid next size (fast)");
      }

have_lock 为传入的形参，传入时值为 0，那么这里会再做一次同上的合法性检查，不通过则直接报错，重复进行的原因是在并行的情况下别的线程也会对分配区进行操作，这里在加锁后再进行检测保证了正确性，检测完后解锁。通过此处的检测流程后，准备进行之后的入链。

入链


    free_perturb (chunk2mem(p), size - CHUNK_HDR_SZ);

    atomic_store_relaxed (&av->have_fastchunks, true);
    unsigned int idx = fastbin_index(size);
    fb = &fastbin (av, idx);

    /* Atomically link P to its fastbin: P->FD = *FB; *FB = P;  */
    mchunkptr old = *fb, old2;

然后进行 free_perturb，根据 perturb_byte 的值进行 memset，不过由于该变量一般情况下值都为零，所以该语句一般情况下不会起效。

然后通过原子操作设置 av->have_fastchunks 为真，表示 fastbin 中有 chunk。然后取得 chunk 的大小对应的 fastbin 链表头，并定义变量 old，old2，初始化 old 指向链表的第一个元素。


    if (SINGLE_THREAD_P)
      {
	/* Check that the top of the bin is not the record we are going to
	   add (i.e., double free).  */
	if (__builtin_expect (old == p, 0))
	  malloc_printerr ("double free or corruption (fasttop)");
	p->fd = PROTECT_PTR (&p->fd, old);
	*fb = p;
      }

然后判断线程情况，在单线程情况下，直接把被 free 的 chunk 插入到链表的头上。这里进行了较松散的 double free 检测，即链表头指向的 chunk 是否和被 free 的 chunk 为同一个 chunk。通过交叉 free 的方法仍然可以实现 double free。

    else
      do
	{
	  /* Check that the top of the bin is not the record we are going to
	     add (i.e., double free).  */
	  if (__builtin_expect (old == p, 0))
	    malloc_printerr ("double free or corruption (fasttop)");
	  old2 = old;
	  p->fd = PROTECT_PTR (&p->fd, old);
	}
      while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
	     != old2);

在多线程情况下就需要通过原子操作来插入了，也就是哪个 catomic 开头的宏，当然其实它和宏 catomic_compare_and_exchange_val_acq 是一个东西，glibc 的解释为（之前 malloc 中也碰到过，原样复制过来）

/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
   Return the old *MEM value.  */

可见该宏通过原子变量来避免多线程情况下的赋值错误，而做的事情和单线程下也无区别，就是 *fb = REVEAL_PTR (victim->fd);。

    /* Check that size of fastbin chunk at the top is the same as
       size of the chunk that we are adding.  We can dereference OLD
       only if we have the lock, otherwise it might have already been
       allocated again.  */
    if (have_lock && old != NULL
	&& __builtin_expect (fastbin_index (chunksize (old)) != idx, 0))
      malloc_printerr ("invalid fastbin entry (free)");
  }

完成入链后会最后再做一次合法性检查，检查原先出于链表头的 chunk（也就是现在 fastbin 中第二个 chunk）的 size 是否属于当前链表。

free to unsorted bin

若 chunk 无法进入上面的俩个 bin 中，则将 chunk 置入 unsorted bin 这个各 bin 的缓冲区中。这里为了减少内存碎片，还会进行合并等操作。

  /*
    Consolidate other non-mmapped chunks as they arrive.
  */

  else if (!chunk_is_mmapped(p)) {

首先判断 chunk 是否为 mmap 分配的，若是直接 unmap 即可，若否进入之后的流程

初始化和合法性检查

在这里会初始化一些变量，并进行许多合法性检查

    /* If we're single-threaded, don't lock the arena.  */
    if (SINGLE_THREAD_P)
      have_lock = true;

    if (!have_lock)
      __libc_lock_lock (av->mutex);

    nextchunk = chunk_at_offset(p, size);

    /* Lightweight tests: check whether the block is already the
       top block.  */
    if (__glibc_unlikely (p == av->top))
      malloc_printerr ("double free or corruption (top)");
    /* Or whether the next chunk is beyond the boundaries of the arena.  */
    if (__builtin_expect (contiguous (av)
			  && (char *) nextchunk
			  >= ((char *) av->top + chunksize(av->top)), 0))
	malloc_printerr ("double free or corruption (out)");
    /* Or whether the block is actually not marked used.  */
    if (__glibc_unlikely (!prev_inuse(nextchunk)))
      malloc_printerr ("double free or corruption (!prev)");

    nextsize = chunksize(nextchunk);
    if (__builtin_expect (chunksize_nomask (nextchunk) <= CHUNK_HDR_SZ, 0)
	|| __builtin_expect (nextsize >= av->system_mem, 0))
      malloc_printerr ("free(): invalid next size (normal)");

    free_perturb (chunk2mem(p), size - CHUNK_HDR_SZ);

首先根据线程情况考虑是否需要加锁，并获得下一个 chunk 的地址，存于变量 nextchunk 中。

然后进行许多合法性检查

检查被 free 的 chunk 是否是 top_chunk，top_chunk 不可能被 free，若为 top_chunk，进行报错
判断 nextchunk 的地址是否大于 top_chunk，若分配区模式为“是否分配连续的虚拟地址”，则 nextchunk 的地址不可能大于 top_chunk，此时进行报错
检查 nextchunk 的 prev_inuse 位，若本 chunk 本来就是 free 状态，代表是 double free，进行报错
获取下一个 chunk 的大小存于 nextsize 中，进行类似于 fastbin 中的 chunk 大小检查。

前后向合并

通过合法性检查后首先尝试进行向低地址合并

    /* consolidate backward */
    if (!prev_inuse(p)) {
      prevsize = prev_size (p);
      size += prevsize;
      p = chunk_at_offset(p, -((long) prevsize));
      if (__glibc_unlikely (chunksize(p) != prevsize))
        malloc_printerr ("corrupted size vs. prev_size while consolidating");
      unlink_chunk (av, p);
    }

对 size，prev_size 等字段值的设置会在向高地址合并时统一进行，这里只做了解链和大小的更新。

在 libc-2.29 及更高版本中，这里和 unlink_chunk 函数中都加入了对前后 chunk 在地址上是否相连的检测，使 unlink 攻击方式变得困难许多，有一种通过 large chunk 绕过的方法，比较典型的是 Balsn_CTF_2019-PlainText 一题，WP

然后尝试向高地址合并，这里要判断下一个 chunk 是不是 top_chunk，若不是则尝试合并

    if (nextchunk != av->top) {
      /* get and clear inuse bit */
      nextinuse = inuse_bit_at_offset(nextchunk, nextsize);

      /* consolidate forward */
      if (!nextinuse) {
	unlink_chunk (av, nextchunk);
	size += nextsize;
      } else
	clear_inuse_bit_at_offset(nextchunk, 0);

然后将处理好的 chunk 放入 unsorted bin 中

      /*
	Place the chunk in unsorted chunk list. Chunks are
	not placed into regular bins until after they have
	been given one chance to be used in malloc.
      */

      bck = unsorted_chunks(av);
      fwd = bck->fd;
      if (__glibc_unlikely (fwd->bk != bck))
	malloc_printerr ("free(): corrupted unsorted chunks");
      p->fd = fwd;
      p->bk = bck;
      if (!in_smallbin_range(size))
	{
	  p->fd_nextsize = NULL;
	  p->bk_nextsize = NULL;
	}
      bck->fd = p;
      fwd->bk = p;

      set_head(p, size | PREV_INUSE);
      set_foot(p, size);

      check_free_chunk(av, p);
    }

入链的操作中对 unsorted bin 的链表进行了完整性检查，检查链表中第一个 chunk 的 bk 是否指向链表头。然后将合并好的 chunk 插到 unsorted bin 的头部，可见 unsorted bin 链表为先进后出结构（malloc 的时候是从尾部开始遍历的），平均了每个 chunk 被访问的机会。

如果下一个 chunk 为 top_chunk，则将当前 chunk 和 top_chunk 合并形成新的 top_chunk。


    /*
      If the chunk borders the current high end of memory,
      consolidate into top
    */

    else {
      size += nextsize;
      set_head(p, size | PREV_INUSE);
      av->top = p;
      check_chunk(av, p);
    }

取出 fastbin

如果合并完成的 chunk 的 size 大于 FASTBIN_CONSOLIDATION_THRESHOLD，默认为 65536（64KB），并且 fastbin 中存在 chunk，调用 malloc_consolidate 将 fastbin 取出

    /*
      If freeing a large space, consolidate possibly-surrounding
      chunks. Then, if the total unused topmost memory exceeds trim
      threshold, ask malloc_trim to reduce top.

      Unless max_fast is 0, we don't know if there are fastbins
      bordering top, so we cannot tell for sure whether threshold
      has been reached unless fastbins are consolidated.  But we
      don't want to consolidate on each free.  As a compromise,
      consolidation is performed if FASTBIN_CONSOLIDATION_THRESHOLD
      is reached.
    */

    if ((unsigned long)(size) >= FASTBIN_CONSOLIDATION_THRESHOLD) {
      if (atomic_load_relaxed (&av->have_fastchunks))
	malloc_consolidate(av);

收缩 top_chunk

之后的处理，根据分配区的不同

对主分配区如果 top_chunk 的大小大于 trim_threshold（堆的收缩阀值），调用 systrim 直接收缩 top_chunk 的大小
对非主分配区直接调用（always try） heap_trim 收缩 sub_heap。

      if (av == &main_arena) {
#ifndef MORECORE_CANNOT_TRIM
	if ((unsigned long)(chunksize(av->top)) >=
	    (unsigned long)(mp_.trim_threshold))
	  systrim(mp_.top_pad, av);
#endif
      } else {
	/* Always try heap_trim(), even if the top chunk is not
	   large, because the corresponding heap might go away.  */
	heap_info *heap = heap_for_ptr(top(av));

	assert(heap->ar_ptr == av);
	heap_trim(heap, mp_.top_pad);
      }
    }

    if (!have_lock)
      __libc_lock_unlock (av->mutex);
  }

完事后根据分配区的锁情况释放锁。

  /*
    If the chunk was allocated via mmap, release via munmap().
  */

  else {
    munmap_chunk (p);
  }
}

代码的最后是之前提到的对 mmap 分配的堆块的处理，通过调用 munmap_chunk 来释放。到这里 _int_free 就结束了。

总结

关于 _int_free 的形参 have_lock，从 __libc_free 进入的时候，是直接置成零的，所以直接 free 的话，在置入 chunk 的时候，是不会对分配区加锁的（并不代表整个 _int_free 都不会加锁）。所以之前在 _int_malloc 中可以看到在结尾时，即便调用过了一次 malloc_consolidate，仍然会再调用一次，就是因为未加锁，fastbin 仍然可能在 malloc 的途中被其他线程加入 chunk。这个 have_lock 也不会总为零的，因为还有别的函数也会调用只，所以许多判断不是多余的。

类似的，对于 chunk 是否为 mmaped 的判断也不多余，因为别的函数调用时不一定会分开来处理。

至此 free 分析结束。

参考

Heap Exploit v2.31 | 最新堆利用技巧，速速查收
glibc内存管理ptmalloc源代码分析
woboq
bootlin
ctf-wiki