Linux内核学习笔记（二）内存管理

综述

本文首先介绍和内存管理相关的一些概念如page，zone，然后介绍多种获得内存的方式，最后介绍Linux的slab层（slab分配器）。

页（page）

页是Linux内核进行内存管理的基本单元。MMU和虚拟内存也都使用页作为基本管理单元。不同的架构有不同的页大小，绝大部分的情况是32-bit使用4KB页大小，64-bit使用8KB页大小。这意味着如果一台机器的物理内存为1GB大小，大小为4KB，那么它的物理内存会被分为262144个页。

页使用struct page数据结构来表示，定义在<linux/mm_types.h>中，下面是strut page的定义，省略了部分域

struct page
{
    unsigned long flags;
    atomic_t _count;
    atomic_t _mapcount;
    unsigned long private;
    struct address_space *mapping;
    pgoff_t index;
    struct list_head lru;
    void *virtual;
}

flags域存储page的状态，比如page是否dirty，是否被锁在内存中。
_count域存储该page的引用计数。count为负时表示该page未被使用，正数则表示该页正在被使用。一个页可能被page cache使用，也可能被page table映射，或者作为私有数据（被private指向），内核一般不会直接访问__count域，而是使用page_count()函数来获得，该函数在_count为负时会返回0。
virtual域存储该page的虚拟地址，high memory（后面有介绍）不能永久地映射到内存的地址空间中，这种情况下virtual域将被置为NULL

内核使用page结构体来管理物理内存中的所有page，内核需要知道一个page是否空闲（即page是否被分配出去了），如果page不是空闲的，那么内核需要知道page的所有者是谁。

page结构体只表示物理内存的page，不是虚拟内存的page。一个虚拟page不同时间可能会对应不同的物理page，即使这个虚拟page存储的数据没有变化（swapping等原因造成的）。

物理内存中的每一个page都会有一个对应的struct page实例，即有多少个页，在内存中就有多少个struct page实例，无论该page是否被使用。听起来是一个很大的内存开销，但是实际上并不大，比如一个有4GB物理内存的系统，使用8KB大小的page。那么总共需要524288个struct page实例，这总共需要20M的内存，相对于4GB的物理内存来说，开销其实很小。

区（zone）

内核将这些page划分成具有不同各自特征的区（zone），不同区的页有着不同的特性。

Linux主要有4种内存区

ZONE_DMA 可以执行DMA操作的区域
ZONE_DMA32 这个区域只能由32-bit的设备进行DMA操作
ZONE_NORMAL 这个区域由能正常映射的页组成
ZONE_HIGHMEM 这个区域由高端内存组成，高端内存的物理页不能永久映射到内核地址空间中

不了解高端内存的话，可以看一下这篇博客linux 用户空间与内核空间——高端内存详解,作者写得很详细

不同区的实际使用和布局是体系架构相关的下表是在x86-32架构中不同区域的描述和大小。

区域	地址范围（物理地址）
ZONE_DMA	<16MB
ZONE_NORMAL	16~896MB
ZONE_HIGHMEM	> 896MB

x86-64架构中，只有ZONE_DMA 和 ZONE_NORMAL两个区域

内核使用strcut zone 来表示区域，该结构体定义在<linux/mmzone.h>中

分配物理页(physical page)

内核在<linux/gfp.h>中定义了多个以页为单位进行内存分配的底层接口

分配页:

函数接口	功能描述
struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)	分配2^order个连续的物理页，返回指向第一个页的指针
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)	分配2^order个连续的物理页，返回第一个页的逻辑地址
struct page * alloc_page(gfp_t gfp_mask)	分配单个物理页，返回指向该页的指针
unsigned long __get_free_page(gfp_t gfp_mask)	分配单个物理爷，并返回该页的逻辑地址
unsigned long get_zeroed_page(unsigned int gfp_mask)	分配单个物理页，将页内容置为0，并返回该页的逻辑地址

内核还提供了一个函数，用于将页指针转换为逻辑地址

void * page_address(struct page *page)，传入参数为页指针，返回参数为逻辑地址。

释放页：

函数接口	功能描述
void __free_pages(struct page *page, unsigned int order)	释放2^order个物理页，传入参数为指向第一个页的指针
void free_pages(unsigned long addr, unsigned int order)	释放2^order个物理页，传入参数为指向第一个页的逻辑地址
void free_page(unsigned long addr)	释放单个页，传入采纳数为指向第一个页的逻辑地址

kmalloc() 分配指定大小的物理上连续的内存

在<linux/slab.h>声明的kmalloc()函数和用户空间的malloc()类似，只是多了一个gfp_t类型的flag参数。 void * kmalloc(size_t size, gfp_t flags) kmalloc返回一段连续的物理内存，大小至少为size，这里说至少是因为内核内存分配是以页为单位的，所以当size的大小不到n个页大小，但是又大于n-1个页大小时，内核会分配n个页的内存。返回地址是第一个页的逻辑地址。

和kmalloc对应的释放内存的函数是void kfree(const void *ptr);

gfp_t Flags说明

获取物理内存页的底层接口和kmalloc()函数中都包含了一个gfp_t类型的参数，gfp是get free page的缩写，这个参数由3种flag组成，分别为action modifier， zone modifier，type。

Action Modifier

actions modifier 规定了内核应该如何分配内存（比如中断处理程序获取内存时要求不能睡眠）下表是不同action modifier及其功能，就不一一翻译了

Flag	Description
__GFP_WAIT	The allocator can sleep.
__GFP_HIGH	The allocator can access emergency pools.
__GFP_IO	The allocator can start disk I/O.
__GFP_FS	The allocator can start filesystem I/O.
__GFP_COLD	The allocator should use cache cold pages.
__GFP_NOWARN	The allocator does not print failure warnings.
__GFP_REPEAT	The allocator repeats the allocation if it fails, but the allocation can potentially fail.
__GFP_NOFAIL	The allocator indefinitely repeats the allocation. The allocation cannot fail.
__GFP_NORETRY	The allocator never retries if the allocation fails.
__GFP_NOMEMALLOC	The allocator does not fall back on reserves.
__GFP_HARDWALL	The allocator enforces “hardwall” cpuset boundaries.
__GFP_RECLAIMABLE	The allocator marks the pages reclaimable.
__GFP_COMP	The allocator adds compound page metadata (used internally by the hugetlb code).

不同的modifier可以通过或的方式合并使用，如ptr = kmalloc(size, __GFP_WAIT | __GFP_IO | __GFP_FS)；

Zone Modifier

zone modifyer 规定了从哪个区域（zone）分配内存

Flag	Description
__GFP_DMA	Allocates only from ZONE_DMA
__GFP_DMA32	Allocates only from ZONE_DMA32
__GFP_HIGHMEM	Allocates from ZONE_HIGHMEM or ZONE_NORMAL
未指定zone modifier	默认从ZONE_DMA或者ZONE_NORMLA分配，会优先选择ZONE_NORMAL

需要注意的是不能给__get_free_page()、 __get_free_pages()或者kmalloc()指定 __GFP_HIGHMEM，因为从高端内存中分配的页可能没有映射到到内核地址空间中，因此没有逻辑地址。不能返回一个有效的逻辑地址。

Type Flags

type flags实际上是前两者的组合，是为了方便某些场合下规定特定的flags而设计的，简化了flags的指定，减少错误的发生。

Flag	Description
GFP_ATOMIC	The allocation is high priority and must not sleep. This is the flag to use in interrupt handlers, in bottom halves, while holding a spinlock, and in other situations where you cannot sleep.
GFP_NOWAIT	Like GFP_ATOMIC, except that the call will not fallback on emergency memory pools. This increases the liklihood of the memory allocation failing.
GFP_NOIO	This allocation can block, but must not initiate disk I/O. This is the flag to use in block I/O code when you cannot cause more disk I/O, which might lead to some unpleasant recursion.
GFP_NOFS	This allocation can block and can initiate disk I/O, if it must, but it will not initiate a filesystem operation. This is the flag to use in filesystem code when you cannot start another filesystem operation.
GFP_KERNEL	This is a normal allocation and might block. This is the flag to use in process context code when it is safe to sleep. The kernel will do whatever it has to do to obtain the memory requested by the caller. This flag should be your default choice.
GFP_USER	This is a normal allocation and might block. This flag is used to allocate memory for user-space processes.
GFP_HIGHUSER	This is an allocation from ZONE_HIGHMEM and might block. This flag is used to allocate memory for user-space processes.
GFP_DMA	This is an allocation from ZONE_DMA. Device drivers that need DMA-able memory use this flag, usually in combination with one of the preceding flags.

其中最常使用的是GFP_KERNEL、GFP_ATOMIC、GFP_DMA.

GPF_KERNEL表明分配过程是可以阻塞的，进程可以进入睡眠状态，这种情况下分配成功率最高，因为内核可以进行内存的swap（在内存不足是将物理页的内容保存到磁盘上，释放内存空间）等操作。
GPF_ATOMIC表明分配过程不能阻塞，进程不能进入睡眠状态，常见于中断处理程序中。因为进程不能被阻塞，所以内核无法进行内存swap等操作，成功率相对于GPF_KERNEL低一些。
GFP_DMA表明这个内存将用于DMA操作，分配的内存必须要在ZONE_DMA区域中

type flags一样可以通过或运算来进行参数组合，如(GPF_DMA| GPF_KERNLE)表明需要DMA_ZONE的内存，进程可以进入睡眠

vmalloc() 分配指定大小的逻辑上连续的内存

vmalloc()声明在<linux/vmallo.h>中： void * vmalloc(unsigned long size)

vmalloc()分配指定大小的内存，这些内存在逻辑上是连续的，在物理上不一定来连续。因为这些页在物理上不一定连续，所以vmalloc()需要设置page table表，建立映射关系，使得这些内存在逻辑上是连续的。因为需要为每个页简历一个映射关系，会导致TLB更容易抖动，从而降低性能。因此内核中大部分情况下使用kmalloc()，vmalloc只有不得不用的时候才会使用（比如分配超大块的内存）。

和vmalloc()对应的释放内存的函数为： void vfree(const void *addr);

Slab层（Slab分配器）

Slab层是linux内核提供的通用数据结构cache层，主要作用是为一些频繁分配和释放的数据结构（对象）建立对象缓存，当对象不再需要时，slab回收该对象，将其放在缓存中，而不是直接销毁。当程序申请新的对象时，直接从该对象的缓存中返回一个可用对象。而不是重新分配内存。

Linux的slab层设计与实现遵循下面的原则：（不一一翻译了）：

Frequently used data structures tend to be allocated and freed often, so cache them.
Frequent allocation and deallocation can result in memory fragmentation (the inability to find large contiguous chunks of available memory).To prevent this, the cached free lists are arranged contiguously. Because freed data structures return to the free list, there is no resulting fragmentation.
The free list provides improved performance during frequent allocation and deallo cation because a freed object can be immediately returned to the next allocation.
If the allocator is aware of concepts such as object size, page size, and total cache size, it can make more intelligent decisions.
If part of the cache is made per-processor (separate and unique to each processor on the system), allocations and frees can be performed without an SMP lock.
If the allocator is NUMA-aware, it can fulfill allocations from the same memory node as the requestor.
Stored objects can be colored to prevent multiple objects from mapping to the same cache lines.

Slab层将对象按照类型分成不同的cache，每个cache中只包含一类对象，一种类型的对象只存在与一个特定的cache中。

cahce被分为多个slab，slab由一个或者多个物理上连续的页组成（通常一个slab只有一个页）。

每个slab包含一些对象（这些对象就是缓存的对象）

简单来说cache由slab组成，slab由对象组成。

每个slab都处于一下三种状态中的一个：

full：没有空闲的对象，该slab中的对象都已经被分配出去了。
partial：部分对象是空闲的，部分对象已被分配。
empty：没有对象被分配，都是空闲的。

当内核的某一个部分请求一个新的对象时，会优先从partial状态的slab中分配，如果没有partial状态的slab，就会从empty状态的slab中分配，如果也没有empty状态的slab，一个新的empty slab会被创建出来。

cache对象使用struct kmem_cache表示，该结构体中包含3个链表，分别是slabs_full, slabs_partial和slabs_emtpy，分别存储3中状态的slab对象。

slab对象使用struct slab来表示。

struct slab {
    struct list_head list; /* full, partial, or empty list */
    unsigned long colouroff; /* offset for the slab coloring */
    void *s_mem; /* first object in the slab */
    unsigned int inuse; /* allocated objects in the slab */
    kmem_bufctl_t free; /* first free object, if any */
};

当需要分配一个新的对象，但是cache中没有partial状态slab，也没有empty状态的slab对象时，slab分配器通过函数static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)来为cachep指向的cache创建一个新的empty slab。 kmem_getpages将会调用__get_free_pages()来获得连续的物理内存页。

当内存不足，并且系统试图释放内存时或者当cache被销毁时，slab分配器通过调用kmem_freepages(struct kmem_cache *cachep, void *addr)函数来释放cachep中的由逻辑地址参数addr指定的slab kmem_freepages()将会调用free_pages()来释放物理内存页。

Slab分配器接口

slab层为调用者提供了一组接口，通过该接口可以 1) 创建某个类型对象的cache 2) 销毁某个类型对象的cache 3) 从cache中获得对象 4) 将对象返还给cache

通过这些接口，slab alloctor隐藏了内部实现（如寻找partial slab，创建新的slab，销毁empty slab），表现为为一个专用的对象分配器。

下表列出了slab分配器提供的主要接口：

函数接口	功能描述
struct kmem_cache * kmem_cache_create(const char name, size_t size, size_t align, unsigned long flags, void (ctor)(void *))	创建一个cache对象，name是cache的名字，size是对象的大小,flags定义了对象的行为，比如是否要在ZONE_DMA中创建每个slab等。
int kmem_cache_destroy(struct kmem_cache *cachep)	销毁cachp指向的cache，调用者需要确保该cache中的slab都是空的
void * kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)	从cache中获得一个可用的对象
void kmem_cache_free(struct kmem_cache cachep, void objp)	释放objp指向的对象，这将会使cachep中的objp被标记为空闲

参考资料

《Linux Kernel Development 3rd Edition》

《Understanding The Linux Kernel 3rd Edition》