目录
一、概述
主要对应初始化过程的内存部分,heap初始化日后补上吧。
二、流程分析
2.1 eal_hugepage_info_init
一般情况下启用大页,该函数主要获取大页信息。
2.1.1 hugepage_info_init
在系统启动时我们使用预留大页的方法:
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
numa:echo 1024 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
echo 1024 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepagesmount -t hugetlbfs nodev /mnt/huge
该函数就是通过解析配置信息,填充下面的结构:
struct hugepage_info {
uint64_t hugepage_sz; /**< size of a huge page */
char hugedir[PATH_MAX]; /**< dir where hugetlbfs is mounted */
uint32_t num_pages[RTE_MAX_NUMA_NODES];
/**< number of hugepages of that size on each socket */
int lock_descriptor; /**< file descriptor for hugepage dir */
};
代码比较简单,其中hugepage_sz是大页的大小,hugedir是挂载点(通过分析/proc/mount)
接下来会根据numa读取大页的总数,每个node的分布情况,这些都记录在下面的hugepage_info中
struct internal_config {
...unsigned num_hugepage_sizes; //大页类型的种类(2048KB, 1G)
struct hugepage_info hugepage_info[MAX_HUGEPAGE_SIZES];
...
volatile unsigned int init_complete;
};
2.1.2 create_shared_memory
将internal_config.hugepage_info信息写入/var/run/dpdk/rte/hugepage_info
2.2 rte_eal_log_init
2.3 rte_eal_vfio_setup
2.4 rte_eal_memzone_init
2.4.1 rte_fbarray_init
先看一下rte_fbarray的结构
struct rte_fbarray {
char name[RTE_FBARRAY_NAME_LEN]; /**< name associated with an array */
unsigned int count; /**< number of entries stored */
unsigned int len; /**< current length of the array */
unsigned int elt_sz; /**< size of each element */
void *data; /**< data pointer */
rte_rwlock_t rwlock; /**< multiprocess lock */
};
比较常规,存储大小相同的obj,elt_sz是数据大小,len指数据的数量,count是目前使用的数量,data指向数据起始
rte_fbarray_init就是根据参数构建上述结构
int __rte_experimental rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len, unsigned int elt_sz)
{
page_sz = sysconf(_SC_PAGESIZE);
mmap_len = calc_data_size(page_sz, elt_sz, len);
data = eal_get_virtual_area(NULL, &mmap_len, page_sz, 0, 0);
if (internal_config.no_shconf) {
} else {
eal_get_fbarray_path(path, sizeof(path), name);
fd = open(path, O_CREAT | O_RDWR, 0600);
if (fd < 0) {
} else if (flock(fd, LOCK_EX | LOCK_NB)) {
RTE_LOG(DEBUG, EAL, "%s(): couldn't lock %s: %s\n",
__func__, path, strerror(errno));
rte_errno = EBUSY;
goto fail;
}
if (resize_and_map(fd, data, mmap_len))
goto fail;
}
/* initialize the data */
memset(data, 0, mmap_len);
/* populate data structure */
strlcpy(arr->name, name, sizeof(arr->name));
arr->data = data;
arr->len = len;
arr->elt_sz = elt_sz;
arr->count = 0;
msk = get_used_mask(data, elt_sz, len);
msk->n_masks = MASK_LEN_TO_IDX(RTE_ALIGN_CEIL(len, MASK_ALIGN));
rte_rwlock_init(&arr->rwlock);
return 0;
}
- 整个fb_array数据由两部分组成:数据部分 elt_sz *len, 掩码部分used_mask(n个u64,需要保证sizeof(u64)对齐位图管理),整个数据长度要保证page_sz对齐,了解了这些再看代码容易了
数据的布局此时清楚了,接下来就是向系统申请内存
2.4.2 eal_get_virtual_area
- 首次调用(next_baseaddr == NULL)可以使用用户使用--baseaddr指定的地址,如果没有指定,使用默认baseaddr对应的地址
static uint64_t baseaddr = 0x100000000;
注释说明了这个地址使用的原因(都是针对64bit system):
/*
* Linux kernel uses a really high address as starting address for serving
* mmaps calls. If there exists addressing limitations and IOVA mode is VA,
* this starting address is likely too high for those devices. However, it
* is possible to use a lower address in the process virtual address space
* as with 64 bits there is a lot of available space.
*
* Current known limitations are 39 or 40 bits. Setting the starting address
* at 4GB implies there are 508GB or 1020GB for mapping the available
* hugepages. This is likely enough for most systems, although a device with
* addressing limitations should call rte_mem_check_dma_mask for ensuring all
* memory is within supported range.
*/
- 如果用户没有指定映射的地址(requested_addr == NULL),那么request使用上面确定的next_baseaddr,并对齐于page_sz
- 接下来使用mmap申请一块只读匿名空间
mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_READ, mmap_flags, -1, 0);
如果申请失败,requested_addr增加一个页再次尝试。
2.4.3 resize_and_map
在上面内存申请完成后,resize_and_map将这些信息映射到共享文件,文件的路径默认是/var/run/dpdk/rte/fbarray_memzone,
最终调用:
map_addr = mmap(addr, len, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 0);
我们可以看到两次映射同一个地址是没有问题的,在mmap系统调用中,会先将前一次映射unmap掉,重新进行映射,关于为什么进行两次映射,网上有的说是防止race conditon,提前申请一块,避免后面申请不到,这个问题留在以后回答吧,我目前也没想明白。TODO
2.5 rte_eal_memory_init
2.5.1 rte_eal_memseg_init
getrlimit/setrlimit设置open fd数量
memseg_primary_init函数实现很多,代码就不贴了,说下实现流程
首先是一个n_memtypes,这个变量的来源是(hugepage_sz, socket_id)二元组组合的个数
对每一个memtype来说,需要n_seglists个segment list,每个segment list包含n_segs个segment,这个结构是使用上面说的fb_array表示的, alloc_memseg_list分配rte_memseg对应的fb_array, alloc_va_space分配匿名大页内存(read only),画了一个示意图:
2.5.2 eal_memalloc_init
rte_memseg_list_walk通过遍历memsegs调用fd_list_create_walk,为每个page(segment)分配对应的fd空间
static int alloc_list(int list_idx, int len)
{
int *data;
int i;
/* ensure we have space to store fd per each possible segment */
data = malloc(sizeof(int) * len);
if (data == NULL) {
RTE_LOG(ERR, EAL, "Unable to allocate space for file descriptors\n");
return -1;
}
/* set all fd's as invalid */
for (i = 0; i < len; i++)
data[i] = -1;
fd_list[list_idx].fds = data;
fd_list[list_idx].len = len;
fd_list[list_idx].count = 0;
fd_list[list_idx].memseg_list_fd = -1;
return 0;
参照一下fd_lists定义:
/*
* we have two modes - single file segments, and file-per-page mode.
*
* for single-file segments, we need some kind of mechanism to keep track of
* which hugepages can be freed back to the system, and which cannot. we cannot
* use flock() because they don't allow locking parts of a file, and we cannot
* use fcntl() due to issues with their semantics, so we will have to rely on a
* bunch of lockfiles for each page. so, we will use 'fds' array to keep track
* of per-page lockfiles. we will store the actual segment list fd in the
* 'memseg_list_fd' field.
*
* for file-per-page mode, each page will have its own fd, so 'memseg_list_fd'
* will be invalid (set to -1), and we'll use 'fds' to keep track of page fd's.
*
* we cannot know how many pages a system will have in advance, but we do know
* that they come in lists, and we know lengths of these lists. so, simply store
* a malloc'd array of fd's indexed by list and segment index.
*
* they will be initialized at startup, and filled as we allocate/deallocate
* segments.
*/
static struct {
int *fds; /**< dynamically allocated array of segment lock fd's */
int memseg_list_fd; /**< memseg list fd */
int len; /**< total length of the array */
int count; /**< entries used in an array */
} fd_list[RTE_MAX_MEMSEG_LISTS];
2.5.3 rte_eal_hugepage_init
使用大页的情况下,调用eal_hugepage_init
static int eal_hugepage_init(void)
{
struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES];
uint64_t memory[RTE_MAX_NUMA_NODES];
int hp_sz_idx, socket_id;
test_phys_addrs_available();
/* calculate final number of pages */
if (calc_num_pages_per_socket(memory,
internal_config.hugepage_info, used_hp,
internal_config.num_hugepage_sizes) < 0)
return -1;
for (hp_sz_idx = 0; hp_sz_idx < (int)internal_config.num_hugepage_sizes;
hp_sz_idx++) {
for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES;
socket_id++) {
struct rte_memseg **pages;
struct hugepage_info *hpi = &used_hp[hp_sz_idx];
unsigned int num_pages = hpi->num_pages[socket_id];
int num_pages_alloc, i;
if (num_pages == 0)
continue;
pages = malloc(sizeof(*pages) * num_pages);
RTE_LOG(DEBUG, EAL, "Allocating %u pages of size %" PRIu64 "M on socket %i\n",
num_pages, hpi->hugepage_sz >> 20, socket_id);
num_pages_alloc = eal_memalloc_alloc_seg_bulk(pages,
num_pages, hpi->hugepage_sz,
socket_id, true);
if (num_pages_alloc < 0) {
free(pages);
return -1;
}
/* mark preallocated pages as unfreeable */
for (i = 0; i < num_pages_alloc; i++) {
struct rte_memseg *ms = pages[i];
ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE;
}
free(pages);
}
}
return 0;
}
- test_phys_addrs_available->rte_mem_virt2phy
rte_mem_virt2phy是获取虚拟地址对应的物理地址,函数的实现是通过当前进程的/proc/self/pagemap 找到vpn->pfn的映射关系,计算出对应的物理地址的值,具体的描述可以参考Document下的pagemap.txt
这里需要注意的是pfn在pagemap占据一个8字节,所有该函数的逻辑可以很容易看懂:
virt_pfn = (unsigned long)virtaddr / page_size;
offset = sizeof(uint64_t) * virt_pfn;
lseek(fd, offset, SEEK_SET)
read(fd, &page, PFN_MASK_SIZE);
if ((page & 0x7fffffffffffffULL) == 0)
return RTE_BAD_IOVA;
physaddr = ((page & 0x7fffffffffffffULL) * page_size)
+ ((unsigned long)virtaddr % page_size);
- 接下来按照二元组(hugepage_sz, socket_id)确定分配rte_memseg的数量
struct rte_memseg **pages;
unsigned int num_pages = hpi->num_pages[socket_id];
pages = malloc(sizeof(*pages) * num_pages);
num_pages_alloc = eal_memalloc_alloc_seg_bulk(pages,
num_pages, hpi->hugepage_sz,
socket_id, true);
/* mark preallocated pages as unfreeable */
for (i = 0; i < num_pages_alloc; i++) {
struct rte_memseg *ms = pages[i];
ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE;
}
- 核心函数eal_memalloc_alloc_seg_bulk完成对所有memtype大页的实际空间分配,即每个(hugepage_sz, socket_id)的memtype分配num_pages个大页(segment)
wa.exact = exact;
wa.hi = hi;
wa.ms = ms;
wa.n_segs = n_segs;
wa.page_sz = page_sz;
wa.socket = socket;
wa.segs_allocated = 0;
/* memalloc is locked, so it's safe to use thread-unsafe version */
ret = rte_memseg_list_walk_thread_unsafe(alloc_seg_walk, &wa);
rte_memseg_list_walk_thread_unsafe前面见过了,就是遍历所有的mcfg->memsegs,对每个memsegs list执行操作,这里执行的空间分配alloc_seg_walk操作
1. 在memsegs_list上找到n_segs 个空闲的segment,对照着前面的图片看应该很容易理解。
need = wa->n_segs;
/* try finding space in memseg list */
cur_idx = rte_fbarray_find_next_n_free(&cur_msl->memseg_arr, 0, need);
2. 对每一个segment分配对应的空间,还记得上文cur_msl->base_va存储着映射的所有page_sz * n_segs大小的匿名空间吗?
for (i = 0; i < need; i++, cur_idx++) {
struct rte_memseg *cur;
void *map_addr;
cur = rte_fbarray_get(&cur_msl->memseg_arr, cur_idx);
map_addr = RTE_PTR_ADD(cur_msl->base_va, cur_idx * page_sz);
if (alloc_seg(cur, map_addr, wa->socket, wa->hi,
msl_idx, cur_idx))
上面的代码清楚的描述了申请过程中每个segment在base_va中的位置,并通过alloc_seg这个函数去分配。关键的代码如下:
/* takes out a read lock on segment or segment list */
fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
map_offset = 0;
ftruncate(fd, alloc_sz)
mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED;
va = mmap(addr, alloc_sz, PROT_READ | PROT_WRITE, mmap_flags, fd, map_offset);
*(volatile int *)addr = *(volatile int *)addr;
iova = rte_mem_virt2iova(addr);
ms->addr = addr;
ms->hugepage_sz = alloc_sz;
ms->len = alloc_sz;
ms->nchannel = rte_memory_get_nchannel();
ms->nrank = rte_memory_get_nrank();
ms->iova = iova;
ms->socket_id = socket_id;
首先这个分支是每个segment对应一个fd,具体的初始化上文已经说过,get_seg_fd就是建立对应的文件:
/* create a hugepage file path */
eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
fd = fd_list[list_idx].fds[seg_idx];
if (fd < 0) {
fd = open(path, O_CREAT | O_RDWR, 0600);
}
fd_list[list_idx].fds[seg_idx] = fd;
对应的文件在hugepage挂载点下,按照本文给出的例子就是/mnt/huge/rtemap_x, x是对应memsegs中的数组的偏移(属于哪个memlist)
接下来利用mmap将segment进行映射,基于文件的shard,这里我们知道整个memlist的已经有一块大的匿名映射了,这里再次映射不会失败,我看了下内核函数,如果检测到两次mmap地址重合,会将重合部分unmap掉,也就是将原有vma split,切出一块segment,剩余部分不会unmap,也就是后续的segment申请的时候不会有匿名内存被占用的情况。
个人感觉mmap指定了MAP_POPULATE 后就不会阻塞在缺页异常了,下面还有一句对申请出来的segment写不知道是不是为了兼容较低的linux版本(有的版本只支持匿名时的MAP_POPULATE参数),这时候对应于segment的rte_memseg已经可以初始化了,我们也看到上面的代码进行了填充。
在循环完成后,整个memtype对应的大页按照rte_memseg都被映射好了,大页就可以使用了!
2.5.4 rte_eal_memdevice_init
就是对全局的nchanel和nrank初始化
2.6 rte_eal_malloc_heap_init
函数将连续的memseg使用heap的方式管理起来,heap数据抽象:
struct malloc_heap {
rte_spinlock_t lock;
LIST_HEAD(, malloc_elem) free_head[RTE_HEAP_NUM_FREELISTS];
struct malloc_elem *volatile first;
struct malloc_elem *volatile last;
unsigned alloc_count;
unsigned int socket_id;
size_t total_size;
char name[RTE_HEAP_NAME_MAX_LEN];
} __rte_cache_aligned;
首先确定上述heap的名称,heap是基于socket的,名称为socket_x(x是socket id)
接下来注册register_mp_requests
函数的重点都在rte_memseg_contig_walk(malloc_add_seg, NULL); 上
rte_memseg_contig_walk是遍历memseg list中连续的mem seg,然后使用malloc_add_seg将这些内存加入heap的管理
heap = &mcfg->malloc_heaps[heap_idx];
/* msl is const, so find it */
msl_idx = msl - mcfg->memsegs;
if (msl_idx < 0 || msl_idx >= RTE_MAX_MEMSEG_LISTS)
return -1;
found_msl = &mcfg->memsegs[msl_idx];
malloc_heap_add_memory(heap, found_msl, ms->addr, len);
heap->total_size += len;
heap的管理在malloc_heap_add_memory中实现