DPDK系列之二十九大页内存的优化

最新推荐文章于 2024-08-05 11:24:46 发布

fpcc

最新推荐文章于 2024-08-05 11:24:46 发布

阅读量355

点赞数

分类专栏： C++ 网络开发文章标签：网络

本文链接：https://blog.csdn.net/fpcc/article/details/132939037

版权

C++ 同时被 2 个专栏收录

249 篇文章 40 订阅

订阅专栏

网络开发

57 篇文章 40 订阅

订阅专栏

一、大页内存

大页内存在前面已经分析过很多，但大多是在形式上进行说明。本篇主要对大页内存的特点和提高性能的原理以及优化的过程等进行分析说明。说大页内存，就得提到X86系列中对虚拟内存的管理：段和页。这里仅简单说明一下，如果有想更清楚的明白这两个定义的可以去看OS原理的相关已知或者汇编相关书籍。
段式管理：其主要的方式就是把程序按内容或者函数相关分成若干段，并给每个段命名。一个用户进程对应一个二维线性的虚拟内存空间（也就是段通过base+offer来查找）。可通过段表来查找。
页式管理：其主要的方式是把虚拟空间划分为若干相等子空间即页。然后每个页的虚拟地址与物理内存地址建立映射的页表。它需要有相应的硬件转换机构来进行地址变换。
段页式管理：上述二者的结合，但增加了复杂性。
在现在普遍使用的x86-64位操作系统上，除一些比较特殊的情况下，基本上已经不再使用段。换句话说主要使用页。而使用页时，一个经常遇到的问题是，如果发生缺页中断，会导致进程的延迟，进而导致整个性能的显著降低，在某些情况下，可能导致无法想到的后果。
缺页是TLB（Translation Lookaside Buffer）快表没有命中的结果。可以简单理解成页的缓冲区。
引起页缺失有两大类可能，一种是软缺失，即内存中页存在，但没有向MMU注册；另外一种是硬缺失，即页没有加载到内存中（已经交换到硬盘）。软缺失的性能损失稍小，但硬缺则非常大。
其实这就是在前面不断提到的计算机资源比较少导致的，理论上讲内存越大，发生缺页的可能性就越小。也正因为资源紧张，OS一般都设计成虚拟地址可共享，或者不常使用的页可以交换出去等等。
通过上面的分析可以知道，缺页的基本原理其实是缺少进程运行需要的内容（页较小），那直白想到，把页加大不就行了，甚至干脆就分一个页（回到早期的内存管理）。但这样又会有一个问题，如果内存不够怎么办（丧失了灵活性，多进程的处理也退步回到了早期）？
所以大家要明白，大页内存是有其应用场景的。也就是说对内存操作非常敏感的业务（1、内存使用量大-十几G以上；2、频繁且随机-局部性差；3、内存访问导致瓶颈）。否则就会大量浪费内存，反而导致整体性能的下降。
在Linux中，本身就带有一个大页内存的库libhugetlbfs，可以通过配置使用。另外在一些大型的互联网公司，往往为了自身的实际应用场景或多或少的对一些大页内存进行处理形成自己的库，网上有原来字节内部人员的相关文章，有兴趣可以找一找。
下来重点分析一下DPDK中对大页内存的处理。

二、DPDK中的大页内存管理

这里需要说明的是，在DPDK中对大页内存采用的是段页式管理，但有一个前提，这些大页必须是属于同一CPU且连续。要想使用自己的内存管理系统，就须要把原来系统自带的相关一套（如malloc,free等）系统调用替换。在DPDK中，其使用的是rte_malloc和rte_free。
rte_malloc实际上从memzone中得到内存，而memzone又从rte_memseg（段内存）中获取得来。而段内存最终维护着一系列的大页内存。
大页内存的整体流程：
1、初始化并进行共享配置映射
这个在前面分析过，通过读取配置文件从根目录（/sys/kernel/mm/hugepages）加载，然后主进程申请共享内存，从进程映射。
2、大页内存的映射
主进程的大页映射其实就是在/mnt/huge目录下进行配置，然后将大页内存映射到刚刚建立好的共享内存中去。并维护页表记录大页内存的虚拟地址和物理地址的关系。大页内存要进行两次映射，两次映射的目的第一次是为了完善相关大页内存的信息，然后进行各种转换；第二次的目的是为了保持虚拟内存的连续性，并最终提供映射到共享内存。连续性是段页管理的前提。
在本项工作完成后，就可以提供抽象层的内存管理了（如刚刚提到的rte_malloc等）。
3、段内存的管理
段页式管理的优点很明显，就是提高效率，但缺点是复杂性增强。DPDK之所以使用这种方式，就是为了提高效率。
4、memzone的实现
在段里存储着可以使用的内存，但是在memzone中是对段内存所指的大页内存申请内存大小的一种管理。它相当于对内存池中对使用中内存的一种描述管理的链表，也就是每当向段申请一次内存就会生成一个rte_memzone对象。
5、DPDK内存的管理
这就回到了本节开头的话，上层应用开始调用rte_malloc, rte_free来实现对内存的控制。这里需要注意的是，由NUMA的存在，DPDK必须保证对每个CPU都进行堆内存的管理。而每个堆中又有一系列的空闲链表。这些空闲链表按被指向申请内存的大小来划分，更合理的使用内存，减少浪费，提高使用效率。
至于链表管理内存分配这块，学过内存池的应该非常清楚，方法有很多。具体到DPDK中，取其中一条链表来说明的话就是查找空闲头，申请内存，通过malloc_elem数据结构将其划分成两部分，每部分都有一个malloc_elem的头（类似于一个双向队列，free_head的malloc_elem的头不动，而尾的malloc_elem头不断的切割形成链表），然后分配使用。再次分配后，先查询此空闲链表头，分析是否够用，够用则在链表内存中分配，继续从已经跳转的尾部头向前跳跃。走到不够后，再申请内存，通过free_head形成链表，如此反复。
同样，释放时反向操作即可，需要注意的是，释放的内存malloc_elem的头指针仍然指向前一个位置，保证访问的连续。

三、源码分析

在明白了上面的逻辑后，再分析一下源码，映射部分在前面分析了，此处不再分析，先看一下段内存的分配：

1、段内存分配

//eal/eal_memory.c
static unsigned
map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi,
		  uint64_t *essential_memory __rte_unused){
        ...
for (i = 0; i < hpi->num_pages[0]; i++) {
  struct hugepage_file *hf = &hugepg_tbl[i];
  ...
virtaddr = mmap(NULL, hugepage_sz, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_POPULATE, fd, 0);
if (virtaddr == MAP_FAILED) {
  RTE_LOG(DEBUG, EAL, "%s(): mmap failed: %s\n", __func__,
      strerror(errno));
  close(fd);
  goto out;
}

hf->orig_va = virtaddr;
...
}
...
}

static int __rte_unused
prealloc_segments(struct hugepage_file *hugepages, int n_pages)
{
	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
	int cur_page, seg_start_page, end_seg, new_memseg;
	unsigned int hpi_idx, socket, i;
	int n_contig_segs, n_segs;
	int msl_idx;

	/* before we preallocate segments, we need to free up our VA space.
	 * we're not removing files, and we already have information about
	 * PA-contiguousness, so it is safe to unmap everything.
	 */
	for (cur_page = 0; cur_page < n_pages; cur_page++) {
		struct hugepage_file *hpi = &hugepages[cur_page];
		munmap(hpi->orig_va, hpi->size);
		hpi->orig_va = NULL;
	}

	/* we cannot know how many page sizes and sockets we have discovered, so
	 * loop over all of them
	 */
	for (hpi_idx = 0; hpi_idx < internal_config.num_hugepage_sizes;
			hpi_idx++) {
		uint64_t page_sz =
			internal_config.hugepage_info[hpi_idx].hugepage_sz;

		for (i = 0; i < rte_socket_count(); i++) {
			struct rte_memseg_list *msl;

			socket = rte_socket_id_by_idx(i);
			n_contig_segs = 0;
			n_segs = 0;
			seg_start_page = -1;

			for (cur_page = 0; cur_page < n_pages; cur_page++) {
				struct hugepage_file *prev, *cur;
				int prev_seg_start_page = -1;

				cur = &hugepages[cur_page];
				prev = cur_page == 0 ? NULL :
						&hugepages[cur_page - 1];

				new_memseg = 0;
				end_seg = 0;

				if (cur->size == 0)
					end_seg = 1;
				else if (cur->socket_id != (int) socket)
					end_seg = 1;
				else if (cur->size != page_sz)
					end_seg = 1;
				else if (cur_page == 0)
					new_memseg = 1;
#ifdef RTE_ARCH_PPC_64
				/* On PPC64 architecture, the mmap always start
				 * from higher address to lower address. Here,
				 * physical addresses are in descending order.
				 */
				else if ((prev->physaddr - cur->physaddr) !=
						cur->size)
					new_memseg = 1;
#else
				else if ((cur->physaddr - prev->physaddr) !=
						cur->size)
					new_memseg = 1;
#endif
				if (new_memseg) {
					/* if we're already inside a segment,
					 * new segment means end of current one
					 */
					if (seg_start_page != -1) {
						end_seg = 1;
						prev_seg_start_page =
								seg_start_page;
					}
					seg_start_page = cur_page;
				}

				if (end_seg) {
					if (prev_seg_start_page != -1) {
						/* we've found a new segment */
						n_contig_segs++;
						n_segs += cur_page -
							prev_seg_start_page;
					} else if (seg_start_page != -1) {
						/* we didn't find new segment,
						 * but did end current one
						 */
						n_contig_segs++;
						n_segs += cur_page -
								seg_start_page;
						seg_start_page = -1;
						continue;
					} else {
						/* we're skipping this page */
						continue;
					}
				}
				/* segment continues */
			}
			/* check if we missed last segment */
			if (seg_start_page != -1) {
				n_contig_segs++;
				n_segs += cur_page - seg_start_page;
			}

			/* if no segments were found, do not preallocate */
			if (n_segs == 0)
				continue;

			/* we now have total number of pages that we will
			 * allocate for this segment list. add separator pages
			 * to the total count, and preallocate VA space.
			 */
			n_segs += n_contig_segs - 1;

			/* now, preallocate VA space for these segments */

			/* first, find suitable memseg list for this */
			for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS;
					msl_idx++) {
				msl = &mcfg->memsegs[msl_idx];

				if (msl->base_va != NULL)
					continue;
				break;
			}
			if (msl_idx == RTE_MAX_MEMSEG_LISTS) {
				RTE_LOG(ERR, EAL, "Not enough space in memseg lists, please increase %s\n",
					RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
				return -1;
			}

			/* now, allocate fbarray itself */
			if (alloc_memseg_list(msl, page_sz, n_segs, socket,
						msl_idx) < 0)
				return -1;

			/* finally, allocate VA space */
			if (alloc_va_space(msl) < 0)
				return -1;
		}
	}
	return 0;
}

2、memzone的相关代码（初始化、申请等）：

int
rte_eal_memzone_init(void)
{
	struct rte_mem_config *mcfg;
	int ret = 0;

	/* get pointer to global configuration */
	mcfg = rte_eal_get_configuration()->mem_config;

	rte_rwlock_write_lock(&mcfg->mlock);

	if (rte_eal_process_type() == RTE_PROC_PRIMARY &&
			rte_fbarray_init(&mcfg->memzones, "memzone",
			RTE_MAX_MEMZONE, sizeof(struct rte_memzone))) {
		RTE_LOG(ERR, EAL, "Cannot allocate memzone list\n");
		ret = -1;
	} else if (rte_eal_process_type() == RTE_PROC_SECONDARY &&
			rte_fbarray_attach(&mcfg->memzones)) {
		RTE_LOG(ERR, EAL, "Cannot attach to memzone list\n");
		ret = -1;
	}

	rte_rwlock_write_unlock(&mcfg->mlock);

	return ret;
}
int
rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len,
		unsigned int elt_sz)
{
  ...
  ma->addr = data;
	ma->len = mmap_len;
	ma->fd = fd;

	/* do not close fd - keep it until detach/destroy */
	TAILQ_INSERT_TAIL(&mem_area_tailq, ma, next);

	/* initialize the data */
	memset(data, 0, mmap_len);

	/* populate data structure */
	strlcpy(arr->name, name, sizeof(arr->name));
	arr->data = data;
	arr->len = len;
	arr->elt_sz = elt_sz;
	arr->count = 0;

	msk = get_used_mask(data, elt_sz, len);
	msk->n_masks = MASK_LEN_TO_IDX(RTE_ALIGN_CEIL(len, MASK_ALIGN));
  ...
}
//申请
static const struct rte_memzone *
rte_memzone_reserve_thread_safe(const char *name, size_t len, int socket_id,
		unsigned int flags, unsigned int align, unsigned int bound)
{
	struct rte_mem_config *mcfg;
	const struct rte_memzone *mz = NULL;

	/* get pointer to global configuration */
	mcfg = rte_eal_get_configuration()->mem_config;

	rte_rwlock_write_lock(&mcfg->mlock);

	mz = memzone_reserve_aligned_thread_unsafe(
		name, len, socket_id, flags, align, bound);

	rte_rwlock_write_unlock(&mcfg->mlock);

	return mz;
}
static const struct rte_memzone *
memzone_reserve_aligned_thread_unsafe(const char *name, size_t len,
		int socket_id, unsigned int flags, unsigned int align,
		unsigned int bound)
{
	struct rte_memzone *mz;
	struct rte_mem_config *mcfg;
	struct rte_fbarray *arr;
	void *mz_addr;
	size_t requested_len;
	int mz_idx;
	bool contig;
......

	struct malloc_elem *elem = malloc_elem_from_data(mz_addr);

	/* fill the zone in config */
	mz_idx = rte_fbarray_find_next_free(arr, 0);

	if (mz_idx < 0) {
		mz = NULL;
	} else {
		rte_fbarray_set_used(arr, mz_idx);
		mz = rte_fbarray_get(arr, mz_idx);
	}

	if (mz == NULL) {
		RTE_LOG(ERR, EAL, "%s(): Cannot find free memzone\n", __func__);
		malloc_heap_free(elem);
		rte_errno = ENOSPC;
		return NULL;
	}

	strlcpy(mz->name, name, sizeof(mz->name));
	mz->iova = rte_malloc_virt2iova(mz_addr);
	mz->addr = mz_addr;
	mz->len = requested_len == 0 ?
			elem->size - elem->pad - MALLOC_ELEM_OVERHEAD :
			requested_len;
	mz->hugepage_sz = elem->msl->page_sz;
	mz->socket_id = elem->msl->socket_id;
	mz->flags = 0;

	return mz;
}

3、rte_malloc和rte_free

先看一下分配代码：

/*
 * Allocate memory on specified heap.
 */
void *
rte_malloc_socket(const char *type, size_t size, unsigned int align,
		int socket_arg)
{
	/* return NULL if size is 0 or alignment is not power-of-2 */
	if (size == 0 || (align && !rte_is_power_of_2(align)))
		return NULL;

	/* if there are no hugepages and if we are not allocating from an
	 * external heap, use memory from any socket available. checking for
	 * socket being external may return -1 in case of invalid socket, but
	 * that's OK - if there are no hugepages, it doesn't matter.
	 */
	if (rte_malloc_heap_socket_is_external(socket_arg) != 1 &&
				!rte_eal_has_hugepages())
		socket_arg = SOCKET_ID_ANY;

	return malloc_heap_alloc(type, size, socket_arg, 0,
			align == 0 ? 1 : align, 0, false);
}

/*
 * Allocate memory on default heap.
 */
void *
rte_malloc(const char *type, size_t size, unsigned align)
{
	return rte_malloc_socket(type, size, align, SOCKET_ID_ANY);
}
/* this will try lower page sizes first */
static void *
malloc_heap_alloc_on_heap_id(const char *type, size_t size,
		unsigned int heap_id, unsigned int flags, size_t align,
		size_t bound, bool contig)
{
	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
	struct malloc_heap *heap = &mcfg->malloc_heaps[heap_id];
	unsigned int size_flags = flags & ~RTE_MEMZONE_SIZE_HINT_ONLY;
	int socket_id;
	void *ret;

	rte_spinlock_lock(&(heap->lock));

	align = align == 0 ? 1 : align;

	/* for legacy mode, try once and with all flags */
	if (internal_config.legacy_mem) {
		ret = heap_alloc(heap, type, size, flags, align, bound, contig);
		goto alloc_unlock;
	}

	/*
	 * we do not pass the size hint here, because even if allocation fails,
	 * we may still be able to allocate memory from appropriate page sizes,
	 * we just need to request more memory first.
	 */

	socket_id = rte_socket_id_by_idx(heap_id);
	/*
	 * if socket ID is negative, we cannot find a socket ID for this heap -
	 * which means it's an external heap. those can have unexpected page
	 * sizes, so if the user asked to allocate from there - assume user
	 * knows what they're doing, and allow allocating from there with any
	 * page size flags.
	 */
	if (socket_id < 0)
		size_flags |= RTE_MEMZONE_SIZE_HINT_ONLY;

	ret = heap_alloc(heap, type, size, size_flags, align, bound, contig);
	if (ret != NULL)
		goto alloc_unlock;

	/* if socket ID is invalid, this is an external heap */
	if (socket_id < 0)
		goto alloc_unlock;

	if (!alloc_more_mem_on_socket(heap, size, socket_id, flags, align,
			bound, contig)) {
		ret = heap_alloc(heap, type, size, flags, align, bound, contig);

		/* this should have succeeded */
		if (ret == NULL)
			RTE_LOG(ERR, EAL, "Error allocating from heap\n");
	}
alloc_unlock:
	rte_spinlock_unlock(&(heap->lock));
	return ret;
}

void *
malloc_heap_alloc(const char *type, size_t size, int socket_arg,
		unsigned int flags, size_t align, size_t bound, bool contig)
{
	int socket, heap_id, i;
	void *ret;

	/* return NULL if size is 0 or alignment is not power-of-2 */
	if (size == 0 || (align && !rte_is_power_of_2(align)))
		return NULL;

	if (!rte_eal_has_hugepages() && socket_arg < RTE_MAX_NUMA_NODES)
		socket_arg = SOCKET_ID_ANY;

	if (socket_arg == SOCKET_ID_ANY)
		socket = malloc_get_numa_socket();
	else
		socket = socket_arg;

	/* turn socket ID into heap ID */
	heap_id = malloc_socket_to_heap_id(socket);
	/* if heap id is negative, socket ID was invalid */
	if (heap_id < 0)
		return NULL;

	ret = malloc_heap_alloc_on_heap_id(type, size, heap_id, flags, align,
			bound, contig);
	if (ret != NULL || socket_arg != SOCKET_ID_ANY)
		return ret;

	/* try other heaps. we are only iterating through native DPDK sockets,
	 * so external heaps won't be included.
	 */
	for (i = 0; i < (int) rte_socket_count(); i++) {
		if (i == heap_id)
			continue;
		ret = malloc_heap_alloc_on_heap_id(type, size, i, flags, align,
				bound, contig);
		if (ret != NULL)
			return ret;
	}
	return NULL;
}
static void *
heap_alloc(struct malloc_heap *heap, const char *type __rte_unused, size_t size,
		unsigned int flags, size_t align, size_t bound, bool contig)
{
	struct malloc_elem *elem;

	size = RTE_CACHE_LINE_ROUNDUP(size);
	align = RTE_CACHE_LINE_ROUNDUP(align);

	/* roundup might cause an overflow */
	if (size == 0)
		return NULL;
	elem = find_suitable_element(heap, size, flags, align, bound, contig);
	if (elem != NULL) {
		elem = malloc_elem_alloc(elem, size, align, bound, contig);

		/* increase heap's count of allocated elements */
		heap->alloc_count++;
	}

	return elem == NULL ? NULL : (void *)(&elem[1]);
}
struct malloc_elem *
malloc_elem_alloc(struct malloc_elem *elem, size_t size, unsigned align,
		size_t bound, bool contig)
{
	struct malloc_elem *new_elem = elem_start_pt(elem, size, align, bound,
			contig);
	const size_t old_elem_size = (uintptr_t)new_elem - (uintptr_t)elem;
	const size_t trailer_size = elem->size - old_elem_size - size -
		MALLOC_ELEM_OVERHEAD;

	malloc_elem_free_list_remove(elem);

	if (trailer_size > MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) {
		/* split it, too much free space after elem */
		struct malloc_elem *new_free_elem =
				RTE_PTR_ADD(new_elem, size + MALLOC_ELEM_OVERHEAD);

		split_elem(elem, new_free_elem);
		malloc_elem_free_list_insert(new_free_elem);

		if (elem == elem->heap->last)
			elem->heap->last = new_free_elem;
	}

	if (old_elem_size < MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) {
		/* don't split it, pad the element instead */
		elem->state = ELEM_BUSY;
		elem->pad = old_elem_size;

		/* put a dummy header in padding, to point to real element header */
		if (elem->pad > 0) { /* pad will be at least 64-bytes, as everything
		                     * is cache-line aligned */
			new_elem->pad = elem->pad;
			new_elem->state = ELEM_PAD;
			new_elem->size = elem->size - elem->pad;
			set_header(new_elem);
		}

		return new_elem;
	}

	/* we are going to split the element in two. The original element
	 * remains free, and the new element is the one allocated.
	 * Re-insert original element, in case its new size makes it
	 * belong on a different list.
	 */
	split_elem(elem, new_elem);
	new_elem->state = ELEM_BUSY;
	malloc_elem_free_list_insert(elem);

	return new_elem;
}

最终回到了链表队列的操作上。

看一下释放的代码：

/* Free the memory space back to heap */
void rte_free(void *addr)
{
	if (addr == NULL) return;
	if (malloc_heap_free(malloc_elem_from_data(addr)) < 0)
		RTE_LOG(ERR, EAL, "Error: Invalid memory\n");
}
int
malloc_heap_free(struct malloc_elem *elem)
{
	struct malloc_heap *heap;
	void *start, *aligned_start, *end, *aligned_end;
	size_t len, aligned_len, page_sz;
	struct rte_memseg_list *msl;
	unsigned int i, n_segs, before_space, after_space;
	int ret;

	if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY)
		return -1;

	/* elem may be merged with previous element, so keep heap address */
	heap = elem->heap;
	msl = elem->msl;
	page_sz = (size_t)msl->page_sz;

	rte_spinlock_lock(&(heap->lock));

	/* mark element as free */
	elem->state = ELEM_FREE;

	elem = malloc_elem_free(elem);

	/* anything after this is a bonus */
	ret = 0;

	/* ...of which we can't avail if we are in legacy mode, or if this is an
	 * externally allocated segment.
	 */
	if (internal_config.legacy_mem || (msl->external > 0))
		goto free_unlock;

	/* check if we can free any memory back to the system */
	if (elem->size < page_sz)
		goto free_unlock;

	/* if user requested to match allocations, the sizes must match - if not,
	 * we will defer freeing these hugepages until the entire original allocation
	 * can be freed
	 */
	if (internal_config.match_allocations && elem->size != elem->orig_size)
		goto free_unlock;

	/* probably, but let's make sure, as we may not be using up full page */
	start = elem;
	len = elem->size;
	aligned_start = RTE_PTR_ALIGN_CEIL(start, page_sz);
	end = RTE_PTR_ADD(elem, len);
	aligned_end = RTE_PTR_ALIGN_FLOOR(end, page_sz);

	aligned_len = RTE_PTR_DIFF(aligned_end, aligned_start);

	/* can't free anything */
	if (aligned_len < page_sz)
		goto free_unlock;

	/* we can free something. however, some of these pages may be marked as
	 * unfreeable, so also check that as well
	 */
	n_segs = aligned_len / page_sz;
	for (i = 0; i < n_segs; i++) {
		const struct rte_memseg *tmp =
				rte_mem_virt2memseg(aligned_start, msl);

		if (tmp->flags & RTE_MEMSEG_FLAG_DO_NOT_FREE) {
			/* this is an unfreeable segment, so move start */
			aligned_start = RTE_PTR_ADD(tmp->addr, tmp->len);
		}
	}

	/* recalculate length and number of segments */
	aligned_len = RTE_PTR_DIFF(aligned_end, aligned_start);
	n_segs = aligned_len / page_sz;

	/* check if we can still free some pages */
	if (n_segs == 0)
		goto free_unlock;

	/* We're not done yet. We also have to check if by freeing space we will
	 * be leaving free elements that are too small to store new elements.
	 * Check if we have enough space in the beginning and at the end, or if
	 * start/end are exactly page aligned.
	 */
	before_space = RTE_PTR_DIFF(aligned_start, elem);
	after_space = RTE_PTR_DIFF(end, aligned_end);
	if (before_space != 0 &&
			before_space < MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) {
		/* There is not enough space before start, but we may be able to
		 * move the start forward by one page.
		 */
		if (n_segs == 1)
			goto free_unlock;

		/* move start */
		aligned_start = RTE_PTR_ADD(aligned_start, page_sz);
		aligned_len -= page_sz;
		n_segs--;
	}
	if (after_space != 0 && after_space <
			MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) {
		/* There is not enough space after end, but we may be able to
		 * move the end backwards by one page.
		 */
		if (n_segs == 1)
			goto free_unlock;

		/* move end */
		aligned_end = RTE_PTR_SUB(aligned_end, page_sz);
		aligned_len -= page_sz;
		n_segs--;
	}

	/* now we can finally free us some pages */

	rte_mcfg_mem_write_lock();

	/*
	 * we allow secondary processes to clear the heap of this allocated
	 * memory because it is safe to do so, as even if notifications about
	 * unmapped pages don't make it to other processes, heap is shared
	 * across all processes, and will become empty of this memory anyway,
	 * and nothing can allocate it back unless primary process will be able
	 * to deliver allocation message to every single running process.
	 */

	malloc_elem_free_list_remove(elem);

	malloc_elem_hide_region(elem, (void *) aligned_start, aligned_len);

	heap->total_size -= aligned_len;

	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
		/* notify user about changes in memory map */
		eal_memalloc_mem_event_notify(RTE_MEM_EVENT_FREE,
				aligned_start, aligned_len);

		/* don't care if any of this fails */
		malloc_heap_free_pages(aligned_start, aligned_len);

		request_sync();
	} else {
		struct malloc_mp_req req;

		memset(&req, 0, sizeof(req));

		req.t = REQ_TYPE_FREE;
		req.free_req.addr = aligned_start;
		req.free_req.len = aligned_len;

		/*
		 * we request primary to deallocate pages, but we don't do it
		 * in this thread. instead, we notify primary that we would like
		 * to deallocate pages, and this process will receive another
		 * request (in parallel) that will do it for us on another
		 * thread.
		 *
		 * we also don't really care if this succeeds - the data is
		 * already removed from the heap, so it is, for all intents and
		 * purposes, hidden from the rest of DPDK even if some other
		 * process (including this one) may have these pages mapped.
		 *
		 * notifications about deallocated memory happen during sync.
		 */
		request_to_primary(&req);
	}

	RTE_LOG(DEBUG, EAL, "Heap on socket %d was shrunk by %zdMB\n",
		msl->socket_id, aligned_len >> 20ULL);

	rte_mcfg_mem_write_unlock();
free_unlock:
	rte_spinlock_unlock(&(heap->lock));
	return ret;
}

同样也要回到链表的操作上了。

四、总结

读比较大的框架的源码，有三个主要问题：一是宏观上流程怎么搞清楚，这个非常重要，否则就会陷入局部理解，它可以对自己的设计架构和思想理念产生影响；二是细节如何实现，它会直接影响阅读者对代码的理解和一些技术及技巧的学习应用；三是如何在细节模块实现与宏观流程的挂接上，也即分层和API连接设计上，这体现了另外的一种思想，即设计思想转化成实践思想的一个过程。
把这三方面弄清楚，就可以考虑如何在自己的实际开发中进行借鉴和应用，如何和自己已经掌握的技术和技巧进行融合，计算机知识本身就是一个理论和实践高度结合的产物，只有在不断的否定之否定的过程中，技术水平才会有长足的进步。