概述
nginx内存分配将内存需求分成了两种:(1)大块内存(2)小内存。内存大小的判定依据是申请的内存是否比同页大小与pool的size两者都小。
对于大块内存,单独利用malloc来申请,并且使用单向链表管理起来
对于小块内存,则从已有的pool数据区中划分出一部分出来,这里的内存划分出去后没有特殊的结构来保存,而是等到申请对象生命周期结束后一起释放。小块内存的存储方式非常类似于sk_buffer,通过tail,end指针来表示多少内存已经被分配出去。
对于大块内存,单独利用malloc来申请,并且使用单向链表管理起来
对于小块内存,则从已有的pool数据区中划分出一部分出来,这里的内存划分出去后没有特殊的结构来保存,而是等到申请对象生命周期结束后一起释放。小块内存的存储方式非常类似于sk_buffer,通过tail,end指针来表示多少内存已经被分配出去。
内存池存储结构
数据结构设计(src/core/ngx_palloc.h)
点击(此处)折叠或打开
- typedef struct {
- u_char *last; /*已用空间的结尾*/
- u_char *end; /*可用空间的结尾*/
- ngx_pool_t *next; /*下一个可用的pool*/
- ngx_uint_t failed; /*本pool分配内存失败次数*/
- } ngx_pool_data_t;
- struct ngx_pool_s {
- ngx_pool_data_t d; /*pool内部信息,定义见上面*/
- size_t max; /*可分配的最大空间*/
- ngx_pool_t *current;/*当前pool*/
- ngx_chain_t *chain; /*链表管理的buf*/
- ngx_pool_large_t *large; /*大块内存单向链表头指针*/
- ngx_pool_cleanup_t *cleanup; /*空间释放回调函数*/
- ngx_log_t *log; /*日志句柄*/
- };
内存分配总流程
图1:nginx内存池分配流程示意图
图2: nginx内存池存储结构示意图
创建内存池
接口是(
src/core/ngx_palloc.c
)
点击(此处)折叠或打开
- ngx_pool_t *
- ngx_create_pool(size_t size, ngx_log_t *log)
- {
- ngx_pool_t *p;
- p = ngx_memalign(NGX_POOL_ALIGNMENT, size, log); /*可以简单理解成分配了size大小的内存*/
- if (p == NULL) {
- return NULL;
- }
- p->d.last = (u_char *) p + sizeof(ngx_pool_t); /*已经保留了pool头部的空间*/
- p->d.end = (u_char *) p + size; /*end指向分配全部空间的末尾*/
- p->d.next = NULL;
- p->d.failed = 0;
- size = size - sizeof(ngx_pool_t); /*可供分配的空间大小,已经刨去了头部所占空间*/
- p->max = (size < NGX_MAX_ALLOC_FROM_POOL) ? size : NGX_MAX_ALLOC_FROM_POOL;
- p->current = p; /*current指向当前pool*/
- p->chain = NULL;
- p->large = NULL;
- p->cleanup = NULL;
- p->log = log;
- return p;
- }
大块内存分配
大块内存的判定方法step 1) 创建内存池时计算内存池分配内存的max值
点击( 此处 )折叠或打开
- size = size - sizeof(ngx_pool_t);
- /*pool的max取值为pool可用空间与NGX_MAX_ALLOC_FROM_POOL之间的小值*/
- p->max = (size < NGX_MAX_ALLOC_FROM_POOL) ? size : NGX_MAX_ALLOC_FROM_POOL;
点击(此处)折叠或打开
- void *
- ngx_palloc(ngx_pool_t *pool, size_t size)
- {
- u_char *m;
- ngx_pool_t *p;
- if (size <= pool->max) {
- p = pool->current;
- do {
- m = ngx_align_ptr(p->d.last, NGX_ALIGNMENT);
- if ((size_t) (p->d.end - m) >= size) {
- p->d.last = m + size;
- return m;
- }
- p = p->d.next;
- } while (p);
- return ngx_palloc_block(pool, size);
- }
- return ngx_palloc_large(pool, size); /*超过pool->max的判定为大内存,调用ngx_palloc_large申请内存*/
- }
点击(此处)折叠或打开
- static void *
- ngx_palloc_large(ngx_pool_t *pool, size_t size)
- {
- void *p;
- ngx_uint_t n;
- ngx_pool_large_t *large;
- /*申请指定大小的内存*/
- p = ngx_alloc(size, pool->log);
- if (p == NULL) {
- return NULL;
- }
- n = 0;
- /*寻找大块内存的合适挂载weizhi*/
- for (large = pool->large; large; large = large->next) {
- if (large->alloc == NULL) {
- large->alloc = p;
- return p;
- }
- if (n++ > 3) {
- break;
- }
- }
- /*重新分配一个large节点,查到pool largelianb*/
- large = ngx_palloc(pool, sizeof(ngx_pool_large_t));
- if (large == NULL) {
- ngx_free(p);
- return NULL;
- }
- large->alloc = p;
- large->next = pool->large;
- pool->large = large;
- return p;
- }
大块内存释放
大块内存的释放需要用户显式调用ngx_free,函数的实现如下:
点击(此处)折叠或打开
- ngx_int_t
- ngx_pfree(ngx_pool_t *pool, void *p)
- {
- ngx_pool_large_t *l;
- /*遍历链表,通过地址比较确认是否是需要释放的大块内存*/
- for (l = pool->large; l; l = l->next) {
- if (p == l->alloc) {
- ngx_log_debug1(NGX_LOG_DEBUG_ALLOC, pool->log, 0,
- "free: %p", l->alloc);
- ngx_free(l->alloc); /*ngx_freeji*/
- l->alloc = NULL; /*注意: 这里并未释放large结构体的内存,可备后用*/
- return NGX_OK;
- }
- }
- return NGX_DECLINED;
- }
他山之石: sk_buff
sk_buff的数据结构与操作点击(此处)折叠或打开
- /**
- * struct sk_buff - socket buffer
- * @next: Next buffer in list
- * @prev: Previous buffer in list
- * @sk: Socket we are owned by
- * @tstamp: Time we arrived
- * @dev: Device we arrived on/are leaving by
- * @transport_header: Transport layer header
- * @network_header: Network layer header
- * @mac_header: Link layer header
- * @_skb_dst: destination entry
- * @sp: the security path, used for xfrm
- * @cb: Control buffer. Free for use by every layer. Put private vars here
- * @len: Length of actual data
- * @data_len: Data length
- * @mac_len: Length of link layer header
- * @hdr_len: writable header length of cloned skb
- * @csum: Checksum (must include start/offset pair)
- * @csum_start: Offset from skb->head where checksumming should start
- * @csum_offset: Offset from csum_start where checksum should be stored
- * @local_df: allow local fragmentation
- * @cloned: Head may be cloned (check refcnt to be sure)
- * @nohdr: Payload reference only, must not modify header
- * @pkt_type: Packet class
- * @fclone: skbuff clone status
- * @ip_summed: Driver fed us an IP checksum
- * @priority: Packet queueing priority
- * @users: User count - see {datagram,tcp}.c
- * @protocol: Packet protocol from driver
- * @truesize: Buffer size
- * @head: Head of buffer
- * @data: Data head pointer
- * @tail: Tail pointer
- * @end: End pointer
- * @destructor: Destruct function
- * @mark: Generic packet mark
- * @nfct: Associated connection, if any
- * @ipvs_property: skbuff is owned by ipvs
- * @peeked: this packet has been seen already, so stats have been
- * done for it, don't do them again
- * @nf_trace: netfilter packet trace flag
- * @nfctinfo: Relationship of this skb to the connection
- * @nfct_reasm: netfilter conntrack re-assembly pointer
- * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c
- * @iif: ifindex of device we arrived on
- * @queue_mapping: Queue mapping for multiqueue devices
- * @tc_index: Traffic control index
- * @tc_verd: traffic control verdict
- * @ndisc_nodetype: router type (from link layer)
- * @dma_cookie: a cookie to one of several possible DMA operations
- * done by skb DMA functions
- * @secmark: security marking
- * @vlan_tci: vlan tag control information
- */
- struct sk_buff {
- /* These two members must be first. */
- struct sk_buff *next;
- struct sk_buff *prev;
- struct sock *sk;
- ktime_t tstamp;
- struct net_device *dev;
- unsigned long _skb_dst;
- #ifdef CONFIG_XFRM
- struct sec_path *sp;
- #endif
- /*
- * This is the control buffer. It is free to use for every
- * layer. Please put your private variables there. If you
- * want to keep them across layers you have to do a skb_clone()
- * first. This is owned by whoever has the skb queued ATM.
- */
- char cb[48];
- unsigned int len,
- data_len;
- __u16 mac_len,
- hdr_len;
- union {
- __wsum csum;
- struct {
- __u16 csum_start;
- __u16 csum_offset;
- };
- };
- __u32 priority;
- kmemcheck_bitfield_begin(flags1);
- __u8 local_df:1,
- cloned:1,
- ip_summed:2,
- nohdr:1,
- nfctinfo:3;
- __u8 pkt_type:3,
- fclone:2,
- ipvs_property:1,
- peeked:1,
- nf_trace:1;
- __be16 protocol:16;
- kmemcheck_bitfield_end(flags1);
- void (*destructor)(struct sk_buff *skb);
- #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
- struct nf_conntrack *nfct;
- struct sk_buff *nfct_reasm;
- #endif
- #ifdef CONFIG_BRIDGE_NETFILTER
- struct nf_bridge_info *nf_bridge;
- #endif
- int iif;
- #ifdef CONFIG_NET_SCHED
- __u16 tc_index; /* traffic control index */
- #ifdef CONFIG_NET_CLS_ACT
- __u16 tc_verd; /* traffic control verdict */
- #endif
- #endif
- kmemcheck_bitfield_begin(flags2);
- __u16 queue_mapping:16;
- #ifdef CONFIG_IPV6_NDISC_NODETYPE
- __u8 ndisc_nodetype:2;
- #endif
- kmemcheck_bitfield_end(flags2);
- /* 0/14 bit hole */
- #ifdef CONFIG_NET_DMA
- dma_cookie_t dma_cookie;
- #endif
- #ifdef CONFIG_NETWORK_SECMARK
- __u32 secmark;
- #endif
- __u32 mark;
- __u16 vlan_tci;
- sk_buff_data_t transport_header;
- sk_buff_data_t network_header;
- sk_buff_data_t mac_header;
- /* These elements must be at the end, see alloc_skb() for details. */
- sk_buff_data_t tail;
- sk_buff_data_t end;
- unsigned char *head,
- *data;
- unsigned int truesize;
- atomic_t users;
- };
再来看看sk_buff的数据指针的作用(注:sk_buff的示意图来自网络):
图3: sk_buff的数据指针移动过程
sk_buff移动指针的操作有:
点击(此处)折叠或打开
- /*
- * Add data to an sk_buff
- */
- extern unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
- static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len)
- {
- unsigned char *tmp = skb_tail_pointer(skb);
- SKB_LINEAR_ASSERT(skb);
- skb->tail += len;
- skb->len += len;
- return tmp;
- }
- extern unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
- static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
- {
- skb->data -= len;
- skb->len += len;
- return skb->data;
- }
- extern unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
- static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
- {
- skb->len -= len;
- BUG_ON(skb->len < skb->data_len);
- return skb->data += len;
- }
- extern unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta);
- static inline unsigned char *__pskb_pull(struct sk_buff *skb, unsigned int len)
- {
- if (len > skb_headlen(skb) &&
- !__pskb_pull_tail(skb, len - skb_headlen(skb)))
- return NULL;
- skb->len -= len;
- return skb->data += len;
- }
- static inline unsigned char *pskb_pull(struct sk_buff *skb, unsigned int len)
- {
- return unlikely(len > skb->len) ? NULL : __pskb_pull(skb, len);
- }
- static inline int pskb_may_pull(struct sk_buff *skb, unsigned int len)
- {
- if (likely(len <= skb_headlen(skb)))
- return 1;
- if (unlikely(len > skb->len))
- return 0;
- return __pskb_pull_tail(skb, len - skb_headlen(skb)) != NULL;
- }
- /**
- * skb_headroom - bytes at buffer head
- * @skb: buffer to check
- *
- * Return the number of bytes of free space at the head of an &sk_buff.
- */
- static inline unsigned int skb_headroom(const struct sk_buff *skb)
- {
- return skb->data - skb->head;
- }
- /**
- * skb_tailroom - bytes at buffer end
- * @skb: buffer to check
- *
- * Return the number of bytes of free space at the tail of an sk_buff
- */
- static inline int skb_tailroom(const struct sk_buff *skb)
- {
- return skb_is_nonlinear(skb) ? 0 : skb->end - skb->tail;
- }
- /**
- * skb_reserve - adjust headroom
- * @skb: buffer to alter
- * @len: bytes to move
- *
- * Increase the headroom of an empty &sk_buff by reducing the tail
- * room. This is only allowed for an empty buffer.
- */
- static inline void skb_reserve(struct sk_buff *skb, int len)
- {
- skb->data += len;
- skb->tail += len;
- }
sk_buff的核心思想之一是预先分配好足够的内存来容纳网络数据包的头部和数据体、以及其他的填充字段,在网络收包或者发包的过程中,剥离包头、添加包头的操作可以通过移动指针(而非不断地拷贝到新的内存空间)实现,由于收发包是网络模块最频繁的操作,因此这样的优化效果极为重要。
nginx的小块内存分配也是通过last和tail两个指针配合来完成内存分配的,与sk_buff的思想有一定的想通之处。
小块内存分配
小块内存分配的流程是: 首先从当前pool开始遍历所有的pools,寻找是否有足够的空间分配,如果有,则分配之(调整last指针,并返回调整前的last指针);否则需要重新分配一个block,连接到当前的pool链表中。
再看看ngx_palloc_block的实现
点击(此处)折叠或打开
- void *
- ngx_palloc(ngx_pool_t *pool, size_t size)
- {
- u_char *m;
- ngx_pool_t *p;
- if (size <= pool->max) {
- p = pool->current;
- /*从current pool开始,遍历所有pool,查找这些pool中是否有足够的空间fenpe*/
- do {
- m = ngx_align_ptr(p->d.last, NGX_ALIGNMENT);
- /*有足够的空间分配,则返回未分配空间的首地址,并调整last指针*/
- if ((size_t) (p->d.end - m) >= size) {
- p->d.last = m + size;
- return m;
- }
- p = p->d.next;
- } while (p);
- /*未找到合适的pool,需要重新分配一个*/
- return ngx_palloc_block(pool, size);
- }
- return ngx_palloc_large(pool, size);
- }
点击(此处)折叠或打开
- static void *
- ngx_palloc_block(ngx_pool_t *pool, size_t size)
- {
- u_char *m;
- size_t psize;
- ngx_pool_t *p, *new, *current;
- psize = (size_t) (pool->d.end - (u_char *) pool);
- m = ngx_memalign(NGX_POOL_ALIGNMENT, psize, pool->log);/*可以简单理解成重新分配一块sizedaxia*/
- if (m == NULL) {
- return NULL;
- }
- new = (ngx_pool_t *) m;
- new->d.end = m + psize;
- new->d.next = NULL;
- new->d.failed = 0;
- m += sizeof(ngx_pool_data_t);
- m = ngx_align_ptr(m, NGX_ALIGNMENT);
- new->d.last = m + size;
- /*从pool的current指针开始,查找下一个合适的current*/
- current = pool->current;
- for (p = current; p->d.next; p = p->d.next) {
- if (p->d.failed++ > 4) {
- current = p->d.next; /*四次分配失败,则表明当前的pool内存紧张,不适合继续充当第一个查找的pool,current后移*/
- }
- }/*请注意,这里一定会遍历到链表的末尾*/
- /*将新分配的pool添加到链表的末尾*/
- p->d.next = new;
- pool->current = current ? current : new;
- return m;
- }
与HAProxy的内存池管理简单对比
通过本文和前文《
HAProxy内存池实现源码分析》,可以知道nginx和haproxy的内存池管理各有千秋,我们可以根据自身应用的需求借鉴两者的方法。这里的对比不考虑两者代码的优雅性。
由于未对nginx和haproxy的源码全景进入细致的分析,下面的理解可能有错误或不准确之处。
nginx内存池策略的优点:
1)同时支持大块内存和零星内存的申请;
2)零星内存的申请效率较高;
3)对于变长的内存,基本不浪费碎片
缺点:
1)实现较为复杂
2)大块内存的复用程度不好,对于频繁使用的大块内存,系统调用开销仍然较大
haproxy内存池策略的优点:
1)大块内存的申请和释放效率高,复用率高
2)对于大块内存链表,不需要额外的空间管理(也就是不需要nginx的large结构体)
3)实现简单
缺点:
1)不支持小片内存从内存池中分配
2)由于pool定长并且每次分配一个trunk,内存浪费的碎片可能较多
由于未对nginx和haproxy的源码全景进入细致的分析,下面的理解可能有错误或不准确之处。
nginx内存池策略的优点:
1)同时支持大块内存和零星内存的申请;
2)零星内存的申请效率较高;
3)对于变长的内存,基本不浪费碎片
缺点:
1)实现较为复杂
2)大块内存的复用程度不好,对于频繁使用的大块内存,系统调用开销仍然较大
haproxy内存池策略的优点:
1)大块内存的申请和释放效率高,复用率高
2)对于大块内存链表,不需要额外的空间管理(也就是不需要nginx的large结构体)
3)实现简单
缺点:
1)不支持小片内存从内存池中分配
2)由于pool定长并且每次分配一个trunk,内存浪费的碎片可能较多
参考文档
http://www.oschina.net/question/234345_42068
http://www.tbdata.org/archives/1390
http://simohayha.iteye.com/blog/545192