CPU微架构资源及监测

(未完待续,持续更新中)

目录 

PMU

perf map

Cache类

TLB​​

TLB结构与层级

任务切换

 Perf监测

Cache

结构和层级

Perf监测 


PMU

perf map

perf list可以看到如下hardware event,

如何查看他们与intel pmu的对应关系 ?参考代码arch/x86/events/intel/core.c

intel_pmu_init()
---
	case INTEL_FAM6_SKYLAKE_MOBILE:
	case INTEL_FAM6_SKYLAKE_DESKTOP:
	case INTEL_FAM6_SKYLAKE_X:
	case INTEL_FAM6_KABYLAKE_MOBILE:
	case INTEL_FAM6_KABYLAKE_DESKTOP:
		x86_pmu.late_ack = true;
		memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));
		...
		name = "skylake";
	...
	}
	snprintf(pmu_name_str, sizeof(pmu_name_str), "%s", name);
---
  1. 查看intel微架构版本,具体微架构历史可以参考,List of Intel CPU microarchitectureshttps://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures 查看方法:cat /sys/devices/cpu/caps/pmu_name;比如Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz,显示的是skylake;
  2.  参考skl_hw_cache_event_ids,比如iTLB相关:

事件的具体意义可以参考Intel PMU Hardware Eventhttps://perfmon-events.intel.com/skylake.html

另外,相关事件的描述中,有一个单词,"retired",其意义可以参考连接performance - What does Intel mean by "retired"? - Stack Overflow 

n the context "retired" means: the instruction (microoperation, μop) leaves the "Retirement Unit". It means that in Out-of-order CPU pipeline the instruction is finally executed and its results are correct and visible in the architectural state as if they execute in-order. In performance context this is the number you should check to compute how many instructions were really executed (with useful output)

Cache类

http://www.cs.uni.edu/~diesburg/courses/cs3430_sp14/sessions/s14/s14_caching_and_tlbs.pdf

Cache的组织结构:

这三种类型可以直观的理解为:数组、hash表、链表;

比如:32-KB, 8-way set associative, 64-byte line size

  • cache总容量是32K
  • 每个组容量为 8 * 64 Bytes
  • 总共有 2^15 / 2^9 = 2^6 64个组

另外,cache hierarchy还有inclusiveexclusive,参考连接Memory part 2: CPU caches [LWN.net] Section 3.2中的一段:

To be able to load new data in a cache it is almost always first necessary to make room in the cache. An eviction from L1d pushes the cache line down into L2 (which uses the same cache line size). This of course means room has to be made in L2. This in turn might push the content into L3 and ultimately into main memory. Each eviction is progressively more expensive. What is described here is the model for an exclusive cache as is preferred by modern AMD and VIA processors. Intel implements inclusive caches {This generalization is not completely correct. A few caches are exclusive and some inclusive caches have exclusive cache properties.} where each cache line in L1d is also present in L2. Therefore evicting from L1d is much faster. With enough L2 cache the disadvantage of wasting memory for content held in two places is minimal and it pays off when evicting. A possible advantage of an exclusive cache is that loading a new cache line only has to touch the L1d and not the L2, which could be faster.d

总结起来就是,cache上下层之间的关系,分成两种:

  • inclusive,上下层之间会保存相同的内容
  • exclusive,下层只是作为上层的victim cache

TLB​​

TLB结构与层级

TLB Entry的格式并未找官方的Xeon的文档,不过可以参考Nios_II的3.2.4. TLB Organization 

TLB Tag Fomat

Field NameDescription
VPNVPN is the virtual page number field. This field is compared with the top 20 bits of the virtual address.
PIDPID is the process identifier field. This field is compared with the value of the current process identifier stored in the tlbmisc control register, effectively extending the virtual address. The field size is configurable in the Nios_II Processor parameter editor, and can be between 8 and 14 bits.
GG is the global flag. When G = 1, the PID is ignored in the TLB lookup.

TLB Data Format 

Field NameDescription
PFNPFN is the physical frame number field. This field specifies the upper bits of the physical address. The size of this field depends on the range of physical addresses present in the system. The maximum size is 20 bits.
CC is the cacheable flag. Determines the default data cacheability of a page. Can be overridden for data accesses using I/O load and store family of Nios II instructions.
RR is the readable flag. Allows load instructions to read a page.
WW is the writable flag. Allows store instructions to write a page.
XX is the executable flag. Allows instruction fetches from a page.

任务切换

需要特别关注的是PID和G,这关系到当切换上下文时,是否需要invalidate tlb; 

参考文档

Intel® 64 and IA-32 Architectures Software Developer’s Manual

Volume 3A: System Programming Guide, Part 1 September 2016

4.10.1 Process-Context Identifiers (PCIDs)

Process-context identifiers (PCIDs) are a facility by which a logical processor may cache information for multiple linear-address spaces. The processor may retain cached information when software switches to a different linear- address space with a different PCID (e.g., by loading CR3; see Section 4.10.4.1 for details)

A PCID is a 12-bit identifier. Non-zero PCIDs are enabled by setting the PCIDE flag (bit 17) of CR4. If CR4.PCIDE = 0, the current PCID is always 000H; otherwise, the current PCID is the value of bits 11:0 of CR3. Not all processors allow CR4.PCIDE to be set to 1.

When a logical processor creates entries in the TLBs (Section 4.10.2) and paging-structure caches (Section 4.10.3), it associates those entries with the current PCID. When using entries in the TLBs and paging-structure caches to translate a linear address, a logical processor uses only those entries associated with the current PCID.

4.10.4.1 Operations that Invalidate TLBs and Paging-Structure Caches

MOV to CR3. The behavior of the instruction depends on the value of CR4.PCIDE:

—  If CR4.PCIDE = 1 and bit 63 of the instruction’s source operand is 0, the instruction invalidates all TLB entries associated with the PCID specified in bits 11:0 of the instruction’s source operand except those for global pages. It also invalidates all entries in all paging-structure caches associated with that PCID. It is not required to invalidate entries in the TLBs and paging-structure caches that are associated with other PCIDs.

4.10.4 Invalidation of TLBs and Paging-Structure Caches

As noted in Section 4.10.2 and Section 4.10.3, the processor may create entries in the TLBs and the paging-struc- ture caches when linear addresses are translated, and it may retain these entries even after the paging structures used to create them have been modified. To ensure that linear-address translation uses the modified paging structures, software should take action to invalidate any cached entries that may contain information that has since been modified.

上文中提到的global pages,来自页表项中的一位,参考下图:

(该图来自2006版本) 

在Linux内核中,相关代码为:

arch/x86/mm/tlb.c

上下文切换的过程中,ASID的切换,
switch_mm_irqs_off()
---
	if (real_prev == next) {
		...
	} else {
		u16 new_asid;
		bool need_flush;
		...
		next_tlb_gen = atomic64_read(&next->context.tlb_gen);

		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);

		if (need_flush) {
			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
			load_new_mm_cr3(next->pgd, new_asid, true);
		} else {
			/* The new ASID is already up to date. */
			load_new_mm_cr3(next->pgd, new_asid, false);
			...
		}
		...
	}
---

这里有三个关键的值:

  • asid,per-cpu只有6个,(TLB_NR_DYN_ASIDS 6),所有到该CPU上的任务,轮着用
    choose_new_asid()
    ---
    	/*
    	 * We don't currently own an ASID slot on this CPU.
    	 * Allocate a slot.
    	 */
    	*new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
    	if (*new_asid >= TLB_NR_DYN_ASIDS) {
    		*new_asid = 0;
    		this_cpu_write(cpu_tlbstate.next_asid, 1);
    	}
    	*need_flush = true;
    ---
    
  • ctx_id,每个任务都有一个ctx_id,是一个全局的原子变量,这个ctx_id用于区分当前cpu上是否有已经持有的asid,这决定了,下一步是不是需要申请新的asid,如果申请,就涉及到invalidate掉该asid之前对应的tlb entry
    init_new_context()
    ---
    	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
    	atomic64_set(&mm->context.tlb_gen, 0);
    ---
    
    
    choose_new_asid()
    ---
    	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
    		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
    		    next->context.ctx_id)
    			continue;
    
    		*new_asid = asid;
    		*need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
    			       next_tlb_gen);
    		return;
    	}
    ---
    
  • tlb_gen,per-mm,即地址空间;即使是同一个context的tlb也可能需要flush
    flush_tlb_mm_range()
    ---
    	/* This is also a barrier that synchronizes with switch_mm(). */
    	info.new_tlb_gen = inc_mm_tlb_gen(mm);
    	---
    		return atomic64_inc_return(&mm->context.tlb_gen);
    	---
    ---
    

 Perf监测

看下面这组perf采集的数据:

perf stat -e dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads make O=../out -j20 > /dev/null

 Performance counter stats for 'make O=../out -j20':

    10,095,672,437      dTLB-load-misses          #    0.09% of all dTLB cache hits   (63.78%)
10,869,338,077,000      dTLB-loads                                                    (63.77%)
     3,475,643,439      dTLB-store-misses                                             (63.77%)
 5,408,658,177,811      dTLB-stores                                                   (63.77%)
     8,101,851,811      iTLB-load-misses          #   22.88% of all iTLB cache hits   (63.76%)
    35,402,725,030      iTLB-loads                                                    (63.77%)

     684.556635266 seconds time elapsed

各项对应的intel的pmu事件是:

  • dTLB-loads,MEM_INST_RETIRED.ALL_LOADS,All retired load instructions;这里使用的计数是,执行成功的所有load指令;
  • dTLB-load-misses,DTLB_LOAD_MISSES.WALK_COMPLETED,Counts completed page walks (all page sizes) caused by demand data loads. This implies it missed in the DTLB and further levels of TLB
  • dTLB-store,MEM_INST_RETIRED.ALL_STORES,All retired  instructions
  • dTLB-store-misses,DTLB_STORE_MISSES.WALK_COMPLETED,Counts completed page walks (all page sizes) caused by demand data stores. This implies it missed in the DTLB and further levels of TLB
  • iTLB-loads,ITLB_MISSES.STLB_HIT,Instruction fetch requests that miss the ITLB and hit the STLB
  • iTLB-load-misses,ITLB_MISSES.WALK_COMPLETED,Counts completed page walks (all page sizes) caused by a code fetch. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB

Cache

结构和层级

还是以Intel skylate为例,参考以下文档:

Skylake (server) - Microarchitectures - Intel - WikiChip

它的cache hiearchy为:

  • L1I Cache:
    • 32 KiB/core, 8-way set associative
      • 64 sets, 64 B line size
      • competitively shared by the threads/core
  • L1D Cache:
    • 32 KiB/core, 8-way set associative
    • 64 sets, 64 B line size
    • competitively shared by threads/core
    • 4 cycles for fastest load-to-use (simple pointer accesses)
      • 5 cycles for complex addresses
    • 128 B/cycle load bandwidth
    • 64 B/cycle store bandwidth
    • Write-back policy
  • L2 Cache:
    • 1 MiB/core, 16-way set associative
    • 64 B line size
    • Inclusive
    • 64 B/cycle bandwidth to L1$
    • Write-back policy
    • 14 cycles latency
  • L3 Cache:
    • 1.375 MiB/core, 11-way set associative, shared across all cores
    • 2,048 sets, 64 B line size
    • Non-inclusive victim cache
    • Write-back policy
    • 50-70 cycles latency

我们看到,L2是Inclusive的而L3是Non-inclusive的,这是什么意思?参考文档:Skylake Processors - HECC Knowledge Base

 An inclusive L3 cache guarantees that every block that exists in the L2 cache also exists in the L3 cache. A non-inclusive L3 cache does not guarantee this.

A larger L2 cache increases the hit rate into the L2 cache, resulting in lower effective memory latency and lower demand on the mesh interconnect and L3 cache.

If the processor has a miss on all the levels of the cache, it fetches the line from memory and puts it directly into the L2 cache of the requesting core, rather than putting a copy into both the L2 and L3 caches, as is done on Broadwell. When the cache line is evicted from the L2 cache, it is placed into L3 if it is expected to be reused.

Due to the non-inclusive nature of the L3 cache, the absence of a cache line in L3 does not indicate that the line is absent in private caches of any of the cores. Therefore, a snoop filter is used to keep track of the location of cache lines in the L1 or L2 caches of cores when a cache line is not allocated in L3. On the previous-generation processors, the shared L3 itself takes care of this task.

 正如对L3的特性描述,它是Non-inclusive Victim Cache

同时在https://www.quora.com/Why-did-Intel-and-AMD-add-an-additional-layer-of-L3-cache-to-share-among-CPU-cores-and-why-couldnt-they-just-expand-and-share-the-L2-cache

 也找到了类似的描述:

Perf监测 

Perf监测的相关事件分别对应了那些Hardware Event呢?

 Performance counter stats for 'make O=../out -j20':

   586,724,859,432      L1-dcache-load-misses     #    5.40% of all L1-dcache hits    (95.80%)
10,872,724,488,141      L1-dcache-loads                                               (95.85%)
 5,408,474,346,992      L1-dcache-stores                                              (95.84%)
   930,699,520,202      L1-icache-load-misses                                         (95.79%)

     725.217684983 seconds time elapsed

  • L1-dcache-loads,MEM_INST_RETIRED.ALL_LOADS,All retired load instructions;这里使用的计数是,执行成功的所有load指令;这与dTLB-loads一样
  • L1-dcache-load-misses,L1D.REPLACEMENT,Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace;
  • L1-dcache-stores,MEM_INST_RETIRED.ALL_STORES,All retired store instructions
  • L1-icache-misses,ICACHE_64B.MISS,Instruction fetch tag lookups that miss in the instruction cache (L1I). Counts at 64-byte cache-line granularity.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值