CPUID读取有关Cache的信息

最新推荐文章于 2024-04-29 11:17:05 发布

weixin_33892359

最新推荐文章于 2024-04-29 11:17:05 发布

阅读量435

点赞数

文章标签：操作系统

原文链接：http://www.cnblogs.com/long123king/p/3522717.html

版权

 
void cpuidTest() 
{ 
    u32 val_eax, val_ebx, val_ecx, val_edx;  
    asm("cpuid" 
            : "=a" (val_eax), 
              "=b" (val_ebx), 
              "=d" (val_ecx), 
              "=c" (val_edx) 
            : "a" (2)); 
   
    printk("eax: 0x%08X\n", val_eax); 
    printk("ebx: 0x%08X\n", val_ebx); 
    printk("ecx: 0x%08X\n", val_ecx); 
    printk("edx: 0x%08X\n", val_edx); 
} 
  

读出结果如下：

 
[190894.986103] ################################################################### 
[190894.986109] eax: 0x76035A01 
[190894.986110] ebx: 0x00F0B0FF 
[190894.986111] ecx: 0x00CA0000 
[190894.986112] edx: 0x00000000 
[190894.986951] ################################################################### 
  

解析出有效的descriptor

 
76H: TLB Instruction TLB: 2M/4M pages, fully associative, 8 entries  
03H: TLB Data TLB: 4 KByte pages, 4-way set associative, 64 entries 
5AH：TLB Data TLB0: 2-MByte or 4 MByte pages, 4-way set associative, 32 entries 
F0H：Prefetch 64-Byte prefetching 
B0H：TLB Instruction TLB: 4 KByte pages, 4-way set associative, 128 entries 
FFH: General CPUID leaf 2 does not report cache descriptor information, use CPUID leaf 4 to query cache parameters 
CAH: STLB Shared 2nd-Level TLB: 4 KByte pages, 4-way associative, 512 entries 
  

可以看到，

General CPUID leaf 2 does not report cache descriptor information, use CPUID leaf 4 to query cache parameters

没有返回Cache相关的信息，都是TLB信息。如果需要了解Cache的信息，需要使用4作为EAX的输入。

我们重新组装代码，读取Cache相关的信息：

 
void cpuidTest() 
{ 
    u32 val_eax, val_ebx, val_ecx, val_edx;  
    asm("cpuid" 
            : "=a" (val_eax), 
              "=b" (val_ebx), 
              "=d" (val_ecx), 
              "=c" (val_edx) 
            : "a" (4), "c"(1)); 
   
    u32 ways,partitions,line_Size, sets; 
   
    ways = val_ebx >> 22; 
    partitions = (val_ebx >> 12) & 0x3FF; 
    line_Size = (val_ebx) & 0xFFF; 
    sets = val_ecx; 
   
    printk("eax: 0x%08X\n", val_eax); 
    printk("ebx: 0x%08X\n", val_ebx); 
    printk("ecx: 0x%08X\n", val_ecx); 
    printk("edx: 0x%08X\n", val_edx); 
   
    printk("ways: %d\n", ways+1); 
    printk("partitions: %d\n", partitions+1); 
    printk("line_size: %d\n", line_Size+1); 
    printk("sets: %d\n", sets+1); 
    printk("Cache L1 size: %d\n", (ways + 1)*(partitions + 1)*(line_Size + 1)*(sets + 1)); 
} 
  

结果如下：

 
[193334.815202] ################################################################### 
[193334.815206] eax: 0x00000021 
[193334.815207] ebx: 0x01C0003F 
[193334.815208] ecx: 0x00000000 
[193334.815209] edx: 0x0000003F 
[193334.815209] ways: 8 
[193334.815210] partitions: 1 
[193334.815211] line_size: 64 
[193334.815211] sets: 1 
[193334.815212] Cache L1 size: 512 
[193334.815672] ################################################################### 
  

可见，L1的Cache是“全相关”，即只有一个cache set，其中有8路，即8个缓存行，每个缓存行里面包含的数据是64bytes，总共512bytes的缓存。

Linux是怎么读取的呢？

 
daniel@ubuntu:/mod/pslist$ cat /proc/cpuinfo 
processor    : 0 
vendor_id    : GenuineIntel 
cpu family    : 6 
model        : 42 
model name    : Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz 
stepping    : 7 
cpu MHz        : 3269.310 
cache size    : 6144 KB 
fdiv_bug    : no 
hlt_bug        : no 
f00f_bug    : no 
coma_bug    : no 
fpu        : yes 
fpu_exception    : yes 
cpuid level    : 5 
wp        : yes 
flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc up pni monitor ssse3 lahf_lm 
bogomips    : 6538.62 
clflush size    : 64 
cache_alignment    : 64 
address sizes    : 36 bits physical, 48 bits virtual 
power management: 
  

 
static int show_cpuinfo(struct seq_file *m, void *v) 
{ 
    struct cpuinfo_x86 *c = v; 
    unsigned int cpu; 
    int i; 
   
****** 
/* Cache size */ 
if (c->x86_cache_size >= 0) 
    seq_printf(m, "cache size\t: %d KB\n", c->x86_cache_size); 
****** 
} 
  

 
unsigned int __cpuinit init_intel_cacheinfo(struct cpuinfo_x86 *c) 
{ 
    /* Cache sizes */ 
    unsigned int trace = 0, l1i = 0, l1d = 0, l2 = 0, l3 = 0; 
    unsigned int new_l1d = 0, new_l1i = 0; /* Cache sizes from cpuid(4) */ 
    unsigned int new_l2 = 0, new_l3 = 0, i; /* Cache sizes from cpuid(4) */ 
    unsigned int l2_id = 0, l3_id = 0, num_threads_sharing, index_msb; 
#ifdef CONFIG_X86_HT 
    unsigned int cpu = c->cpu_index; 
#endif 
   
    if (c->cpuid_level > 3) { 
        static int is_initialized; 
   
        if (is_initialized == 0) { 
            /* Init num_cache_leaves from boot CPU */ 
            num_cache_leaves = find_num_cache_leaves(); 
            is_initialized++; 
        } 
   
        /* 
         * Whenever possible use cpuid(4), deterministic cache 
         * parameters cpuid leaf to find the cache details 
         */ 
        for (i = 0; i < num_cache_leaves; i++) { 
            struct _cpuid4_info_regs this_leaf; 
            int retval; 
   
            retval = cpuid4_cache_lookup_regs(i, &this_leaf); 
            if (retval >= 0) { 
                switch (this_leaf.eax.split.level) { 
                case 1: 
                    if (this_leaf.eax.split.type == 
                            CACHE_TYPE_DATA) 
                        new_l1d = this_leaf.size/1024; 
                    else if (this_leaf.eax.split.type == 
                            CACHE_TYPE_INST) 
                        new_l1i = this_leaf.size/1024; 
                    break; 
                case 2: 
                    new_l2 = this_leaf.size/1024; 
                    num_threads_sharing = 1 + this_leaf.eax.split.num_threads_sharing; 
                    index_msb = get_count_order(num_threads_sharing); 
                    l2_id = c->apicid >> index_msb; 
                    break; 
                case 3: 
                    new_l3 = this_leaf.size/1024; 
                    num_threads_sharing = 1 + this_leaf.eax.split.num_threads_sharing; 
                    index_msb = get_count_order( 
                            num_threads_sharing); 
                    l3_id = c->apicid >> index_msb; 
                    break; 
                default: 
                    break; 
                } 
            } 
        } 
    } 
    /* 
     * Don't use cpuid2 if cpuid4 is supported. For P4, we use cpuid2 for 
     * trace cache 
     */ 
    if ((num_cache_leaves == 0 || c->x86 == 15) && c->cpuid_level > 1) { 
        /* supports eax=2  call */ 
        int j, n; 
        unsigned int regs[4]; 
        unsigned char *dp = (unsigned char *)regs; 
        int only_trace = 0; 
   
        if (num_cache_leaves != 0 && c->x86 == 15) 
            only_trace = 1; 
   
        /* Number of times to iterate */ 
        n = cpuid_eax(2) & 0xFF; 
   
        for (i = 0 ; i < n ; i++) { 
            cpuid(2, &regs[0], &regs[1], &regs[2], &regs[3]); 
   
            /* If bit 31 is set, this is an unknown format */ 
            for (j = 0 ; j < 3 ; j++) 
                if (regs[j] & (1 << 31)) 
                    regs[j] = 0; 
   
            /* Byte 0 is level count, not a descriptor */ 
            for (j = 1 ; j < 16 ; j++) { 
                unsigned char des = dp[j]; 
                unsigned char k = 0; 
   
                /* look up this descriptor in the table */ 
                while (cache_table[k].descriptor != 0) { 
                    if (cache_table[k].descriptor == des) { 
                        if (only_trace && cache_table[k].cache_type != LVL_TRACE) 
                            break; 
                        switch (cache_table[k].cache_type) { 
                        case LVL_1_INST: 
                            l1i += cache_table[k].size; 
                            break; 
                        case LVL_1_DATA: 
                            l1d += cache_table[k].size; 
                            break; 
                        case LVL_2: 
                            l2 += cache_table[k].size; 
                            break; 
                        case LVL_3: 
                            l3 += cache_table[k].size; 
                            break; 
                        case LVL_TRACE: 
                            trace += cache_table[k].size; 
                            break; 
                        } 
   
                        break; 
                    } 
   
                    k++; 
                } 
            } 
        } 
    } 
   
    if (new_l1d) 
        l1d = new_l1d; 
   
    if (new_l1i) 
        l1i = new_l1i; 
   
    if (new_l2) { 
        l2 = new_l2; 
#ifdef CONFIG_X86_HT 
        per_cpu(cpu_llc_id, cpu) = l2_id; 
#endif 
    } 
   
    if (new_l3) { 
        l3 = new_l3; 
#ifdef CONFIG_X86_HT 
        per_cpu(cpu_llc_id, cpu) = l3_id; 
#endif 
    } 
   
    c->x86_cache_size = l3 ? l3 : (l2 ? l2 : (l1i+l1d)); 
   
    return l2; 
} 
  

c->x86_cache_size = l3 ? l3 : (l2 ? l2 : (l1i+l1d));

 
static inline void native_cpuid(unsigned int *eax, unsigned int *ebx, 
                unsigned int *ecx, unsigned int *edx) 
{ 
    /* ecx is often an input as well as an output. */ 
    asm volatile("cpuid" 
        : "=a" (*eax), 
          "=b" (*ebx), 
          "=c" (*ecx), 
          "=d" (*edx) 
        : "0" (*eax), "2" (*ecx)); 
} 
  

为什么x86_cache_size是6144KB，而我们得到的L1缓存为512Bytes，L2也是512Bytes，L3是1536Bytes呢？

上面的程序有个错误，改过来结果

asm("cpuid"
        : "=a" (val_eax),
          "=b" (val_ebx),
          "=d" (val_ecx),
          "=c" (val_edx) //c和d写反了，不配对
        : "a" (4), "c"(1));

 
[261214.698170] ################################################################### 
[261214.698174] eax: 0x00000041 
[261214.698175] ebx: 0x05C0003F 
[261214.698176] ecx: 0x00000FFF 
[261214.698177] edx: 0x00000000 
[261214.698178] ways: 24 
[261214.698178] partitions: 1 
[261214.698179] line_size: 64 
[261214.698180] sets: 4096 
[261214.698181] Cache L3 size: 6144 KB 
  

Intel与缓存有关的总线技术：

Strong Uncacheable：

所有的读和写操作，都按照编码时设定的严格顺序出现在系统总线(System Bus)上，不会发生乱序。

所有可能的硬件优化都被禁止，比如对可能的内存访问进行预测(speculative memory accesses)，pagetable walks, prefectches of speculated branch targets.等等。

当系统的IO被映射到物理内存空间时，这种模式是有用的，可以保证对IO设备的操作严格，不会引起歧义。

但是如果用来操作RAM内存，会极大降低性能。

Uncacheable:

和Strong Uncacheable是类似，区别在于：

Uncacheable可以通过对MTRR寄存器进行编程来将该区域类型修改为WC类型，但是Strong Uncacheable区域不可以被修改。

MTRR的意思是“内存类型及范围寄存器(Memory Type & Range Register)”。

Write Combining（WC）：

读写不被缓存在缓存中，但是写被缓存在WC Buffer中。而且不强制要求一致性(coherency)。

对内存的写可能被delay并且combine在WC buffer中，以减少对内存的访问次数。

直到一些特定的事件发生时，才被写回到内存中。

主要用于一些对写内存的顺序不感冒的场合，比如video frame buffer等等。

Write Through（WT）：写透

读被缓存在Cache中。

写操作会直接写到内存中。

如果缓存命中，则更新缓存，或者使用缓存失效(Invalidate)。

这种方式本身可以保证一致性(Coherency)。

Write Back（WB）:

读写都被缓存在Cache中。

写操作会被积累在缓存中，直到有人触发了write-back操作，才被写回到内存中。

或者当缓存行选中让位时，需要先将缓存行中的内容回写到内存中。

Write Protected（WP）:

读缓存在Cache中。

写操作首先传到系统总线上，即写到内存中。

然后让所有的processor的缓存中，与写的内存相关的缓存行全部失效。

DMA操作时对缓存的影响与依赖

通过DMA模式，从外部设备读入了一段数据到内存时，需要将该段内存对应的高速缓存行全部清空。注意，DMA操作是外设与内存之间的直接交换数据，不会经过CPU，因此也不会经过缓存。但是DMA操作之后，内存与缓存之间可能出现不一致现象，因此需要使相应的缓存失效。

同样，如果需要进行DMA操作，将内存中的一段数据写出到外部设备上时，也需要先将缓存中的内容回写到内存中，再从内存中回写到外部设备上。

广义上的缓存，大致有三种类型：

Cache:

这是狭义上的缓存，包含数据缓存和指令缓存，通常L1缓存分为数据和指令缓存两种，而L2和L3都是Unified Cache。

指令缓存，CPU基本上已经很好的支持和优化了，比如分支预测等等。

TLB: Translation Look-aside Buffers，快表

为了回忆分页机制的页面映射过程，会将页目录以及页表中的一部分先缓存在CPU内部的Buffer中，这个Buffer就是TLB。

这是专门用于分页机制的缓存。

Write Buffer

就是上文提到的WC Buffer。

CPU对内存进行写操作时，如果当前系统总线已经被锁住，此时可以将内容先写到一个缓存中，狭义上的Cache可以充当这个角色，但是如果对于Write Combining，并没有利用到狭义Cache时，就提供了WC Buffer供CPU来作写缓冲。

可以通过两种方式，控制某段内存区域适用的缓存方式：

1. 通过页表(PAT)中项目的字段，可以以页为粒度，对该页适合的缓存方式进行指定；

2. 通过MTRR进行设置，可以设置任意粒度的内存区域适用的缓存方式。

Order在缓存中和内存中的反应

处理器执行一段程序时，缓存接收到的改变和内存接收到的改变并不完全相同，体现出来的指令的序列也不相同。

指令序，Program Ordering, 指的就是缓存接收到的顺序，与执行的程序中的顺序是一样的；

处理器序，Processor Ordering，指的是出现在系统总线上的顺序，可以与指令序不同。

如果二者相同，称为“强序”, 否则，称为“弱序”

转载于:https://www.cnblogs.com/long123king/p/3522717.html

weixin_33892359

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
CPUID读取有关Cache的信息

1: void cpuidTest() 2: { 3: u32 val_eax, val_ebx, val_ecx, val_edx; 4: asm("cpuid" 5: : "=a" (val_eax), 6: "=b" (val_ebx), 7: "=d" (val...
复制链接

扫一扫