摘要(Abstract)
The security of computer systems fundamentally relies on memory isolation,e.g., kernel address ranges are marked as non-accessible and are protected fromuser access. In this paper, we present Meltdown. Meltdown exploits side effectsof out-of-order execution on modern processors to read arbitrary kernel-memorylocations including personal data and passwords. Out-of-order execution is anindispensable performance feature and present in a wide range of modernprocessors. The attack is independent of the operating system, and it does not relyon any software vulnerabilities. Meltdown breaks all security assumptions givenby address space isolation as well as paravirtualized environments and, thus, everysecurity mechanism building upon this foundation. On affected systems, Meltdownenables an adversary to read memory of other processes or virtual machines in thecloud without any permissions or privileges, affecting millions of customersand virtually every user of a personal computer. We show that the KAISERdefense mechanism for KASLR [8] has the important (but inadvertent) side effectof impeding Meltdown. We stress that KAISER must be deployed immediately toprevent large-scale exploitation of this severe information leakage.
内存隔离是计算机系统安全的基础,例如:内核空间的地址段往往是标记为受保护的,用户态程序读写内核地址则会触发异常,从而阻止其访问。在这篇文章中,我们会详细描述这个叫Meltdown的硬件漏洞。Meltdown是利用了现代处理器上乱序执行(out-of-order execution)的副作用(side effect),使得用户态程序也可以读出内核空间的数据,包括个人私有数据和密码。由于可以提高性能,现代处理器广泛采用了乱序执行特性。利用Meltdown进行攻击的方法和操作系统无关,也不依赖于软件的漏洞。地址空间隔离带来的安全保证被Meltdown给无情的打碎了(半虚拟化环境也是如此),因此,所有基于地址空间隔离的安全机制都不再安全了。在受影响的系统中,Meltdown可以让一个攻击者读取其他进程的数据,或者读取云服务器中其他虚拟机的数据,而不需要相应的权限。这份文档也说明了KAISER(本意是解决KASLR不能解决的问题)可以防止Meltdown攻击。因此,我们强烈建议必须立即部署KAISER,以防止大规模、严重的信息泄漏。
一、简介(Introduction)
One of the central security features of today’s operating systems ismemory isolation. Operating systems ensure that user applications cannot accesseach other’s memories and prevent user applications from reading or writing kernelmemory. This isolation is a cornerstone of our computing environments andallows running multiple applications on personal devices or executing processesof multiple users on a single machine in the cloud.
当今操作系统的核心安全特性之一是内存隔离。所谓内存隔离就是操作系统要确保用户应用程序不能访问彼此的内存,此外,它也要阻止用户应用程序对内核空间的访问。在个人设备上,多个进程并行运行,我们需要隔离彼此。在云计算环境中,共享同一台物理主机的多个用户(虚拟机)的多个进程也是共存的,我们也不能让某个用户(虚拟机)的进程能够访问到其他用户(虚拟机)的进程数据。因此,这种内核隔离是我们计算环境的基石。
On modern processors, the isolation between the kernel and user processesis typically realized by a supervisor bit of the processor that defines whethera memory page of the kernel can be accessed or not. The basic idea is that thisbit can only be set when entering kernel code and it is cleared when switchingto user processes. This hardware feature allows operating systems to map thekernel into the address space of every process and to have very efficienttransitions from the user process to the kernel, e.g., for interrupt handling.Consequently, in practice, there is no change of the memory mapping whenswitching from a user process to the kernel.
在现代处理器上,内核和用户地址空间的隔离通常由处理器控制寄存器中的一个bit实现(该bit被称为supervisor bit,标识当前处理器处于的模式),该bit定义了是否可以访问kernel space的内存页。基本的思路是:当执行内核代码的时候才设置此位等于1,在切换到用户进程时清除该bit。有了这种硬件特性的支持,操作系统可以将内核地址空间映射到每个进程。在用户进程执行过程中,往往需要从用户空间切换到内核空间,例如用户进程通过系统调用请求内核空间的服务,或者当在用户空间发生中断的时候,需要切换到内核空间执行interrupt handler,以便来处理外设的异步事件。考虑到从用户态切换内核态的频率非常高,如果在这个过程中地址空间不需要切换,那么系统性能就不会受到影响。
In this work, we present Meltdown1. Meltdown is a novel attack that allows overcomingmemory isolation completely by providing a simple way for any user process toread the entire kernel memory of the machine it executes on, including allphysical memory mapped in the kernel region. Meltdown does not exploit anysoftware vulnerability, i.e., it works on all major operating systems. Instead,Meltdown exploits side-channel information available on most modern processors,e.g., modern Intel microarchitectures since 2010 and potentially on other CPUsof other vendors.
While side-channel attacks typically require very specific knowledge aboutthe target application and are tailored to only leak information about itssecrets, Meltdown allows an adversary who can run code on the vulnerable processorto obtain a dump of the entire kernel address space, including any mappedphysical memory. The root cause of the simplicity and strength of Meltdown areside effects caused by out-of-order execution.
在这项工作中,我们提出了利用meltdown漏洞进行攻击的一种全新的方法,通过这种方法,任何用户进程都可以攻破操作系统对地址空间的隔离,通过一种简单的方法读取内核空间的数据,这里就包括映射到内核地址空间的所有的物理内存。Meltdown并不利用任何的软件的漏洞,也就是说它对任何一种操作系统都是有效的。相反,它是利用大多数现代处理器(例如2010年以后的Intel微架构(microarchitectural),其他CPU厂商也可能潜伏这样的问题)上的侧信道(side-channel)信息来发起攻击。一般的侧信道攻击(side-channel attack)都需要直到攻击目标的详细信息,然后根据这些信息指定具体的攻击方法,从而获取秘密数据。Meltdown攻击方法则不然,它可以dump整个内核地址空间的数据(包括全部映射到内核地址空间的物理内存)。Meltdown攻击力度非常很大,其根本原因是利用了乱序执行的副作用(side effect)。
Out-of-order execution is an important performance feature of today’sprocessors in order to overcome latencies of busy execution units, e.g., amemory fetch unit needs to wait for data arrival from memory. Instead of stallingthe execution, modern processors run operations out-of-order i.e., they look ahead and schedulesubsequent operations to idle execution units of the processor. However, suchoperations often have unwanted side-effects, e.g., timing differences [28, 35,11] can leak information from both sequential and out-of-order execution.
有时候CPU执行单元在执行的时候会需要等待操作结果,例如加载内存数据到寄存器这样的操作。为了提高性能,CPU并不是进入stall状态,而是采用了乱序执行的方法,继续处理后续指令并调度该指令去空闲的执行单元去执行。然而,这种操作常常有不必要的副作用,而通过这些执行指令时候的副作用,例如时序方面的差异[ 28, 35, 11 ],我们可以窃取到相关的信息。
From a security perspective, one observation is particularly significant:Out-of-order; vulnerable CPUs allow an unprivileged process to load data from aprivileged (kernel or physical) address into a temporary CPU register. Moreover,the CPU even performs further computations based on this register value, e.g.,access to an array based on the register value. The processor ensures correctprogram execution, by simply discarding the results of the memory lookups(e.g., the modified register states), if it turns out that an instructionshould not have been executed. Hence, on the architectural level (e.g., the abstractdefinition of how the processor should perform computations), no securityproblem arises.
虽然性能提升了,但是从安全的角度来看却存在问题,关键点在于:在乱序执行下,被攻击的CPU可以运行未授权的进程从一个需要特权访问的地址上读出数据并加载到一个临时的寄存器中。CPU甚至可以基于该临时寄存器的值执行进一步的计算,例如,基于该寄存器的值来访问数组。当然,CPU最终还是会发现这个异常的地址访问,并丢弃了计算的结果(例如将已经修改的寄存器值)。虽然那些异常之后的指令被提前执行了,但是最终CPU还是力挽狂澜,清除了执行结果,因此看起来似乎什么也没有发生过。这也保证了从CPU体系结构角度来看,不存在任何的安全问题。
However, we observed that out-of-order memory lookups influence the cache,which in turn can be detected through the cache side channel. As a result, an attackercan dump the entire kernel memory by reading privileged memory in anout-of-order execution stream, and transmit the data from this elusive statevia a microarchitectural covert channel (e.g., Flush+Reload) to the outsideworld. On the receiving end of the covert channel, the register value isreconstructed. Hence, on the microarchitectural level (e.g., the actualhardware implementation), there is an exploitable security problem.
然而,我们可以观察乱序执行对cache的影响,从而根据这些cache提供的侧信道信息来发起攻击。具体的攻击是这样的:攻击者利用CPU的乱序执行的特性来读取需要特权访问的内存地址并加载到临时寄存器,程序会利用保存在该寄存器的数据来影响cache的状态。然后攻击者搭建隐蔽通道(例如,Flush+Reload)把数据传递出来,在隐蔽信道的接收端,重建寄存器值。因此,在CPU微架构(和实际的CPU硬件实现相关)层面上看的确是存在安全问题。
Meltdown breaks all security assumptions given by the CPU’s memoryisolation capabilities. We evaluated the attack on modern desktop machines andlaptops, as well as servers in the cloud. Meltdown allows an unprivileged processto read data mapped in the kernel address space, including the entire physicalmemory on Linux and OS X, and a large fraction of the physical memory onWindows. This may include physical memory of other processes, the kernel, andin case of kernel-sharing sandbox solutions (e.g., Docker, LXC) or Xen inparavirtualization mode, memory of the kernel (or hypervisor), and otherco-located instances. While the performance heavily depends on the specificmachine, e.g., processor speed, TLB and cache sizes, and DRAM speed, we can dumpkernel and physical memory with up to 503KB/s. Hence, an enormous number of systems are affected.
CPU苦心经营的内核隔离能力被Meltdown轻而易举的击破了。我们对现代台式机、笔记本电脑以及云服务器进行了攻击,并发现在Linux和OS X这样的系统中,meltdown可以让用户进程dump所有的物理内存(由于全部物理内存被映射到了内核地址空间)。而在Window系统中,meltdown可以让用户进程dump大部分的物理内存。这些物理内存可能包括其他进程的数据或者内核的数据。在共享内核的沙箱(sandbox)解决方案(例如Docker,LXC)或者半虚拟化模式的Xen中,dump的物理内存数据也包括了内核(即hypervisor)以及其他的guest OS的数据。根据系统的不同(例如处理器速度、TLB和高速缓存的大小,和DRAM的速度),dump内存的速度可以高达503kB/S。因此,Meltdown的影响是非常广泛的。
The countermeasure KAISER [8], originally developed to preventside-channel attacks targeting KASLR, inadvertently protects against Meltdownas well. Our evaluation shows that KAISER prevents Meltdown to a large extent.Consequently, we stress that it is of utmost importance to deploy KAISER on alloperating systems immediately. Fortunately, during a responsible disclosurewindow, the three major operating systems (Windows, Linux, and OS X)implemented variants of KAISER and will roll out these patches in the nearfuture.
我们提出的对策是KAISER[8 ],KAISER最初是为了防止针对KASLR的侧信道攻击,不过无意中也意外的解决了Meltdown漏洞。我们的评估表明,KAISER在很大程度上防止了Meltdown,因此,我们强烈建议在所有操作系统上立即部署KAISER。幸运的是,三大操作系统(Windows、Linux和OS X)都已经实现了KAISER变种,并会在不久的将来推出这些补丁。
Meltdown is distinct from the Spectre Attacks [19] in several ways,notably that Spectre requires tailoring to the victim process’s softwareenvironment, but applies more broadly to CPUs and is not mitigated by KAISER.
熔断(Meltdown)与幽灵(Spectre)攻击[19]有几点不同,最明显的不同是发起幽灵攻击需要了解受害者进程的软件环境并针对这些信息修改具体的攻击方法。不过在更多的CPU上存在Spectre漏洞,而且KAISER对Spectre无效。
Contributions. The contributions of this work are:
1. We describe out-of-order execution as a new, extremely powerful,software-based side channel.
2. We show how out-of-order execution can be combined with amicroarchitectural covert channel to
transfer the data from an elusive state to a receiver on the outside.
3. We present an end-to-end attack combining out-oforder execution withexception handlers or TSX, to read arbitrary physical memory without anypermissions or privileges, on laptops, desktop machines,
and on public cloud machines.
4. We evaluate the performance of Meltdown and the effects of KAISER onit.
这项工作的贡献包括:
1、我们首次发现可以通过乱序执行这个侧信道发起攻击,攻击力度非常强大
2、我们展示了如何通过乱序执行和处理器微架构的隐蔽通道来传输数据,泄露信息。
3、我们展示了一种利用乱序执行(结合异常处理或者TSX)的端到端的攻击方法。通过这种方法,我们可以在没有任何权限的情况下读取了笔记本电脑,台式机和云服务器上的任意物理内存。
4、我们评估了Meltdown的性能以及KAISER对它的影响
Outline. The remainder ofthis paper is structured as follows: In Section 2, we describe the fundamentalproblem which is introduced with out-of-order execution. In Section 3, weprovide a toy example illustrating the side channel Meltdown exploits. InSection 4, we describe the building blocks of the full Meltdown attack. InSection 5, we present the Meltdown attack. In Section 6, we evaluate theperformance of the Meltdown attack on several different systems. In Section 7,we discuss the effects of the software-based KAISER countermeasure and proposesolutions in hardware. In Section 8, we discuss related work and conclude ourwork in Section 9.
本文概述:本文的其余部分的结构如下:在第2节中,我们描述了乱序执行带来的基本问题,在第3节中,我们提供了一个简单的示例来说明Meltdown利用的侧信道。在第4节中,我们描述了Meltdown攻击的方块结构图。在第5节中,我们展示如何进行Meltdown攻击。在第6节中,我们评估了几种不同系统上的meltdown攻击的性能。在第7节中,我们讨论了针对meltdown的软硬件对策。软件解决方案主要是KAISER机制,此外,我们也提出了硬件解决方案的建议。在第8节中,我们将讨论相关工作,并在第9节给出我们的结论。
二、背景介绍(Background)
In this section, we provide background on out-of-order execution, addresstranslation, and cache attacks.
这一小节,我们将描述乱序执行、地址翻译和缓存攻击的一些基本背景知识。
1、乱序执行(Out-of-order execution)
Out-of-order execution is an optimization technique that allows tomaximize the utilization of all execution units of a CPU core as exhaustive aspossible. Instead of processing instructions strictly in the sequential programorder, the CPU executes them as soon as all required resources are available.While the execution unit of the current operation is occupied, other executionunits can run ahead. Hence, instructions can be run in parallel as long astheir results follow the architectural definition.
乱序执行是一种优化技术,通过该技术可以尽最大可能的利用CPU core中的执行单元。和顺序执行的CPU不同,支持乱序执行的CPU可以不按照program order来执行代码,只要指令执行的资源是OK的(没有被占用),那么就进入执行单元执行。如果当前指令涉及的执行单元被占用了,那么其他指令可以提前运行(如果该指令涉及的执行单元是空闲的话)。因此,在乱序执行下,只要结果符合体系结构定义,指令可以并行运行。
In practice, CPUs supporting out-of-order execution support runningoperations speculativelyto the extent thatthe processor’s out-of-order logic processes instructions before the CPU iscertain whether the instruction will be needed and committed. In this paper, werefer to speculative execution in a more restricted meaning, where it refers toan instruction sequence following a branch, and use the term out-of-orderexecution to refer to any way of getting an operation executed before the processorhas committed the results of all prior instructions.
在实际中,CPU的乱序执行和推测执行(speculative execution)捆绑在一起的。在CPU无法确定下一条指令是否一定需要执行的时候往往会进行预测,并根据预测的结果来完成乱序执行。在本文中,speculative execution被认为是一个受限的概念,它特指跳转指令之后的指令序列的执行。而乱序执行这个术语是指处理器在提交所有前面指令操作结果之前,就已经提前执行了当前指令。
In 1967, Tomasulo [33] developed an algorithm [33] that enabled dynamicscheduling of instructions to allow out-of-order execution. Tomasulo [33]introduced a unified reservation station that allows a CPU to use a data valueas it has been computed instead of storing it to a register and re-reading it.The reservation station renames registers to allow instructions that operate onthe same physical registers to use the last logical one to solveread-after-write (RAW), write-after-read (WAR) and write-after-write (WAW)hazards. Furthermore, the reservation unit connects all execution units via acommon data bus (CDB). If an operand is not available, the reservation unit canlisten on the CDB until it is available and then directly begin the executionof the instruction.
1967,Tomasulo设计了一种算法[ 33 ] [ 33 ],实现了指令的动态调度,从而允许了乱序执行。Tomasulo [ 33 ]为CPU执行单元设计了统一的保留站(reservation station)。在过去,CPU执行单元需要从寄存器中读出操作数或者把结果写入寄存器,现在,有了保留站,CPU的执行单元可以使用它来读取操作数并且保存操作结果。我们给出一个具体的RAW(read-after-write)的例子:
R2 <- R1 + R3
R4 <- R2 + R3
第一条指令是计算R1+R3并把结果保存到R2,第二条指令依赖于R2的值进行计算。在没有保留站的时候,第一条指令的操作结果提交到R2寄存器之后,第二条指令才可以执行,因为需要从R2寄存器中加载操作数。如果有了保留站,那么我们可以在保留站中重命名寄存器R2,我们称这个寄存器是R2.rename。这时候,第一条指令执行之后就把结果保存在R2.rename寄存器中,而不需要把最终结果提交到R2寄存器中,这样第二条指令就可以直接从R2.rename寄存器中获取操作数并执行,从而解决了RAW带来的hazard。WAR和WAW类似,不再赘述。(注:上面这一句的翻译我自己做了一些扩展,方便理解保留站)。此外,保留站和所有的执行单元通过一个统一的CDB(common data bus)相连。如果操作数尚未准备好,那么执行单元可以监听CDB,一旦获取到操作数,该执行单元会立刻开始指令的执行。
On the Intel architecture, the pipeline consists of the front-end, theexecution engine (back-end) and the memory subsystem [14]. x86 instructions arefetched by the front-end from the memory and decoded to microoperations (μOPs) which are continuously sent tothe execution engine. Out-of-order execution is implemented within theexecution engine as illustrated in Figure 1. The Reorder Buffer is responsible for registerallocation, register renaming and retiring. Additionally, other optimizations likemove elimination or the recognition of zeroing idioms are directly handled bythe reorder buffer. The μOPs are forwarded to the Unified Reservation Station that queues the operations on exitports that are connected to Execution Units. Each execution unit can perform different tasks likeALU operations, AES operations, address generation units (AGU) or memory loads andstores. AGUs as well as load and store execution units are directly connectedto the memory subsystem to process its requests.
在英特尔CPU体系结构中,流水线是由前端、执行引擎(后端)和内存子系统组成[14]。前端模块将x86指令从存储器中读取出来并解码成微操作(μOPS,microoperations),uOPS随后被发送给执行引擎。在执行引擎中实现了乱序执行,如上图所示。重新排序缓冲区(ReorderBuffer)负责寄存器分配、寄存器重命名和将结果提交到软件可见的寄存器(这个过程也称为retirement)。此外,reorder buffer还有一些其他的功能,例如move elimination 、识别zeroing idioms等。uOPS被发送到统一保留站中,并在该保留站的输出端口上进行排队,而保留站的输出端口则直接连接到执行单元。每个执行单元可以执行不同的任务,如ALU运算,AES操作,地址生成单元(AGU)、memory load和memory store。AGU、memory load和memory store这三个执行单元会直接连接到存储子系统中以便处理内存请求。
Since CPUs usually do not run linear instruction streams, they have branchprediction units that are used to obtain an educated guess of which instructionwill be executed next. Branch predictors try to determine which direction of abranch will be taken before its condition is actually evaluated. Instructionsthat lie on that path and do not have any dependencies can be executed inadvance and their results immediately used if the prediction was correct. Ifthe prediction was incorrect, the reorder buffer allows to rollback by clearingthe reorder buffer and re-initializing the unified reservation station.
由于CPU并非总是运行线性指令流,所以它有分支预测单元。该单元可以记录过去程序跳转的结果并用它来推测下一条可能被执行的指令。分支预测单元会在实际条件被检查之前确定程序跳转路径。如果位于该路径上的指令没有任何依赖关系,那么这些指令可以提前执行。如果预测正确,指令执行的结果可以立即使用。如果预测不正确,reorder buffer可以回滚操作结果,而具体的回滚是通过清除重新排序缓冲区和初始化统一保留站来完成的。
Various approaches to predict the branch exist: With static branchprediction [12], the outcome of the branch is solely based on the instructionitself. Dynamic branch prediction [2] gathers statistics at run-time to predictthe outcome. One-level branch prediction uses a 1-bit or 2-bit counter torecord the last outcome of the branch [21]. Modern processors often usetwo-level adaptive predictors [36] that remember the history of the last n outcomes allow to predict regularlyrecurring patterns. More recently, ideas to use neural branch prediction [34,18, 32] have been picked up and integrated into CPU architectures [3].
分支预测有各种各样的方法:使用静态分支预测[ 12 ]的时候,程序跳转的结果完全基于指令本身。动态分支预测[ 2 ]则是在运行时收集统计数据来预测结果。一级分支预测使用1位或2位计数器来记录跳转结果[ 21 ]。现代处理器通常使用两级自适应预测器[36],这种方法会记住最后n个历史跳转结果,并通过这些历史跳转记过来寻找有规律的跳转模式。最近,使用神经分支预测[ 34, 18, 32 ]的想法被重新拾起并集成到CPU体系结构中[ 3]。
2、地址空间(address space)
To isolate processesfrom each other, CPUs support virtual address spaces where virtual addressesare translated to physical addresses. A virtual address space is divided into aset of pages that can be individually mapped to physical memory through amulti-level page translation table. The translation tables define the actualvirtual to physical mapping and also protection properties that are used toenforce privilege checks, such as readable, writable, executable anduser-accessible. The currently used translation table that is held in a specialCPU register. On each context switch, the operating system updates thisregister with the next process’ translation table address in order to implementper process virtual address spaces. Because of that, each process can onlyreference data that belongs to its own virtual address space. Each virtualaddress space itself is split into a user and a kernel part. While the useraddress space can be accessed by the running application, the kernel addressspace can only be accessed if the CPU is running in privileged mode. This isenforced by the operating system disabling the user accessible property of thecorresponding translation tables. The kernel address space does not only havememory mapped for the kernel’s own usage, but it also needs to performoperations on user pages, e.g., filling them with data. Consequently, theentire physical memory is typically mapped in the kernel. On Linux and OS X,this is done via a direct-physical map, i.e., the entire physical memory is directly mappedto a pre-defined virtual address (cf. Figure 2).
为了相互隔离进程,CPU支持虚拟地址空间,但是CPU向总线发出的是物理地址,因此程序中的虚拟地址需要被转换为物理地址。虚拟地址空间被划分成一个个的页面,这些页面又可以通过多级页表映射到物理页面。除了虚拟地址到物理地址的映射,页表也定义了保护属性,如可读的、可写的、可执行的和用户态是否可访问等。当前使用页表保存在一个特殊的CPU寄存器中(对于X86,这个寄存器就是cr3,对于ARM,这个寄存器是TTBR系列寄存器)。在上下文切换中,操作系统总是会用下一个进程的页表地址来更新这个寄存器,从而实现了进程虚拟地址空间的切换。因此,每个进程只能访问属于自己虚拟地址空间的数据。每个进程的虚拟地址空间本身被分成用户地址空间和内核地址空间部分。当进程运行在用户态的时候只可以访问用户地址空间,只有在内核态下(CPU运行在特权模式),才可以访问内核地址空间。操作系统会disable内核地址空间对应页表中的用户是否可访问属性,从而禁止了用户态对内核空间的访问。内核地址空间不仅为自身建立内存映射(例如内核的正文段,数据段等),而且还需要对用户页面进行操作,例如填充数据。因此,整个系统中的物理内存通常会映射在内核地址空间中。在Linux和OS X上,这是通过直接映射(direct-physical map)完成的,也就是说,整个物理内存直接映射到预定义的虚拟地址(参见上图)。
Instead of a direct-physical map, Windows maintains a multiple so-called paged pools, non-paged pools, and the system cache. These pools are virtual memoryregions in the kernel address space mapping physical pages to virtual addresseswhich are either required to remain in the memory (non-paged pool) or can beremoved from the memory because a copy is already stored on the disk (pagedpool). The system cache further contains mappings of allfile-backed pages. Combined, these memory pools will typically map a largefraction of the physical memory into the kernel address space of every process.
Windows中的地址映射机制,没有兴趣了解。
The exploitation of memory corruption bugs often requires the knowledge ofaddresses of specific data. In order to impede such attacks, address spacelayout randomization (ASLR) has been introduced as well as nonexecutable stacksand stack canaries. In order to protect the kernel, KASLR randomizes theoffsets where drivers are located on every boot, making attacks harder as they nowrequire to guess the location of kernel data structures. However, side-channelattacks allow to detect the exact location of kernel data structures [9, 13,17] or derandomize ASLR in JavaScript [6]. A combination of a software bug andthe knowledge of these addresses can lead to privileged code execution.
利用memory corruption(指修改内存的内容而造成crash)bug进行攻击往往需要知道特定数据的地址(因为我们需要修改该地址中的数据)。为了阻止这种攻击,内核提供了地址空间布局随机化(ASLR)、非执行堆栈和堆栈溢出检查三种手段。为了保护内核,KASLR会在驱动每次开机加载的时候将其放置在一个随机偏移的位置,这种方法使得攻击变得更加困难,因为攻击者需要猜测内核数据结构的地址信息。然而,攻击者可以利用侧信道攻击手段获取内核数据结构的确定位置[ 9, 13, 17 ]或者在JavaScript中对ASLR 解随机化[ 6 ]。结合本节描述的两种机制,我们可以发起攻击,实现特权代码的执行。
3、缓存攻击(Cache Attacks)
In order to speed-up memory accesses and address translation, the CPUcontains small memory buffers, called caches, that store frequently used data.CPU caches hide slow memory access latencies by buffering frequently used datain smaller and faster internal memory. Modern CPUs have multiple levels ofcaches that are either private to its cores or shared among them. Address spacetranslation tables are also stored in memory and are also cached in the regularcaches.
为了加速内存访问和地址翻译过程,CPU内部包含了一些小的内存缓冲区,我们称之为cache,用来保存近期频繁使用的数据,这样,CPU cache实际上是隐藏了底层慢速内存的访问延迟。现代CPU有多个层次的cache,它们要么是属于特定CPU core的,要么是在多个CPU core中共享的。地址空间的页表存储在内存中,它也被缓存在cache中(即TLB)。
Cache side-channel attacks exploit timing differences that are introducedby the caches. Different cache attack techniques have been proposed anddemonstrated in the past, including Evict+Time [28], Prime+Probe [28, 29], andFlush+Reload [35]. Flush+Reload attacks work on a single cache linegranularity. These attacks exploit the shared, inclusive last-level cache. Anattacker frequently flushes a targeted memory location using the clflush instruction. By measuring the time it takes to reload thedata, the attacker determines whether data was loaded into the cache by anotherprocess in the meantime. The Flush+Reload attack has been used for attacks onvarious computations, e.g., cryptographic algorithms [35, 16, 1], web serverfunction calls [37], user input [11, 23, 31], and kernel addressing information[9].
缓存侧信道攻击(Cache side-channel attack)是一种利用缓存引入的时间差异而进行攻击的方法,在访问memory的时候,已经被cache的数据访问会非常快,而没有被cache的数据访问比较慢,缓存侧信道攻击就是利用了这个时间差来偷取数据的。各种各样的缓存攻击技术已经被提出并证明有效,包括Evict+Time [28 ],Prime+Probe [28, 29 ],Flush+Reload [35 ]。Flush+Reload方法在单个缓存行粒度上工作。缓存侧信道攻击主要是利用共享的cache(包含的最后一级缓存)进行攻击。攻击者经常使用CLFLUSH指令将目标内存位置的cache刷掉。然后读目标内存的数据并测量目标内存中数据加载所需的时间。通过这个时间信息,攻击者可以获取另一个进程是否已经将数据加载到缓存中。Flush+Reload攻击已被用于攻击各种算法,例如,密码算法[ 35, 16, 1],Web服务器函数调用[ 37 ],用户输入[ 11, 23, 31 ],以及内核寻址信息[ 9 ]。
A special use case are covert channels. Here the attacker controls both,the part that induces the side effect, and the part that measures the sideeffect. This can be used to leak information from one security domain toanother, while bypassing any boundaries existing on the architectural level orabove. Both Prime+Probe and Flush+Reload have been used in high-performancecovert channels [24, 26, 10].
缓存侧信道攻击一个特殊的使用场景是构建隐蔽通道(covert channel)。在这个场景中,攻击者控制隐蔽通道的发送端和接收端,也就是说攻击者会通过程序触发产生cache side effect,同时他也会去量测这个cache side effect。通过这样的手段,信息可以绕过体系结构级别的边界检查,从一个安全域泄漏到外面的世界,。Prime+Probe 和 Flush+Reload这两种方法都已被用于构建高性能隐蔽通道[ 24, 26, 10 ]。
三、简单示例(A toy example)
In this section, we start with a toy example, a simple code snippet, toillustrate that out-of-order execution can change the microarchitectural statein a way that leaks information. However, despite its simplicity, it is used asa basis for Section 4 and Section 5, where we show how this change in state canbe exploited for an attack.
在这一章中,我们给出一个简单的例子,并说明了在乱序执行的CPU上执行示例代码是如何改变CPU的微架构状态并泄露信息的。尽管它很简单,不过仍然可以作为第4章和第5章的基础(在这些章节中,我们会具体展示meltdown攻击)。
Listing 1 shows a simple code snippet first raising an (unhandled)exception and then accessing an array. The property of an exception is that thecontrol flow does not continue with the code after the exception, but jumps to anexception handler in the operating system. Regardless of whether this exceptionis raised due to a memory access, e.g., by accessing an invalid address, or dueto any other CPU exception, e.g., a division by zero, the control flowcontinues in the kernel and not with the next user space instruction.
1 raise_exception(); 2 // the line below is never reached 3 access(probe_array[data * 4096]); |
上面的列表显示了一个简单的代码片段:首先触发一个异常(我们并不处理它),然后访问probe_array数组。异常会导致控制流不会执行异常之后的代码,而是跳转到操作系统中的异常处理程序去执行。不管这个异常是由于内存访问而引起的(例如访问无效地址),或者是由于其他类型的CPU异常(例如除零),控制流都会转到内核中继续执行,而不是停留在用户空间,执行对probe_array数组的访问。
Thus, our toy examplecannot access the array in theory, as the exception immediately traps to thekernel and terminates the application. However, due to the out-of-order execution,the CPU might have already executed the following instructions as there is nodependency on the exception. This is illustrated in Figure 3. Due to the exception,the instructions executed out of order are not retired and, thus, never havearchitectural effects.
因此,我们给出的示例代码在理论上不会访问probe_array数组,毕竟异常会立即陷入内核并终止了该应用程序。但是由于乱序执行,CPU可能已经执行了异常指令后面的那些指令,要知道异常指令和随后的指令没有依赖性。如上图所示。虽然异常指令后面的那些指令被执行了,但是由于产生了异常,那些指令并没有提交(注:instruction retire,instruction commit都是一个意思,就是指将指令执行结果体现到软件可见的寄存器或者memory中,不过retire这个术语翻译成中文容易引起误会,因此本文统一把retire翻译为提交或者不翻译),因此从CPU 体系结构角度看没有任何问题(也就是说软件工程师从ISA的角度看不到这些指令的执行)。
Although the instructions executed out of order do not have any visiblearchitectural effect on registers or memory, they have microarchitectural sideeffects. During the out-of-order execution, the referenced memory is fetched intoa register and is also stored in the cache. If the out-of-order execution hasto be discarded, the register and memory contents are never committed.Nevertheless, the cached memory contents are kept in the cache. We can leveragea microarchitectural side-channel attack such as Flush+Reload [35], whichdetects whether a specific memory location is cached, to make thismicroarchitectural state visible. There are other side channels as well whichalso detect whether a specific memory location is cached, including Prime+Probe[28, 24, 26], Evict+ Reload [23], or Flush+Flush [10]. However, as Flush+ Reloadis the most accurate known cache side channel and is simple to implement, we donot consider any other side channel for this example.
虽然违反了program order,在CPU上执行了本不应该执行的指令,但是实际上从寄存器和memory上看,我们不能捕获到任何这些指令产生的变化(也就是说没有architecture effect)。不过,从CPU微架构的角度看确实是有副作用。在乱序执行过程中,加载内存值到寄存器同时也会把该值保存在cache中。如果必须要丢弃掉乱序执行的结果,那么寄存器和内存值都不会commit。但是,cache中的内容并没有丢弃,仍然在cache中。这时候,我们就可以使用微架构侧信道攻击(microarchitectural side-channel attack)的方法,例如Flush+Reload [35],来检测是否指定的内存地址被cache了,从而让这些微架构状态信息变得对用户可见。我们也有其他的方法来检测内存地址是否被缓存,包括:Prime+Probe [28, 24, 26],Evict+ Reload [23], 或者Flush+Flush [10]。不过Flush+ Reload是最准确的感知cache sidechannel的方法,并且实现起来非常简单,因此在本文中我们主要介绍Flush+ Reload。
Based on the value of data in this toy example, a different partof the cache is accessed when executing the memory access out of order. As data is multiplied by 4096, data accesses to probe array are scattered over the array with adistance of 4 kB (assuming an 1 B data type for probe array). Thus, there is an injective mapping from the value of data to a memory page, i.e., there are no two different values ofdata which result in an access to the same page. Consequently, if a cache lineof a page is cached, we know the value of data. The spreading over different pages eliminates falsepositives due to the prefetcher, as the prefetcher cannot access data across pageboundaries [14].
我们再次回到上面列表中的示例代码。probe_array是一个按照4KB字节组织的数组,变化data变量的值就可以按照4K size来遍历访问该数组。如果在乱序执行中访问了data变量指定的probe_array数组内的某个4K内存块,那么对应页面(指的是probe_array数组内的4K内存块)的数据就会被加载到cache中。因此,通过程序扫描probe_array数组中各个页面的cache情况可以反推出data的数值(data数值和probe_array数组中的页面是一一对应的)。在Intel处理器中,prefetcher不会跨越page的边界,因此page size之间的cache状态是完全独立的。而在程序中把cache的检测分散到若干个page上主要是为了防止prefetcher带来的误报。
Figure 4 shows theresult of a Flush+Reload measurement iterating over all pages, after executingthe out-oforder snippet with data= 84. Although the array accessshould not have happened due to the exception, we can clearly see that theindex which would have been accessed is cached. Iterating over all pages (e.g.,in the exception handler) shows only a cache hit for page 84 This shows thateven instructions which are never actually executed, change themicroarchitectural state of the CPU. Section 4 modifies this toy example to notread a value, but to leak an inaccessible secret.
上图是通过Flush+Reload 方法遍历probe_array数组中的各个page并计算该page数据的访问时间而绘制的坐标图。横坐标是page index,共计256个,纵坐标是访问时间,如果cache miss,那么访问时间大概是400多个cycle,如果cache hit,访问时间大概是200个cycle以下,二者有显著的区别。从上图我们可以看出,虽然由于异常,probe_array数组访问不应该发生,不过在data=84上明显是cache hit的,这也说明了在乱序执行下,本不该执行的指令也会影响CPU微架构状态,在下面的章节中,我们将修改示例代码,去窃取秘密数据。
四、Meltdown攻击架构图(Building block ofattack)
The toy example in Section 3 illustrated that side-effects of out-of-orderexecution can modify the microarchitectural state to leak information. Whilethe code snippet reveals the data value passed to a cache-side channel, we wantto show how this technique can be leveraged to leak otherwise inaccessiblesecrets. In this section, we want to generalize and discuss the necessarybuilding blocks to exploit out-of-order execution for an attack.
上一章中我们通过简单的示例代码展示了乱序执行的副作用会修改微架构状态,从而造成信息泄露。通过代码片段我们已经看到了data变量值已经传递到缓存侧通道上,下面我们会详述如何利用这种技术来泄漏受保护的数据。在本章中,我们将概括并讨论利用乱序执行进行攻击所需要的组件。
The adversary targets a secret value that is kept somewhere in physicalmemory. Note that register contents are also stored in memory upon contextswitches, i.e., they are also stored in physicalmemory. As described in Section 2.2, the address space of every processtypically includes the entire user space, as well as the entire kernel space,which typically also has all physical memory (inuse) mapped. However, thesememory regions are only accessible in privileged mode (cf. Section 2.2).
攻击者的目标是保存在物理内存中的一个秘密值。注意:寄存器值也会在上下文切换时保存在物理内存中。根据2.2节所述,每个进程的地址空间通常包括整个用户地址空间以及整个内核地址空间(使用中的物理内存都会映射到该空间中),虽然进程能感知到内核空间的映射。但是这些内存区域只能在特权模式下访问(参见第2.2节)。
In this work, we demonstrate leaking secrets by bypassing theprivileged-mode isolation, giving an attacker full read access to the entirekernel space including any physical memory mapped, including the physicalmemory of any other process and the kernel. Note that Kocher et al. [19] pursuean orthogonal approach, called Spectre Attacks, which trick speculativeexecuted instructions into leaking information that the victim process isauthorized to access. As a result, Spectre Attacks lack the privilegeescalation aspect of Meltdown and require tailoring to the victim process’ssoftware environment, but apply more broadly to CPUs that support speculative executionand are not stopped by KAISER.
在这项工作中,我们绕过了地址空间隔离机制,让攻击者可以对整个内核空间进行完整的读访问,这里面就包括物理内存直接映射部分。而通过直接映射,攻击者可以访问任何其他进程和内核的物理内存。注意:Kocher等人[ 19 ]正在研究一种称为幽灵(spectre)攻击的方法,它通过推测执行(speculative execution)来泄漏目标进程的秘密信息。因此,幽灵攻击不涉及Meltdown攻击中的特权提升,并且需要根据目标进程的软件环境进行定制。不过spectre会影响更多的CPU(只要支持speculativeexecution的CPU都会受影响),另外,KAISER无法阻挡spectre攻击。
The full Meltdownattack consists of two building blocks, as illustrated in Figure 5. The firstbuilding block of Meltdown is to make the CPU execute one or more instructionsthat would never occur in the executed path. In the toy example (cf. Section3), this is an access to an array, which would normally never be executed, as theprevious instruction always raises an exception. We call such an instruction,which is executed out of order, leaving measurable side effects, a transientinstruction. Furthermore, wecall any sequence of instructions containing at least one transient instructiona transient instruction sequence.
完整的meltdown攻击由两个组件构成,如上图所示。第一个组件是使CPU执行一个或多个在正常路径中永远不会执行的指令。在第三章中的简单示例代码中,对数组的访问指令按理说是不会执行,因为前面的指令总是触发异常。我们称这种指令为瞬态指令(transient instruction),瞬态指令在乱序执行的时候被CPU执行(正常情况下不会执行),留下可测量的副作用。此外,我们把任何包含至少一个瞬态指令的指令序列称为瞬态指令序列。
In order to leverage transient instructions for an attack, the transientinstruction sequence must utilize a secret value that an attacker wants toleak. Section 4.1 describes building blocks to run a transient instructionsequence with a dependency on a secret value.
为了使用瞬态指令来完成攻击,瞬态指令序列必须访问攻击者想要获取的秘密值并加以利用。第4.1节将描述一段瞬态指令序列,我们会仔细看看这段指令会如何使用受保护的数据。
The second building block of Meltdown is to transfer themicroarchitectural side effect of the transient instruction sequence to anarchitectural state to further process the leaked secret. Thus, the secondbuilding described in Section 4.2 describes building blocks to transfer amicroarchitectural side effect to an architectural state using a covertchannel.
Meltdown的第二个组件主要用来检测在瞬态指令序列执行完毕之后,在CPU微架构上产生的side effect。并将其转换成软件可以感知的CPU体系结构的状态,从而将数据泄露出来。因此,在4.2节中描述的第二个组件主要是使用隐蔽信道来把CPU微架构的副作用转换成CPU architectural state。
1、执行瞬态指令(executing transient instructions)
The first building block of Meltdown is the execution of transientinstructions. Transient instructions basically occur all the time, as the CPUcontinuously runs ahead of the current instruction to minimize the experienced latencyand thus maximize the performance (cf. Section 2.1). Transient instructionsintroduce an exploitable side channel if their operation depends on a secretvalue. We focus on addresses that are mapped within the attacker’s process, i.e., the user-accessible user spaceaddresses as well as the user-inaccessible kernel space addresses. Note thatattacks targeting code that is executed
within the context (i.e., address space) of another process are possible [19],but out of scope in this work, since all physical memory (including the memoryof other processes) can be read through the kernel address space anyway.
Meltdown的第一个组件是执行瞬态指令。其实瞬态指令是时时刻刻都在发生的,因为CPU在执行当前指令之外,往往会提前执行当前指令之后的那些指令,从而最大限度地提高CPU性能(参见第2.1节的描述)。如果瞬态指令的执行依赖于一个受保护的值,那么它就引入一个可利用的侧信道。另外需要说明的是:本文主要精力放在攻击者的进程地址空间中,也就是说攻击者在用户态访问内核地址空间的受保护的数据。实际上攻击者进程访问盗取其他进程地址空间的数据也是可能的(不过本文并不描述这个场景),毕竟攻击者进程可以通过内核地址空间访问系统中所有内存,而其他进程的数据也就是保存在系统物理内存的某个地址上。
Accessing user-inaccessible pages, such as kernel pages, triggers anexception which generally terminates the application. If the attacker targets asecret at a user inaccessible address, the attacker has to cope with this exception.We propose two approaches: With exception handling, we catch the exception effectivelyoccurring after executing the transient instruction sequence, and with exceptionsuppression, we prevent theexception from occurring at all and instead redirect the control flow afterexecuting the transient instruction sequence. We discuss these approaches indetail in the following.
运行于用户态时访问特权页面,例如内核页面,会触发一个异常,该异常通常终止应用程序。如果攻击者的目标是一个内核空间地址中保存的数据,那么攻击者必须处理这个异常。我们提出两种方法:一种方法是设置异常处理函数,在发生异常的时候会调用该函数(这时候已经完成了瞬态指令序列的执行)。第二种方法是抑制异常的触发,下面我们将详细讨论这些方法。
Exception handling. A trivial approach is to fork the attacking applicationbefore accessing the invalid memory location that terminates the process, andonly access the invalid memory location in the child process. The CPU executesthe transient instruction sequence in the child process before crashing. Theparent process can then recover the secret by observing the microarchitectural state,e.g., through a side-channel.
程序自己定义异常处理函数。
一个简单的方法是在访问内核地址(这个操作会触发异常并中止程序的执行)之前进行fork的操作,并只在子进程中访问内核地址,触发异常。在子进程crash之前,CPU已经执行了瞬态指令序列。在父进程中可以通过观察CPU微架构状态来盗取内核空间的数据。
It is also possible to install a signal handler that will be executed if acertain exception occurs, in this specific case a segmentation fault. Thisallows the attacker to issue the instruction sequence and prevent theapplication from crashing, reducing the overhead as no new process has to becreated.
当然,你也可以设置信号处理函数。异常触发后将执行该信号处理函数(在这个场景下,异常是segmentation fault)。这种方法的好处是应用程序不会crash,不需要创建新进程,开销比较小。
Exception suppression.
这种方法和Transactional memory相关,有兴趣的同学可以自行阅读原文。
2、构建隐蔽通道(building covert channel)
The second building block of Meltdown is the transfer of themicroarchitectural state, which was changed by the transient instructionsequence, into an architectural state (cf. Figure 5). The transient instructionsequence can be seen as the sending end of a microarchitectural covert channel.The receiving end of the covert channel receives the microarchitectural statechange and deduces the secret from the state. Note that the receiver is not partof the transient instruction sequence and can be a different thread or even adifferent process e.g., the parent process in the fork-and-crash approach.
第二个Meltdown组件主要是用来把执行瞬态指令序列后CPU微架构状态变化的信息转换成相应的体系结构状态(参考上图)。瞬态指令序列可以认为是微架构隐蔽通道的发端,通道的接收端用来接收微架构状态的变化信息,从这些状态变化中推导出被保护的数据。需要注意的是:接收端并不是瞬态指令序列的一部分,可以来自其他的线程甚至是其他的进程。例如上节我们使用fork的那个例子中,瞬态指令序列在子进程中,而接收端位于父进程中
We leverage techniques from cache attacks, as the cache state is amicroarchitectural state which can be reliably transferred into anarchitectural state using various techniques [28, 35, 10]. Specifically, we useFlush+Reload [35], as it allows to build a fast and low-noise covert channel.Thus, depending on the secret value, the transient instruction sequence (cf.Section 4.1) performs a regular memory access, e.g., as it does in the toyexample (cf. Section 3).
我们可以利用缓存攻击(cache attack)技术,通过对高速缓存的状态(是微架构状态之一)的检测,我们可以使用各种技术[ 28, 35, 10 ]将其稳定地转换成CPU体系结构状态。具体来说,我们可以使用Flush+Reload技术 [35],因为该技术允许建立一个快速的、低噪声的隐蔽通道。然后根据保密数据,瞬态指令序列(参见第4.1节)执行常规的存储器访问,具体可以参考在第3节给出的那个简单示例程序中所做的那样。
After the transient instruction sequence accessed an accessible address, i.e., this is the sender of the covert channel;the address is cached for subsequent accesses. The receiver can then monitorwhether the address has been loaded into the cache by measuring the access timeto the address. Thus, the sender can transmit a ‘1’-bit by accessing an address which is loaded intothe monitored cache, and a ‘0’-bitby not accessing such an address.
在隐蔽通道的发送端,瞬态指令序列会访问一个普通内存地址,从而导致该地址的数据被加载到了cache(为了加速后续访问)。然后,接收端可以通过测量内存地址的访问时间来监视数据是否已加载到缓存中。因此,发送端可以通过访问内存地址(会加载到cache中)传递bit 1的信息,或者通过不访问内存地址(不会加载到cache中)来发送bit 0信息。而接收端可以通过监视cache的信息来接收这个bit 0或者bit 1的信息。
Using multiple different cache lines, as in our toy example in Section 3,allows to transmit multiple bits at once. For every of the 256 different bytevalues, the sender accesses a different cache line. By performing aFlush+Reload attack on all of the 256 possible cache lines, the receiver canrecover a full byte instead of just one bit. However, since the Flush+Reloadattack takes much longer (typically several hundred cycles) than the transientinstruction sequence, transmitting only a single bit at once is more efficient.The attacker can simply do that by shifting and masking the secret valueaccordingly.
使用一个cacheline可以传递一个bit,如果使用多个不同的cacheline(类似我们在第3章中的简单示例代码一样),就可以同时传输多个比特。一个Byte(8-bit)有256个不同的值,针对每一个值,发送端都会访问不同的缓存行,这样通过对所有256个可能的缓存行进行Flush+Reload攻击,接收端可以恢复一个完整字节而不是一个bit。不过,由于Flush+Reload攻击所花费的时间比执行瞬态指令序列要长得多(通常是几百个cycle),所以只传输一个bit是更有效的。攻击者可以通过shift和mask来完成保密数据逐个bit的盗取。
Note that the covert channel is not limited to microarchitectural stateswhich rely on the cache. Any microarchitectural state which can be influencedby an instruction (sequence) and is observable through a side channel can beused to build the sending end of a covert channel. The sender could, forexample, issue an instruction (sequence) which occupies a certain executionport such as the ALU to send a ‘1’-bit.The receiver measures the latency when executing an instruction (sequence) onthe same execution port. A high latency implies that the sender sends a ‘1’-bit, whereas a low latency implies thatsender sends a ‘0’-bit. Theadvantage of the Flush+ Reload cache covert channel is the noise resistance andthe high transmission rate [10]. Furthermore, the leakage can be observed fromany CPU core [35], i.e., rescheduling eventsdo not significantly affect the covert channel.
需要注意的是:隐蔽信道并非总是依赖于缓存。只要CPU微架构状态会被瞬态指令序列影响,并且可以通过side channel观察这个状态的改变,那么该微架构状态就可以用来构建隐蔽通道的发送端。例如,发送端可以执行一条指令(该指令会占用相关执行单元(如ALU)的端口),来发送一个“1”这个bit。接收端可以在同一个执行单元端口上执行指令,同时测量时间延迟。高延迟意味着发送方发送一个“1”位,而低延迟意味着发送方发送一个“0”位。Flush+ Reload隐蔽通道的优点是抗噪声和高传输速率[ 10 ]。此外,我们可以从任何cpu core上观察到数据泄漏[ 35 ],即调度事件并不会显著影响隐蔽信道。
本文未完待续
近期精彩文章
宋宝华: 关于DMA ZONE和dma alloc coherent若干误解的彻底澄清
....