A Parallel Full-System Emulator for Risc Architure Host

H.-Y. Jeong et al. (eds.), Advanced in Computer Science and Its Applications,  
Lecture Notes in Electrical Engineering 279,  
DOI: 10.1007/978-3-642-41674-3_145, © Springer-Verlag Berlin Heidelberg 2014 
A Parallel Full-System Emulator for Risc Architure Host 
Xiao-Wu Jiang, Xiang-Lan Chen, Huang Wang, and Hua-Ping Chen 
Department of Computer Science, University of Science and Technology of China 
No. 443, Huangshan Road, Baohe District, Hefei City, Anhui Province, PRC 
Abstract. In this paper, we port a parallel full-system emulator to RISC host to 
achieve higher performance by utilize all the multi-core resources from physical 
CPU, in contrast the traditional full-system emulator is sequentially in SMP 
emulation and can only use one core of host machine. We mainly deal with the 
atomic instruction translation to RISC ll/sc pairs, and apply lightweight lock-free 
FIFO queue algorithms using both interleaving and non-interleaving ll/sc pairs. 
The tests show that the performance of parallel full-system emulator have high 
Keywords: Parallel emulation, Atomical, lock-free queue. 
1 Introduction 
RISC is a type of microprocessor architecture that utilizes a small, highly-optimized set 
of instructions. It is in the late 1970s and early 1980s that the first RISC projects came 
out,they are IBM 801, Stanford MIPS, and Berkeley RISC 1 and 2. In 1980s and the 
early 1990s, a wide variety of similar RISC processors were used in Unix workstation 
market as well as in printers, routers, etc. It is the beginning of 21st century when RISC 
architecture dominate the majority of low end and mobile systems. This situation 
happened mainly because the low power and low cost compared to X86. Now the 
typical RISC architectures are arm and MIPS. 
As the performance of a single processer had nearly reached its rooftop. Other 
technologies are used to ensure the Moore's Law. Symmetric multiprocessing is the 
most efficient one. Manufacturers typically integrate multiple cores into a single 
integrated circuit diet, which is known as a chip multiprocessor or CMP. After 2006, 
intel and AMD first introduce x86 CMP cpu Core Duo and Athlon64 X2 to desktop 
user. Four years later, armv7-a based multi-core cortex-A9 and MIPS64-compatible 
quad-core loongson 3A appear to the public. Now desktop CPU has reached deca-core 
(e.g. Intel Xeon E7-2850) and mobile CPU has reached octa-core(e.g. samsung Exynos 
5 Octa). 
Now RISC architecture are developing towards desktop. AMD has announced to 
produce arm cpu for server, google had already run ChromeOS on arm, and Microsoft 
also establish windows RT to support arm. Besides some desktop and laptop are 1046 X.-W. Jiang et al. 
inspired by loongson 3 family, a multicore MIPS64-compatible cpu developed by 
Chinese Academy of Sciences. 
Even through RISC Architecture, especially arm and MIPS, are rapidly developing, 
In desktop area these Architecture still lack of applications compared to X86. 
Full-system emulation has been peoved to be an effiective way to imigrate existen 
applications to other architectures. There is already some full-system emulators that 
enable X86 OS running on arm or MIPS. A typical tool is QEMU, whilch is sequential 
in SMP emulation and can emulate multiple architecture on multiple architecture, 
including X86 on arm/MIPS. Thanks to the high efficiency of X86, Sequential QEMU 
is fast enough to use On x86 machine. Even though MIPS and arm have no less cores 
than X86, for a single core, it’s far less fast compared to X86. For example, MIPS64- 
compatible loongson 3A has 4 cores at 900Mhz while Intel i5-2400 has 4 cores at 
3.1Ghz. when running a 7z in one thread, Intel is 5.8x faster than loongson. So when 
emulating a X86 machine on arm/MIPS, it is very importent to use all of the cores 
rather than only one to make the guest machine fluently. 
X86 have some parallel full-system emulator which can use all of the resources of 
host machine. But on arm or MIPS there is no parallel full-system emulator now. 
In order to make full-system emulation faster, a parallel full-system emulator 
COREMU is ported to RISC architecture. In the remainder of this paper we presents 
how we port this parallel full-system emulator to RISC. Section 2 introduce full-system 
emulator and the parallelizing strategy of full-system emulator on X86. Section 3 focus 
on solving the atomic instruction translation from CISC x86 to RISC MIPS/arm. 
Section 4 present lock-free FIFO queue algorithms using ll/sc pairs in interruption 
simulation. Section 5 report experimental results comparing our Parallel full-system 
emulator to ordinary full-system emulator and host OS.This paper mainly talk about 
MIPS, arm is the same except lock-free queue algorithm in section 4.3. 
2 QEMU and Parallel Full-System Emulator 
2.1 QEMU and QEMU’s Multiprocessor Emulation 
QEMU[1] is a hosted virtual machine monitor: It emulates CPU through dynamic 
binary translation(DBT) and provides a set of device models, enabling it to run a 
variety of unmodified guest operating systems. QEMU has user-mode emulation and 
full-system emulation mode, in which QEMU emulates a full computer system, 
including one or more processors and peripherals. 
For a single emulated processor, QEMU translates the emulated code to TCG(Tiny 
Code Generator) and then translates the TCG to host instructions. After a block of 
emulated code translated to instructions on host machine, QEMU will execute it. In 
QEMU’s multiprocessor emulation, QEMU emulates a SMP machine with multiple 
processors and a certain device to support inter-core communications (such as APIC in 
x86). QEMU emulates these processor sequentially in a round-robin strategy: each 
emulated processor has a time slice to execute. After that, physical CPU turn to the next 
emulated processor to execute. And between each time slice, physical CPU turn to 
execute some peripherals simulation and inter-core communications.  A Parallel Full-System Emulator for Risc Architure Host 1047 
2.2 Parallel Full-System Emulator 
While QEMU being as a sequentially full system emulator, there exist a few kinds of 
parallel full-system emulator: Parallel SimOS, COREMU[2], PQEMU[5] and 
HQEM[6]. Parallel SimOS is designed for alpha architecture and the other three are 
specially designed for X86 machine, both of them are not able to run on MIPS 
architecture now. Compared to other parallel emulator, COREMU have high scalability 
and high performance. Our job is majorly based on it. 
COREMU is hosted on X86 and targeted on multiple architecture especially x86 and 
arm. It wraps the translation-execution logic to a single thread, and then bind these 
threads to different physical CPU cores. Besides, it warps all peripherals emulation to 
an individual thread called IO-thread. COREMU majorly use multithread to achieve 
parallel. It offered an efficiently emulate synchronization primitive to coordinate 
concurrent access to the emulated shared memory from each emulated processor. To 
deal with inter-core communications COREMU use lock-free FIFO queue.  
When building a parallel full system emulator like COREMU on MIPS, we mainly 
deal with the atomic instruction translation strategy for lightweight memory 
transactions and lock-free FIFO queue for inter-core communications. 
Fig. 1. Sequential and parallel full-system emulation 
3 Atomic Instruction on MIPS Host 
3.1 Atomic Instruction on X86 
CAS(compare-and-swap) is an atomic instruction which is widely used in 
multithreading to achieve synchronization. The C function of CAS in Figure 1 shows 
the basic behavior of CAS, which provide the guarantee of atomicity. 
Fig. 2. CAS in C  Fig. 3. translation of atomic inc use CAS 
COREME use CASN(Multiword CAS) algorithm in atomic instruction translation, 
which execute multiple CAS to simulate the atomic instruction on guest machine. 
int CAS(int *mem,int oldval,int newval){ 
 int old_reg_val=*reg; 
 return old_reg_val; 

void inc(int *reg){ 
  int old=*reg; 
  int new=old+1; 
} 1048 X.-W. Jiang et al. 
Figure 2 shows the translation of atomic inc use CAS in C.COREMU use CASN 
majorly because it targeting at X86 host, and X86 has cmpxchg as its CAS instruction. 
3.2 Atomic Instruction on MIPS 
Different from X86, MIPS is a RISC architecture and has no CAS instructions. MIPS 
provides ll(Load Linked) and sc(Store Conditional Word) (on arm the instructions 
named llrex, screx)to achieve atomic read-modify-write (RMW) operation. ll reg,mem 
load a word from memory to reg, and remember this operation. sc reg,mem store a word 
to the same location in memory. When a sc instruction fetch memory, it will check 
whether the location is modified after the last ll instruction. If it didn’t modified, reg 
will set 1 for the success of execution, while if it has been modified the reg will set 0 for 
the failure of execution. LL/SC has two advantages over CAS: reads and writes are 
separate instructions, and both instructions can be performed using only two registers.  
3.3 Aligned Instruction 
X86 atomic instruction contains inc, dec, add, xchg, and, or, xadd, bit_testandset, 
bit_testandreset, etc. But OS and applications won’t use them all. From experiment we 
find that linux kernel and applications on it only use inc, dec, xchg, cmpxchg and xadd. 
This paper use ll/sc pair and inline assembly to achieve lightweight memory 
transaction. Figure 3-6 show the translated inc, xchg, xadd, cmpxchg in MIPS. 
Fig. 4. inc mem 
Fig. 5. xchg reg,mem 
Fig. 6. xadd reg,mem 
Fig. 7. cmpxchg mem,old,new 
3.4 Unaligned Instruction 
The above research shows the solution to all 32bit aligned memory access, but as a 
CISC architecture, X86 has non-32bit aligned memory access while MIPS required 
32biit aligned memory access. The experiment result shows that all these unaligned 
memory access exist in 8bit or 16bit bit xchg and cmpxchg, and the memory will not 
across two 32bit memory address. 
1:  ll t,*mem 
        addi  t,t,1 
        sc  t,*mem 
beqz t,1b 
1:  ll temp1,*mem 
move temp,reg 
    move reg,temp1 
        sc  temp,*mem 
    beqz temp,1b 
1:  ll temp,*dst 
add temp1,reg,temp 
move reg,temp 
sc temp1,*dst 
beqz temp1,1b 
1:  ll temp,mem 
bne temp,old,2f 
move temp,new 
2:  sc temp,mem 
    beqz temp,1b  A Parallel Full-System Emulator for Risc Architure Host 1049 
With this feature, we deal with this unaligned instruction as below: when QEMU got 
a unaligned instruction, then just expand the address to 32bit aligned (new_addr=addr 
& ~0x3) and operate the whole 32bit atomically, then we can ensure the atomicity of 
the original operation. 
4 Lock-Free Queue in Interruption Simulation 
4.1 Interruption Simulation 
As emulation in QEMU is sequential, the asynchronous communication between core 
to core/device emulated in a synchronous way. All of the processor running logic are 
schedule by round-robin fashion. When a core is schedule out, QEMU will do those 
synchronous events including device interruption and inter-processor interruption. 
However in parallel emulation more than one emulated core are running at the same 
time, interrupt vector may be modified parallel by each running core. COREMU use a 
lock-free FIFO queue to achieve asynchronous communication. 
4.2 Lock-Free Queue in X86 
Unlike ll/sc pair in MIPS, CAS in X86 can’t not detect ABA problem[8]. A typical 
ABA problem like below: 
• Process1 reads value A from shared memory 
• Process1 then preempted allowing process2 to run.  
• Process2 modifies the shared memory value A to value B and back to A before 
• Process1 begins execution again, sees that the shared memory value has not changed 
and continues.  
ABA problem is a major problem when designing a Lock-free queue algorithm because 
the node type of queue are always pointer, and a same pointer may result from an 
enqueue with the same malloc. COREMU add a counter to each queue node and use 
CAS2 to avoid ABA problem in lock-free queue. CAS2 check a queue node which 
contains a pointer and a counter. The counter never be the same after each en/dequeue 
operation. Besides X86 has native CAS2 instruction: cmpxchg8b/16b. 
4.3 Lock-Free Queue in MIPS 
There is a way to use interleaving ll/sc pairs directly to form a lock-free FIFO queue as 
Claude Evequoz talk about in his paper[3]. We apply this algorithm on arm because 
arm support interleaving of ll/sc pairs. 1050 X.-W. Jiang et al. 
Q: array[0..Q_LENGTH-1] of *NODE; 
unsigned int Head, Tail; 
bool enqueue(node *p){ 
 unsigned int t,tail; 
 node *slot; 
  t = Tail; 
  if(t == Head + Q_LENGTH) 
   return FULL_QUEUE; 
  tail = t % Q_LENGTH; 
  if(t == Tail) 
   if(slot != null){ 
   else if(SC(Q[tail],node)){ 
        return  OK; 

// Circular list initialized with null 
// Extraction and insertion indices 
node *Dequeue(void){ 
 unsigned int h,head; 
 node *slot; 
    h  =  Head; 
  if(h == Tail) 
   return null; 
  head = h % Q_LENGTH; 
  slot = LL(&Q[head]); 
  if(h == Head) 
   if(slot == null){ 
        if(LL(&Head)  ==  h) 
   else if(SC(&Q[head],null){ 
        if(LL(&Head)  ==  h) 
        return  slot; 
Fig. 8. Lock free FIFO queue using ll/sc pair 
As algorithm using ll/sc always need either nesting or interleaving of ll/sc pairs, we 
can’t use the algorithm based on it because MIPS do not support it. Actually a single 
en/dequeue contains two operations: modify Head/Tail pointer and en/queue node. 
Two operation must be execute at one atomic time while MIPS only support one. 
Paper[7] offered us a way to build CASN which can atomically run multiword CAS. 
We first use ll/sc pair to build a software version of CAS, and then generate the CAS2. 
In this way we can use lock-free queue in COREMU, but CAS2 in MIPS is much 
heavier than CAS, not to speak of there is no native support of CAS. So it is very 
important to reduce the use of CAS2 in lock-free queue algorithm.  
We use lock-free algorithm found by John  D.valois[4], which especially reduce 
CASN instruction. Both enqueue and dequeue has only one CAS2 and one xadd. This 
algorithm is based on a standard circular array. There are three special values, HEAD 
TAIL and EMPTY, and node value. Initially, two adjacent locations are set to HEAD 
and TAIL while others are set to EMPTY. To enqueue the value x, a process find the 
unique location containing the special TAIL value.CAS2 is then used to change two 
adjacent location from <TAIL, EMPTY> to <x, TAIL>.The dequeue operation is 
similar, using the CAS2 operation to change <HEAD, x> to <EMPTY, HEAD> and the 
return the x. 
Besides, we keep two counters: the number of enqueue and the number of dequeue. 
Both of them are increase by FAA whenever an en/dequeue process complete. These 
two counter helps to quickly find the HEAD and TAIL.FAA can be simulated by xadd 
1,mem in Figure 4, CAS2 can be generate by multiple CAS. 
When reaching the beginning and ending of the array, this algorithm still work, 
because software CAS2 do not require two memory adjacent.  A Parallel Full-System Emulator for Risc Architure Host 1051 
5 Experiments and Discussion 
In order to test the performance between origin QEMU and our modified QEMU, and 
test the performance between multithread program in native machine and in our 
modified QEMU, two benchmark are chosen. These benchmark are performed on a 4 
core (900Mhz) loongson 3A , a quad core MIPS cpu, running Debian 6 with kernel 
version The guest OS is Debian 6 with version 2.6.32-5. 
Firstly, we write a simple multithread pi which is designed to calculate pi in totally N 
step in T threads concurrently. Each thread calculate ߨ

, and finally calculate Ɏ. 

ൌ ෍
ͳ͸ሺ‹ ൅ –ሻ

െ ͳ͸ሺ݅ܶ ൅ ݐሻ ൅ ͵
Ɏ ൌ ෍ ߨ

All these result shows below. OriQemu short for origin QEMU ModQemu short for 
our modified QEMU, n(1,2,4) means QEMU run with –smp n option(emulating an n 
core machine). The result shows the efficiency of modified QEMU, It is 3x faster than 
original QEMU when the number of core  on emulated machine is set to 4 and the 
number of thread is set to 4. The speedup rate reached 3 and efficiency reached nearly 
3/4 compared to the number of physical core. 
Fig. 9. The time of multithread pi 
Secondly, we test the performance between modified QEMU and native machine 
though 7z, a widely used multithread compress application which contains a building 
benchmark. The dictionary size is set to 256KB. The result shows the compress and 
decompress speed on both native and modified QEMU, A higher threadnum makes a 
higher compress/decompress rate, and both in native and modified QEMU the 
application hold almost the same speedup rate: compress 2.79 to 2.68 and depress 3.68 
to 3.65.1052 X.-W. Jiang et al. 
1 2 4 8 1 6
host_compress host_decompress
guest_compuress guest_decompress
Fig. 10. The speed of 7z compress/decompress 
6 Conclusion 
We find an atomic instruction translating strategy for ll/sc pairs on RISC and use a 
more light-weight lock-free FIFO queue on asynchronous communication emulation. 
Finally we successfully emulate X86 in parallel on MIPS target. The experiments 
proved its efficiency compared to original QEMU and host machine. 
1. Bellard, F.: QEMU, a fast and portable dynamic translator. USENIX (2005) 
2. Wang, Z., Liu, R., Chen, Y., Wu, X., Chen, H., Zhang, W., Zang, B.: COREMU: a scalable 
and portable parallel full-system emulator. In: Cascaval, C., Yew, P.-C. (eds.) PPOPP, pp. 
213–222. ACM (2011) 
3. Evéquoz, C.: Non-Blocking Concurrent FIFO Queues with Single Word Synchronization 
Primitives. In: ICPP, pp. 397–405. IEEE Computer Society (2008) 
4. Valois, J.D.: Implementing Lock-Free Queues. In: Proceedings of the Seventh International 
Conference on Parallel and Distributed Computing Systems, Las Vegas, NV (1994) 
5. Ding, J.-H., Chang, P.-C., Hsu, W.-C., Chung, Y.-C.: PQEMU: A Parallel System Emulator 
Based on QEMU. In: ICPADS, pp. 276–283. IEEE (2011) 
6. Hong, D.-Y., Hsu, C.-C., Yew, P.-C., Wu, J.-J., Hsu, W.-C., Liu, P., Wang, C.-M., Chung, 
Y.-C.: HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores. 
Paper presented at the Meeting of the CGO (2012) 
7. Harris, T.L., Fraser, K., Pratt, I.: A Practical Multi-word Compare-and-Swap Operation. In: 
Malkhi, D. (ed.) DISC 2002. LNCS, vol. 2508, Springer, Heidelberg (2002) 
8. Dechev, D., Pirkelbauer, P., Stroustrup, B.: Understanding and Effectively Preventing the 
ABA Problem in Descriptor-Based Lock-Free Designs. Paper presented at the Meeting of the 
ISORC (2010) 
文章标签: qemu
个人分类: QEMU
相关热词: parallel
上一篇QEMU, a Fast and Portable Dynamic Translator
想对作者说点什么? 我来说一句