Port QEMU to another CPU

最新推荐文章于 2022-05-02 09:06:12 发布

DerekLouie

最新推荐文章于 2022-05-02 09:06:12 发布

阅读量1k

点赞数

分类专栏： Automobile electric development 文章标签： function translation subroutine assembly optimization compiler

Automobile electric development 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Quote: http://louis-vuitton.daili.name/paypal/gold/browse.php?u=Oi8vbGlidm5jc2VydmVyLnNvdXJjZWZvcmdlLm5ldC9xZW11L3FlbXUtcG9ydGluZy5odG1s&b=29

Prerequisites AKA quick and dirty introduction to assembler

This is a short introduction to assembler. You will not learn how to write assembler, just get an idea how a compiler transforms a simple C function to assembler, just enough to understand the next section.

I assume that you are familiar with C, especially with pointers and addresses. I use a simple form of RISC here, because I find it easier to understand for an assembly newbie: RISC means that each instruction is as simple as possible (so that execution is very fast).

Assembler is much simpler than C. It has fewer instructions and only a handful of registers, the "variables" of assembler. This makes it hard to express complex ideas, and even harder to understand those ideas expressed in assembler.

The general registers have integer values, and their precision (measured in bits, also called width) tells you how much memory can be addressed through this CPU (there are exceptions, but let's keep it simple, okay?), because you can get a value by loading an address into a register, and then load the value from that address.

In Assembler this would look like

 	li $r1, some_variable	# load the address of some_variable into
 				# the register called r1
 	lw $r2, $r1		# load the word at $r1 into the register r2

Assembler does not care for type safety, that's the job for your compiler. Therefore you can just add two registers not caring if they hold numbers, or addresses. This would look like

 	addl $r2,$r1,$r3	# r2=r1+r3

Let's look at a simple C function now:

 int add(int a,int b) {
 	return a+b;
 }

and translate that into assembler. This looks like this:

 add:
 	addl $v0,$a0,$a1	# v0=a0+a1
 	jr $ra			# jump back to where add was called from

What are those v0,a0,a1? Well, they are other names for special registers. On MIPS, you have 32 registers: r0, .., r31. But normally you don't use them equally, but by convention r2 is called v0 and used as return value, r4 is called a0 and holds the first parameter to a function, ... This convention is called the Application Binary Interface ("ABI").

Also the register r31, which is called ra, has a special purpose: when a function is called, this register holds the address of the next instruction of the caller, i.e. "jr $ra" is just a "return" in C. Note that "return" can return a value, but "jr $ra" does not. This is what I meant by "simple": You return a value in assembler by assigning that value to a register before returning to the caller, using two instructions, not one.

Now imagine that a function calls a function calls a function etc. No chance 32 registers would suffice! So, a few of them, called non-volatile registers, are supposed to be saved by the callee, i.e. if a function wants to use one of those registers, it has to save the original value, and restore it before returning. Which register is volatile an which not? This is defined in the ABI.

Where can you save the original value? That's where the stack comes in. (Stack space is where you can temporarily store something.) Another special purpose register, called $sp - the stack pointer, holds an address which you can decrement, thus reserving stack space to use. The stack sits in the main memory, so it is not very fast as compared to registers. For programs, $sp is already initialised before the start of the program; you don't have to worry about that.

Example:

 	subl $sp,$sp,32		# subtract 32 from $sp ($sp=$sp-32)
 	sw $r7,28($sp)		# store the value of $r7 to $sp+28
 	...
 	lw $r7,28($sp)		# restore the value of $r7 from $sp+28
 	addl $sp,$sp,32		# restore $sp

Why $sp+28 and not $sp+31? Because $sp is a char*, and you need 4 bytes to store a 32 bit register. Thus, by reserving 32 bytes on the stack, we have space for 8 registers.

In this example, we reserved 32 bytes from the stack, saved the value of register $r7 into that space, did something ("..."), restored the value and freed the reserved space.

It is a good idea to write a simple function in C and look what gcc does with it: Instead of "-c" to compile a C file, you pass "-S" to write symbolic assembly code. This is more or less human readable and you can learn quite a lot from it.

It is important for the next section to keep in mind that very simple functions, i.e. not needing many registers, and not calling other functions (for which you have to save the return address, $ra), you don't need stack space, and you can then return in the middle of such a function by inserting the return instruction ("jr $ra"). This can be done with simple function in contrast to complex ones: Complex functions have to restore values from the stack, and more importantly, restore the stack pointer itself.

One last hint before delving into the details of QEmu: In QEmu's source code, the term "assembler instruction" is often abbreviated by "insn". This abbreviation is quite common when reading about assembler (for some strange reason some people don't like words with more than one syllable...).

Portable dynamic translation - detailed explanation

In the following, guest code means the assembly code which is to be emulated, while host code means the assembly code which is actually executed.

The dynamic translation is a complex process, which is made even more complex by the fact that some basic opimization takes place at translation time, too.

The basic idea is that for each instruction of the guest CPU, there is a function which simulates that instruction. Now, when emulating the guest CPU, we could call the function for the guest CPU's current instructions one by one. This is called interpreting and that's what Bochs does.

QEmu however is a dynamic translator, which means that those functions are not called directly, but a sequence of (guest) code is translated by chaining the corresponding functions (of host code). That means that prior to executing those instructions, the according functions (without the return statement) are pasted together. Evidently, if that particular code is executed more than once, this method is much faster than interpreting, because there have to be a lot less lookups, and also the overhead of function calls is no longer needed.

Dynamic translation is a lot more complicated than interpreting, because you have to change the compiled code on the fly. For example, if the instruction has operands (like "add" has), those operands have to be inserted in the copied code, at the right place.

Fortunately, that is exactly what a linker must do with the object files it links together. If, for example, an object file contains a reference to an extern int, the linker puts the address of the extern int into that location. The object file contains a list of those locations together with information what should be put there. We call those "relocations", because the object code can be relocated where you want by changing the values at these locations appropriately.

So, every (guest) instruction which takes operands, has a corresponding function with references to extern ints (__op_param1, __op_param2, ...). The program dyngen is then used to extract the relocation information, and write a function (dyngen_code) which uses this information to copy the compiled code and fill in the operands. This function is written at compile time. It is called at translate time.

There is also the problem of jumps. This is why the guest code is only translated in blocks. At the end of each block, the code must return to jump to (and possibly - before that - translate) another block.

There is more to QEmu's dynamic translation: Some tricks optimize the speed of the translated code, like not calculating unneeded flags. In order to achieve this, QEmu breaks up the functions for each instruction into oplets and introduces an intermediate translation, also known as 2-pass translation. This makes it possible to delay flag calculation or to leave out some unneeded calculations.

In the file op.c for each target, those oplets (op_*) are defined. As stated earlier, the dyngen compiler writes a function, dyngen_code, which can copy those oplets and fill in the operands, and which is placed in the file op.h. However, dyngen also generates the file opc.h, containing the opcodes for the intermediate "CPU" (i.e. numbers for the oplets), and the file gen-op.h, in which there is a function (gen_op_*) for each oplet. The latter functions are used to create the intermediate code for the oplets.

The file translate.c contains functions to translate the (guest) instructions into the intermediate code (gen_*), and the function disas_insns (or disas_{target}_insns), which reads the guest code and calls the gen_* functions (1st pass).

The file translate-all.c contains the function cpu_gen_code, which calls disas_insns to create the intermediate code, and then dyngen_code (2nd pass) to actually create the assembler code which eventually gets called.

Porting to a new host

If your host architecture does not produce ELF files, you have to implement an object file loader for that file format (find CONFIG_FORMAT_COFF for a good example).

In configure, add support for your CPU (hint: look for the string "$cpu" and adapt the lines for another CPU).

In file dyngen-exec.h, before the define of FORCE_RET, add defines for the registers (AREG*), just like it was done for the other architecures.

Also in dyngen-exec.h, at the end, add a define for EXIT_TB, which is typically just the assembler code for a return. Make sure this is as simple as it can be: if this code uses the stack, it will be fatal (the compiler will try to clean up the stack *after* the return instruction).

In exec-all.h, before the typedef of spinlock_t, add code for testandset, which is an atomic equivalent to if(*p==0) { *p=1; return 1; } else { return 0; } Atomic means that between the test of *p, and the setting of *p, the current thread may not be interrupted (or, if interrupted, return 0;). If your CPU is supported by Linux, you may want to look at /usr/include/asm/bitops.h, the function test_and_set_bit. I believe that such short pieces of code are unprotectable, especially if you don't copy them, but instead imitate them (otherwise there's a legal issue: Linux is GPL, while libqemu is LGPL. Don't you just hate it being bullied by lawyers all the time?). Test your code! QEmu will run just fine up to a certain stage (on Linux: the calibration loop) with a wrong implementation.

At the end of dyngen.h, add code for flush_icache_range, which should invalidate your CPU's instruction cache. If in doubt, leave empty and hope for the best. If you have a RISC processor, you probably have to find out how to invalidate the cache, because most of RISC's speed comes from aggressive caching.

In dyngen.c, you have to add some defines. Find the line "#error unsupported CPU - please update the code" and add a section for your new host. The only tricky thing is whether your compiler produces ELF object files with relocations in RELOC or RELOCA format. The easiest way to find out is to try both and then look at the code which dyngen produced (op.h). If there are no relocations, you took the wrong format.

In dyngen.c in the function gen_code, add code to find the return instruction (look for "#error unsupported CPU"). The variable p_start contains the start of the instruction, and p_end points just after it. Your job is to set copy_size so that p_start+copy_size points to the return code. A good example is HOST_PPC.

Also in the function gen_code, add code to handle the relocations. Note that you don't have to handle all relocations, as the file op.o will be linked to libqemu.a, and you only need to handle the relocations for __op_param1 and friends, and the relocations containing absolute addresses pointing into the code.

In vl.c, before the definition of cpu_ticks_offset, add a definition for cpu_get_real_ticks. If there is no CPU support for this, or you are too lazy to find it (like I am), you can use this code:

 -- snip --
 /* Derived from: "m68k updates #2" by Richard Zidlicky
   "crude hack to get some sort of rdtsc support" */

 
 #include <sys/time.h>
 static int64_t cputicks=0;
 static struct timeval lastcptcall={0,0};

 
 // assume 5 MHz Pentium, min 80 ticks between rdtsc calls

 
 int64_t cpu_get_real_ticks(void)
 {
       struct timeval tp;
       gettimeofday(&tp,(void*)0);
       if (tp.tv_sec == lastcptcall.tv_sec &&
          tp.tv_usec == lastcptcall.tv_usec ){
         cputicks += 1;
       } else {
         cputicks=0;
         lastcptcall=tp;
       }
       return ((int64_t)tp.tv_sec*1000000+tp.tv_usec)*5+cputicks;
 }
 -- snap --

Remarks AKA my experience with porting to MIPS host

In a lot of places, there were already stubs for mips, but a few modifications were missing.

My biggest problem was the ELF code: first attempts only produced PIC code which uses a global table. What I mean is this: Everytime there was a reference to __op_param1 or friends, the compiler produced code like

 	LI r0, offset
 	LW r1,r0(gp)

which means that a 32 value is loaded from the address which is in the global table at offset, where offset is a 16 bit number, or in C:

 	r1 = *(int)(gp[(short)offset]);

This is of no use, of course, as dyngen_code wants to modify the code in question to use another parameter than __op_param1, and the address in the global table would stay the same for every instance of that particular oplet.

I played around with several assemblers, and tried to force gcc to not use the global table, only to find out that it was IRIX' assembler introducing it stubbornly, not gcc. At one point I had an object file which did not contain code using the global table. Unfortunately this was no longer valid PIC code, so the linker would not link it. I finally gave up and implemented asm constructs for __op_param1 and friends.

As MIPS is a RISC code, which basically means that all instructions are 32 bit wide, and therefore cannot take any 32 bit operand, a register can load a value only in two steps. That means two relocations for each operand.

Next problem: IRIX' assembler uses RELOC *and* RELOCA. Don't shout at me, it's not my fault. Therefore I had to make an ugly hack which copies those relocations without addend (the RELOC type) together with the other relocations (the RELOCA type) into a new place. The funniest thing: the addend is not even used in my code!

The only relocations I had to deal with now were the calling relocations. There was much frustration with trying to make those relocations independent from the global table, but in the end I realized that this is not bad at all: Those relocations (with the exception of op_jmp, which is treated differently, anyway) are satisfied at link time (as op.o gets linked into libqemu.a), so dyngen_code does not need to modify them.

One word about the AREGs: MIPS has 32 registers, but according to the ELF ABI they cannot be freely used. Some of them have to be saved explicitely before calling functions, others are clobbered. This does concern us, because we use compiled functions, and must play by the same rules. It would only be different if we implemented all op_* functions as hand crafted assembly code.

When I was trying to find out what is happening, I added an fprintf to EXIT_TB(). *BIG MISTAKE*. Don't do that. An fprintf will almost certainly reserve stack space, which clobbers your stack, because the code to cleanup the stack never will be called. It is no help to use "return;" instead, as this most likely will result in a "jmp .end_of_function", and we want to strip away the final ret (chaining!).

Another mistake I made: I compiled op.c with "-fomit-frame-pointer", and the rest without. This leads to wrong code! Try to compile every object file with the same options until it runs, *then* try to optimize the compilation process.

Speaking of optimization: Everytime I compiled op.c with -O0, almost every oplet started by reserving stack space. This is not what I want, because often, there is an EXIT_TB() in an oplet (meaning: return from the current Translated Block to QEmu), which is just a return instruction. On CISC machines, the return address is on the stack. On RISC machines, it is in a register which is possibly saved on the stack. So this would lead to segmentation faults because of stack corruption.

So I had to turn on optimization in order to avoid using the stack. Fortunately, only moderate optimization is needed, because the oplets are really simple functions using mainly global variables. Unfortunately, gcc insisted on optimizing away simple assignments with registers s0-s7. This was fascinating for me, as I thought that the optimizer realized that the result of those assignments is unnecessary. After many hours, I finally read again an email from Paul Brook, and understood what is going on: on RISC processors, a jump is not immediate... it executes the next instruction called the "branch delay slot", and then jumps. As dyngen wants to strip away the jump, this is no good: we would have to strip away the instruction before the last one! I finally found the gcc option "-mno-delayed-branch" to fix that.

Hint: If you have a similar problem, namely that you do not know which gcc option makes your op.o invalid, you might want to try to add "-daP" to the OP_CFLAGS in order to get a bunch of files, named "op.c.*" which contain output from the various stages in gcc processing. In my case, the file op.c.34.dbr contained the reordering, and dbr stands for "delayed branch".

I also had great fun with debugging all the time: there's no decent debugger for IRIX but gdb, which seems to be a bit broken when it comes to pthreads. The only remedy was to use extensive logging (to stderr, feels like Matrix) which was better in the end anyway, because I used a few tricks so I was even faster than single stepping interactively.

So, I was almost there. A few instructions worked already, but then there was either a segmentation fault, or an illegal instruction, or even a bus error. Never at the same time, though. Okay, googling again, I found a nice tutorial about shellcode on MIPS. Shellcode is assembler code with one restriction: it cannot contain bytes which are zero, because shellcode is meant to be injected by some hackery (most likely a buffer overflow created by a creative string, which is terminated by a zero). Therefore, shellcode has to be compact, and has to deal with the total absence of convenience: no libc, no nothing. And this tutorial contained an information I missed before: RISC processors do have an instruction cache, which means that quite a chunk of instructions are prepared to be executed long before they are actually executed. Unfortunately, as a dynamic translator creates new code all the time, this is not so desirable. Further googling provided an API call on IRIX to flush that cache.

And then, QEmu started to work. Almost. A window flashed, but then QEmu crashed with a Bus Error. A Bus Error means that there was an unaligned access, i.e. a 32-bit number should have been read from an address, which is not a multiple of 4. MIPS CPUs are quite picky when it comes to that. The problem was in hw/vga_template.h, and it said it right there:

         /* XXX: unaligned accesses are done */

The remedy was to rewrite those accesses using byte accesses.

What worked really well was to add disassembly: I just copied the respective code from gdb (normally the file gdb-{your_version}/opcodes/{arch}-dis.c), stripped away the gdb specific includes, added a prototype of print_insn_mips to dis-asm.h and two lines in disas.c just before the line 'fprintf(out, "Asm output not supported on this arch/n");' This helped a lot as I really could use the "-d out_asm" now.

Remarks II AKA my experiences with Solaris/SPARC

First of all, other people worked at the huge task to get QEmu running on SPARC. To mention a few: David S. Miller, who wrote the first support for QEmu on a SPARC host, J黵gen Keil, who evidently did much to make it run after QEmu evolved beyond the working stage (on a SPARC host), and Martin Bochnig, who worked a bit on it, and was optimistic enough to trust me I could do it (when I wasn't; the AXi he promised me when I succeed sure was helping a lot...). I am not sure who used Linux/SPARC and who Solaris/SPARC, but sfter so much work already done, there were only a few things left. However, not knowing anything about the SPARC architecture, it was quite hard for me.

The two major problems were in my case: - Register Windows, and - which global registers can I use?

As explained in the "Assembler Crash Course" at the beginning of this document, a few registers - the non-volatile registers - have to be saved before they can be used. So, the wise people at SPARC invented the saving of whole sets of registers: Out of 32 registers, only 8 registers stay the same all the time, and are therefore called g0-g7 (global registers). The other 24 registers are divided into three categories: i0-i7 (input), t0-t7 (temporary) and o0-o7 (output).

When a subroutine is called, the input parameters are stored into the output registers o0-o7. The subroutine starts by shifting the Register Window by 16 registers, so that now these registers become i0-i7, i.e. the output of the caller is the input of the callee. The input and temporary registers of the caller are hidden away, and the callee has a fresh set of temporary and output registers. At the end of the subroutine the output is stored into the registers i0-i7, the Register Window is shifted back, and so the output is visible to the caller in the registers which are now o0-o7, the same ones in which the input to the subroutine was stored before calling it.

The shifting of the Register Windows is done by the instructions "save" and "restore". These instructions have to be stripped away before the oplets can be chained, so there is code in dyngen to analyze the oplets in order to strip them. A few oplets, however, did not have these instructions in my case, because they are so simple they don't need them. I added code to not strip these functions.

Of course, there is more to it. For example, the "save" instruction takes an argument, which basically says how much space would be needed on the stack if we're out of Register Windows (depending on the processor, there are only 2-32 windows). And you don't really have 8 input registers, as the ABI defines i7 as the return address and i6 as the frame pointer.

I strongly believe that the inventor of the term "Register Windows" at SPARC is a very funny person, indeed. If I ever meet him, I buy him whatever drink he likes most.

Which global register to use?

You might have asked yourself why it is necessary to strip the "save" and "restore" insns. Here is why: we want to be able to use the local registers l0-l7 also!

I tried those global registers which were already selected by David S. Miller, only to find out that QEmu crashed on my box. I then reduced the #define'd AREGs to the minimum, 4 registers. Although I passed "-ffixed-g1" to gcc, which means that gcc should leave the register g1 alone, it got messed up from time to time. I now believe that some libc function just uses it without restoring it, but maybe I am wrong and the ABI specifies this as a volatile register. By using g2, g3, g4 and g6, I finally had a running system.

But not for long! Martin tested it and soon found that sometimes it does not work. I now know why: I should have read the ABI! It's clearly stated there: g1 is volatile. g5-g7 are reserved for the system! As g0 is always 0 (as on most RISC machines, for a clean design), I had to use l0-l7. Of course, for that to work, I had to tell QEmu that whenever a translated block is executed, those registers are clobbered. Remember, the syntax is

 asm("some assembler" : output registers : input registers : clobbered registers)

As on SPARC, the translated block is called via assembly anyway (see cpu-exec.c), this was easy.

What helped me most in the course of getting QEmu running on SPARC, was forcing it to single step (in target-i386/translate.c, I just made the "if" checking for "cflags & CF_SINGLE_INSN" succeed always), and then run it with "-d in_asm,cpu". After the crashes, I compared the output in /tmp/qemu.log with that of a running QEmu on another machine, and soon found which register was clobbered.