http://locklessinc.com/articles/gcc_asm/

最新推荐文章于 2022-01-04 15:02:05 发布

weixin_34262482

最新推荐文章于 2022-01-04 15:02:05 发布

阅读量190

点赞数

文章标签： python 操作系统

原文链接：https://my.oschina.net/zhuzihasablog/blog/266486

版权

2019独角兽企业重金招聘Python工程师标准>>>

GCC Inline ASM

GCC has an extremely powerful feature where it allows inline assembly within C (or C++) code. Other assemblers allow verbatim assembly constructs to be inserted into object code. The assembly code then interfaces with the outside world though the standard ABI. GCC is different. It exposes an interface into its "Register Transfer Language" (RTL). This means that gcc understands the meaning of the inputs and outputs to the fragment of assembly code.

The extra information gcc has allows it to carefully choose the registers (or other operands) that define the interface. The ones chosen can vary depending on the surrounding code. In addition, gcc can be told which registers will be "clobbered" by the assembly code. It will then automatically save and restore them if required. This contrasts strongly with other methods, where inlined assembly code needs to manually do this saving and restoring. (Even when the surrounding code is such that it isn't needed.)

The result is that commonly a piece of gcc inline assembly will compile into a single asm instruction in the executable or library. (Often you just want access to a single instruction not exposed by C.) However, to do this, you need to understand how to craft the constraints told to the compiler. If they are incorrect, then subtle bugs can result.

Constraints

A simple function using inline-assembly might look like:

static __attribute__((used)) int var1;
int func1(void)
{
	int out;
	asm("mov var1, %0" : "=r" (out));
	return out;
}

The above shows several features of gcc's interface. Firstly, the asm code is a compile-time C constant string. You can put anything you like within that string. GCC doesn't parse the assembly language itself. What it does do is use escape sequences (i.e. %0 in the above) to reference the interface described by the programmer. In this case %0 corresponds to the zeroth constraint, which in turn is described after the colon.

That constraint "=r" is an output-constraint (due to the use of the '=' symbol), and consists of a general-purpose register (due to the use of the 'r' symbol. The resulting output is then stored into the variable within the parenthesis, 'out'.

The result is a magic bit of code that somehow materializes a value, and then stores it into the variable 'out'. GCC doesn't understand where the value comes from. So in turn, it doesn't know that the variable 'var1' is actually used unless you tell it explicitly by the used attribute. (An unused variable can be elided from the executable object as a simple optimization.)

When the above is put inside a .c file called gcc_asm.c, and then compiled, the result is:


	.p2align 4,,15
	.globl	func1
	.type	func1, @function
func1:
.LFB7:
	.cfi_startproc
#APP
# 8 "gcc_asm.c" 1
	mov var1, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE7:
	.size	func1, .-func1

The standard ABI on 64bit x86 machines is to return integers in the %eax register. GCC picks this for the register chosen to contain the variable 'out'. Thus the resulting function actually only consists of two instructions: (The above has a whole lot of asm directives describing unwinding and debug information in addition, but that doesn't appear in the straight-line code.)


	mov var1, %eax
	ret

See how gcc has replaced the '%0' in the asm string with the register it picked for the zeroth constraint. If there were more constraints, we could use '%1', '%2' etc. for them in the asm string. Values up to '%9' are available.

The above describes how to get information out of a fragment of inline assembly code. So what about the reverse, getting information in? An example function that does that looks like:

static __attribute__((used)) int var2;
void func2(int parm)
{
	/* Register input - volatile because has no outputs, writes to memory */
	asm volatile("mov %0, var2" : : "r" (parm) : "memory");
}

The above looks very similar to the first function. However, it has two more colon-delineated parts to the asm intrinsic. The first of these is again the asm string. The second, for the outputs, is blank in this case. This function has no outputs. The third section is an input constraint. Notice that the '=' symbol is missing. (It's an input, not an output.) What remains is the 'r', describing that this asm code wants that input stored in some general register. Finally, the asm code ends with a 'memory' constraint. This tells gcc that it writes to arbitrary memory.

One other difference from the other function is that the asm fragment has an extra 'volatile' keyword. This is necessary because the code has no outputs. GCC needs to know if it is allowed to elide the perhaps useless asm which may not interact with anything else. The 'volatile' tells gcc that it shouldn't be removed. The 'memory' constraint tells gcc that it shouldn't move this call across other memory references. (Otherwise our read of 'var2' might cross writes to it.)

It is possible to have output-less inline asm which don't have the above constraints. However, be aware that gcc can optimize your asm away, or move it around if they are missing. If done when you don't expect, the result will again be subtle bugs.

The above when compiled yields:


	.p2align 4,,15
	.globl	func2
	.type	func2, @function
func2:
.LFB8:
	.cfi_startproc
#APP
# 17 "gcc_asm.c" 1
	mov %edi, var2
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE8:
	.size	func2, .-func2

Which again is as small as possible. GCC picks the %edi register corresponding to the ABI register for the first parameter on x86_64. (If you want to find the exact code generated by the asm fragment, look for the areas surrounded by #APP, #NO_APP comments.)

It is easy to create an inline asm with both input and output parameters:

int func3(int parm)
{
	int out;
	asm ("mov %1, %0": "=r" (out) : "r" (parm));
	return out;
}

Here, the input parameter is %1, and the output is %0. Note the AT&T syntax used by default, which has outputs on the right of the asm instructions. Intel format can be used, which swaps things around. However, most gcc inline asm you will see will stick to AT&T format, so you should get used to seeing it.

The above compiles into:


	.p2align 4,,15
	.globl	func3
	.type	func3, @function
func3:
.LFB9:
	.cfi_startproc
#APP
# 24 "gcc_asm.c" 1
	mov %edi, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE9:
	.size	func3, .-func3

GCC has picked both input and output registers so that again the result is a single instruction.

A slightly more complex example is when you want something to be both an input and an output at the same time. For that, use the position of an output, and use a '+' symbol instead of an '=':

int func4(int parm)
{
	asm ("add $0xff, %0": "+r" (parm) : : "cc");
	return parm;
}

The above also shows how you should prefix immediates with a dollar symbol in AT&T syntax. It also has the 'cc' constraint. This stands for "condition codes". Since the add instruction will affect the carry flag amongst other things, we need to tell gcc about it. Otherwise it might want to split a test-and-branch around our code. If it did so, the branch might go the wrong way due to the condition codes being corrupted. Basically, any inline asm that does arithmetic should explicitly clobber the flags like this.

When compiled, we get:


	.p2align 4,,15
	.globl	func4
	.type	func4, @function
func4:
.LFB10:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 31 "gcc_asm.c" 1
	add $0xff, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE10:
	.size	func4, .-func4

So now the input and outputs are one and the same register, %eax. However, since the parameter passed to the function is in %edi, gcc helpfully copies it into %eax for us. Only when the copying was really needed did gcc insert it.

Looking at a slightly more complex example:

int foo(int);

int func5(int parm)
{
	int out;
	asm ("mov $0xff, %0\n\t"
		"add %1, %0\n\t"
		: "=&r" (out) : "r" (parm) : "cc");
	return foo(out);
}

/* Broken, because input and output share a register */
int func6(int parm)
{
	int out;
	asm ("mov $0xff, %0\n\t"
		"add %1, %0\n\t"
		: "=r" (out) : "r" (parm) : "cc");
	return foo(out);
}

Functions 5 and 6 attempt to something similar to function 4. However, instead of returning a value, they call some other function called foo. This means that the output should be in the %edi register. However, the input will also be in that register. The result shows how gcc will assume that output and input registers are allowed to overlap unless you tell it otherwise.

func6() will not work correctly. gcc will pick %edi for both of 'out' and 'parm'. This will compile into:


	.p2align 4,,15
	.globl	func6
	.type	func6, @function
func6:
.LFB12:
	.cfi_startproc
#APP
# 51 "gcc_asm.c" 1
	mov $0xff, %edi
	add %edi, %edi
	
# 0 "" 2
#NO_APP
	jmp	foo@PLT
	.cfi_endproc
.LFE12:
	.size	func6, .-func6

Which isn't what we want. The register is corrupted, and then added to itself.

To fix this, use the '=&' constraint. That tells gcc that the output constraint register shouldn't overlap an input register. Using that instead gives us function 5:


	.p2align 4,,15
	.globl	func5
	.type	func5, @function
func5:
.LFB11:
	.cfi_startproc
#APP
# 41 "gcc_asm.c" 1
	mov $0xff, %eax
	add %edi, %eax
	
# 0 "" 2
#NO_APP
	movl	%eax, %edi
	jmp	foo@PLT
	.cfi_endproc
.LFE11:
	.size	func5, .-func5

Which uses two registers, as required. It picks %eax for this, and inserts the extra copy needed.

You may have noticed that the multi-line asm used '\n\t' control codes. This simply makes the result nice. You just need a carriage return '\n' to go to the next line. The tab character indents things to line up with the code generated by gcc from the rest of the program. (Remember that the inline asm string is basically inserted verbatim into the output sent to the assembler, modulo simple replacements.)

To have multiple inputs, just separate them with commas:

int func7(int p1, int p2)
{
	int out;
	asm ("add %2, %1\n\t"
		"mov %1, %0\n\t"
		: "=r" (out) : "r" (p1), "r" (p2) : "cc");
	return out;
}

Which will compile into:


	.p2align 4,,15
	.globl	func7
	.type	func7, @function
func7:
.LFB13:
	.cfi_startproc
#APP
# 61 "gcc_asm.c" 1
	add %esi, %edi
	mov %edi, %eax
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE13:
	.size	func7, .-func7

Another possibility is that you might want some inputs and outputs to share a register. As described above, one way to do that is to use the '+' constraint. However, there is another way. You can use the number corresponding to another constraint within a second constraint. If you do this, then gcc will know that the two are linked, and must be the same. An example of using this is:

/* Input and output for same variable */
int func8(int p1, int p2)
{
	asm ("add %2, %0"
		: "=r" (p1) : "0" (p1), "r" (p2) : "cc");
	return p1;
}

Which compiles into:


	.p2align 4,,15
	.globl	func8
	.type	func8, @function
func8:
.LFB14:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 70 "gcc_asm.c" 1
	add %esi, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE14:
	.size	func8, .-func8

This may, or may not be a more readable technique than using a '+' constraint. '+' used to be buggy in old versions of gcc, so old code tends to use this method. Newer code might want to use the more concise '+' descriptor.

In addition to passing information in registers, gcc can understand references to raw memory. This will expand to some more complex addressing mode within the asm string. Note that not all instructions can handle arbitrary memory references. Thus sometimes you need gcc to create a register with the required information. However, if you can get away with it, it is more efficient to use memory directly. Some code that does this looks like:

int var9;
int func9(void)
{
	int out;
	asm ("mov %0, %1"
		: "=r" (out) : "m" (var9));
	return out;
}

Which compiles into:


	.p2align 4,,15
	.globl	func9
	.type	func9, @function
func9:
.LFB15:
	.cfi_startproc
#APP
# 80 "gcc_asm.c" 1
	mov %eax, var9(%rip)
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE15:
	.size	func9, .-func9

Notice how in the above, gcc has generated a %rip-relative addressing mode for us.

Sometimes you really want a constraint to be satisfied by a certain register. Fortunately, gcc has specialized constraints for many (but not all) of the general purpose registers used on x86_64.

int func10(int p1)
{
	asm ("inc %%eax" : "+a" (p1) :: "cc");

	return p1;
}

The above code shows how you can explicitly use the 'a' register (which corresponds to %al, %ax, %eax, or %rax, depending on size). Note how we need to use a double-percent sign within the asm string. This is similar to a normal printf format string, where to print a single percent you need two of them. (This is due to a percent symbol being an escape character.)

Compiling, we get:


	.p2align 4,,15
	.globl	func10
	.type	func10, @function
func10:
.LFB16:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 89 "gcc_asm.c" 1
	inc %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE16:
	.size	func10, .-func10

GCC has copied from %edi into the constraint register defined by 'a', %eax for us. Note that different machines will have differing names, and differing constraint symbols for their registers. You will need to look at the gcc documentation for your particular machine to find out what they are. This article will concentrate on the x86_64 case.

Another commonly used register is the 'd' (%dl, %dx, %edx, %rdx) register:

int func11(int p1, int p2, int p3)
{
	asm ("add %1, %0" : "+d" (p1) : "r" (p3) : "cc");

	return p1;
}

The above is a little tricky. p3 is passed in within %edx as specified by the function ABI. This means that gcc needs to copy it into another register so that p1 can go there. Fortunately, gcc handles all of the marshalling for us:


	.p2align 4,,15
	.globl	func11
	.type	func11, @function
func11:
.LFB17:
	.cfi_startproc
	movl	%edx, %ecx
	movl	%edi, %edx
#APP
# 96 "gcc_asm.c" 1
	add %ecx, %edx
# 0 "" 2
#NO_APP
	movl	%edx, %eax
	ret
	.cfi_endproc
.LFE17:
	.size	func11, .-func11

Note the extra moves before the add instruction, and afterwards in order to get things where they need to be. This is the reason why you really shouldn't use explicit named registers if you can avoid them. The only time where they are unavoidable is if you want to match some kind of ABI, or have to interface with an instruction with fixed inputs or outputs.

An example of this on x86, is the mul instruction. That will put its output in the 'a' and 'd' registers, and always takes one of its inputs from the 'a' register. So to describe it's use you might do something like:

/* Commuting inputs */
unsigned func12(unsigned p1, unsigned p2)
{
	unsigned hi, lo;

	asm ("mul %3"
		: "=a" (lo), "=d" (hi)
		: "%0" (p1), "r" (p2) : "cc");

	return hi + lo;
}

The above uses another feature of gcc asm. Sometimes inputs commute, and we don't really care which of them uses a particular register. In this case p1*p2 = p2*p1, and we don't mind which of them goes in %eax. To tell gcc this, we can use the '%' constraint flag, which means that that constraint and the following one commute.


	.p2align 4,,15
	.globl	func12
	.type	func12, @function
func12:
.LFB18:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 109 "gcc_asm.c" 1
	mul %esi
# 0 "" 2
#NO_APP
	addl	%edx, %eax
	ret
	.cfi_endproc
.LFE18:
	.size	func12, .-func12

In this case, gcc decides not to swap the order of the two inputs because it doesn't matter.

We can try something slightly different, where we use the 'D' constraint to force the use of %edi as the multiplicand.

unsigned func13(unsigned p1, unsigned p2)
{
	unsigned hi, lo;

	asm ("mul %3"
		: "=a" (lo), "=d" (hi)
		: "%0" (p1), "D" (p2) : "cc");

	return hi + lo;
}

This compiles into:


	.p2align 4,,15
	.globl	func13
	.type	func13, @function
func13:
.LFB19:
	.cfi_startproc
	movl	%edi, %eax
	movl	%esi, %edi
#APP
# 121 "gcc_asm.c" 1
	mul %edi
# 0 "" 2
#NO_APP
	addl	%edx, %eax
	ret
	.cfi_endproc
.LFE19:
	.size	func13, .-func13

Unfortunately, gcc fails to make the swap in this case as well, even though it would be very profitable to do so. It looks like you can't really count on the '%' constraint specifier, which is a shame.

Generalized Constraints

There is another way to get more flexibility within the constraints. You can simply list more than one constraint symbol. GCC will choose the best one. An example of using either a register, or a direct memory reference is:

int var14;
int func14(int p1)
{
	asm ("add %1, %0"
		:"+r" (p1) : "rm" (var14) : "cc");
	return foo(p1);
}

Which will use the better direct-memory operand:


	.globl	func14
	.type	func14, @function
func14:
.LFB20:
	.cfi_startproc
#APP
# 133 "gcc_asm.c" 1
	add var14(%rip), %edi
# 0 "" 2
#NO_APP
	jmp	foo
	.cfi_endproc
.LFE20:
	.size	func14, .-func14

Another way of gaining flexibility is using a more general constraint. 'g' allows a register, memory, or immediate operand. Using it:

int var15;
int func15(int p1)
{
	asm ("add %1, %0"
		:"+r" (p1) : "g" (var15) : "cc");
	return foo(p1);
}

GCC will again pick the best option, which in this case is direct memory addressing modes.


	.p2align 4,,15
	.globl	func15
	.type	func15, @function
func15:
.LFB21:
	.cfi_startproc
#APP
# 142 "gcc_asm.c" 1
	add var15(%rip), %edi
# 0 "" 2
#NO_APP
	jmp	foo
	.cfi_endproc
.LFE21:
	.size	func15, .-func15

Of course, if you want an immediate, there is a symbol for that as well, 'i'. The limitation is that an immediate must be a compile, or link-time constant.

int func16(int p1)
{
	asm ("add %1, %0"
		:"+r" (p1) : "i" (99) : "cc");
	return foo(p1);
}

Which compiles into:


	.globl	func16
	.type	func16, @function
func16:
.LFB22:
	.cfi_startproc
#APP
# 150 "gcc_asm.c" 1
	add $99, %edi
# 0 "" 2
#NO_APP
	jmp	foo
	.cfi_endproc
.LFE22:
	.size	func16, .-func16

Notice how gcc automatically converts into the AT&T syntax for us, with the dollar symbol preceding the constant.

There are other constraint modifiers. One of which is the '#' symbol which acts like a comment character.

int var17;
int func17(int p1)
{
	asm ("add %1, %0"
		:"+r" (p1) : "m#hello" (var17) : "cc");
	return foo(p1);
}

The above compiles into:


	.p2align 4,,15
	.globl	func17
	.type	func17, @function
func17:
.LFB23:
	.cfi_startproc
#APP
# 159 "gcc_asm.c" 1
	add var17(%rip), %edi
# 0 "" 2
#NO_APP
	jmp	foo
	.cfi_endproc
.LFE23:
	.size	func17, .-func17

Everything after the hash symbol is ignored. Unfortunately, you can't include spaces or punctuation symbols within the comment. The other thing that ends the 'comment' is a comma. This is because you can use commas to allow multiple alternatives in an inline asm. The alternatives are linked together (all first option, all second option, etc.) rather than being unlinked like in the 'rm' case. Some example code is:

int var18, var18a;
int func18(int p1, int p2)
{
	/* Uses reg-reg option */
	asm ("add %1, %0"
		:"+m,r,r" (p1) : "r,m,r" (p2) : "cc");

	/* Uses mem-reg option */
	asm ("add %1, %0"
		:"+m,r,r" (var18) : "r,m,r" (p2) : "cc");
#if 1
	/* Uses reg-mem option even with two memory operands (gcc copies for us) */
	asm ("add %1, %0"
		:"+m,r,r" (var18) : "r,m,r" (var18a) : "cc");
#else
	/* Creates invalid instruction with two memory operands */
	asm ("add %1, %0"
		:"+g" (var18) : "g" (var18a) : "cc");
#endif
	/* Uses reg-mem option */
	asm ("add %1, %0"
		:"+m,r,r" (p1) : "r,m,r" (var18) : "cc");

#if 0
	/* Doesn't work "inconsistent operand constraints" */
	asm ("add %1, %0"
		:"+r,m" (p1) : "m,r" (var18) : "cc");
#endif

	return foo(p1);
}

The above shows the power of the technique. In x86 assembly language, there can only be a single reference to memory within an instruction. Thus if we use two 'g' constraints, we can sometimes generate invalid code. One fix for this is to use register-only 'r' constraints. However, they can lead to inefficiency. What we want to do is only ban the invalid option. By using alternative constraints, we select the valid 'm + r', 'r + m', and 'r + r' options.

Note that this feature isn't used very often within inline asm code, so is a little buggy. The final inline asm which is #defined out, in the above function should work. However, gcc gets confused by it. The fix is to add the 'r + r' option, like in the other cases.

When compiled, the above yields:


	.p2align 4,,15
	.globl	func18
	.type	func18, @function
func18:
.LFB24:
	.cfi_startproc
#APP
# 174 "gcc_asm.c" 1
	add %esi, var18(%rip)
# 0 "" 2
# 170 "gcc_asm.c" 1
	add %esi, %edi
# 0 "" 2
#NO_APP
	movl	var18a(%rip), %eax
#APP
# 178 "gcc_asm.c" 1
	add %eax, var18(%rip)
# 0 "" 2
# 186 "gcc_asm.c" 1
	add var18(%rip), %edi
# 0 "" 2
#NO_APP
	jmp	foo
	.cfi_endproc
.LFE24:
	.size	func18, .-func18

Disparagement

Another possibility is when you want a constraint, but you don't want the compiler to worry too much about the cost of that constraint. This doesn't really come into play very often. In fact, with orthogonal architectures like x86, it may not happen at all. This is really a case of API leakage, where gcc offers a feature that may be useful to some machines to all. The '*' constraint specifier causes the following character to not count in terms of register pressure. The canonical example is the following:

int var19;
int func19(int p1)
{
	int out;

	/* Picks the "same register" case */
	asm ("add %1, %0"
		:"=*r,m" (out) : "0,r" (p1) : "cc");

	/* Picks the "mem-register operand" case */
	asm ("add %1, %0"
		:"=*r,m" (var19) : "0,r" (out) : "cc");

	return var19 - out;
}

In the above we have an instruction (an add, in this case), which will either take two references to the same register, or a memory-register combination. The same-reg, same-reg case is more strict, and we would like gcc to use the memory-addressing version if possible. The '*' accomplishes this. However, this trick is rather subtle... and probably shouldn't be used with inline asm. The above compiles into:


	.p2align 4,,15
	.globl	func19
	.type	func19, @function
func19:
.LFB25:
	.cfi_startproc
#APP
# 204 "gcc_asm.c" 1
	add %edi, %edi
# 0 "" 2
# 208 "gcc_asm.c" 1
	add %edi, var19(%rip)
# 0 "" 2
#NO_APP
	movl	var19(%rip), %eax
	subl	%edi, %eax
	ret
	.cfi_endproc
.LFE25:
	.size	func19, .-func19

Note how the differing form of the instruction is chosen.

A much better technique is to use constraint modifiers that explicitly penalize some alternatives over others. By using the right amount of penalization, you can create patterns that match the machine's costs. GCC will then be able to make intelligent choices about which is best. The simple way to do this is to add a '?' character to the more costly alternative.

/* Alternatives - costs (works with partial matches) */
int func20(int p1, int p2)
{
#if 1
	/* Picks using eax */
	asm ("add %1, %0"
		:"+r?,a" (p1) : "d?,r" (p2) : "cc");
#else
	/* Picks using edx */
	asm ("add %1, %0"
		:"+r,a?" (p1) : "d,r?" (p2) : "cc");
#endif
	return foo(p1);
}

The above shows how you can tell the compiler that (for example) %eax can be more or less expensive to use than %edx. It compiles into:


	.p2align 4,,15
	.globl	func20
	.type	func20, @function
func20:
.LFB26:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 220 "gcc_asm.c" 1
	add %esi, %eax
# 0 "" 2
#NO_APP
	movl	%eax, %edi
	jmp	foo
	.cfi_endproc
.LFE26:
	.size	func20, .-func20

Of course, a single level of penalization might not be enough. You can add more '?' symbols. Two question marks is even more penalized than one.

/* Can use more than one '?' for larger cost */
int func21(int p1, int p2)
{
#if 1
	/* Picks using eax */
	asm ("add %1, %0"
		:"+r??,a?" (p1) : "d??,r?" (p2) : "cc");
#else
	/* Picks using edx */
	asm ("add %1, %0"
		:"+r?,a??" (p1) : "d?,r??" (p2) : "cc");
#endif
	return foo(p1);
}

Giving:


	.p2align 4,,15
	.globl	func21
	.type	func21, @function
func21:
.LFB27:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 235 "gcc_asm.c" 1
	add %esi, %eax
# 0 "" 2
#NO_APP
	movl	%eax, %edi
	jmp	foo
	.cfi_endproc
.LFE27:
	.size	func21, .-func21

For even greater penalization, you can use the '!' symbol. It is equivalent to 100 '?' symbols. This should be very rarely needed.

/* Even stronger disparagement of the alternative ! = 100?'s */
int func22(int p1, int p2)
{
#if 1
	/* Picks using eax */
	asm ("add %1, %0"
		:"+r!,a??" (p1) : "d!,r??" (p2) : "cc");
#else
	/* Picks using edx */
	asm ("add %1, %0"
		:"+r??,a!" (p1) : "d??,r!" (p2) : "cc");
#endif
	return foo(p1);
}

Giving:


	.p2align 4,,15
	.globl	func22
	.type	func22, @function
func22:
.LFB28:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 250 "gcc_asm.c" 1
	add %esi, %eax
# 0 "" 2
#NO_APP
	movl	%eax, %edi
	jmp	foo
	.cfi_endproc
.LFE28:
	.size	func22, .-func22

Clobbers

Up until now, we have only used the clobber part of the asm intrinsic for 'memory', and 'cc' (condition codes). However, there are other things you can put in there. The most often used are names of registers. This tells gcc that that register is somehow used in the asm string. It will not use that register for inputs or outputs, and will helpfully save that register before the asm is called, and then will automatically restore it afterwards.

An example of this where we clobber the %rdx register is:

int func23(int p1)
{
	unsigned out = 25;
	asm ("mul %1"
		: "+a" (out) : "g" (p1) : "%rdx", "cc");
	return out;
}

The mul instruction will write to %rax and %rdx. We don't care about the upper part, so it isn't an output. To tell gcc about the register write, the clobber does the job. (Yes, there are other versions of the x86 multiply instruction that don't clobber %rdx unnecessarily, but this is just an example of how clobbers might be useful.) This compiles into:


 	.p2align 4,,15
	.globl	func23
	.type	func23, @function
func23:
.LFB29:
	.cfi_startproc
	movl	$25, %eax
#APP
# 264 "gcc_asm.c" 1
	mul %edi
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE29:
	.size	func23, .-func23

In this case, %rdx is 'dead' because it is a parameter-register in the ABI. GCC doesn't need to save or restore it, so doesn't. Without the clobber, we would need to save and restore the register manually. That would be inefficient in cases like the above, were such saves and restores are not needed.

Of course, you can clobber more than one register:

/* More than one clobber */
int func24(int p1)
{
	int out;

	/* %1 cannot overlap clobber list.  Note use of "%%" in asm */
	asm ("mov %1, %%edi\n\t"
		"call foo\n\t"
		: "=a" (out) : "g" (p1) : "%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9", "memory", "cc");
	return out;
}

The above is bad coding style. You really shouldn't use control-flow altering instructions inside inline asm. GCC doesn't know about them, and can do optimizations that invalidate what you are trying to do. (If 'foo' is inlined everywhere, it may not even exist to call.) Also, there have been many bugs when the number of clobbered registers gets too large. If gcc can't find a way to save and restore everything it may simply give up and crash.

In the above case, we are lucky, and it compiles without issue. The trick is to notice that the clobbered registers are all dead (except %rdi) due to the x86_64 SYSV ABI.


	.p2align 4,,15
	.globl	func24
	.type	func24, @function
func24:
.LFB30:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 275 "gcc_asm.c" 1
	mov %eax, %edi
	call foo
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE30:
	.size	func24, .-func24

A much better technique is to use explicit temporaries. GCC can then allocate them where ever it has space. It can also move things around for more efficiency, based on the needs of surrounding code. An example of doing this is:

/* Better option, if available.  Use temp out registers */
int func25(int p1, int p2)
{
	int temp1, temp2;
	int out;

	/* '=&' so temp's don't overlap with inputs */
	asm ("mov %3, %1\n\t"
		"mov %4, %2\n\t"
		"shr $10, %1\n\t"
		"shl $10, %2\n\t"
		"add %3, %1\n\t"
		"lea (%4, %2, 1), %0\n\t"
		"xor %1, %0\n\t"
		: "=r" (out), "=&r" (temp1), "=&r" (temp2): "g" (p1), "g" (p2) : "cc");

	return out;
}

In the above, we use two temporary registers. Since we don't want them to overlap the other inputs or outputs, they need to be defined by '=&r' constraints. The only thing left on the clobber list is the 'cc' due to the arithmetic and logic instructions altering the condition codes.


	.p2align 4,,15
	.globl	func25
	.type	func25, @function
func25:
.LFB31:
	.cfi_startproc
#APP
# 288 "gcc_asm.c" 1
	mov %edi, %edx
	mov %esi, %ecx
	shr $10, %edx
	shl $10, %ecx
	add %edi, %edx
	lea (%esi, %ecx, 1), %eax
	xor %edx, %eax
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE31:
	.size	func25, .-func25

Finally, there is another way to name registers within the asm string itself. Depending on your point of view, the numerical '%0-%9' scheme may be more or less readable than the following:

int func26(int p1)
{
	int out = 137;
	asm ("sub %[p1], %[out_name]"
		: [out_name] "+r" (out) : [p1] "g" (p1) : "cc");
	return out;
}

By putting a name within square brackets in the constraints we can then use those names in the asm string. Note that the asm operand name does not have to be the same as the C variable it comes from. However, for readability, it may be better to keep the two the same if possible. The main disadvantage of the technique is that it can make the asm string a little longer, and can make it harder to see what addressing modes are used.


	.p2align 4,,15
	.globl	func26
	.type	func26, @function
func26:
.LFB32:
	.cfi_startproc
	movl	$137, %eax
#APP
# 304 "gcc_asm.c" 1
	sub %edi, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE32:
	.size	func26, .-func26

Less Common Constraint Types

There are a few standard constraints beyond those discussed above.

One of these is for "offsetable memory", which is any memory reference which can take an offset to it. In the orthogonal x86 architecture, this is anything that 'm' could reference, so this constraint class isn't too useful there. Other machines may be different though. An example of its usage is:

int func27(int p1)
{
	static int out[2];

	asm ("mov %1, 4+%0"
		: "=o" (out) : "r" (p1));

	return out[1];
}

Which compiles into:


	.p2align 4,,15
	.globl	func27
	.type	func27, @function
func27:
.LFB33:
	.cfi_startproc
#APP
# 317 "gcc_asm.c" 1
	mov %edi, 4+out.2398(%rip)
# 0 "" 2
#NO_APP
	movl	out.2398+4(%rip), %eax
	ret
	.cfi_endproc
.LFE33:
	.size	func27, .-func27

The linker and assembler understand the more complex addressing within "out.2398+4(%rip)", and will generate the appropriate fix-up for us.

Since some machines have offsetable memory as a separate class from normal memory constraints, there is some memory which is not offsetable. If you want to have a constraint that references such memory, you can use the 'V' constraint flag. However, since x86 doesn't have such a beast, we don't provide an example of its use.

Some machines provide memory that automatically increments or decrements things stored within it. Such memory can be described by the '<' and '>' constraints. Again, x86 doesn't have anything like that, so those constraints are not supported, and no example is provided.

Another constraint that isn't so useful on x86 is 'n'. That refers to a constant integer that is known at assembly time. Some machines have less capable assemblers and linkers, and cannot use the more general 'i' constraint. 'i' is an integer constant known at link time. Since 'n' defines a sub-category of 'i', you can also use it on x86:

int func28(void)
{
	int out = 0;

	asm ("add %1,%0"
		: "+r" (out) : "n" (5));

	return out;
}

The above acts just like 'i' would do, and uses the 5 as an immediate:


	.p2align 4,,15
	.globl	func28
	.type	func28, @function
func28:
.LFB34:
	.cfi_startproc
	xorl	%eax, %eax
#APP
# 332 "gcc_asm.c" 1
	add $5,%eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE34:
	.size	func28, .-func28

Another integer immediate constraint type is 's'. This describes an integer that is known at link time, but not compile or assembly time. This isn't particularly useful on x86, but on other machines can lead to optimizations.

Not all immediates are integers. Some machines allow immediate floating point numbers. The 'E' constraint is for floating point immediates that are defined on the compiling machine. If the target machine is different, then the bit-values may be incorrect. Thus, this constraint shouldn't be used if you are cross-compiling.

The x86 architecture really doesn't allow floating point immediates. You should get constants into SSE registers and the legacy floating point stack from memory instead. However, there are a coupled of special cases that still work:

double func29(void)
{
	double out;
	unsigned long long temp;

	asm ("movabs %2, %1\n\t"
		"movq %1, %0\n\t"
		: "=x" (out), "=r" (temp) : "E" (2.0));
	return out;
}

The above use the bit-pattern for the double '2.0', and indirectly moves it into an SSE register (defined by the 'x' constraint). It would be more efficient to do a direct memory load, but the above does work:


	.p2align 4,,15
	.globl	func29
	.type	func29, @function
func29:
.LFB35:
	.cfi_startproc
#APP
# 346 "gcc_asm.c" 1
	movabs $0x4000000000000000, %rax
	movq %rax, %xmm0
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE35:
	.size	func29, .-func29

The code for float-sized immediates is similar:

float func30(void)
{
	float out;
	unsigned temp;

	asm ("mov %2, %1\n\t"
		"movd %1, %0\n\t"
		: "=x" (out), "=r" (temp) : "E" (2.0f));
	return out;
}

Giving:


	.p2align 4,,15
	.globl	func30
	.type	func30, @function
func30:
.LFB36:
	.cfi_startproc
#APP
# 357 "gcc_asm.c" 1
	mov $0x40000000, %eax
	movd %eax, %xmm0
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE36:
	.size	func30, .-func30

In addition to the 'E' constraint is the 'F' constraint. This is cross-compiling friendly, and should probably be used instead. Otherwise, it has the same meaning as it's 'E' cousin.

double func31(void)
{
	double out;
	unsigned long long temp;
	asm ("movabs %2, %1\n\t"
		"movq %1, %0\n\t"
		: "=x" (out), "=r" (temp) : "F" (2.0));
	return out;
}

Which produces identical code to the 'E' version:


	.p2align 4,,15
	.globl	func31
	.type	func31, @function
func31:
.LFB37:
	.cfi_startproc
#APP
# 368 "gcc_asm.c" 1
	movabs $0x4000000000000000, %rax
	movq %rax, %xmm0
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE37:
	.size	func31, .-func31

Another rarely used constraint is 'p'. It describes a valid memory addresses. On x86, it behaves just like 'm' does. You should use the more standard 'm' instead.

void *func32(void)
{
	static int mem;
	void *out;
	asm ("lea (%1), %0"
		: "=r" (out) : "p" (&mem));
	return out;
}

Which compiles into:


	.p2align 4,,15
	.globl	func32
	.type	func32, @function
func32:
.LFB38:
	.cfi_startproc
#APP
# 381 "gcc_asm.c" 1
	lea ($mem.2421), %rax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE38:
	.size	func32, .-func32

There is one final constraint common to all machines, 'X'. This constraint matches absolutely everything. This catch-all doesn't give gcc any information about how to pass the information to the inline asm, so gcc picks the form most convenient for it. Since the exact output will be highly variable, it is difficult to use in normal asm instructions. However, it may be helpful in asm directives:

const char *func33(int p1)
{
	const char *str;
	asm (".pushsection .data\n\t"
		"1:\n\t"
		".asciz \"%1\"\n\t"
		".popsection\n\t"
		"lea 1b(%%rip),%0\n\t"
		: "=r" (str) : "X" (p1));
	return str;
}

The above compiles into:


	.p2align 4,,15
	.globl	func33
	.type	func33, @function
func33:
.LFB39:
	.cfi_startproc
#APP
# 390 "gcc_asm.c" 1
	.pushsection .data
	1:
	.asciz "%edi"
	.popsection
	lea 1b(%rip),%rax
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE39:
	.size	func33, .-func33

This creates a zero-terminated ASCII string containing the operand used by gcc. With a bit of section magic, it obtains a pointer to it, which is then returned in the output.

X86 Register Constraints

Most of the previous constraint types will work on all machines. Some have been x86-only though. For example, 'a', which will expand to '%al', '%ax', '%eax" or '%rax', will obviously not work the same way on another architecture. We have seen a few of these x86-only, but there are many more.

A simple register constraint is 'R'. This selects any legacy register for use. i.e. one of the a,b,c,d,si,di,bp, or sp registers. This may be useful when interfacing with old code unable to use any of the new 64 bit registers. Otherwise, the constraint acts just like 'r' would do:

int func34(int p1, int p2, int p3, int p4, int p5)
{
	int out;
	/* Copies from r8 into legacy register first */
	asm ("mov %1, %0"
		: "=r" (out) : "R" (p5));
	return foo(out);
}

The above cannot use p5 as is because it is passed in %r8 by the ABI. Thus gcc will insert a move instruction into a legacy register as requested. This copy wouldn't happen if 'r' were used instead;


	.p2align 4,,15
	.globl	func34
	.type	func34, @function
func34:
.LFB40:
	.cfi_startproc
	movl	%r8d, %eax
#APP
# 405 "gcc_asm.c" 1
	mov %eax, %edi
# 0 "" 2
#NO_APP
	jmp	foo
	.cfi_endproc
.LFE40:
	.size	func34, .-func34

Another constraint that picks a subset of the available registers is 'q'. This picks a register with an addressable lower 8-bit part. The list of available registers differs between 64-bit mode and 32-bit mode. In 32-bit mode, some of the registers don't exist. i.e. you can't access %dil or %sil.

char func35(char p1)
{
	char out;
	asm ("mov %1, %0"
		: "=q" (out) : "q" (p1));
	return out;
}

Otherwise the use looks exactly like 'r' would have.


	.p2align 4,,15
	.globl	func35
	.type	func35, @function
func35:
.LFB41:
	.cfi_startproc
#APP
# 414 "gcc_asm.c" 1
	mov %dil, %al
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE41:
	.size	func35, .-func35

A variant of the above is the 'Q' constraint, that picks a register with a 'high' 8-bit sub-register. i.e. any of the a, b, c or d registers:

char func35a(char p1)
{
	char out;
	asm ("mov %1, %0"
		: "=Q" (out) : "Q" (p1));
	return out;
}

Which compiles into:


	.p2align 4,,15
	.globl	func35a
	.type	func35a, @function
func35a:
.LFB42:
	.cfi_startproc
	movl	%edi, %edx
#APP
# 423 "gcc_asm.c" 1
	mov %dl, %al
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE42:
	.size	func35a, .-func35a

Notice how the compiler was not allowed to use the %edi register as the operand any more. Instead, it picked %edx.

As we have seen in the earlier sections, some of the x86 registers have constraints of their very own. We have seen 'a' and 'd'. Similarly, 'b' and 'c' do what you might expect, and refer to the '%bl', '%bx', '%ebx', and '%rbx' registers, and the '%cl', '%cx', '%ecx', and '%rcx' registers respectively. An example of this might be:

int func36(int p1, int p2, int p3, int p4)
{
	int out;
	asm ("mov %1, %0\n\t"
		"add %2, %3\n\t"
		"add %4, %0\n\t"
		"add %3, %0\n\t"
		: "=&r" (out) : "a" (p1), "b" (p2), "c" (p3), "d" (p4) : "cc");
	return out;
}

Where every input has had its register manually defined by an explicit constraint. GCC needs to do a little bit of copying to get everything into the right spot;


	.p2align 4,,15
	.globl	func36
	.type	func36, @function
func36:
.LFB42:
	.cfi_startproc
	movl	%edx, %r8d
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset 3, -16
	movl	%ecx, %edx
	movl	%edi, %eax
	movl	%esi, %ebx
	movl	%r8d, %ecx
#APP
# 423 "gcc_asm.c" 1
	mov %eax, %r9d
	add %ebx, %ecx
	add %edx, %r9d
	add %ecx, %r9d
	
# 0 "" 2
#NO_APP
	popq	%rbx
	.cfi_def_cfa_offset 8
	movl	%r9d, %eax
	ret
	.cfi_endproc
.LFE42:
	.size	func36, .-func36

There are also special constraints for the si and di registers, 'S' and 'D' respectively. (We have used 'D' before in func13().) Something using them looks like:

int func37(int p1, int p2)
{
	int out;
	asm ("mov %1, %0\n\t"
		"add %2, %0\n\t"
		: "=&r" (out) : "D" (p1), "S" (p2) : "cc");
	return out;
}

Which compiles into:


	.p2align 4,,15
	.globl	func37
	.type	func37, @function
func37:
.LFB43:
	.cfi_startproc
#APP
# 435 "gcc_asm.c" 1
	mov %edi, %eax
	add %esi, %eax
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE43:
	.size	func37, .-func37

There is one final way to access the general purpose registers, which is via the 'A' constraint. This is the two-register pair defined by the a and d registers. This is useful when you want to deal with 128-bit quantities in 64-bit mode, or 64-bit quantities in 32-bit mode. The low bits are stored in the a register, and the high bits in the d register, just like the multiply and division asm instructions expect. Its use looks like:

__uint128_t func38(unsigned long long p1, unsigned long long p2)
{
	__uint128_t out;
	asm ("mul %2"
		:"=&A" (out) : "%0" (p1), "g" (p2) : "cc");
	return out;
}

Which compiles into:


	.p2align 4,,15
	.globl	func38
	.type	func38, @function
func38:
.LFB44:
	.cfi_startproc
	movq	%rdi, %rax
#APP
# 445 "gcc_asm.c" 1
	mul %rsi
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE44:
	.size	func38, .-func38

Since the ABI requires a function returning a 128-bit integer to do so in %rax and %rdx, the above has no extra register to register copies. (Other than that required to get the multiply instruction initialized.)

X86 Floating Point Constraints

The x86 has a strange floating-point coprocessor which uses an internal stack of registers. Dealing with this is difficult with gcc. You need to make sure that the right number of values are added and removed from this stack. GCC assumes that all output constraints are under its purview, and are popped by it. Input constraints are more complex, can be either popped by gcc afterwards or not.

The least complex method is to tie an input constraint to an output. That makes it popped afterwards with the output that replaces it. You can also clobber an input to make it assumed to have been implicitly popped. Otherwise, gcc will assume it can use the input later for other calculations, and will handling the popping of that register itself.

One critical detail is that the floating point processor acts on a stack. That means that the used (popped or not) registers must be contiguous. It's not possible for gcc to re-arrange the stack by popping something the middle. You need to make sure the outputs are first in the stack, followed by all registers you pop, and finally followed by the ones gcc will pop from that stack.

The constraint for the top of the floating point stack is 't'. We can add things to the stack without a floating point register input by using memory instead:

long double func39(int p1)
{
	long double out;
	asm ("fild %1"
		: "=t" (out) : "m" (p1));
	return out;
}

The above converts an integer into a long double float:


	.p2align 4,,15
	.globl	func39
	.type	func39, @function
func39:
.LFB45:
	.cfi_startproc
	movl	%edi, -12(%rsp)
#APP
# 454 "gcc_asm.c" 1
	fild -12(%rsp)
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE45:
	.size	func39, .-func39

The ABI mandates that long doubles are returned in st(0), so the above routine doesn't need to alter the stack.

The next-from-top floating point register, st(1), also has a special constraint: 'u'. An example of its use might be:

long double func40(long double p1, long double p2)
{
	long double out;
	asm ("fadd %2, %0"
		: "=&t" (out) : "%0" (p1), "u" (p2));
	return out;
}

Note how in the above we link the first input to the output, so it is stored in st(0), and popped by gcc afterwards. The other input is in st(1), and since is not clobbered, will also be popped by gcc afterwards.


	.globl	func40
	.type	func40, @function
func40:
.LFB46:
	.cfi_startproc
	fldt	8(%rsp)
	fldt	24(%rsp)
	fxch	%st(1)
#APP
# 463 "gcc_asm.c" 1
	fadd %st(1), %st
# 0 "" 2
#NO_APP
	fstp	%st(1)
	ret
	.cfi_endproc
.LFE46:
	.size	func40, .-func40

You can see how gcc sets up the floating point stack (in a not particularly efficient way). You can also see how the st(1) input is cleaned up afterwards by the ftsp instruction. st(0) is still live at the end of the function, and is used for the long double output.

Finally, you can create an input in an arbitrary floating point slot by using the 'f' constraint. (This doesn't work as an output constraint.) An example of this is:

long double func41(long double p1)
{
	long double out = 2.0;
	asm ("fadd %1\n\t"
		: "+&t" (out) : "f" (p1));
	return out;
}

Where just to be different from the previous function, we use an in-out parameter on the top of the stack.


	.p2align 4,,15
	.globl	func41
	.type	func41, @function
func41:
.LFB47:
	.cfi_startproc
	flds	.LC1(%rip)
	fldt	8(%rsp)
	fxch	%st(1)
#APP
# 472 "gcc_asm.c" 1
	fadd %st(1)
	
# 0 "" 2
#NO_APP
	fstp	%st(1)
	ret
	.cfi_endproc
.LFE47:
	.size	func41, .-func41

Again the code generated has an extra fxch than what is needed. You really shouldn't use the legacy floating point instructions. Instead, modern code should use SSE instructions for their floating point work.

Another legacy part of the x86 instruction set are the mmx registers. These are aliases of the legacy floating point stack. This means that they are difficult to use because you need to use the 'emms' instruction afterwards to avoid floating point exceptions. However, some older vectorized code does use them. The constraint for their use is 'y':

typedef char mmx64 __attribute__ ((vector_size (8)));
mmx64 func42(mmx64 p1, mmx64 p2)
{
	asm ("paddb %1, %0"
		: "+&y" (p1) : "y" (p2));
	return p1;
}

Which compiles into:


	.globl	func42
	.type	func42, @function
func42:
.LFB48:
	.cfi_startproc
	movdq2q	%xmm0, %mm0
	movdq2q	%xmm1, %mm1
#APP
# 485 "gcc_asm.c" 1
	paddb %mm1, %mm0
# 0 "" 2
#NO_APP
	movq	%mm0, -8(%rsp)
	movq	-8(%rsp), %xmm0
	ret
	.cfi_endproc
.LFE48:
	.size	func42, .-func42

The above is obviously very inefficient, as gcc goes through the better SSE registers as mandated by the vector ABI. Another thing missing is the emms instruction. You'll need to use yet another inline asm in order to add it where needed. A better option is to avoid these registers if possible.

Instead, most modern code should be using the 16-byte SSE registers. The constraint for accessing those is 'x'. (This was also used in func29.) Since the ABI is much more compatible, the overhead is lower:

typedef double xmmd __attribute__ ((vector_size (16)));
xmmd func43(xmmd p1, xmmd p2)
{
	asm ("addpd %1, %0"
		: "+&x" (p1) : "x" (p2));
	return p1;
}

Which when compiled, produces:


	.p2align 4,,15
	.globl	func43
	.type	func43, @function
func43:
.LFB49:
	.cfi_startproc
#APP
# 494 "gcc_asm.c" 1
	addpd %xmm1, %xmm0
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE49:
	.size	func43, .-func43

Many fewer instructions are used in the above, with the bulk of the function just a single SSE instruction.

The final register constraint type is defined by the two-character string 'Yz'. This constrains to the first SSE register, %xmm0. This is useful because that register is often mentioned by the ABI. It is the first floating point or vector parameter passed to a function, and also the register used for floating point or vectorized output. Using it is easy:

xmmd func44(xmmd p1, xmmd p2)
{
	asm ("addpd %1, %0"
		: "+&Yz" (p2) : "x" (p1));
	return p2;
}

Here we deliberately cause gcc to have to swap the SSE registers around in order to get p2 into %xmm0:


	.globl	func44
	.type	func44, @function
func44:
.LFB50:
	.cfi_startproc
	movapd	%xmm0, %xmm2
	movapd	%xmm1, %xmm0
#APP
# 502 "gcc_asm.c" 1
	addpd %xmm2, %xmm0
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE50:
	.size	func44, .-func44

X86 Integer Constraints

In addition to the machine-specific register constraints, the x86 inline asm in gcc also supports special integer constraints. Most of these are actually not useful for inline asm - being 'leakage' from the RTL pattern-matching used by the optimizer. They still can be used, although this is not recommended as these are not really documented.

The first of these is relatively useful. The 'I' constraint specifies a constant integer in the range 1-31. It is useful for 32 bit shift instructions:

unsigned func45(unsigned p1)
{
	asm ("shl %1, %0"
		: "+g" (p1) : "I" (20) : "cc");
	return p1;
}

This compiles as you might expect:


	.p2align 4,,15
	.globl	func45
	.type	func45, @function
func45:
.LFB51:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 510 "gcc_asm.c" 1
	shl $20, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE51:
	.size	func45, .-func45

Similarly, there is the 'J' constraint which specifies a constant integer in the range 1-63 for 64 bit shift instructions:

unsigned long long func46(unsigned long long p1)
{
	asm ("shl %1, %0"
		: "+g" (p1) : "J" (40) : "cc");
	return p1;
}

Which compiles into:


	.p2align 4,,15
	.globl	func46
	.type	func46, @function
func46:
.LFB52:
	.cfi_startproc
	movq	%rdi, %rax
#APP
# 518 "gcc_asm.c" 1
	shl $40, %rax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE52:
	.size	func46, .-func46

The above two constraints are helpful in that gcc will error out if the constants are the wrong size. This extra error-checking can prevent bugs.

Perhaps less useful is the 'K' constraint. This specifies a signed 8-bit integer constant;

signed char func47(signed char p1)
{
	asm ("add %1, %0"
		: "+a" (p1) : "K" (-127) : "cc");
	return p1;
}

Compiling into:


	.p2align 4,,15
	.globl	func47
	.type	func47, @function
func47:
.LFB53:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 526 "gcc_asm.c" 1
	add $-127, %al
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE53:
	.size	func47, .-func47

On the other hand the 'L' constraint is obviously something that has escaped from RTL-land. It only allows the two integers 0xFF or 0xFFFF. It basically is a method of pattern-matching certain zero-extending constructs. Since you can't alter the asm string based on register matches, this constraint is barely useful. Of course, it still can be used:

unsigned func48(unsigned p1)
{
	unsigned out;

	asm ("mov %1, %0\n\t"
		"and %2, %0\n\t"
		: "=&r" (out) : "L" (0xff), "g" (p1) : "cc");
	return out;
}

Where the above shows how the and instruction may be used for zero-extension:


	.p2align 4,,15
	.globl	func48
	.type	func48, @function
func48:
.LFB54:
	.cfi_startproc
#APP
# 539 "gcc_asm.c" 1
	mov $255, %eax
	and %edi, %eax
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE54:
	.size	func48, .-func48

Another not so useful constraint is 'M'. This specifies integer constants from 0-3. This is useful for RTL pattern-matching shifts that may otherwise be better done with an lea instruction. However, again the result is something not so useful for inline asm. You probably shouldn't use it. However, if you do, it may look something like:

unsigned func49(unsigned p1)
{
	asm ("shl %1, %0"
		: "+&g" (p1) : "M" (3) : "cc");
	return p1;
}

Which compiles into:


	.p2align 4,,15
	.globl	func49
	.type	func49, @function
func49:
.LFB55:
	.cfi_startproc
	movl	%edi, %eax
#APP
# 548 "gcc_asm.c" 1
	shl $3, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE55:
	.size	func49, .-func49

The next integer constraint is 'N'. This one specifies an unsigned 8-bit integer constant. It is useful for the io/instructions 'in' and 'out':

unsigned func50(void)
{
	unsigned out;
	asm volatile ("in %1, %0"
		: "=a" (out) : "N" (0x80));
	return out;
}

Which gives:


	.p2align 4,,15
	.globl	func50
	.type	func50, @function
func50:
.LFB56:
	.cfi_startproc
#APP
# 557 "gcc_asm.c" 1
	in $128, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE56:
	.size	func50, .-func50

The addition of 64-bit support to gcc meant that constraints needed to be added to support it. Since most instructions do not support 64 bit immediates, we need to differentiate from 'i' which will allow such large integers. Instead, you can use 'e', for a constraint for a constant 32-bit signed integer:

long long func51(void)
{
	long long out;

	asm ("mov %1, %0"
		: "=a" (out) : "e" (-1));
	return out;
}

Which when compiled gives:


	.p2align 4,,15
	.globl	func51
	.type	func51, @function
func51:
.LFB57:
	.cfi_startproc
#APP
# 567 "gcc_asm.c" 1
	mov $-1, %rax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE57:
	.size	func51, .-func51

Similarly, there now is also a constraint for 32-bit unsigned integer constants, 'Z':

unsigned long long func52(void)
{
	unsigned long long out;

	asm ("mov %1, %0"
		: "=a" (out) : "Z" (0xfffffffful) : "cc");

	return out;
}

Which we can compile to give:


	.p2align 4,,15
	.globl	func52
	.type	func52, @function
func52:
.LFB58:
	.cfi_startproc
#APP
# 577 "gcc_asm.c" 1
	mov $4294967295, %rax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE58:
	.size	func52, .-func52

Finally there are two floating point constant constraints that you probably shouldn't use at all. These are used by gcc for optimizations. The first of these, 'G', will match a constant that can be easily generated by the i387 by a single instruction. However, since the resulting operand cannot actually be used by floating point instructions, there is very little point in using it in inline asm:

const char *func53(void)
{
	const char *out;

	asm (".pushsection .data\n"
		"1:\n\t"
		".asciz \"%1\"\n\t"
		".popsection\n\t"
		"lea 1b, %0\n\t"
		: "=r" (out) : "G" (1.0L));
	return out;
}

Where in the above we use the same trick as used with the 'X' constraint, and simply convert the operand into a string. The resulting code after compilation is:


	.p2align 4,,15
	.globl	func53
	.type	func53, @function
func53:
.LFB59:
	.cfi_startproc
#APP
# 592 "gcc_asm.c" 1
	.pushsection .data
1:
	.asciz "1.0e+0"
	.popsection
	lea 1b, %rax
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE59:
	.size	func53, .-func53

The other floating point constraint is the equivalent for SSE registers, 'C'. Since there are less constants constructible from a single instruction, this is even less useful:

const char *func54(void)
{
	const char *out;

	asm (".pushsection .data\n"
		"1:\n\t"
		".asciz \"%1\"\n\t"
		".popsection\n\t"
		"lea 1b, %0\n\t"
		: "=r" (out) : "C" (0));
	return out;
}

Which when compiled produces:


	.p2align 4,,15
	.globl	func54
	.type	func54, @function
func54:
.LFB60:
	.cfi_startproc
#APP
# 610 "gcc_asm.c" 1
	.pushsection .data
1:
	.asciz "$0"
	.popsection
	lea 1b, %rax
	
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE60:
	.size	func54, .-func54

X86 Operand Modifiers

The use of constraints doesn't fulfil all the possible things you might want to do in an inline assembly statement. The problem is that the operand %0 might not be in quite the form you want. For example, you may want to access a sub-register of %0, or use a different addressing mode that perhaps requires some slightly different formatting than the default. Fortunately, gcc offers operand modifiers that allow doing these changes.

Operand modifiers work by inserting a symbol between the percent sign and the number for the operand (or its square-bracketed operand name). By using different modifiers, you can get different effects. However, many of the modifiers are really designed for RTL usage, so aren't helpful in inline asm mode.

The simplest modifier is one that just outputs the character 'b' (for byte-sized accesses) if the compiler is in AT&T mode. This helps in writing asm strings that can also be parsed in Intel mode, which requires unadorned instructions. Use the 'B' symbol to do this:

void func55(unsigned char *p1)
{
	asm volatile ("mov%B0 $1, (%0)"
		: : "r" (p1) : "memory");
}

Which compiles into:


	.p2align 4,,15
	.globl	func55
	.type	func55, @function
func55:
.LFB61:
	.cfi_startproc
#APP
# 627 "gcc_asm.c" 1
	movb $1, (%rdi)
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE61:
	.size	func55, .-func55

Note how 'mov' gets changed into 'movb'. This particular operand modifier doesn't really depend on the operand itself.

There are other versions of this for the 16-bit and 32-bit cases. 'W' will generate a 'w', and 'L' will create an 'l':

void func56(unsigned short *p1)
{
	asm volatile ("mov%W0 $1, (%0)"
		: : "r" (p1) : "memory");
}

void func57(unsigned *p1)
{
	asm volatile ("mov%L0 $1, (%0)"
		: : "r" (p1) : "memory");
}

Which compile into:


	.p2align 4,,15
	.globl	func56
	.type	func56, @function
func56:
.LFB62:
	.cfi_startproc
#APP
# 634 "gcc_asm.c" 1
	movw $1, (%rdi)
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE62:
	.size	func56, .-func56
	.p2align 4,,15
	.globl	func57
	.type	func57, @function
func57:
.LFB63:
	.cfi_startproc
#APP
# 641 "gcc_asm.c" 1
	movl $1, (%rdi)
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE63:
	.size	func57, .-func57

Unfortunately, this pattern does not continue into 64 bits. The 'Q' modifier outputs an 'l', rather than the 'q' you might expect. Perhaps this is due to the fact that most instructions cannot take a 64-bit immediate. An example of using it is:

void func58(unsigned long long *p1)
{
	asm volatile ("mov%Q0 $1, (%0)"
		: : "r" (p1) : "memory");
}

Yielding:


	.p2align 4,,15
	.globl	func58
	.type	func58, @function
func58:
.LFB64:
	.cfi_startproc
#APP
# 648 "gcc_asm.c" 1
	movl $1, (%rdi)
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE64:
	.size	func58, .-func58

Finally, there are two other character-printing modifiers. 'S' creates an 's', and 'T' makes a 't'. These are less useful, corresponding to legacy floating-point use. Of course, since the output is a raw string, you don't actually have to use them for that... and other sillier usages are possible, as is shown below.

void func59(unsigned long long *p1)
{
	asm volatile ("bt%S0 $1, (%0)"
		: : "r" (p1) : "memory");
}

void func60(unsigned long long *p1)
{
	asm volatile ("no%T0q (%0)"
		: : "r" (p1) : "memory");
}

Giving when compiled:


	.p2align 4,,15
	.globl	func59
	.type	func59, @function
func59:
.LFB65:
	.cfi_startproc
#APP
# 655 "gcc_asm.c" 1
	bts $1, (%rdi)
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE65:
	.size	func59, .-func59
	.p2align 4,,15
	.globl	func60
	.type	func60, @function
func60:
.LFB66:
	.cfi_startproc
#APP
# 662 "gcc_asm.c" 1
	notq (%rdi)
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE66:
	.size	func60, .-func60

Of course it goes without saying that such tricks should be avoided in real code.

Another operand modifier tells gcc that the operand is a label. This is used in the "asm goto" extension. Labels are listed after the clobber list, and can be referred inside the asm string. Such asms should not have any outputs. They are designed for control flow usage.

The problem is that there is no real way to get condition code information into and out of an inline asm statement. The asm goto method avoids this problem by letting the user do the branching inside, and thus all condition usage is encapsulated. Other gcc optimizers can then deal with the jump labels and move them around as needed. The result can be very efficient code. An example using it is:

int func61(volatile void *p1, size_t p2)
{
	asm goto (
		"lock; bts %1, (%0)\n\t"
		"jc %l2\n\t"
		: : "r" (p1), "r" (p2) : "memory", "cc" : carry);
	return 0;

carry:
	return 1;
}

Which compiles into:


	.p2align 4,,15
	.globl	func61
	.type	func61, @function
func61:
.LFB67:
	.cfi_startproc
#APP
# 669 "gcc_asm.c" 1
	lock; bts %rsi, (%rdi)
	jc .L64
	
# 0 "" 2
#NO_APP
	xorl	%eax, %eax
	ret
	.p2align 4,,10
	.p2align 3
.L64:
.L63:
	movl	$1, %eax
	ret
	.cfi_endproc
.LFE67:
	.size	func61, .-func61

If this function gets inlined inside an if statement, then the extra statements that set the output will be removed by optimizers.

The above modifiers didn't really change the output of the operands. However the following do. The 'a' and 'A' modifiers deal with addresses. They are helpful when allowing compilation in Intel mode. They modify the operands in the correct way so that dereferencing is written in the right syntax. Example of their use is:

void func62(unsigned char *p1)
{
	asm volatile ("movb $1, %a0"
		: : "r" (p1) : "memory");
}

void func63(void *p1)
{
	asm volatile ("jmp %A0"
		: : "r" (p1) :);
}


	.p2align 4,,15
	.globl	func62
	.type	func62, @function
func62:
.LFB68:
	.cfi_startproc
#APP
# 681 "gcc_asm.c" 1
	movb $1, (%rdi)
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE68:
	.size	func62, .-func62
	.p2align 4,,15
	.globl	func63
	.type	func63, @function
func63:
.LFB69:
	.cfi_startproc
#APP
# 687 "gcc_asm.c" 1
	jmp *%rdi
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE69:
	.size	func63, .-func63

Note how 'a' added brackets around the register name, and 'A' added an asterisk in front.

The 'p' modifier is similar. It modifies an operand to be a raw symbol name. For constants, it removes the leading dollar symbol. This is useful because in some contexts a dollar symbol is incorrect syntax. For example, in segment-offset addressing:

int func64(void)
{
	int out;
	asm volatile ("movl %%gs:%p1, %0"
		: "=r" (out) : "i" (40) : "memory");
	return out;
}

Compiles into:


	.p2align 4,,15
	.globl	func64
	.type	func64, @function
func64:
.LFB70:
	.cfi_startproc
#APP
# 694 "gcc_asm.c" 1
	movl %gs:40, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE70:
	.size	func64, .-func64

Notice how %gs:$40 would be wrong.

The 'P' modifier does a little more work, it removes things like '@PLT '. This is helpful if you are creating something like a dynamic linker, where you need to do inline asm before relocations have been calculated:

unsigned long long func65(void)
{
	unsigned long long out;
	asm ("leaq (%P1), %0"
		: "=r" (out) : "g" (func65));
	return out;
}

Which gives:


	.p2align 4,,15
	.globl	func65
	.type	func65, @function
func65:
.LFB71:
	.cfi_startproc
#APP
# 702 "gcc_asm.c" 1
	leaq (func65), %rax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE71:
	.size	func65, .-func65

Notice how the raw unadorned 'func65' is used.

The 'X' modifier is similar to 'P'. It outputs a symbol name with a prefixed dollar symbol. It is useful for symbolic immediates:

unsigned long long func66(void)
{
	unsigned long long out;
	asm ("movabs %X1, %0"
		: "=r" (out) : "g" (func66));
	return out;
}

Which compiles into:


	.p2align 4,,15
	.globl	func66
	.type	func66, @function
func66:
.LFB72:
	.cfi_startproc
#APP
# 710 "gcc_asm.c" 1
	movabs $func66, %rax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE72:
	.size	func66, .-func66

Compare with the output from the 'P' modifier. Basically, these symbol modifiers are only useful if you are playing with linker tricks. Usually, the default behavior from the 'm' or 'g' constraint is what you want. Only when you absolutely need some other form of linkage are they needed.

Occasionally, you may want to use a differently sized sub-register based on a given constraint. Without operand modifiers there is no way to do this. The given asm string for a register name may be completely different. Compare %rax to %eax, versus %r8 to %r8d. Fortunately, gcc provides ways of accessing all possible registers based on a given constraint.

The 'b' operand modifier gives you the 8-bit register related to a given register operand. (For those registers that have two 8-bit sub-registers, it picks the low one, i.e. %al, not %ah from %eax. Code using it looks like:

unsigned long long func67(unsigned p1)
{
	unsigned long long out = 0;
	asm volatile ("mov %b1, (%0)"
		: : "r" (&out), "r" (p1) : "memory");
	return out;
}

The above takes the bottom 8 bits of the 32-bit integer parameter, and sets the corresponding bits of the 64 bit output:


	.p2align 4,,15
	.globl	func67
	.type	func67, @function
func67:
.LFB73:
	.cfi_startproc
	movq	$0, -8(%rsp)
	leaq	-8(%rsp), %rax
#APP
# 718 "gcc_asm.c" 1
	mov %dil, (%rax)
# 0 "" 2
#NO_APP
	movq	-8(%rsp), %rax
	ret
	.cfi_endproc
.LFE73:
	.size	func67, .-func67

There are, of course, other sized sub-registers. The 16-bit operand modifier is 'w':

unsigned long long func68(unsigned p1)
{
	unsigned long long out = 0;
	asm volatile ("mov %w1, (%0)"
		: : "r" (&out), "r" (p1) : "memory");
	return out;
}

Which does a similar thing as the previous function, but to the bottom 16 bits:


	.p2align 4,,15
	.globl	func68
	.type	func68, @function
func68:
.LFB74:
	.cfi_startproc
	movq	$0, -8(%rsp)
	leaq	-8(%rsp), %rax
#APP
# 726 "gcc_asm.c" 1
	mov %di, (%rax)
# 0 "" 2
#NO_APP
	movq	-8(%rsp), %rax
	ret
	.cfi_endproc
.LFE74:
	.size	func68, .-func68

The operand modifier for 32-bits is 'k':

unsigned long long func69(unsigned p1)
{
	unsigned long long out;
	asm volatile ("mov %1, %k0"
		: "=r" (out) : "r" (p1));
	return out;
}

Where the above uses the 64-bit x86 feature that using a 32-bit instruction on a 64-bit register will clear the upper 32 bits. The asm looks like:


	.p2align 4,,15
	.globl	func69
	.type	func69, @function
func69:
.LFB75:
	.cfi_startproc
#APP
# 734 "gcc_asm.c" 1
	mov %edi, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE75:
	.size	func69, .-func69

Finally, if you want the 64-bit version of a register, use the 'q' modifier:

unsigned long long func70(unsigned p1)
{
	unsigned long long out = 0;
	asm volatile ("mov %q1, %0"
		: "=r" (out) : "r" (p1));
	return out;
}

Which compiles into:


	.p2align 4,,15
	.globl	func70
	.type	func70, @function
func70:
.LFB76:
	.cfi_startproc
#APP
# 743 "gcc_asm.c" 1
	mov %rdi, %rax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE76:
	.size	func70, .-func70

Of course, we still may want to access the other 8-bit "high" sub-register. The 'h' operand modifier allows this:

unsigned long long func71(unsigned p1)
{
	unsigned long long out = 0;
	asm volatile ("mov %h1, (%0)"
		: : "r" (&out), "Q" (p1) : "memory");
	return out;
}

Note how we had to use the 'Q' constraint to make sure that the high sub-reg existed. The resulting code chooses the %edx, and the %dh registers for this:


	.p2align 4,,15
	.globl	func71
	.type	func71, @function
func71:
.LFB78:
	.cfi_startproc
	movq	$0, -8(%rsp)
	leaq	-8(%rsp), %rax
	movl	%edi, %edx
#APP
# 759 "gcc_asm.c" 1
	mov %dh, (%rax)
# 0 "" 2
#NO_APP
	movq	-8(%rsp), %rax
	ret
	.cfi_endproc
.LFE78:
	.size	func71, .-func71

Somewhat related is the 'H' operand modifier. This allows you to access the high 8-byte part of a 16 byte SSE variable in memory. It adds 8 bytes to the offset in the memory access. This effect can of course be simulated manually.

xmmd var72;
xmmd func72(double p1)
{
	asm volatile ("movlpd %1, %H0"
		: : "m" (var72), "x" (p1) : "memory");
	return var72;
}

Which compiles into:


	.p2align 4,,15
	.globl	func72
	.type	func72, @function
func72:
.LFB79:
	.cfi_startproc
#APP
# 767 "gcc_asm.c" 1
	movlpd %xmm0, var72+8(%rip)
# 0 "" 2
#NO_APP
	movapd	var72(%rip), %xmm0
	ret
	.cfi_endproc
.LFE79:
	.size	func72, .-func72

Another useful feature is that there are operand modifiers that help inline asm statements that deal with constants. The main issue is that in AT&T syntax, you may need to add a suffix to an instruction to tell the assembler what size of instruction to use. In Intel syntax, this suffix should not be there. The other problem is that flexible code may need to accept many possible instruction sizes. The 'z', and 'Z' modifiers help here. They print the correct suffix for a given register size:

int func73(void)
{
	int out;
	asm ("mov%z0 %1, %0"
		: "=r" (out) : "i" (25));
	return out;
}

Compiles into:


	.p2align 4,,15
	.globl	func73
	.type	func73, @function
func73:
.LFB80:
	.cfi_startproc
#APP
# 775 "gcc_asm.c" 1
	movl $25, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE80:
	.size	func73, .-func73

Notice the 'l' in the 'movl' instruction has been added for us. The 'Z' variant is similar:

int func74(void)
{
	int out;
	asm ("mov%Z0 %1, %0"
		: "=r" (out) : "i" (25));
	return out;
}

And in this case compiles identically.


	.p2align 4,,15
	.globl	func74
	.type	func74, @function
func74:
.LFB81:
	.cfi_startproc
#APP
# 783 "gcc_asm.c" 1
	movl $25, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE81:
	.size	func74, .-func74

The difference between 'z' and 'Z' is that 'Z' is more flexible. It works with floating-point registers as well as just the integer ones. Unfortunately, neither modifier will work with constant asm constraints, just register constraints.

Sometimes useful is that you may want to write accesses to the top of the legacy floating point stack slightly differently. The 'y' modifier converts 'st' into 'st(0)':

long double func75(long double p1, long double p2)
{
	long double out;
	asm ("fadd %2, %y0"
		: "=&t" (out) : "%0" (p1), "u" (p2));
	return out;
}

Compare the result with the output from from func40().


	.p2align 4,,15
	.globl	func75
	.type	func75, @function
func75:
.LFB82:
	.cfi_startproc
	fldt	8(%rsp)
	fldt	24(%rsp)
	fxch	%st(1)
#APP
# 791 "gcc_asm.c" 1
	fadd %st(1), %st(0)
# 0 "" 2
#NO_APP
	fstp	%st(1)
	ret
	.cfi_endproc
.LFE82:
	.size	func75, .-func75

'n' is a weird operand modifier. It negates the value of an integer constant. It also suppresses the leading dollar sign:

int func76(void)
{
	int out;
	asm ("movl $%n1, %0"
		: "=r" (out) : "i" (25));
	return out;
}

Which gives:


	.p2align 4,,15
	.globl	func76
	.type	func76, @function
func76:
.LFB83:
	.cfi_startproc
#APP
# 799 "gcc_asm.c" 1
	movl $-25, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE83:
	.size	func76, .-func76

Another strange one is the 's' modifier. It prints out an integer constant, followed by a comma. It does not suppress the leading dollar sign:

int func77(void)
{
	int out;
	asm ("movl %s1 %0"
		: "=r" (out) : "i" (25));
	return out;
}

Which gives:


	.p2align 4,,15
	.globl	func77
	.type	func77, @function
func77:
.LFB84:
	.cfi_startproc
#APP
# 807 "gcc_asm.c" 1
	movl $25,  %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE84:
	.size	func77, .-func77

The next set of modifiers help asm using AVX instructions. The 't' modifier converts a SSE register name into its AVX equivalent:

typedef double ymmd __attribute__ ((vector_size (32)));
ymmd func78(xmmd p1, xmmd p2)
{
	ymmd out;
	asm ("vmovapd %t1, %0"
		: "=x" (out) : "x" (p2));
	return out;
}

Where if you compile with the -mavx compile-time flag, you get:


	.p2align 4,,15
	.globl	func78
	.type	func78, @function
func78:
.LFB85:
	.cfi_startproc
#APP
# 816 "gcc_asm.c" 1
	vmovapd %ymm1, %ymm0
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE85:
	.size	func78, .-func78

The reverse is implemented by the 'x' modifier, which converts an AVX name into the SSE version:

xmmd func79(ymmd p1, ymmd p2)
{
	xmmd out;
	asm ("movapd %x1, %x0"
		: "=x" (out) : "x" (p2));
	return out;
}

Giving:


	.p2align 4,,15
	.globl	func79
	.type	func79, @function
func79:
.LFB86:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
#APP
# 824 "gcc_asm.c" 1
	movapd %xmm1, %xmm0
# 0 "" 2
#NO_APP
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	popq	%rbp
	.cfi_def_cfa 7, 8
	vzeroupper
	ret
	.cfi_endproc
.LFE86:
	.size	func79, .-func79

Also potentially useful for AVX code is the 'd' operand modifier. This is documented to duplicate an operand. Since the fused multiply-add instructions come in three and four operand variants, it would be convenient to be able to support both from the same code-base. Using duplicated operands would help somewhat. Unfortunately, simple usage of 'd' with AVX registers leads to internal compiler errors with the current version of gcc (4.7.1), so this modifier should be avoided for now.

Other modifiers to be avoided are those dealing with condition codes. There is no way for inline asm to input a condition code operand type. (They are generated from RTL, however.) So you shouldn't use the 'c', 'C', 'f', 'F', 'D' and 'Y' modifiers.

The one remaining modifier is 'O'. It isn't particularly useful. It prints nothing if sun syntax is off. (The default.) Otherwise it prints 'w', 'l' or 'q', helpful for cmov instructions, which are slightly different in that asm dialect.

Special Operands

In addition to operands specified by the constraints, there are a few others. The first of these we have seen before. '%%' will print a single percent sign. This is helpful for writing asm registers explicitly within the output string. The '%%' behavior is the same as that for the printf() function, so it is easy to remember. func10(), above shows its use.

The '%*' operand prints an asterisk if you are using AT&T assembly output. Otherwise, nothing is printed. This is helpful for portability:

void func81(void *p1)
{
	asm volatile ("jmp %*%0"
		: : "r" (p1) :);
}

Which compiles into:


	.p2align 4,,15
	.globl	func81
	.type	func81, @function
func81:
.LFB87:
	.cfi_startproc
#APP
# 839 "gcc_asm.c" 1
	jmp *%rdi
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE87:
	.size	func81, .-func81

Again, you probably shouldn't use control flow instructions like that in inline asm, since gcc will not understand them. However... sometimes you might just need to, and tricks like that often help.

The '%=' operand prints a unique numeric identifier within the compilation region. This is helpful for constructing a unique symbol from within an inline asm. Perhaps __LINE__, or local symbols should be used instead though. For example:

void func82(void *p1)
{
	asm volatile (".L%=something:"
		: : :);
}

Which compiles to give:


	.p2align 4,,15
	.globl	func82
	.type	func82, @function
func82:
.LFB88:
	.cfi_startproc
#APP
# 845 "gcc_asm.c" 1
	.L820something:
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE88:
	.size	func82, .-func82

Where in this particular case, it expanded to "820". Note that since you can construct a symbol name with a given pattern, this trick may be helpful for debugging.

The '%@' operand expands to the thread TLS segment register. In 32-bit mode, this is %gs. In 64-bit mode, %fs. If you are writing low-level thread library code, this may be helpful for portability.

int func83(void)
{
	int out;
	asm volatile ("movl %@:%p1, %0"
		: "=r" (out) : "i" (40) : "memory");
	return out;
}

Which compiles to give:


	.p2align 4,,15
	.globl	func83
	.type	func83, @function
func83:
.LFB89:
	.cfi_startproc
#APP
# 852 "gcc_asm.c" 1
	movl %fs:40, %eax
# 0 "" 2
#NO_APP
	ret
	.cfi_endproc
.LFE89:
	.size	func83, .-func83

The '%~' operand expands to 'i' if avx2 is available. Otherwise it expands to 'f'. I don't know why this could be useful.

The '%;' operand expands to ':' if gcc has had compiled in support for certain buggy versions of the gnu assembler. Otherwise, it expands to nothing. This apparently is useful for getting segment overrides to work. However, these days, binutils is most likely modern, so you don't have to worry about this.

Finally, there are two more operands that are not useful from inline asm. The '%+' operand is designed to add branch-prediction prefixes. However, inline asm can't give the information it needs. The '%&' operand expands to the name of a dynamic tls variable used within the function the inline asm is invoked in. This is used internally within gcc to get thread local variables to work correctly. You shouldn't need to use it in inline asm code.

Other Tricks

Another interface with assembly language within gcc are register variables. GCC has an extension that lets you assign which particular register a variable may use. An example of this is:

int func84(int p1)
{
	register int out asm("r10d") = p1;

	return out;
}

Where we would like the input parameter p1 to be stored in %r10, before being copied into %eax for output. Unfortunately, reality isn't so kind:


	.p2align 4,,15
	.globl	func84
	.type	func84, @function
func84:
.LFB90:
	.cfi_startproc
	movl	%edi, %eax
	ret
	.cfi_endproc
.LFE90:
	.size	func84, .-func84

GCC ignores our request, and instead optimizes the extra moves away. You might think you could use a volatile specifier on the variable to make loads and stores to it more explicit. This doesn't work either. In fact, there is a warning "-Wvolatile-register-var" for this broken usage. In light of the fact that asm register variables are held captive to the whims of the optimizer, the should perhaps not be used. It is difficult to make sure they will have the behavior you might need.

A final trick is that it is possible to insert asm at top-level within a C source code file. Normally, you would need to be inside a function to use inline assembly language. However, we can use the fact that the section attribute is inserted verbatim into the output. Since we can embed carriage returns, we can put anything we like there. The only constraint is that the input must be a constant C string:

int func85(void);
int __attribute__((used, section(".text\n\t"
			".globl	func85\n\t"
			".type	func85, @function\n\t"
			"func85:\n\t"
			"mov $1, %eax\n\t"
			"retq\n\t"
			".size	func85, .-func85\n\t"
			".section .data"))) func85a;

The above creates a function called func85() within the section attribute. The 'used' attribute is there to make sure that the variable func85a is not removed. The result is that func85 is inserted into the object code manually:


	.globl	func85a
	.section	.text
	.globl	func85
	.type	func85, @function
	func85:
	mov $1, %eax
	retq
	.size	func85, .-func85
	.section .data,"aw",@progbits
	.align 4
	.type	func85a, @object
	.size	func85a, 4
func85a:
	.zero	4

A similar version of this trick allows variables to be put into elf sections that are not '@progbits'. Simply add the section details you want, and then end them with a comment '#' character. The comment will remove the unwanted details gcc adds as a suffix.