Using Assembly Language in Linux

 

Using Assembly Language in Linux.

by Phillip

phillip@ussrback.com

Last updated: Monday 8th January 2001

Note: there is a turkish translation of this article .

Contents:


Introduction.

This article will describe assembly language programming under Linux. Contained within the bounds of the article is a comparison between Intel and AT&T syntax asm, a guide to using syscalls and a introductory guide to using inline asm in gcc.

This article was written due to the lack of (good) info on this field of programming (inline asm section in particular), in which case i should remind thee that this is not a shellcode writing tutorial because there is no lack of info in this field.

Various parts of this text I have learnt about through experimentation and hence may be prone to error. Should you find any of these errors on my part, do not hesitate to notify me via email and enlighten me on the given issue.

There is only one prerequisite for reading this article, and thats obviously a basic knowledge of x86 assembly language and C.


Intel and AT&T Syntax.

Intel and AT&T syntax Assembly language are very different from each other in appearance, and this will lead to confusion when one first comes across AT&T syntax after having learnt Intel syntax first, or vice versa. So lets start with the basics.

Prefixes.

In Intel syntax there are no register prefixes or immed prefixes. In AT&T however registers are prefixed with a '%' and immed's are prefixed with a '$'. Intel syntax hexadecimal or binary immed data are suffixed with 'h' and 'b' respectively. Also if the first hexadecimal digit is a letter then the value is prefixed by a '0'.

Example:

Intex Syntax
mov	eax,1
mov ebx,0ffh
int 80h
AT&T Syntax
movl	$1,%eax
movl $0xff,%ebx
int $0x80

Direction of Operands.

The direction of the operands in Intel syntax is opposite from that of AT&T syntax. In Intel syntax the first operand is the destination, and the second operand is the source whereas in AT&T syntax the first operand is the source and the second operand is the destination. The advantage of AT&T syntax in this situation is obvious. We read from left to right, we write from left to right, so this way is only natural.

Example:

Intex Syntax
instr	dest,source
mov eax,[ecx]
AT&T Syntax
instr 	source,dest
movl (%ecx),%eax

Memory Operands.

Memory operands as seen above are different also. In Intel syntax the base register is enclosed in '[' and ']' whereas in AT&T syntax it is enclosed in '(' and ')'.

Example:

Intex Syntax
mov	eax,[ebx]
mov eax,[ebx+3]
AT&T Syntax
movl	(%ebx),%eax
movl 3(%ebx),%eax

The AT&T form for instructions involving complex operations is very obscure compared to Intel syntax. The Intel syntax form of these is segreg:[base+index*scale+disp]. The AT&T syntax form is %segreg:disp(base,index,scale).

Index/scale/disp/segreg are all optional and can simply be left out. Scale, if not specified and index is specified, defaults to 1. Segreg depends on the instruction and whether the app is being run in real mode or pmode. In real mode it depends on the instruction whereas in pmode its unnecessary. Immediate data used should not '$' prefixed in AT&T when used for scale/disp.

Example:

Intel Syntax
instr 	foo,segreg:[base+index*scale+disp]
mov eax,[ebx+20h]
add eax,[ebx+ecx*2h
lea eax,[ebx+ecx]
sub eax,[ebx+ecx*4h-20h]
AT&T Syntax
instr	%segreg:disp(base,index,scale),foo
movl 0x20(%ebx),%eax
addl (%ebx,%ecx,0x2),%eax
leal (%ebx,%ecx),%eax
subl -0x20(%ebx,%ecx,0x4),%eax

As you can see, AT&T is very obscure. [base+index*scale+disp] makes more sense at a glance than disp(base,index,scale).

Suffixes.

As you may have noticed, the AT&T syntax mnemonics have a suffix. The significance of this suffix is that of operand size. 'l' is for long, 'w' is for word, and 'b' is for byte. Intel syntax has similar directives for use with memory operands, i.e. byte ptr, word ptr, dword ptr. "dword" of course corresponding to "long". This is similar to type casting in C but it doesnt seem to be necessary since the size of registers used is the assumed datatype.

Example:

Intel Syntax
mov	al,bl
mov ax,bx
mov eax,ebx
mov eax, dword ptr [ebx]
AT&T Syntax
movb	%bl,%al
movw %bx,%ax
movl %ebx,%eax
movl (%ebx),%eax

**NOTE: ALL EXAMPLES FROM HERE WILL BE IN AT&T SYNTAX**


Syscalls.

This section will outline the use of linux syscalls in assembly language. Syscalls consist of all the functions in the second section of the manual pages located in /usr/man/man2. They are also listed in: /usr/include/sys/syscall.h. A great list is at http://www.linuxassembly.org/syscall.html. These functions can be executed via the linux interrupt service: int $0x80.

Syscalls with < 6 args.

For all syscalls, the syscall number goes in %eax. For syscalls that have less than six args, the args go in %ebx,%ecx,%edx,%esi,%edi in order. The return value of the syscall is stored in %eax.

The syscall number can be found in /usr/include/sys/syscall.h. The macros are defined as SYS_<syscall name> i.e. SYS_exit, SYS_close, etc.

Example:
(Hello world program - it had to be done)

According to the write(2) man page, write is declared as: ssize_t write(int fd, const void *buf, size_t count);

Hence fd goes in %ebx, buf goes in %ecx, count goes in %edx and SYS_write goes in %eax. This is followed by an int $0x80 which executes the syscall. The return value of the syscall is stored in %eax.

$ cat write.s
.include "defines.h"
.data
hello:
.string "hello world/n"

.globl main
main:
movl $SYS_write,%eax
movl $STDOUT,%ebx
movl $hello,%ecx
movl $12,%edx
int $0x80

ret
$

The same process applies to syscalls which have less than five args. Just leave the un-used registers unchanged. Syscalls such as open or fcntl which have an optional extra arg will know what to use.

Syscalls with > 5 args.

Syscalls whos number of args is greater than five still expect the syscall number to be in %eax, but the args are arranged in memory and the pointer to the first arg is stored in %ebx.

If you are using the stack, args must be pushed onto it backwards, i.e. from the last arg to the first arg. Then the stack pointer should be copied to %ebx. Otherwise copy args to an allocated area of memory and store the address of the first arg in %ebx.

Example:
(mmap being the example syscall). Using mmap() in C:

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>

#define STDOUT 1

void main(void) {
char file[]="mmap.s";
char *mappedptr;
int fd,filelen;

fd=fopen(file, O_RDONLY);
filelen=lseek(fd,0,SEEK_END);
mappedptr=mmap(NULL,filelen,PROT_READ,MAP_SHARED,fd,0);
write(STDOUT, mappedptr, filelen);
munmap(mappedptr, filelen);
close(fd);
}

Arrangement of mmap() args in memory:

%esp %esp+4 %esp+8 %esp+12 %esp+16 %esp+20
00000000 filelen 00000001 00000001 fd 00000000

ASM Equivalent:

$ cat mmap.s
.include "defines.h"

.data
file:
.string "mmap.s"
fd:
.long 0
filelen:
.long 0
mappedptr:
.long 0

.globl main
main:
push %ebp
movl %esp,%ebp
subl $24,%esp

// open($file, $O_RDONLY);

movl $fd,%ebx // save fd
movl %eax,(%ebx)

// lseek($fd,0,$SEEK_END);

movl $filelen,%ebx // save file length
movl %eax,(%ebx)

xorl %edx,%edx

// mmap(NULL,$filelen,PROT_READ,MAP_SHARED,$fd,0);
movl %edx,(%esp)
movl %eax,4(%esp) // file length still in %eax
movl $PROT_READ,8(%esp)
movl $MAP_SHARED,12(%esp)
movl $fd,%ebx // load file descriptor
movl (%ebx),%eax
movl %eax,16(%esp)
movl %edx,20(%esp)
movl $SYS_mmap,%eax
movl %esp,%ebx
int $0x80

movl $mappedptr,%ebx // save ptr
movl %eax,(%ebx)

// write($stdout, $mappedptr, $filelen);
// munmap($mappedptr, $filelen);
// close($fd);

movl %ebp,%esp
popl %ebp

ret
$

**NOTE: The above source listing differs from the example source code found at the end of the article. The code listed above does not show the other syscalls, as they are not the focus of this section. The source above also only opens mmap.s, whereas the example source reads the command line arguments. The mmap example also uses lseek to get the filesize.**

Socket Syscalls.

Socket syscalls make use of only one syscall number: SYS_socketcall which goes in %eax. The socket functions are identified via a subfunction numbers located in /usr/include/linux/net.h and are stored in %ebx. A pointer to the syscall args is stored in %ecx. Socket syscalls are also executed with int $0x80.

$ cat socket.s
.include "defines.h"

.globl _start
_start:
pushl %ebp
movl %esp,%ebp
sub $12,%esp

// socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
movl $AF_INET,(%esp)
movl $SOCK_STREAM,4(%esp)
movl $IPPROTO_TCP,8(%esp)

movl $SYS_socketcall,%eax
movl $SYS_socketcall_socket,%ebx
movl %esp,%ecx
int $0x80

movl $SYS_exit,%eax
xorl %ebx,%ebx
int $0x80

movl %ebp,%esp
popl %ebp
ret
$

Command Line Arguments.

Command line arguments in linux executables are arranged on the stack. argc comes first, followed by an array of pointers (**argv) to the strings on the command line followed by a NULL pointer. Next comes an array of pointers to the environment (**envp). These are very simply obtained in asm, and this is demonstrated in the example code (args.s).


GCC Inline ASM.

This section on GCC inline asm will only cover the x86 applications. Operand constraints will differ on other processors. The location of the listing will be at the end of this article.

Basic inline assembly in gcc is very straightforward. In its basic form it looks like this:

	__asm__("movl	%esp,%eax");	// look familiar ?

or

	__asm__("
movl $1,%eax // SYS_exit
xor %ebx,%ebx
int $0x80
");

It is possible to use it more effectively by specifying the data that will be used as input, output for the asm as well as which registers will be modified. No particular input/output/modify field is compulsory. It is of the format:

	__asm__("<asm routine>" : output : input : modify);

The output and input fields must consist of an operand constraint string followed by a C expression enclosed in parentheses. The output operand constraints must be preceded by an '=' which indicates that it is an output. There may be multiple outputs, inputs, and modified registers. Each "entry" should be separated by commas (',') and there should be no more than 10 entries total. The operand constraint string may either contain the full register name, or an abbreviation.

Abbrev Table
Abbrev Register
a %eax/%ax/%al
b %ebx/%bx/%bl
c %ecx/%cx/%cl
d %edx/%dx/%dl
S %esi/%si
D %edi/%di
m memory

Example:

	__asm__("test	%%eax,%%eax", : /* no output */ : "a"(foo));

OR

	__asm__("test	%%eax,%%eax", : /* no output */ : "eax"(foo));

You can also use the keyword __volatile__ after __asm__: "You can prevent an `asm' instruction from being deleted, moved significantly, or combined, by writing the keyword `volatile' after the `asm'."

(Quoted from the "Assembler Instructions with C Expression Operands" section in the gcc info files.)

$ cat inline1.c
#include <stdio.h>

int main(void) {
int foo=10,bar=15;

__asm__ __volatile__ ("addl %%ebxx,%%eax"
: "=eax"(foo) // ouput
: "eax"(foo), "ebx"(bar)// input
: "eax" // modify
);
printf("foo+bar=%d/n", foo);
return 0;
}
$

You may have noticed that registers are now prefixed with "%%" rather than '%'. This is necessary when using the output/input/modify fields because register aliases based on the extra fields can also be used. I will discuss these shortly.

Instead of writing "eax" and forcing the use of a particular register such as "eax" or "ax" or "al", you can simply specify "a". The same goes for the other general purpose registers (as shown in the Abbrev table). This seems useless when within the actual code you are using specific registers and hence gcc provides you with register aliases. There is a max of 10 (%0-%9) which is also the reason why only 10 inputs/outputs are allowed.

$ cat inline2.c
int main(void) {
long eax;
short bx;
char cl;

__asm__("nop;nop;nop"); // to separate inline asm from the rest of
// the code
__volatile__ __asm__("
test %0,%0
test %1,%1
test %2,%2"
: /* no outputs */
: "a"((long)eax), "b"((short)bx), "c"((char)cl)
);
__asm__("nop;nop;nop");
return 0;
}
$ gcc -o inline2 inline2.c
$ gdb ./inline2
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnulibc1"...
(no debugging symbols found)...
(gdb) disassemble main
Dump of assembler code for function main:
... start: inline asm ...
0x8048427 : nop
0x8048428 : nop
0x8048429 : nop
0x804842a : mov 0xfffffffc(%ebp),%eax
0x804842d : mov 0xfffffffa(%ebp),%bx
0x8048431 : mov 0xfffffff9(%ebp),%cl
0x8048434 : test %eax,%eax
0x8048436 : test %bx,%bx
0x8048439 : test %cl,%cl
0x804843b : nop
0x804843c : nop
0x804843d : nop
... end: inline asm ...
End of assembler dump.
$

As you can see, the code that was generated from the inline asm loads the values of the variables into the registers they were assigned to in the input field and then proceeds to carry out the actual code. The compiler auto detects operand size from the size of the variables and so the corresponding registers are represented by the aliases %0, %1 and %2. (Specifying the operand size in the mnemonic when using the register aliases may cause errors while compiling).

The aliases may also be used in the operand constraints. This does not allow you to specify more than 10 entries in the input/output fields. The only use for this i can think of is when you specify the operand constraint as "q" which allows the compiler to choose between a,b,c,d registers. When this register is modified we will not know which register has been chosen and consequently cannot specify it in the modify field. In which case you can simply specify "<number>".

Example:

$ cat inline3.c
#include <stdio.h>

int main(void) {
long eax=1,ebx=2;

__asm__ __volatile__ ("add %0,%2"
: "=b"((long)ebx)
: "a"((long)eax), "q"(ebx)
: "2"
);
printf("ebx=%x/n", ebx);
return 0;
}
$

Compiling

Compiling assembly language programs is much like compiling normal C programs. If your program looks like Listing 1, then you would compile it like you would a C app. If you use _start instead of main, like in Listing 2 you would compile the app slightly differently:

  • Listing 1
$ cat write.s
.data
hw:
.string "hello world/n"
.text
.globl main
main:
movl $SYS_write,%eax
movl $1,%ebx
movl $hw,%ecx
movl $12,%edx
int $0x80
movl $SYS_exit,%eax
xorl %ebx,%ebx
int $0x80
ret
$ gcc -o write write.s
$ wc -c ./write
4790 ./write
$ strip ./write
$ wc -c ./write
2556 ./write
  • Listing 2
$ cat write.s
.data
hw:
.string "hello world/n"
.text
.globl _start
_start:
movl $SYS_write,%eax
movl $1,%ebx
movl $hw,%ecx
movl $12,%edx
int $0x80
movl $SYS_exit,%eax
xorl %ebx,%ebx
int $0x80

$ gcc -c write.s
$ ld -s -o write write.o
$ wc -c ./write
408 ./write

The -s switch is optional, it just creates a stripped ELF executable which is smaller than a non-stripped one. This method (Listing 2) also creates smaller executables, since the compiler isnt adding extra entry and exit routines as would normally be the case.


Links.

Further reference.

http://www.linuxassembly.org
GNU Assembler Manual
GNU C Compiler Manual
GNU Debugger Manual
Operand Constraint Reference
AT&T Syntax Reference

Example Code

args.s Reads command line arguments passed to the prog
daemon.s Binds a shell to a port (backdoor style)
mmap.s Maps a file to memory, and dumps its contents
socket.s Creates a socket
write.s Hello world !
linasm-src.tgz Makefile defines.h args.s daemon.s socket.s write.s
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
蛋白质是生物体中普遍存在的一类重要生物大分子,由天然氨基酸通过肽键连接而成。它具有复杂的分子结构和特定的生物功能,是表达生物遗传性状的一类主要物质。 蛋白质的结构可分为四级:一级结构是组成蛋白质多肽链的线性氨基酸序列;二级结构是依靠不同氨基酸之间的C=O和N-H基团间的氢键形成的稳定结构,主要为α螺旋和β折叠;三级结构是通过多个二级结构元素在三维空间的排列所形成的一个蛋白质分子的三维结构;四级结构用于描述由不同多肽链(亚基)间相互作用形成具有功能的蛋白质复合物分子。 蛋白质在生物体内具有多种功能,包括提供能量、维持电解质平衡、信息交流、构成人的身体以及免疫等。例如,蛋白质分解可以为人体提供能量,每克蛋白质能产生4千卡的热能;血液里的蛋白质能帮助维持体内的酸碱平衡和血液的渗透压;蛋白质是组成人体器官组织的重要物质,可以修复受损的器官功能,以及维持细胞的生长和更新;蛋白质也是构成多种生理活性的物质,如免疫球蛋白,具有维持机体正常免疫功能的作用。 蛋白质的合成是指生物按照从脱氧核糖核酸(DNA)转录得到的信使核糖核酸(mRNA)上的遗传信息合成蛋白质的过程。这个过程包括氨基酸的活化、多肽链合成的起始、肽链的延长、肽链的终止和释放以及蛋白质合成后的加工修饰等步骤。 蛋白质降解是指食物中的蛋白质经过蛋白质降解酶的作用降解为多肽和氨基酸然后被人体吸收的过程。这个过程在细胞的生理活动中发挥着极其重要的作用,例如将蛋白质降解后成为小分子的氨基酸,并被循环利用;处理错误折叠的蛋白质以及多余组分,使之降解,以防机体产生错误应答。 总的来说,蛋白质是生物体内不可或缺的一类重要物质,对于维持生物体的正常生理功能具有至关重要的作用。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值