Chapter 4 – Code and Data Segments

最新推荐文章于 2024-10-30 15:52:54 发布

weixin_33896069

最新推荐文章于 2024-10-30 15:52:54 发布

阅读量359

点赞数

文章标签：开发工具 python

原文链接：https://my.oschina.net/zhuzihasablog/blog/266484

版权

2019独角兽企业重金招聘Python工程师标准>>>

4.1 Introduction to the Code and Data Segments

The use of segments by the IA-32 architecture supports a wide range of operating system and program designs. Of interest to us is the flat memory model that provides application programs with access to a continuous block of address space in memory. To implement a flat memory model at least two segments must be created in memory: a code segment and data segment. Both of these segments are mapped by Windows to an entire linear address space whereby the CS and DS registers are given the same base address and the same segment limit. The instruction pointer (EIS) and variable names contain offsets into the CODE and DATA segments respectively. The flat memory model minimizes coding requirements while protecting your code and data from other programs.

4.2 Scope and Purpose

This chapter provides an introduction on how to create and use CODE and DATA segments in your inline assembly code. Topics covered in this chapter include:

Overview of the CODE and DATA segments
General description of the SEGMENT directive
Discussion of the extensions to the SEGMENT directive including: PRIVATE, PUBLIC, ALIGN, INFO, COMMON, and USE32

4.3 Segment Directive

A typical 80x86 processor running in 32-bit protected mode can access a maximum of four gigabytes (232) of memory. When running in protected mode, an operating system like Windows may use the flat memory model to assign your application up to 4GB of memory. Flat memory looks to your program as a linear array of bytes with the first allocated byte beginning at zero (0) and going up to 4,294,967,295 (FFFFFFFFh).

Although the Windows operating system can assign your program up to 4GB of memory, it does so by using a technique called paging whereby parts of your allocated memory space are swapped in and out of memory to the hard drive in 4KB chunks. Paging allows Windows, drivers, libraries, and every application running on the computer to have 4GB of virtual memory. This means that although your program may see 4GB of virtual memory, it certainly does not possess 4GB of physical memory.

The assembler allows your code to define and reference specific memory segments allocated to your program. Typically, program code is placed in the CODE section. The DATA segment is divided into a section for initialized data and a section for uninitialized data. NASM labels these sections TEXT for the code section, DATA for the initialized data section, and BSS for the uninitialized data section.

The SEGMENT directive changes which section of the output file the code you write is assembled into. Note that the SECTION directive is synonymous with using the SEGMENT directive; both mean the same thing. Either one is used to tell the assembler and linker how to set up your segments.

4.4 Code Segment – TEXT

The CODE segment defines the section where your code is stored. The '.text' instruction marks this section as readable and executable, but not writable. It also tells the linker that this section of memory contains program code. The code written in the code section of inline assembly blocks is analogous to writing IWBASIC code, only the syntax is different.

Example of setting up a CODE segment:

  section .text (or segment .text)  ; MUST BE preceded with a period—no underlines
  . . .                             ; code statements start here

NOTE
NASM labels the code segment ‘text’. Naming it the ‘code’ segment will not work. The word ‘text’ must be in small letters too. Capitalizing the word ‘TEXT’ will cause an error to occur. The same applies to the ‘data’ and ‘bss’ instructions. Precede segment instructions with a period.

Figure 4-1 shows the logical memory layout when using the flat memory model. Notice that programs may be assigned up to 4GB of memory space starting at zero (0) and extending upwards to a maximum of 4,294,967,295 (FFFFFFFFh). IWBASIC may also set aside some memory as additional storage space for such structures as buffers and heaps.

Figure 4-1 Flat Memory Model Representation

In the code segment, the CS and EIP registers hold the logical address of the next instruction to be executed. The 16-bit CS register is used as an index into a table that contains the actual 32-bit starting address of the segment. This table is setup and controlled by Windows. Table 4-1 contains an example of logical addressing. You can see that the CS index points to the bottom of the assigned code segment at address 100A1142h with 32-bit offsets indicated after the colons. These 32-bit offsets are tracked by the EIP register.

Table 4-1 Logical Addressing in the Code Segment

Logical CS:EIP	Machine Language Opcode and Operand	Assembly Language Mnemonics and Operand
100A1142:01002286	B9	mov ecx, 0
100A1142:01002287	48	dec eax
100A1142:01002288	3D 00000000	cmp eax, 0
100A1142:0100228B	31D2	xor edx, edx

4.5 Initialized Data Segment – DATA

The DATA section defines an area of memory where defined program variables are located. The ‘.data’ instruction marks this section as readable and writable, but not executable. It also tells the linker that the data in this section are initialized (variables and constants have assigned values). Variables placed in the DATA segment are analogous to global variables in IWBASIC.

Figure 4-2 shows how IWBASIC and assembly data are stored in the DATA segment. Notice the order that data bytes are stored.

Figure 4-2 Data Storage

There are two approaches to storing data in memory called big endian and little endian. Big endian order means that the most significant byte (or word) is stored first in memory. That is, at a lower memory address. Intel IA-32 processors store data in little endian order. As Figure 4-2 shows, the IWBASIC string is stored in character order starting with character ‘D’ stored at a lower address than character ‘A’. Notice how the IWBASIC string ends with a zero (null terminated)—making the string variable myStr five bytes wide.

The WORD in Figure 4-2, defined by the assembler, is stored with its least significant byte stored at the lowest memory address. Little endian storage allows the least significant byte to be accessed first so the processor can begin operating on it while the most significant byte is being accessed. This has become a mute point on modern processors.

Example of declaring a DATA section with variables:

  section .data (or segment .data)
    byte_data       db  41h, 42h, 43h, 44h            ; Define four BYTES
    word_data       dw  2143h                         ; Define WORD
    doubleword_data dd  12345678                      ; Define DWORD
    quadword_data   dq  1.25x1025                     ; Define QWORD
    tenbyte_data    dt  -12_67_98_50_32_54_99_03_10   ; Define TWORD
    constant_number equ 1234                          ; Define a constant

Notice that when you use the EQU symbol, the source line must contain a label. You can also use EQU in conjunction with the ‘$’ symbol to define an absolute address or length of a string. EQU stands for “equate.”

Example of using EQU and $ symbols to determine the length of a string:

  message db  ‘Assembly is fun!’,0   ; Define a zero delimited string
  msglen  equ $-message              ; Length of the string including the zero

The ‘$’ symbol is used to define the current memory address and by subtracting it from the address represented by ‘message’ you obtain the length of the string. Remember that labels like ‘message’ represent the starting offset address in memory of the defined variable. The ‘$’ symbol represents the current address location. In the previous example, msglen is equal to 17 bytes.

The ‘$$’ symbol evaluates to the beginning of the current assembly code section so that you can tell how far into the section you are by using the combination of $-$$. This is useful when you need to align data in the DATA segment. Listing 4-1 shows how to use the $$ symbol.

Listing 4-1 Using the $$ Symbol

INT myResult               REM Declare uninitialized integer variable

_asm
segment .data
  message1 db 'Assembly is fun!', 0  ; Declare three initialized variables
  message2 db 'Learning assembly is challenging.',0
  distance dd $-$$         ; Calculate distance in bytes from start of the data segment

segment .text
  mov eax, [distance]      ; Copy calculated value into EAX register
  mov [myResult], eax      ; Copy value into the variable myResult so it can be printed
_endasm

PRINT myResult             REM The number 51 should print to the screen
DO: UNTIL INKEY$ <> "”     REM Wait for a key to be pressed
END                        REM Close command window and exit the program

Declaring initialized variables in assembly is very similar to declaring IWBASIC variables. Unless you change the name of the DATA segment, initialized assembly variables are stored along with variables declared in IWBASIC code.

Table 4-2 compares some IWBASIC variable declarations to assembly language variable declarations.

Table 4-2 IWBASIC and NASM Variable Declarations

IWBASIC Variable Declarations	Size	NASM Variable Declarations
INT myInt = 123	4 bytes	myInt dd 123
STRING mySTr = "Assembly is fun!"	IWBASIC up to 255 bytes NASM unlimited	myStr db 'Assembly is fun!', 0
FLOAT myFloat = 1.23	4 bytes	myFloat dd 1.23
INT64 myBigInt = 123456789	8 bytes	myBigInt dq 123456789
DOUBLE myBigFloat = 123.456789	8 bytes	myBigFloat dq 123.456789
N/A	10 bytes	myTword dt 12_34_56_78_90_09_87_56_43_21
N/A	10 bytes	myDQword do 123x10³⁴

4.6 Uninitialized Data Segment – BSS

The BSS segment also defines an area of memory where program variables are located. The ‘.bss’ instruction marks this section as readable and writable, but not executable. It also tells the linker that the data in this section are not initialized (variables have not been assigned values). However, during the coding process, you identify how much space you will need when variables in the BSS section are eventually initialized. The data storage directives used in the BSS segment include: RESB, RESW, RESD, RESQ, REST, and RESO.

Example of reserving uninitialized space for variables in the BSS section:

  section .bss
    byte_space     resb 50   ; Reserve space for 50 BYTES  
    word_one       resw 1    ; Reserve space for a WORD (2 bytes)
    integer_number resd 1    ; Reserve space for a DWORD (4 bytes)
    qword_float    resq 1    ; Reserve space for a QWORD (8 bytes)
    BCD_element    rest 1    ; Reserve space for a TWORD (10 bytes)
    vector_matrix  reso 1    ; Reserve space for a DQWORD (16 bytes)

NOTE
In 32-bit protected mode flat memory model programming, you do not have to manipulate the segment registers. In fact, you cannot read nor change them directly; just consider them part of the operating system.

4.7 Segment Extensions

It is important to remember that your code is placed at a physical memory address assigned by the system. The assigned physical address is stored in descriptor tables in memory. Your program is then assigned linear memory space beginning at zero (0) so that your code always thinks its starting address is at the bottom of memory. The linear address is a combination of the 32-bit descriptor table indices referenced by the segment registers and the 32-bit offset located in the instruction pointer for the CODE segment and 32-bit offsets assigned each variable in the DATA segment. This is a simplistic view, but adequate at our stage of assembly programming. We never have to worry about segment registers and calculating physical addresses in our programs.

4.7.1 Segment Alignment with the ALIGN Extension

The IA-32 architecture works best when program code and data are aligned on even numbered boundaries. More specifically, the fetch, pipeline, and prediction mechanics of the IA-32 architecture work best when segments are aligned on specific even numbered boundaries.

Example of alignment directives using ALIGN:

  section .text align=16   ; Align CODE segment on a 16-byte boundary
  section .data align=4    ; Align DATA segment on a 4-byte boundary
  section .bss  align=4    ; Align BSS segment on a 4-byte boundary

Typically, your code and data are aligned on the next available memory location ending in a zero (0). If you do not specify an alignment strategy, the assembler uses the alignment shown in the previous example as the default strategy. Be aware that if you share a segment with IWBASIC, then whatever settings IWBASIC has declared are what you must use. You cannot change segment extension values once they are initially set. Additionally, you can utilize these segment instructions inside your _asm/_endasm blocks using various strategies as depicted in Listings 4-2 and 4-3.

Listing 4-2 Data Section Placed After the Code Section

_asm
section .text            ; Start a CODE segment
  mov  eax, [integer1]   ; Copy integer1 value into EAX
  add  eax, 500          ; Add 500 to value in EAX
  mov  [integer1], eax   ; Copy the result of the addition back into the integer1 variable

section .data            ; Start a DATA segment
integer1:  dd  255       ; Declare 32-bit integer variable

section .text            ; Return back to the CODE segment before exiting assembly code block
_endasm

In Listing 4-2, the data section is placed after the code section. Listing 4-3 is an example of putting the data section before the code section. Take care to ensure you are in the CODE segment prior to exiting the inline assembly code block.

Listing 4-3 Data Sections Placed before the Code Section

_asm
section .data            ; Start a DATA segment for initialized data variables
integer1  dd  255        ; Declare DWORD variable labeled integer1

section .bss             ; Start BSS segment for uninitialized data variables
buffer1  resd  1         ; Reserve one DWORD-size memory space for buffer1 variable

section .text            ; Start a CODE segment
  mov  eax, [integer1]   ; Copy contents of integer1 into the EAX register
  add  eax, 500          ; Add 500 to contents in EAX register
  mov  [buffer1], eax    ; Copy result in EAX register to buffer1 variable
_endasm

Note that when you enter a _asm/_endasm block, by default it is regarded as the CODE segment, therefore you do not need to label it with ‘text’ unless you create DATA and BSS segments. The compiler, assembler, and linker generally position segments in the order in which they appear in the source code. However, if you put a DATA or BSS segment after the CODE segment, you must return the program back to the CODE segment by adding another ‘SECTION .text’ directive before leaving the assembly code block. This can be a source of many errors in your program because all subsequent IWBASIC code instructions will likely be written into the DATA section unless you switch back to the CODE section.

The ALIGN directive is also used to align individual variable declarations in the DATA and BSS segments.

Example directives to align data and reserve space on even boundaries:

  align  4, db 123   ; Align data on 4-byte boundary
  align  4, resb 1   ; Reserve one byte and align it on a 4-byte boundary
  alignb 4           ; Same as previous statement

The reason for correct alignment is that on some 80x86 processors, extra cycles are used to fetch unaligned data. We will see in Chapter 19 that some processors even crash if data aren’t properly aligned when using SIMD extensions. However, if you are not using SIMD extensions, then alignment won’t become a big issue unless you are coding a lot of iterative loops and so forth.

4.7.2 INFO, COMMON, PRIVATE, PUBLIC, and USE32 Extensions

You will probably never use the INFO, COMMON, PRIVATE, and PUBLIC extensions when writing inline assembly code, but they are here for your understanding.

The INFO directive defines a section to be an informational section, which is not included in the executable file by the linker. The purpose of the INFO section is to pass information to the linker.

Example of an INFO section:

  section .linkerdirective INFO align=4
  -p myInc.inc                           ; Include myInc.inc file when linking

The COMMON directive is used to declare common or global variables.

Example of the COMMON directive:

  common integer_one 4   ; Declare a global integer

The purpose of the COMMON directive is to tell the linker to merge integer_one into one reference if multiple modules use this same variable name.

PUBLIC is the default condition for segments and tells the linker to combine all code segments into one code group and all data segments into one data group. Using the PRIVATE directive instructs the linker not to combine segments into like groups.

Example of PUBLIC and PRIVATE directives:

  section .data public align=4     ; Group DATA segments (default)
  section .code private align=16   ; Private code section aligned on 16-byte boundary

USE32 is the default offset mode for all your 32-bit coding. IWBASIC works in this mode and you typically do not want to change this even if you could. Therefore, you should never have the occasion to use this directive. However, you will see it in some of the references in the appendices of this book and should be aware that USE32 is used to indicate the size of the offset into the segment.

4.8 Putting it all Together

Although you may never need to use segment directives, this book makes ample use of these instructions to demonstrate the full range of inline assembly coding techniques. One use is to implement higher precision integers and floating-point numbers using the 80-bit FPU, 64-bit MMX, and 128-bit XMM registers, something that may not be possible in IWBASIC. Therefore, to use these extended numbers, you need to put them into a DATA section in your assembly code.

You can also close a segment and reopen it later in your code with another SEGMENT directive. Be aware that you cannot change the attribute extensions such as ALIGN once you have defined them and if you share IWBASIC segments, those attributes are already defined for you. Listing 4-4 depicts a typical assembly block of code that declares CODE, DATA, and BSS segments.

Listing 4-4 Defining Segments within Blocks of Assembly Source Code

INT myInteger = 100       REM Declare myInteger as type integer (4 bytes) initialized to 100

_asm                      ; Declare a block of inline assembly code—notice the semi-colon here
;=============================================================================================
; Comment about what the following code is supposed to do – good idea to put the date too!
;=============================================================================================
section .data             ; Start a DATA section for initialized data variables
    value1    dd 1000     ; Declare value1 as DWORD (4 bytes)
    constant1 equ 250     ; Declare myConst1 as a constant equal to 250

section .bss              ; Start a BSS section for uninitialized data variables
    value2 resd 1         ; Reserve space in memory for one DWORD (4 bytes) labeled value2

section .text             ; Start a CODE section
    mov eax, [value1]     ; Copy 1000 to EAX register
    add eax, constant1    ; EAX = 1000 + 250
    mov [value2], eax     ; Copy value in EAX to value2 variable in BSS segment
    add [myInteger], eax  ; myInteger = myInteger + EAX; i.e., myInteger = 100 + 1250
_endasm                   REM End of inline assembly code block—notice the REM instruction

PRINT myInteger           REM Screen should display 1350
DO: UNTIL INKEY$ <> ""    REM Wait until a key is pressed
END                       REM Close command window and exit program

The program in Listing 4-4 shows you how to set up and use the DATA, BSS, and CODE segments within a IWBASIC program. Note that the use of brackets ’[‘ and ‘]’ to enclose variable names. This instructs the assembler to copy the contents of the variables into the destination operand. In IWBASIC, this is called passing data by value. If we omit the brackets, then the addresses of the variables are placed into the destination operands. In IWBASIC, this is called passing variables by reference.

NOTE
Copious use of comments is always a good idea. Comments do not take up any space in the object file and they let you understand your code when you return to it many months later.

Again, note the use of single quotes to initiate IWBASIC comments versus semi-colons used to initiate comments in the assembly code block. Interestingly, although the _asm directive is a IWBASIC instruction, you must initiate a comment after this directive using a semi-colon or your source code won’t compile. Appears your assembly code starts at the _asm directive. However, after the _endasm directive, you are back into IWBASIC territory and must now use a single quote to add a comment.

4.9 Chapter Review

Chapter 4 introduced you to the notion of declaring and using DATA, BSS, and CODE sections in your inline assembly code. These three sections share similar named sections setup by IWBASIC and although you may never use these directives in your assembly code, understanding how they work is key to knowing how and where your code and variables are stored.

You learned that if you use these directives, it matters what order you place them in. If segments are defined in your assembly code, you must ensure that you switch back to the CODE segment before exiting out of any assembly code blocks. We also discussed segment extensions that may or may not be used in your programs. These extensions include PRIVATE, PUBLIC, ALIGN, INFO, and USE32. Typically, IWBASIC has already declared values for these directives.

Finally, we described how the processor’s segment registers hold indexes into descriptor tables containing the physical addresses of each memory segment assigned to your program. You learned that your code is located in a linear address space and instructions are located by combining the 32-bit offset address in the EIB register with the physical address indexed by the CS register. Similarly, variable data are located by combining the index pointed to by the DS register with the offsets assigned each variable. Therefore, when we discuss copying the address of a variable into a register, what we really mean is that we are copying the 32-bit offset of the variable. The actual address is the combination of the DS register and the offset into the DATA segment.

Chapter 4 Exercises:

Which memory segment is used to place initialized data?
Can you read and write to the CODE segment?
Historically, NASM names the CODE segment TEXT. Is it equally appropriate to name this section CODE?
What do the EQU and $ symbols represent in NASM?
What are the default boundary alignments for the CODE, DATA, and BSS segments?
If you do not specify CODE, DATA, or BSS segments in your assembly code block, what is the implied default segment?
How many bytes does the following variable declaration occupy in your data section: myVariable dq 12345?
Why should you align information in the data and code sections of your programs?
How would the following variable be stored in memory by Intel 80x86 microprocessors: myInteger dd 0C21CB0Ah?
How would you reserve space in the .BSS section for a word-size integer named myInteger?