3.17.15 Intel 386 and AMD x86-64 Options
- Generate instructions for the machine type
cpu-type. In contrast to -mtune=cpu-type, which merely tunes the generated code for the specified
cpu-type, -march=cpu-type allows GCC to generate code that may not run at all on processors other than the one indicated. Specifying
The choices for cpu-type are:
- This selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine. Using
-march=native enables all instruction subsets supported by the local machine (hence the result might not run on different machines). Using
-mtune=native produces code optimized for the local machine under the constraints of the selected instruction set.
- Original Intel i386 CPU.
- Intel i486 CPU. (No scheduling is implemented for this chip.)
- Intel Pentium CPU with no MMX support.
- Intel Pentium MMX CPU, based on Pentium core with MMX instruction set support.
- Intel Pentium Pro CPU.
- When used with -march, the Pentium Pro instruction set is used, so the code runs on all i686 family chips. When used with
-mtune, it has the same meaning as `generic'.
- Intel Pentium II CPU, based on Pentium Pro core with MMX instruction set support.
- Intel Pentium III CPU, based on Pentium Pro core with MMX and SSE instruction set support.
- Intel Pentium M; low-power version of Intel Pentium III CPU with MMX, SSE and SSE2 instruction set support. Used by Centrino notebooks.
- Intel Pentium 4 CPU with MMX, SSE and SSE2 instruction set support.
- Improved version of Intel Pentium 4 CPU with MMX, SSE, SSE2 and SSE3 instruction set support.
- Improved version of Intel Pentium 4 CPU with 64-bit extensions, MMX, SSE, SSE2 and SSE3 instruction set support.
- Intel Core 2 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3 instruction set support.
- Intel Core i7 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2 instruction set support.
- Intel Core i7 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AES and PCLMUL instruction set support.
- Intel Core CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AES, PCLMUL, FSGSBASE, RDRND and F16C instruction set support.
- Intel Atom CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3 instruction set support.
- AMD K6 CPU with MMX instruction set support.
- Improved versions of AMD K6 CPU with MMX and 3DNow! instruction set support.
- AMD Athlon CPU with MMX, 3dNOW!, enhanced 3DNow! and SSE prefetch instructions support.
- Improved AMD Athlon CPU with MMX, 3DNow!, enhanced 3DNow! and full SSE instruction set support.
- Processors based on the AMD K8 core with x86-64 instruction set support, including the AMD Opteron, Athlon 64, and Athlon 64 FX processors. (This supersets MMX, SSE, SSE2, 3DNow!, enhanced 3DNow! and 64-bit instruction set extensions.)
- Improved versions of AMD K8 cores with SSE3 instruction set support.
- CPUs based on AMD Family 10h cores with x86-64 instruction set support. (This supersets MMX, SSE, SSE2, SSE3, SSE4A, 3DNow!, enhanced 3DNow!, ABM and 64-bit instruction set extensions.)
- CPUs based on AMD Family 15h cores with x86-64 instruction set support. (This supersets FMA4, AVX, XOP, LWP, AES, PCL_MUL, CX16, MMX, SSE, SSE2, SSE3, SSE4A, SSSE3, SSE4.1, SSE4.2, ABM and 64-bit instruction set extensions.)
- AMD Family 15h core based CPUs with x86-64 instruction set support. (This supersets BMI, TBM, F16C, FMA, AVX, XOP, LWP, AES, PCL_MUL, CX16, MMX, SSE, SSE2, SSE3, SSE4A, SSSE3, SSE4.1, SSE4.2, ABM and 64-bit instruction set extensions.)
- CPUs based on AMD Family 14h cores with x86-64 instruction set support. (This supersets MMX, SSE, SSE2, SSE3, SSSE3, SSE4A, CX16, ABM and 64-bit instruction set extensions.)
- IDT WinChip C6 CPU, dealt in same way as i486 with additional MMX instruction set support.
- IDT WinChip 2 CPU, dealt in same way as i486 with additional MMX and 3DNow! instruction set support.
- VIA C3 CPU with MMX and 3DNow! instruction set support. (No scheduling is implemented for this chip.)
- VIA C3-2 (Nehemiah/C5XL) CPU with MMX and SSE instruction set support. (No scheduling is implemented for this chip.)
- AMD Geode embedded processor with MMX and 3DNow! instruction set support.
The choices for cpu-type are the same as for -march. In addition, -mtune supports an extra choice for cpu-type:
- Produce code optimized for the most common IA32/AMD64/EM64T processors. If you know the CPU on which your code will run, then you should use the corresponding
-mtune or -march option instead of
-mtune=generic. But, if you do not know exactly what CPU users of your application will have, then you should use this option.
As new processors are deployed in the marketplace, the behavior of this option will change. Therefore, if you upgrade to a newer version of GCC, code generation controlled by this option will change to reflect the processors that are most common at the time that version of GCC is released.
There is no -march=generic option because -march indicates the instruction set the compiler can use, and there is no generic instruction set applicable to all processors. In contrast, -mtune indicates the processor (or, in this case, collection of processors) for which the code is optimized.
- Use the standard 387 floating-point coprocessor present on the majority of chips and emulated otherwise. Code compiled with this option runs almost everywhere. The temporary results are computed in 80-bit precision instead of the precision specified by
the type, resulting in slightly different results compared to most of other chips. See
-ffloat-store for more detailed description.
This is the default choice for i386 compiler.
- Use scalar floating-point instructions present in the SSE instruction set. This instruction set is supported by Pentium III and newer chips, and in the AMD line by Athlon-4, Athlon XP and Athlon MP chips. The earlier version of the SSE instruction set supports
only single-precision arithmetic, thus the double and extended-precision arithmetic are still done using 387. A later version, present only in Pentium 4 and AMD x86-64 chips, supports double-precision arithmetic too.
For the i386 compiler, you must use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For the x86-64 compiler, these extensions are enabled by default.
The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80 bits.
This is the default choice for the x86-64 compiler.
- Attempt to utilize both instruction sets at once. This effectively doubles the amount of available registers, and on chips with separate execution units for 387 and SSE the execution resources too. Use this option with care, as it is still experimental, because the GCC register allocator does not model separate functional units well, resulting in unstable performance.
Warning: the requisite libraries are not part of GCC. Normally the facilities of the machine's usual C compiler are used, but this can't be done directly in cross-compilation. You must make your own arrangements to provide suitable library functions for cross-compilation.
On machines where a function returns floating-point results in the 80387 register stack, some floating-point opcodes may be emitted even if
-msoft-float is used.
The usual calling convention has functions return values of types
double in an FPU register, even if there is no FPU. The idea is that the operating system should emulate an FPU.
The option -mno-fp-ret-in-387 causes such values to be returned in ordinary CPU registers instead.
sqrtinstructions for the 387. Specify this option to avoid generating those instructions. This option is the default on FreeBSD, OpenBSD and NetBSD. This option is overridden when -march indicates that the target CPU always has an FPU and so the instruction does not need emulation. These instructions are not generated unless you also use the -funsafe-math-optimizations switch.
long double, and
long longvariables on a two-word boundary or a one-word boundary. Aligning
doublevariables on a two-word boundary produces code that runs somewhat faster on a Pentium at the expense of more memory.
On x86-64, -malign-double is enabled by default.
Warning: if you use the -malign-double switch, structures containing the above types are aligned differently than the published application binary interface specifications for the 386 and are not
binary compatible with structures in code compiled without that switch.
long doubletype. The i386 application binary interface specifies the size to be 96 bits, so -m96bit-long-double is the default in 32-bit mode.
Modern architectures (Pentium and newer) prefer
long double to be aligned to an 8- or 16-byte boundary. In arrays or structures conforming to the ABI, this is not possible. So specifying
long double to a 16-byte boundary by padding the
long double with an additional 32-bit zero.
In the x86-64 compiler, -m128bit-long-double is the default choice as its ABI specifies that
long double is aligned on 16-byte boundary.
Notice that neither of these options enable any extra precision over the x87 standard of 80 bits for a
Warning: if you override the default value for your target ABI, this changes the size of structures and arrays containing
long double variables, as well as modifying the function calling convention for functions taking
long double. Hence they are not binary-compatible with code compiled without that switch.
retnum instruction, which pops their arguments while returning. This saves one instruction in the caller since there is no need to pop the arguments there.
You can specify that an individual function is called with this calling sequence with the function attribute `stdcall'. You can also override the -mrtd option by using the function attribute `cdecl'. See Function Attributes.
Warning: this calling convention is incompatible with the one normally used on Unix, so you cannot use it if you need to call libraries compiled with the Unix compiler.
Also, you must provide function prototypes for all functions that take variable numbers of arguments (including
printf); otherwise incorrect code is generated for calls to those functions.
In addition, seriously incorrect code results if you call a function with too many arguments. (Normally, extra arguments are harmlessly ignored.)
Warning: if you use this switch, and num is nonzero, then you must build all modules with the same value, including any libraries. This includes the system libraries and startup modules.
Warning: if you use this switch then you must build all modules with the same value, including any libraries. This includes the system libraries and startup modules.
Setting the rounding of floating-point operations to less than the default 80 bits can speed some programs by 2% or more. Note that some mathematical libraries assume that extended-precision (80-bit) floating-point operations are enabled by default; routines
in such libraries could suffer significant loss of accuracy, typically through so-called “catastrophic cancellation”, when this option is used to set the precision to less than extended precision.
force_align_arg_pointer, applicable to individual functions.
On Pentium and Pentium Pro,
long double values should be aligned to an 8-byte boundary (see
-malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type
__m128 may not work properly if it is not 16-byte aligned.
To ensure proper alignment of this values on the stack, the stack boundary must be as aligned as that required by any value stored on the stack. Further, every function must be generated such that it keeps the stack aligned. Thus calling a function compiled with a higher preferred stack boundary from a function compiled with a lower preferred stack boundary most likely misaligns the stack. It is recommended that libraries that use callbacks always use the default setting.
This extra alignment does consume extra stack space, and generally increases code size. Code that is sensitive to stack space usage, such as embedded systems and operating system kernels, may want to reduce the preferred alignment to
To generate SSE/SSE2 instructions automatically from floating-point code (as opposed to 387 instructions), see -mfpmath=sse.
GCC depresses SSEx instructions when -mavx is used. Instead, it generates new AVX instructions or AVX equivalence for all SSEx instructions when needed.
These options enable GCC to use these extended instructions in generated code, even without
-mfpmath=sse. Applications that perform run-time CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should
be compiled without these options.
cldinstruction in the prologue of functions that use string instructions. String instructions depend on the DF flag to select between autoincrement or autodecrement mode. While the ABI specifies the DF flag to be cleared on function entry, some operating systems violate this specification by not clearing the DF flag in their exception dispatchers. The exception handler can be invoked with the DF flag set, which leads to wrong direction mode when string instructions are used. This option can be enabled by default on 32-bit x86 targets by configuring GCC with the --enable-cld configure option. Generation of
cldinstructions can be suppressed with the -mno-cld compiler option in this case.
vzeroupperinstruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary
CMPXCHG16Ballows for atomic operations on 128-bit double quadword (or oword) data types. This is useful for high-resolution counters that can be updated by multiple processors (or cores). This instruction is generated as part of atomic built-in functions: see __sync Builtins or __atomic Builtins for details.
SAHFinstructions in 64-bit code. Early Intel Pentium 4 CPUs with Intel 64 support, prior to the introduction of Pentium 4 G1 step in December 2005, lacked the
SAHFinstructions which were supported by AMD64. These are load and store instructions, respectively, for certain status flags. In 64-bit mode, the
SAHFinstruction is used to optimize
remainderbuilt-in functions; see Other Builtins for details.
movbeinstruction to implement
__builtin_ia32_crc32dito generate the
RSQRTSSinstructions (and their vectorized variants
RSQRTPS) with an additional Newton-Raphson step to increase precision instead of
SQRTSS(and their vectorized variants) for single-precision floating-point arguments. These instructions are generated only when -funsafe-math-optimizations is enabled together with -finite-math-only and -fno-trapping-math. Note that while the throughput of the sequence is higher than the throughput of the non-reciprocal instruction, the precision of the sequence can be decreased by up to 2 ulp (i.e. the inverse of 1.0 equals 0.99999994).
Note that GCC implements
) in terms of
RSQRTPS) already with -ffast-math (or the above option combination), and doesn't need
Also note that GCC emits the above sequence with additional Newton-Raphson step for vectorized single-float division and vectorized
) already with -ffast-math (or the above option combination), and doesn't need
- Enable all estimate instructions.
- Enable the default instructions, equivalent to -mrecip.
- Disable all estimate instructions, equivalent to -mno-recip.
- Enable the approximation for scalar division.
- Enable the approximation for vectorized division.
- Enable the approximation for scalar square root.
- Enable the approximation for vectorized square root.
So, for example, -mrecip=all,!sqrt enables all of the reciprocal approximations, except for square root.
GCC currently emits calls to
vmlsAcos4 for corresponding function type when
-mveclibabi=svml is used, and
__vrs4_powf for the corresponding function type when
-mveclibabi=acml is used.
-D_MT; when linking, it links in a special thread helper library -lmingwthrd which cleans up per-thread exception-handling data.
memsetfor short lengths.
- Expand using i386
repprefix of the specified size.
- Expand into an inline loop.
- Always use a library call.
%fsfor 64-bit), or whether the thread base pointer must be added. Whether or not this is valid depends on the operating system, and whether it maps the segment to cover the entire TLS area.
For systems that use the GNU C Library, the default is on.
ms_hook_prologueisn't possible at the moment for -mfentry and -pg.
These `-m' switches are supported in addition to the above on x86-64 processors in 64-bit environments.
- Generate code for a 32-bit or 64-bit environment. The
-m32 option sets
long, and pointer types to 32 bits, and generates code that runs on any i386 system.
The -m64 option sets
intto 32 bits and
longand pointer types to 64 bits, and generates code for the x86-64 architecture. For Darwin only the -m64 option also turns off the -fno-pic and -mdynamic-no-pic options.
The -mx32 option sets
long, and pointer types to 32 bits, and generates code for the x86-64 architecture.
- Do not use a so-called “red zone” for x86-64 code. The red zone is mandated by the x86-64 ABI; it is a 128-byte area beyond the location of the stack pointer that is not modified by signal or interrupt handlers
and therefore can be used for temporary data without adjusting the stack pointer. The flag
-mno-red-zone disables this red zone.
- Generate code for the small code model: the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or dynamically linked. This is the default
- Generate code for the kernel code model. The kernel runs in the negative 2 GB of the address space. This model has to be used for Linux kernel code.
- Generate code for the medium model: the program is linked in the lower 2 GB of the address space. Small symbols are also placed there. Symbols with sizes larger than
-mlarge-data-threshold are put into large data or BSS sections and can be located above 2GB. Programs can be statically or dynamically linked.
- Generate code for the large model. This model makes no assumptions about addresses and sizes of sections.
- Generate code for long address mode. This is only supported for 64-bit and x32 environments. It is the default address mode for 64-bit environments.
- Generate code for short address mode. This is only supported for 32-bit and x32 environments. It is the default address mode for 32-bit and x32 environments.