Thread Local Storage

最新推荐文章于 2021-10-31 01:08:35 发布

wenbin0301

最新推荐文章于 2021-10-31 01:08:35 发布

阅读量2.2k

点赞数

分类专栏： IT-Programming Language 文章标签： thread compiler linker variables initialization module

IT-Programming Language 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Source: http://www.nynaeve.net/?p=180

ThreadLocal Storage, part 1: Overview

Windows, likepractically any other mainstream multithreading operating system, provides amechanism to allow programmers to efficiently store state on a per-threadbasis. This capability is typically known asThread Local Storage,and it’s quite handy in a number of circumstances where global variables mightneed to be instanced on a per-thread basis.

Although theusage of TLS on Windows is fairly well documented, the implementation detailsof it are not so much (though there are a smattering of pieces of third partydocumentation floating out there).

Conceptually,TLS is in principal not all that complicated (famous last words), at least froma high level. The general design is that all TLS accesses go through either apointer or array that is present on the TEB, which is a system-defined datastructure that is already instanced per thread.

The“per-thread” resolution of the TEB is fairly well documented, but for the benefitof those that are unaware, the general idea is that one of the segmentregisters (fs on x86,gs on x64) is repurposed by the OS to pointto the base address of the TEB for the current thread. This allows, say, anaccess tofs:[0x0] (orgs:[0x0] on x64) to always access the TEBallocated for the current thread, regardless of other threads in the addressspace. The TEB does really exist in the flat address space of the process (andindeed there is a field in the TEB that contains the flat virtual address ofit), but the segmentation mechanism is simply used to provide a convenient wayto access the TEB quickly without having to search through a list of thread IDsand TEB pointers (or other relatively slow mechanisms).

On non-x86 andnon-x64 architectures, the underlying mechanism by which the TEB is accessedvaries, but the general theme is that there is a register of some sort which isalways set to the base address of the current thread’s TEB for easy access.

The TEB itselfis probably one of the best-documented undocumented Windows structures,primarily because there is type information included for the debugger’s benefitin all recent ntdll and ntoskrnl.exe builds. With this information and a littledisassembly work, it is not that hard to understand the implementation behindTLS.

Before we canlook at the implementation of how TLS works on Windows, however, it isnecessary to know the documented mechanisms to use it. There are two ways toaccomplish this task on Windows. The first mechanism is aset of kernel32APIs (comprising TlsGetValue, TlsSetValue, TlsAlloc,and TlsFree that allows explicit access to TLS. The usage of thefunctions is fairly straightforward;TlsAlloc reserves space on allthreads for a pointer-sized variable, andTlsGetValue can be used to readthis per-thread storage on any thread (TlsSetValue andTlsFree areconceptually similar).

The secondmechanism by which TLS can be accessed on Windows is through some specialsupport from the loader (residing ntdll) and the compiler and linker, whichallow “seamless”,implicit usage of thread local variables, just as onewould use any global variable, provided that the variables are tagged with__declspec(thread)(when using the Microsoft build utilities). This is more convenient than usingthe TLS APIs as one doesn’t need to go and call a function every time you wantto use a per-thread variable. It also relieves the programmer of having toexplicitly remember to callTlsAlloc and TlsFree atinitialization time and deinitialization time, and it implies an efficientusage of per-thread storage space (implicit TLS operates by allocating asingle large chunk of memory, the size of which is defined by the sum of allper-thread variables, for each thread so that only one index into the implicitTLS array is used for all variables in a module).

With theadvantages of implicit TLS, why would anyone use the explicit TLS API? Well, itturns out that prior to Windows Vista, there are some rather annoyinglimitations baked into the loader’s implicit TLS support. Specifically,implicit TLS does not operate when a module using it is not being loaded atprocess initialization time (during static import resolution). In practice,this means that it is typically not usable except by the main process image(.exe) of a process, and any DLL(s) that are guaranteed to be loaded atinitialization time (such as DLL(s) that the main process image static linksto).

ThreadLocal Storage, part 2: Explicit TLS

Previously,I outlined some of the general design principles behind both flavors of TLS inuse on Windows. Anyone can see the design and high level interface to TLS byreading MSDN, though; the interesting parts relate to the implementationitself.

The explicitTLS API is (by far) the simplest of the two classes of TLS in terms of theimplementation, as it touches the fewest “moving parts”. As I mentioned lasttime, there are really just four key functions in the explicit TLS API. Themost important two are TlsGetValue and TlsSetValue, which managethe actual setting and retrieving of per-thread pointers.

These twofunctions are simple enough to annotate entirely. The essential mechanismbehind them is that they are basically just “dumb accessors” into an array (twoarrays in actuality,TlsSlots andTlsExpansionSlots) in the TEB,which is indexed by thedwTlsIndex argument to return (or set) thedesired per-thread variable. The implementation ofTlsGetValue on Vista(32-bit) is as follows (TlsSetValue is similar, except that it writes tothe arrays instead of reading from them, and has support for demand-allocatingtheTlsExpansionSlots array; more on that later):

PVOID

WINAPI

TlsGetValue(

__inDWORD dwTlsIndex

)

{

PTEB Teb = NtCurrentTeb(); // fs:[0x18]

// Reset the last error state.

Teb->LastErrorValue = 0;

// If the variable is in the main array, return it.

if (dwTlsIndex < 64)

return Teb->TlsSlots[ dwTlsIndex ];

if (dwTlsIndex > 1088)

{

BaseSetLastNTError( STATUS_INVALID_PARAMETER );

return 0;

}

// Otherwise it's in the expansion array.

// If it's not allocated, we default to zero.

if (!Teb->TlsExpansionSlots)

return 0;

// Fetch the value from the expansion array.

return Teb->TlsExpansionSlots[ dwTlsIndex - 64 ];

}

(The assembler version(annotated) is also available.)

The TlsSlotsarray in the TEB is a part of every thread, which gives each thread aguaranteed set of 64 thread local storage indexes. Later on, Microsoft decidedthat 64 was not enough TLS slots to go around and added theTlsExpansionSlotsarray, for an additional 1024 TLS slots. The TlsExpansionSlots array isdemand-allocated inTlsAlloc if the initial set of 64 slots is exceeded.

(This is, bythe way, the nature of the seemingly arbitrary 64 and 1088 TLS slot limitationsmentionedby MSDN, forthose keeping score.)

TlsAlloc and TlsFree are, for all intents and purposes,implemented just as what one would expect. They acquire a lock, search for afree TLS slot (returning the index if one is found), otherwise indicating tothe caller that there are no free slots. If the first 64 slots are exhaustedand the TlsExpansionSlots array has not been created, thenTlsAllocwill allocate and zero space for 1024 more TLS slots (pointer-sized values),and then update theTlsExpansionSlots to refer to the newly allocatedstorage.

Internally, TlsAllocand TlsFree utilize the Rtl bitmappackage to track usage of individual TLS slots; each bit in a bitmap describeswhether a particular TLS slot is free or in use. This allows for reasonablyfast (and space efficient) mapping of TLS slot usage for book-keeping purposes.

If one has beenfollowing along so far, then the question as to what happens whenTlsAllocis called such that it must create theTlsExpansionSlots array afterthere is already more than one thread in the current process may have come tomind. This might appear to be a problem at first glance, asTlsAlloc onlycreates the array for the current thread. Although one might be tempted toconclude that, given this behavior ofTlsAlloc, explicit TLS thereforedoesn’t work reliably above 64 TLS slots if the extra slots are allocated afterthe second thread in the process is created, this is in fact not the case.There exists some clever sleight of hand that is performed by TlsGetValueand TlsSetValue, which compensates for the fact thatTlsAlloc canonly create theTlsExpansionSlots memory block for the current thread.

Specifically,if TlsGetValue is called with an array index within the confines of theTlsExpansionSlotsarray, but the array has not been allocated for the current thread, then zerois returned. (This is the default value for an uninitialized TLS slot, and isthus consequently legal.) Similarly, ifTlsSetValue is called with anarray index that falls under theTlsExpansionSlots array, and the arrayhas not yet been created,TlsSetValue allocates the memory block ondemand and initializes the requested TLS slot.

There also existsone final twist in TlsFree that is required to support the behavior ofreleasing a TLS slot while there are multiple threads running. A potentialproblem exists whereby a thread releases a TLS slot, and then it becomesreallocated, following which the previous contents of the TLS slot are stillpresent on other threads running in the process.TlsFree alleviates thisproblem by asking the kernel for help, in the form of theThreadZeroTlsCellthread information class. When the kernel sees aNtSetInformationThreadcall for ThreadZeroTlsCell, it enumerates all threads in the process andwrites a zero pointer-length value to each running thread’s instance of therequested TLS slot, thus flushing the old contents and resetting the slot tothe unallocated default state. (It is not strictly necessary for this to havebeen done in kernel mode, although the designers chose to go this route.)

When a threadexits normally, if the TlsExpansionSlots pointer has been allocated, itis freed to the process heap. (Of course, if a thread is terminated byTerminateThread, the TlsExpansionSlotsarray is leaked. This is yet one reason among innumerable others why you shouldstay away from TerminateThread.)

Next up:Examining implicit TLS support (__declspec(thread) variables).

Thread LocalStorage, part 3: Compiler and linker support for implicit TLS »

ThreadLocal Storage, part 3: Compiler and linker support for implicit TLS

Last time, I discussed themechanismsby which so-called explicit TLS operates (the TlsGetValue, TlsSetValue andother associated supporting routines).

Althoughexplicit TLS is certainly fairly heavily used, many of the more “interesting”pieces about how TLS works in fact relate to the work that the loader does tosupport implicit TLS, or__declspec(thread) variables (in CL). Whileboth TLS mechanisms are designed to provide a similar effect, namely thecapability to store information on a per-thread basis, many aspects of theimplementations of the two different mechanisms are very different.

Whenyou declare a variable with the __declspec(thread) extended storage class, thecompiler and linker cooperate to allocate storage for the variable in a specialregion in the executable image. By convention, all variables with the__declspec(thread) storage class are placed in the .tls section of a PE image,although this is not technically required (in fact, the thread local variablesdo not even really need to be in their own section, merely contiguous inmemory, at least from the loader’s perspective). On disk, this region of memorycontains the initializer data for all thread local variables in a particularimage. However, this data is never actually modified and references to aparticular thread local variable will never refer to an address within thissection of the PE image; the data is merely a “template” to be used whenallocating storage for thread local variables after a thread has been created.

Thecompiler and linker also make use of several special variables in the contextof implicit TLS support. Specifically, a variable by the name of _tls_used (ofthe type IMAGE_TLS_DIRECTORY) is created by a portion of the C runtime that isstatic linked into every program to represent the TLS directory that will beused in the final image (references to this variable should be extern “C” inC++ code for name decoration purposes, and storage for the variable need not beallocated as the supporting CRT stub code already creates the variable). TheTLS directory is a part of the PE header of an executable image which describesto the loader how the image’s thread local variables are to be managed. Thelinker looks for a variable by the name of _tls_used and ensures that in theon-disk image, it overlaps with the actual TLS directory in the final image.

Thesource code for the particular section of C runtime logic that declares_tls_used lives in the tlssup.c file (which comes with Visual Studio), makingthe variable pseudo-documented. The standard declaration for _tls_used is asso:

_CRTALLOC(".rdata$T")

const IMAGE_TLS_DIRECTORY _tls_used=

{

(ULONG)(ULONG_PTR) &_tls_start, // startof tls data

(ULONG)(ULONG_PTR) &_tls_end, // end of tls data

(ULONG)(ULONG_PTR) &_tls_index, // addressof tls_index

(ULONG)(ULONG_PTR) (&__xl_a+1), // pointerto callbacks

(ULONG) 0, // size of tls zero fill

(ULONG) 0 // characteristics

};

TheCRT code also provides a mechanism to allow a program to register a set of TLScallbacks, which are functions with a similar prototype toDllMainthat are called when a thread starts or exits (cleanly) in the current process.(These callbacks can even be registered for a main process image, where thereis no DllMain routine.) The callbacks are typed as PIMAGE_TLS_CALLBACK, and theTLS directory points to a null-terminated array of callbacks (called insequence).

Fora typical image, there will not exist any TLS callbacks (in practice, almosteverything uses DllMain to perform per-thread initialization tasks). However,the support is retained and is fully functional. To use the support that theCRT provides for TLS callbacks, one needs to declare a variable that is storedin the specially named “.CRT$XLx” section, where x is a value between A and Z.For example, one might write the following code:

#pragmasection(".CRT$XLY",long,read)

extern "C"__declspec(allocate(".CRT$XLY"))

PIMAGE_TLS_CALLBACK _xl_y =MyTlsCallback;

Thestrange business with the special section names is required because thein-memory ordering of the TLS callback pointers is significant. To understandwhat is happening with this peculiar looking declaration, it is first necessaryto understand a bit about the compiler and linker organize data in the final PEimage that is produced.

Non-headerdata in a PE image is placed into one or more sections, which are regions ofmemory with a common set of attributes (such as page protection). The__declspec(allocate(“section-name”))keyword (CL-specific) tells the compiler that a particular variable is to beplaced in a specific section in the final executable. The compiler additionallyhas support for concatenating similarly-named sections into one larger section.This support is activated by prefixing a section name with a $ characterfollowed by any other text. The compiler concatenates the resulting sectionwith the section of the same name, truncated at the $ character (inclusive).

Thecompiler alphabetically orders individual sections when concatenating them (dueto the usage of the $ character in the section name). This means that in-memory(in the final executable image), a variable in the “.CRT$XLB” section will beafter a variable in the “.CRT$XLA” section but before a variable in “.CRT$XLZ”section. The C runtime uses this quirk of the compiler to create an array ofnull terminated function pointers to TLS callbacks (with the pointer stored inthe “.CRT$XLZ” section being the null terminator). Thus, in order to ensurethat the declared function pointer resides within the confines of the TLScallback array being referenced by _tls_used, it is necessary place in asection of the form “.CRT$XLx“.

Thecreation of the TLS directory is, however, only one portion of how the compilerand linker work together to support __declspec(thread) variables. Next time,I’ll discuss just how the compiler and linker manage accesses to suchvariables.

Update:Philmentions that this support for TLS callbacks does not work before the VisualStudio 2005 release. Be warned if you are still using an old compiler package.

ThreadLocal Storage, part 4: Accessing __declspec(thread) data

Yesterday,I outlined how the compiler and linker cooperate to support TLS. However, Ididn’t mention just what exactly goes on under the hood when one declares a __declspec(thread)variable and accesses it.

Beforethe inner workings of a __declspec(thread) variable access can be explained,however, it is necessary to discuss several more special variables in tlssup.c.These special variables are referenced by_tls_used to create the TLS directoryfor the image.

Thefirst variable of interest is _tls_index, which is implicitly referenced by thecompiler in the per-thread storage resolution mechanism any time a thread localvariable is referenced (well, almost every time; there’s an exception to this,which I’ll mention later on). _tls_index is also the only variable declared intlssup.c that uses the default allocation storage class. Internally, itrepresents the current module’s TLS index. The per-module TLS index is, inprincipal, similar to a TLS index returned by TlsAlloc. However, the two arenot compatible, and there exists significantly more work behind the per-moduleTLS index and its supporting code. I’ll cover all of that later as well; fornow, just bear with me.

Thedefinitions of _tls_start and _tls_end appear as so in tlssup.c:

#pragma data_seg(".tls")

#if defined (_M_IA64) || defined(_M_AMD64)

_CRTALLOC(".tls")

#endif

char _tls_start = 0;

#pragmadata_seg(".tls$ZZZ")

#if defined (_M_IA64) || defined(_M_AMD64)

_CRTALLOC(".tls$ZZZ")

#endif

char _tls_end = 0;

Thiscode creates the two variables and places them at the start and end of the“.tls” section. The compiler and linker will automatically assume a defaultallocation section of “.tls” for all__declspec(thread) variables, such thatthey will be placed between _tls_start and _tls_end in the final image. The twovariables are used to tell the linker the bounds of the TLS storage templatesection, via the image’s TLS directory (_tls_used).

Nowthat we know how __declspec(thread) works from a language level, it isnecessary to understand the supporting code the compiler generates for anaccess to a__declspec(thread) variable. This supporting code is, fortunately,fairly straightforward. Consider the following test program:

__declspec(thread) int threadedint =0;

int __cdecl wmain(int ac,

wchar_t **av)

{

threadedint = 42;

return 0;

}

Forx64, the compiler generated the following code:

mov ecx, DWORD PTR _tls_index

mov rax, QWORD PTR gs:88

mov edx, OFFSET FLAT:threadedint

mov rax, QWORD PTR [rax+rcx*8]

mov DWORD PTR [rdx+rax], 42

Recallthat the gs segment register refers to the base address of the TEB on x64. 88(0×58) is the offset in the TEB for theThreadLocalStoragePointer member on x64(more on that later):

+0x058 ThreadLocalStoragePointer : Ptr64 Void

Ifwe examine the code after the linker has run, however, we’ll notice somethingstrange:

mov ecx, cs:_tls_index

mov rax, gs:58h

mov edx, 4

mov rax, [rax+rcx*8]

mov dword ptr [rdx+rax], 2Ah ; 42

xor eax, eax

Ifyou haven’t noticed it already, the offset of the “threadedint” variable wasresolved to a small value (4). Recall that in the pre-link disassembly, the“mov edx, 4″ instruction was “mov edx, OFFSET FLAT:threadedint”.

Now,4 isn’t a very flat address (one would expect an address within the confines ofthe executable image to be used). What happened?

Well,it turns out that the linker has some tricks up its sleeve that were put intoplay here. The “offset” of a__declspec(thread) variable is assumed to berelative to the base of the “.tls” section by the linker when it is resolvingaddress references. If one examines the “.tls” section of the image, thingsbegin to make a bit more sense:

0000000001007000 _tls segment parapublic 'DATA' use64

0000000001007000 assume cs:_tls

0000000001007000 ;org 1007000h

0000000001007000 _tls_start dd 0

0000000001007004 ; int threadedint

0000000001007004 ?threadedint@@3HAdd 0

0000000001007008 _tls_end dd 0

Theoffset of “threadedint” from the start of the “.tls” section is indeed 4 bytes.But all of thisstill doesn’t explain how the instructions the compilergenerated access a variable that is instanced per thread.

The“secret sauce” here lies in the following three instructions:

mov ecx, cs:_tls_index

mov rax, gs:58h

mov rax, [rax+rcx*8]

Theseinstructions fetch ThreadLocalStoragePointer out of the TEB and index it by_tls_index.The resulting pointer is then indexed again with the offset ofthreadedint fromthe start of the “.tls” section to form a complete pointer to this thread’sinstance of thethreadedint variable.

InC, the code that the compiler generated could be visualized as follows:

// This represents the".tls" section

struct _MODULE_TLS_DATA

{

int tls_start;

int threadedint;

int tls_end;

} MODULE_TLS_DATA, *PMODULE_TLS_DATA;

PTEB Teb;

PMODULE_TLS_DATA TlsData;

Teb = NtCurrentTeb();

TlsData =Teb->ThreadLocalStoragePointer[ _tls_index ];

TlsData->threadedint = 42;

Thisshould look familiar if you’ve used explicit TLS before. The typical paradigmfor explicit TLS is to place a structure pointer in a TLS slot, and then toaccess your thread local state, the per thread instance of the structure isretrieved and the appropriate variable is then referenced off of the structurepointer. The difference here is that the compiler and linker (and loader, moreon that later) cooperated to save you (the programmer) from having to do all ofthat explicitly; all you had to do was declare a __declspec(thread) variableand all of this happens magically behind the scenes.

There’sactually an additional curve that the compiler will sometimes throw withrespect to how implicit TLS variables work from a code generation perspective.You may have noticed how I showed the x64 version of an access to a__declspec(thread)variable; this is because, by default, x86 builds of a .exe involve a specialoptimization (/GA (Optimize for Windows Application,quite possibly the worst name for a compiler flag ever)) that eliminates thestep of referencing the special _tls_index variable by assuming that it iszero.

Thisoptimization is only possible with a .exe that will run as the main processimage. The assumption works in this case because the loader assigns per-moduleTLS index values on a sequential basis (based on the loaded module list), andthe main process image should be the second thing in the loaded module list,after NTDLL (which, now that this optimization is being used, can never haveany__declspec(thread) variables, or it would get TLS index zero instead of themain process image). It’s worth noting that in the (extremely rare) case that a.exe exports functions and is imported by another .exe, this optimization willcause random corruption if the imported .exe happens to use __declspec(thread).

Forreference, with /GA enabled, the x86 build of the above code results in thefollowing instructions:

mov eax, large fs:2Ch

mov ecx, [eax]

mov dword ptr [ecx+4], 2Ah ; 42

Rememberthat on x86, fs points to the base address of the TEB, and thatThreadLocalStoragePointeris at offset +0x2C from the base of the x86 TEB.

Noticethat there is no reference to _tls_index; the compiler assumes that it willtake on the value zero. If one examines a .dll built with the x86 compiler, the/GA optimization is always disabled, and_tls_index is used as expected.

Themagic behind __declspec(thread) extends beyond just the compiler and linker,however. Something still has to set up the storage for each module’s per-threadstate, and that something is the loader. More on how the loader plays a part inthis complex process next time.

ThreadLocal Storage, part 5: Loader support for __declspec(thread) variables (processinitialization time)

Last time,I described the mechanism by which the compiler and linker generate code toaccess a variable that has been instanced per-thread via the __declspec(thread)extended storage class. Although the compiler and linker have essentially “setthe stage” with respect to implicit TLS at this point, the loader is thecomponent that “fills in the dots” and supplies the necessary run-timeinfrastructure to allow everything to operate.

Specifically,the loader is responsible for managing the allocation of per-module TLS indexvalues, the allocation and management of the memory for theThreadLocalStoragePointerarray referred to by the TEB of every thread. Additionally, the loader is alsoresponsible for managing the memory for each module’s thread-instanced (thatis,__declspec(thread)-decorated) variables.

Theloader’s TLS-related allocation and management duties can conceptually be splitup into four distinct areas (Note that this represents the Windows Server 2003and earlier view of things; I will go over some of the changes that WindowsVista makes this this model in a future posting in the TLS series.):

1. At process initialization time,allocate _tls_index values, determine the extent of memory required for eachmodule’s TLS block, and call TLS and DLL initializers (in that order).

2. At thread initialization time,allocate and initialize TLS memory blocks for each module utilizing TLS,allocate theThreadLocalStoragePointer array for the current thread, and linkthe TLS memory blocks in to theThreadLocalStoragePointer array. Additionally,TLS initializers and then DLL initializers (in that order) are invoked for thecurrent thread.

3. At thread deinitialization time,call TLS deinitializers and then DLL deinitializers (in that order), andrelease the current thread’s TLS memory blocks for each module using TLS, andrelease theThreadLocalStoragePointer array.

4. At process deinitialization time,call TLS and DLL initializers (in that order).

Ofcourse, the loader performs a number of other tasks when these events occur;this is simply a list of those that have some bearing on TLS support.

Mostof these operations are fairly straightforward, with the arguable exception ofprocess initialization. Process initialization of TLS is primarily handled intwo subroutines inside ntdll,LdrpInitializeTls and LdrpAllocateTls.

LdrpInitializeTlsis invoked during process initialization after all DLLs have been loaded, butbefore any initializer (or TLS) routines have been called. It essentially walksthe loaded module list and sums the length of TLS data for each module thatcontains a valid TLS directory. For each module that contains TLS, a datastructure is allocated that contains the length of the module’s TLS data andthe TLS index that has been assigned to that module. (TheTlsIndex field in the LDR_DATA_TABLE_ENTRYstructure appears to be unused except as a flag that the module has TLS (beingalways set to -1), at least as far back as Windows XP. It is worth mentioningthat theWINE implementation of implicit TLSincorrectly uses TlsIndex as the real module TLS index, so it may be unreliableto assume that it is always -1 if you care about working on WINE.)

Modulesthat use implicit TLS and which are present at initialization time areadditionally marked as pinned in memory for the lifetime of the process byLdrpInitializeProcess(the LoadCount of any such module is fixed to 0xFFFF). In practice, this istypically unlikely to matter, as for such modules to be present at process initializationtime, they must also by definition static linked by either the main processimage or a dependency of the main process image.

AfterLdrpInitializeTls has determined which modules use TLS in the current processand has assigned those modules TLS index values,LdrpAllocateTls is called toallocate and initialize module TLS values for the initial thread.

Atthis point, process initialization continues, eventually resulting in TLSinitializers and then DLL initializers (DllMain) being called for loadedmodules. (Note that the main process image can have one or more TLS callbacks,even though it cannot have a DLL initializer routine.)

Oneinteresting fact about TLS initializers is that they are always called beforeDLL initializers for their corresponding DLL. (The process occurs in sequence,such that DLL A’s TLS and DLL initializers are called, then DLL B’s TLS and DLLinitializers, and so forth.) This means that TLS initializers need to becareful about making, say, CRT calls (as the C runtime is initialized beforethe user’s DllMain routine is called, by the actual DLL initializer entrypoint,such that the CRT will not be initialized when a TLS initializer for the moduleis invoked). This can be dangerous, as global objects will not have been constructedyet; the module will be in a completely uninitialized state except that importshave been snapped.

Anotherpoint worth mentioning about the loader’s TLS support is that contrary to thePortableExecutable specification, the SizeOfZeroFill member of the IMAGE_TLS_DIRECTORYstructure is not used (or supported) by the linker or the loader. This meansthat in practice, all TLS template data is initialized, and the size of thememory block allocated for per-module implicit TLS doesnot include the SizeOfZeroFillmember as the PE documentation (or certainotherpublications that appear to be based on said documentation) wouldseem to state. (It seems that theWINE folks happened to get it wrong aswell, thanks to the implication in the PE specification that the field isactually used.)

Someprograms abuse TLS callbacks for anti-debugging purposes (gaining codeexecution before the normal process entrypoint routine is executed by creatinga TLS callback for the main process image), although this is, in practice,quite obvious as almost all PE images do not use TLS callbacks at all.

Upthrough Windows Server 2003, the above is really all the loader needs to dowith respect to supporting__declspec(thread). While this approach would appearto work quite well, it turns out that there are, in fact, some problems with it(if you’ve been following along thus far, you can probably figure out what theyare). More on some of the limitations of the Windows Server 2003 approach toimplicit TLS next week.