Operating system code, as one of my colleague developers recently realized, is “just code”. It’s not voodoo and it does not exist on a higher plane of knowledge. In fact, an operating system kernel is usually remarkably well structured and well designed in comparison to other pieces of software. When you think about it, it has to be. More than one person needs to understand and maintain a core set of code that must work and must support debugging of all other software that runs upon it. People move on, and change jobs to look for new challenges to keep learning. If only one person understood the way an operating system worked, then there is a huge amount of risk.

One of the most interesting facets to Operating Systems that I’ve followed in my career is how they start. Initialization is the last step in design, but at the same time it uncovers the most fundamental bedrock of the principles used. You start with literally nothing but a CPU which can execute instructions (sometimes not even with memory to use), and must take a platform from that point to a fully functioning system - one that not only utilizes available hardware, but abstracts it to a common understanding.

What I’m going to do in this article is discuss the details of how the Windows CE 6.0 kernel starts, and the association of the ‘Microsoft’ kernel code with the code that comes from an Original Equipment Manufacturer (OEM). It is hoped that by relating this understanding more people will have a better idea of the ‘hows’ and ‘whys’ of the Microsoft design.

To start, let’s quickly review how the operating system software is built. The Microsoft tool chain emits .EXE and/or .DLL program files. Files of both these types of extension are “Portable Executable” format, or “PE” format. They are practically identical in every aspect:

  • They are extended Common Object File Format (COFF) format files
  • They have import tables and export tables (EXE export tables are usually blank)
  • They have an entry point defined in their headers for where execution should start

There is nothing extraordinary about the operating system kernel program – it is compiled using the standard compiler and with a minimal set of definitions for that compiler. An EXE file is produced (called NK.EXE). It does not link to any external library or DLL – it can’t. When this code starts there is nothing in the system, or even a system for that matter. Since the EXE is in a known format (PE COFF), you can determine the entry point from looking at the EXE header. This means we know where to set the CPU’s instruction pointer to so that the program can start.

One additional property is that a PE file can be arranged so that it may “execute in place”. This means that if the file data is placed at a particular virtual address, no changes need to be made to the program code in the file in order for it to address other code and data at the correct addresses. For example, I can tell the Microsoft linker program to place the kernel program file at the virtual address 0x80000000. Then references to code (function entry points) will be placed in the EXE file such that other code can jump to them by address. If function foo() is at address 0x80001000 and inside its body it calls a function bar() which sits at address 0x80005000, there will be an instruction stored directly in the program code for ‘foo()’ that calls to 0x80005000. The dotted lines are just the delineation of function code start or end.

If the EXE program file for the kernel could not sit at 0x80000000 and had to be moved, the ‘bar()’ function would move with it and the call instruction in ‘foo()’ would have to be changed to have the correct, new address. Otherwise it would call to the wrong place:

 

You can see in the example above that if the kernel EXE file that is designed to be placed at 0x80000000 is loaded at 0x80050000 instead, the instructions in the program will be incorrect.

The process of changing an EXE or DLL program file after it has been loaded to reflect the actual load address is called “fixing up”. Records are placed in a standard EXE file which allow the program file to be fixed up. However, until the fixup process is done the addresses of functions in the EXE will be incorrect. To get around this, Windows CE kernel EXE files are fixed up beforehand to be loaded at a specific address. A program called ROMIMAGE actually pre-processes the kernel EXE file and some DLLs that are used and fixes them up when it builds the operating system p_w_picpath file (NK.BIN).

To recap, we get a fixed-up EXE file which is called NK.EXE which contains portions of the operating system kernel. This EXE has an entry point defined in it, the same as every other COFF EXE or DLL. For execution to start, the bootloader for the system is supposed to put the p_w_picpath file at the right address, find this EXE entry point and jump to it. The bootloader is a separate discussion, and its startup and execution is very platform-specific. For the context of this article, we will simply assume that a bootloader places the OS p_w_picpath file into memory at a specific address. We will see below how the bootloader can find the NK.EXE file within the p_w_picpath and then find its entry point.

The NK.EXE is only part of the Windows Embedded CE 6.0 kernel – it comprises the OEM Adaptation Layer (OAL) and boilerplate code to start the system. The main portion of the operating system kernel that does all the process, thread and memory functionality lives in a Microsoft-supplied DLL called ‘kernel.dll’. This is a DLL which is also ‘fixed up’ by the ROMIMAGE program to live at a specific virtual address in memory. So this means there are at least two executable modules that we need to know the location and the entry point of. The entry point address is stored inside the EXE or DLL file, but what about the location of the EXE and DLL files inside the p_w_picpath?

Windows CE p_w_picpaths have an important structure set up by ROMIMAGE that is placed into the p_w_picpath file, called the “Table Of Contents”, or TOC. This TOC holds pointers and metadata for the operating system p_w_picpath file. Somewhere near the beginning of the p_w_picpath file a marker is placed – the bytes “CECE” (0x44424442). Right after this marker is placed an offset to the TOC. This allows a bootloader or other program looking at the file to be able to find information about the p_w_picpath. In addition to this offset value that is prefixed by a marker, the OAL must define a public symbol called ‘pTOC’ (exported using ‘C’ naming conventions), which ROMIMAGE can find and fill in with the virtual address of the TOC when it prepares the p_w_picpath file. When compiled, the pTOC variable in the NK.EXE must have the value 0xFFFFFFFF. When it prepares the NK.BIN OS system p_w_picpath, ROMIMAGE does the following (in addition to other tasks):

  1. Load NK.EXE and fix it up.
  2. Make the TOC and find a place for it in the p_w_picpath file (will live in virtual memory when the os p_w_picpath is loaded).
  3. Find the ‘pTOC’ variable in the NK.EXE file and make sure it has the current value 0xFFFFFFFF.
  4. Set the pTOC variable value to the virtual address of the TOC that was created in step (2).

This way, when the NK.EXE starts it can reference this variable to know where the TOC is. Using the TOC, the program can find all the other pieces of the operating system p_w_picpath.

ROMIMAGE uses the configuration .BIB files to know where the p_w_picpath is supposed to go and where RAM is. There are two important parts of the CONFIG.BIB file – the RAMIMAGE and the RAM lines. Here is an example from the Device Emulator’s CONFIG.BIB:

NK 0x80070000 0x02000000 RAMIMAGE
RAM 0x82070000 0x01E7F000 RAM

These entries tell ROMIMAGE what to do. It knows to place the OS p_w_picpath file at 0x80070000, and that it can start using read/write memory at 0x82070000. With this information it can place modules such as NK.EXE and KERNEL.DLL into virtual memory, and then build a TOC and put that into the p_w_picpath as well. To help the kernel start, the TOC also contains information on where RAM is. A more detailed look at what is in memory when the p_w_picpath file has been placed is shown below:

In order for the actual operating system to start, the bootloader needs to:

  1. Put the p_w_picpath file at the right place in memory.
  2. Find the “CECE” marker.
  3. Use the TOC pointer that comes right after it to find the TOC.
  4. Search the TOC for the “NK.EXE” file entry.
  5. Scan the EXE file to find its entry point (it is a standard PE format file).
  6. Jump to the address that corresponds to the entry point.

The really interesting stuff happens once the NK.EXE program is started. In broad strokes, it has its own tasks to perform:

  1. Set up virtual memory and turn it on.
  2. Gather important information that the KERNEL.DLL will need to use to run the system.
  3. Use the pTOC to scan the TOC for the KERNEL.DLL file inside the operating system p_w_picpath.
  4. Find the entry point of KERNEL.DLL (it is a standard PE format file).
  5. Pass critical information gathered in (2) to KERNEL.DLL in a call to its entry point.

We will walk through these activities in detail to better understand them. Some parts of the startup process are CPU-type-specific. For instance, the ARM CPU and the X86 CPU have different virtual memory management hardware and mapping structures. However, to keep things consistent a general process is maintained. Whenever possible I will attempt to call out any operations specific to an architecture.

When the NK.EXE starts, there are a few prerequisites of the system:

  1. All caches are disabled
  2. The entire RAMIMAGE and RAM regions specified in the CONFIG.BIB file are physically addressable and readable.
  3. Virtual Memory is in a predefined state (CPU typically executes in physical address mode).

An additional prerequisite can be satisfied before NK.EXE starts, or can be done in the very beginning of NK.EXE execution:

4. RAM should be writeable without any supplemental configuration (for example, of a memory controller).

These assumptions allow the NK.EXE startup code to do what is necessary to bring any particular system up, and not have to worry about some things being done and others not being done. Point (3) above may be counterintuitive, but since the kernel must be entirely self-contained, it does not make sense for it to rely on the bootloader to configure virtual memory properly before it starts. This ‘decouples’ the OS from whatever bootloader is used to start it.

When it starts executing instructions in physical address mode, the first action taken by NK.EXE is to calculate the physical address of the OEMAddressTable symbol. This is a table that is built into the kernel that defines the static (unchanging) default regions of virtual memory. NK.EXE knows:

  1. It’s own location in virtual memory (where it will be executing instructions)
  2. It’s own location in physical memory (where it currently is executing instructions)
  3. The virtual address of the OEMAddressTable (it was determined when the NK.EXE was built and subsequently fixed up by ROMIMAGE).

Using this information, a simple calculation tells it the physical address of OEMAddressTable:

NK::PhysicalBase + (NK::Virtual OEMAddressTable – NK::Virtual Base) è NK Physical OEMAddressTable

The OEMAddressTable has triads of DWORDS making up a line in a table, with the following format:

<region virtual start> <region physical start> <region size in MB>

<region virtual start> <region physical start> <region size in MB>

...

From the information found in this table, the NK.EXE program can set up the virtual memory mapping tables for the Memory Management Unit (MMU) to function. Where the MMU-formatted mapping tables are kept and what they look like is platform-specific – the OEMAddressTable is a simplistic format that works for any architecture. Virtual memory is set up using the data in the OEMAddressTable and enabled, and then the NK.EXE transitions to the virtual address where it can execute code.

One thing to note at this point is that anything that is supposed to be in RAM that needs to be pre-initialized (set to zero or some other known value) is not yet available. RAM is still a clean slate and can have any contents whatsoever. The initialization values in the p_w_picpath file (the .data sections of NK.EXE and other modules) for read/write data must be copied from the p_w_picpath to actual RAM addresses before they can be properly used. How does the NK.EXE know what to copy or where to place things in virtual RAM for these modules? The TOC.

The TOC not only lists the start addresses of all modules in the p_w_picpath, but it also describes RAM and where the read/write portions of each module are to be located so that the kernel can work with them. Pieces of the OS p_w_picpath that need to be copied to RAM are called “copy entries”. Before the NK.EXE can access its own read/write variables, it needs to copy the copy entries to RAM. This begs the question – the pTOC is a variable, isn’t it? How could the NK.EXE know where the pTOC is if it hasn’t been set up? The answer is that the pTOC is a read-only variable – only ROMIMAGE writes to it when the p_w_picpath file is created. The storage for pTOC is not located in RAM, and does not need to be copied before its value can be used. The function inside NK.EXE that copies all the copy entries described by the pTOC to RAM is typically called “KernelRelocate()”. It is a simple process of going through a simple table of structures and copying ranges of virtual memory from one place to another. Once it is finished all NK.EXE variables can be read from or written to just like any other program.

 

At this point we have a working program, just like any other program for decades past. It executes instructions, can call functions, and can read and write memory locations. There are no threads, no processes, and no operating system constructs, but everything is placed in a known location and can be accessed to let us do the rest of the startup of the higher-level systems.

Virtual Memory allows a tremendous amount of flexibility. Windows CE reserves a few regions of the virtual address range for its own private use inside the OS kernel. There are several ranges of 4k ‘pages’ of virtual memory that are set aside in the highest address ranges, from about 0xFFFE0000 upwards. The kernel maps some physical memory into this range to store its ‘global’ dynamic data. Some of this memory can be used for memory mapping tables for an architecture-specific MMU. Some is reserved for the kernel-mode and interrupt stacks. Most importantly, at least one of the 4k pages is reserved specifically as a ‘Kernel Data Page’. This page contains a plethora of data fields which is specific to a version of the kernel. The NK.EXE sets up the location and initial contents of this page directly.

Three important values stored in the structure by NK.EXE:

  1. A copy of pTOC
  2. T he address of OEMAddressTable.
  3. The address of the function OEMInitGlobals()

The first two pieces of information are placed in the Kernel Data Page so that any code that knows the address of the Page can find what is in the OS p_w_picpath and the basic layout of virtual memory. The last piece of information is specifically used so that the NK.EXE contents can be used once control has been passed to KERNEL.DLL. In general, the contents of the reserved portion of virtual memory looks like:

Now that the Kernel data page has been initialized and virtual memory is active, we can jump into the Microsoft KERNEL.DLL executable’s entry point. Remember, we can find the KERNEL.DLL file in the p_w_picpath by using the TOC, and then we can scan for the entry point of the module. Even though NK.EXE knows where it is going to put the kernel data page in virtual memory beforehand, the KERNEL.DLL cannot assume its location. Therefore, we pass the virtual address of the kernel data page to the entry point of KERNEL.DLL. Although the Microsoft code can call back into the NK.EXE function addresses, control is never fully restored to the NK.EXE program.

After the jump, we are now executing Microsoft kernel code. The code at the entry point is given the address of the Kernel Data Page, and through its fields the TOC to know anything it needs to know about the OS p_w_picpath. The kernel does some basic setup of its own and sets some critical data fields for its own use into the Kernel Data Page.

The KERNEL.DLL has a static table of functions and data, called “NKGlobals”, which is built into its DLL simply as a static data structure. Since the KERNEL.DLL is fixed up by ROMIMAGE to run from a particular virtual address, the function pointers in the NKGlobals will be correct when the KERNEL.DLL code starts to run. Some of the functions pointed to by this structure are ones like SetLastError() and NKwvsprintfW(). These are routines that the NK.EXE is allowed to call directly. However, it is important to note that at this point the NK.EXE does not know where these functions are in KERNEL.DLL – it still needs to be told where this table of functions and data is inside KERNEL.DLL .

The KERNEL.DLL passes the address of “NKGlobals” back to NK.EXE in a function call to OEMInitGlobals(), the address of which was left in the Kernel Data Page. So, in essence the function call graph looks like this:

 

As shown above, the OEMInitGlobals() function stores a pointer to the NKGlobals structure that resides in KERNEL.DLL. After it stores this pointer, NK.EXE can use it to find the addresses of the KERNEL.DLL functions it is allowed to call.

OEMInitGlobals also passes back (via function return value) a pointer to its own structure, called “OEMGlobals”. This structure is critical to the kernel to get access to all the functionality that is platform-specific that is inside NK.EXE. The KERNEL.DLL module is constructed so that it will run on any processor belonging to a certain architecture (X86, ARM, etc). The NK.EXE is the abstraction of a specific species of the architecture (such as XSCALE or OMAP processor) and the platform that supports that architecture. The OEMGlobals structure is comprised of function pointers and data just like NKGlobals. Some of its members include:

  • PFN_InitDebugSerial(), PFN_WriteDebugByte(), PFN_ReadDebugByte()
  • PFN_SetRealTime(), PFN_GetRealTime(), PFN_SetAlarmTime()
  • PFN_Ioctl()

These function pointers point to the legacy OEM functions like OEMInitDebugSerial and OEMIoctl that live inside NK.EXE. Many other functions are listed so that KERNEL.DLL can do what is necessary for a particular platform. The functions are fairly self-explanatory in name and are well documented on MSDN.

Once the call to OEMInitGlobals() completes, the KERNEL.DLL has everything it needs to do architecture-generic and platform-specific processing. It knows where memory is and how it is laid out virtually, as well as the location of every module in the p_w_picpath. The NK.EXE also has a pointer to a table of functions it can call. In essence, the two code modules have executed a manual ‘handshake’ by executing a simplistic method of manual dynamic linking.

Everything up to this point that NK.EXE and KERNEL.DLL have done has been done without any processes or threads, and without any kernel services running. To bring the rest of the system up, the KERNEL.DLL has to do three things:

  1. Architecture-specific setup
  2. Architecture-neutral setup
  3. Platform-specific setup (specific CPU and BSP initialization)

The architecture-specific setup is done first by a call to a KERNEL.DLL function called <architecture>Setup. On an ARM platform this would be called ARMSetup(). On an X86 platform this would be called X86Setup(). The actions taken by the architecture-specific code are numerous, but they all execute in a single-threaded context with no processes running. The actions taken here include but are not limited to:

  • Set up hard required page tables and reserve VM for kernel page tables
  • Update cache information in Page Tables
  • Flush the Transition Lookaside Buffer (TLB)
  • Set up architecture-specific buses and components (companion chips, coprocessors, etc).

The one other thing this architecture-specific code does is set up the Interlocked API code so that NK.EXE knows where it is and can call it. This is a bit of an aside, but I will explain in detail because it is a critically important piece of the OS.

Even at the most basic level, Windows CE needs to coordinate actions among different threads of execution – even some that run inside the kernel, outside the scope of any specific process. The mechanism used to do this with the highest amount of efficiency is the Interlocked API. The API consists of a handful of functions, the most important of which is InterlockedCompareExchange(). The purpose of this function is to:

  1. Read a memory location (M) into register (R)
  2. Compare the value read (R) with a match value in another register (R2)
  3. If (R) and (R2) are not equal, exit
  4. Write the value of another register (R3) back to memory location (M)

These four steps are meant to execute atomically, and they form the basis of coordination between different threads. That is, there should be no interruption between each of (1), (2), (3) and (4). The only way to guarantee this on some of today’s processors where the operation is not available directly in hardware is to ensure interrupts are disabled. Herein lies a problem, since user-mode processes do not have sufficient privilege to disable interrupts, and it would be very inefficient to have to do a system call to the kernel and disable interrupts every time two threads wanted to coordinate with each other.

To be efficient, there is one single place in the entire system where the InterlockedCompareExchange() happens. The code for the four steps above is placed in the Kernel Data Page, at a particular location that is well known. Then the NK.EXE and KERNEL.DLL (and any process which has the Kernel Data Page mapped) can call the code, and the instructions all occur in the same place. This is done so that the API is restartable. What does this mean? Why do we do this?

Thread switches in an operating system can happen for three reasons:

  • It has been specifically requested by the executing thread
  • The thread’s time-slice has expired (noted by a timer interrupt event) and it is another thread’s turn to run.
  • Another type of interrupt occurs, which causes a situation where a thread of higher priority should execute.

The second two cases are really the same – an interrupt occurs that ultimately causes a thread switch. Since an interrupt can occur between any of the steps (1) to (4) and potentially switch out the thread, the operation we needed to be atomic might not be – some other thread might run in between (2) and (3), for example.

To ensure that the instructions (1) to (4) occur atomically, every time there is an interrupt a simple bounds check is made to see if the CPU was currently executing somewhere in (1) to (4). If the interrupt occurred when the CPU was executing after (1) and before (4), then the instruction pointer for the current thread is reset to point to instruction (1), so that the operation may be retried. In order for the interrupt code to be able to check if the CPU was executing in between (1) and (4), the code for it must be in a single known location. That location is inside the Kernel Data Page.

Once the Interlocked API code has been copied to the Kernel Data Page, the NK.EXE knows where it is and can coordinate actions with KERNEL.DLL when multiple threads become active – ultimately by using the Interlocked API.

Back onto our main discussion, the next step in the KERNEL.DLL startup is the architecture-neutral setup. One of the first architectural-neutral things to set up is to see if the OS p_w_picpath includes a KITL.DLL to allow communication with and debugging of the OS kernel.

KITL stands for “Kernel Independent Transport Layer”. This is basically a mechanism by which data ‘packets’ specific to the Windows CE system can be passed between the kernel of the device and Platform Builder running on the desktop. Usually, the portions of KITL which are implemented in NK.EXE purely revolve around the encoding for transport and the transport of the data packets. A Board Support Package (BSP) does not have to know anything about the data being sent and received between the device and the desktop – it just has to facilitate the correct transmission and reception. Mechanisms for transport of the KITL packets include but are not limited to RS232 Serial, Ethernet, and USB. A full description of KITL is beyond the scope of this blog article.

Other actions that happen during the architecture-neutral setup include:
  1. Initialize Kernel Debug Output (by calling OEMInitDebugSerial() through the function pointer in the OEMGlobals structure)
  2. Write a masthead debug string (“Windows CE Kernel Version xxxx”) to the debug output.
  3. Select the kernel processor type from the available options

When the architecture-neutral portions have been completed, we can do the platform-specific setup. This code lives in NK.EXE since it is OEM and board specific. To initialize this part, the kernel calls into OEMInit() through the function pointer that is in the OEMGlobals structure. OEMInit does board-specific initialization, and can do one other important thing – start KITL.

If KITL is built into the NK.EXE, then its functions are directly accessible from NK.EXE. If KITL is in a DLL, then that DLL will have been loaded by the kernel at the beginning of the architecture-neutral setup, as shown above. In either event, the OEMInit() function can call a Kernel IO control saying that KITL should be started. Based on whether the KITL.DLL was found or not, the kernel knows what to do.

Upon return from OEMInit(), the kernel is ready to start processes and threads to run. It synchronizes its cache, and then enters the processor architecture’s service mode if it is not already running in it. Then it does any one-time inits that do not require a current thread. These actions include:

  1. Enumerate available Memory (optional call to OEMEnumExtensionDRAM() )
  2. Initialize critical sections in the kernel (critical section code uses the Interlocked API, the setup of which was discussed above).
  3. Initialize heap structures
  4. Initialize process and thread tracking structures
  5. Any other actions done before multi-threading is enabled.

After all single-threaded initialization is done, the kernel is ready to schedule the first thread. This first thread is called “SystemStartupFunc()”, and lives in KERNEL.DLL. To start the thread, the kernel specifies that there is no current thread to switch from, sets the first thread as the only one available to run, then calls into the thread scheduler code. The scheduler code takes a look at all available threads and chooses the next one to run. At this point in startup we only have one thread that has been manually set up to run, so that one is the one that is switched to.

The SystemStartupFunc() function begins execution by flushing the system cache, then does things that require a ‘current’ thread to be running in order to happen. These actions include:
  1. Initialize the system loader
  2. Initialize the paging pool
  3. Initialize system logging
  4. Initialize system debugger

The SystemStartupFunc() will call one more OEM function before it completes initialization – it will call the OEMIoctl() function through the function pointer in the OEMGlobals, with an argument ‘OEM_HAL_POSTINIT’. This tells the NK.EXE that all system startup has completed and we are about to schedule threads and processes.

Upon exit from this first call to OEMIoctl(), the SystemStartupFunc() initializes the system message queue, any watchdogs, and then creates and starts the threads for the power manager and file system. Thus, the rest of the higher-level parts of the operating system begin to execute here. The last operation taken by the SystemStartupFunc() is to create another thread which executes the function “RunAppsAtStartup()”. This function creates the first user processes.

We are now at the point where the kernel, power manager, and file system are all executing, and applications can begin to get executed that have been described to run in the system registry.

This concludes the blog entry on how Windows Embedded CE 6.0 starts. The internals of Windows CE are quite interesting and very well structured, and the startup process described above gives insight into the most critical system components. In the future I hope to publish other articles on the internals of the system registry, the file system, and the device and power managers.

From:

http://blogs.msdn.com/b/ce_base/archive/2007/11/26/how-does-windows-embedded-ce-6.0-start_3f00_.aspx