Powerful x86/x64 Mini Hook-Engine

10 Apr 2008 CPOL
 Rate this:
A powerful x86/x64 Hook-Engine

Introduction

I wrote this little hook-engine for a much bigger article. Sometimes it seems such a waste to write valuable code for large articles whose topic isn't directly related to the code. This often leads to the problem that the code won't be found by the people who are looking for it.

Personally, I would've used Microsoft's Detour hook engine, but the free license only applies to x86 applications, and that seemed a little bit too restrictive to me. So, I decided to write my own engine in order to support x64 as well. I've never downloaded Detour nor have I ever seen its APIs, but from the general overview given by Microsoft, it's easy to guess how it works.

As I said, this is only a part of something bigger. It's not perfect, but it can easily become such. Since this is not a beginner's guide about hooking, I assume that the reader already possesses the necessary knowledge to understand the material. If you never heard about this subject, you'd better start with another article. There's plenty of guides out there, no need to repeat the same things here.

As everybody knows, there's only one easy and secure way to hook a Win32 API: to put an unconditional jump at the beginning of the code to redirect it to the hooked function. And by secure I just mean that our hook can't be bypassed. Of course, there are some other ways, but they're either complicated or insane or both. A proxy DLL, for instance, might work in some cases, but it's rather insane for system DLLs. Overwriting the IAT is unsecure for two reasons:

1. The program might use GetProcAddress to retrieve the address of an API (and in that case we should handle this API as well).
2. It's not always possible, there are many cases as for packed programs where the IAT gets built by the protection code and not by the Windows loader.

Ok, I guess you're convinced. Let's just say that there's a reason why Microsoft uses the method presented in this article.

How It Works

A common technique used in combination with the unconditional jump is:

This approach may seem unsafe in a multi-threading environment and it is. It might work, but our technique is much more powerful. Well, nothing new, we just put our unconditional jump at the beginning of the code we want to hook and we put the original instructions of the API elsewhere in memory. When the hooked function jumps to our code we can call the bridge we created, which, after the first instructions, will jump to the API code which follows our unconditional jump:

Let's make a real world example. If the first instructions of the function/API we want to hook are:

mov edi, edi
push ebp
mov ebp, esp
xor ecx, ecx

They will be replaced by our:

00400000 jmp our_code
00400005 xor ecx, ecx

Our bride will look like this:

mov edi, edi
push ebp
mov ebp, esp
jmp 00400005

Of course, to know the size of the instructions we're going to replace, we need a disassembler both for x86 and x64. I searched on Google for an x64 disassembler and found the diStorm64 disassembler. I quote from its homepage:

diStorm64 is a professional quality open source disassembler library for AMD64, licensed under the BSD license.

diStorm is a binary stream disassembler. It's capable of disassembling 80x86 instructions in 64 bits (AMD64, X86-64) and both in 16 and 32 bits. In addition, it disassembles FPU, MMX, SSE, SSE2, SSE3, SSSE3, SSE4, 3DNow! (w/ extensions), new x86-64 instruction sets, VMX, and AMD's SVM! diStorm was written to decode quickly every instruction as accurately as possible. Robust decoding, while taking special care for valid or unused prefixes, is what makes this disassembler powerful, especially for research. Another benefit that might come in handy is that the module was written as multi-threaded, which means you could disassemble several streams or more simultaneously.
For rapidly use, diStorm is compiled for Python and is easily used in C as well. diStorm was originally written under Windows and ported later to Linux and Mac. The source code is portable and platform independent (supports both little and big endianity).
It also can be used as a ring0 disassembler (tested as a kernel driver using the DDK under Windows)!

This sounded pretty good to me. Now that we have our disassembler, we can start!

The first thing I wanted to know was if it was possible to create bridges without having to relocate jumps. As the reader knows jumps, most of the time, have a relative address as operand and not an absolute one. This leads to the problem that I can't relocate a jump without having to recalculate its relative address. Also, I wanted to test if this disassembler really worked fine. So, I wrote a little program which creates a log file of all the instructions of all exported functions in a DLL which are going to be overwritten by an unconditional jump. Here's the code:

#include <span class="code-string">"stdafx.h"</span>
#include <span class="code-string">"distorm.h"</span>
#include <span class="code-keyword"><stdlib.h></span>
#include <span class="code-keyword"><stdlib.h></span>
#include <span class="code-keyword"><Windows.h></span>

VOID AddFunctionToLog(FILE *Log, BYTE *FileBuf, DWORD FuncRVA);
VOID GetInstructionString(char *Str, _DecodedInst *Instr);

int _tmain(int argc, _TCHAR* argv[])
{
if (argc < 2) return 0;
//
// Open log file
//
FILE *Log = NULL;

if (_tfopen_s(&Log, argv[2], _T("w")) != 0)
return 0;
//
// Open PE file
//
OPEN_EXISTING, 0, NULL);

if (hFile == INVALID_HANDLE_VALUE)
{
fclose(Log);
return 0;
}

DWORD FileSize = GetFileSize(hFile, NULL);
BYTE *FileBuf = new BYTE [FileSize];
DWORD BRW;
if (FileBuf)
CloseHandle(hFile);
pDosHeader->e_lfanew : 0) + (ULONG_PTR) FileBuf);

if (!FileBuf || pDosHeader->e_magic != IMAGE_DOS_SIGNATURE ||
{
fclose(Log);
if (FileBuf)
delete FileBuf;
return 0;
}

//
// Walk through export dir's functions
//
IMAGE_EXPORT_DIRECTORY *pExportDir = (IMAGE_EXPORT_DIRECTORY *)
DWORD *pFunctions = (DWORD *) (RvaToOffset(pNtHeaders,
for (DWORD x = 0; x < pExportDir->NumberOfFunctions; x++)
{
if (pFunctions[x] == 0) continue;
}
fclose(Log);
delete FileBuf;
return 0;
}

//
// This function adds to the log the instructions
// at the beginning of each function which are going
// to be overwritten by the hook jump
//
VOID AddFunctionToLog(FILE *Log, BYTE *FileBuf, DWORD FuncRVA)
{
#define MAX_INSTRUCTIONS 100
((*(IMAGE_DOS_HEADER *) FileBuf).e_lfanew + (ULONG_PTR) FileBuf);
_DecodeResult res;
_DecodedInst decodedInstructions[MAX_INSTRUCTIONS];
unsigned int decodedInstructionsCount = 0;
#ifdef _M_IX86
_DecodeType dt = Decode32Bits;
#define JUMP_SIZE 10 // worst case scenario
#else ifdef _M_AMD64
_DecodeType dt = Decode64Bits;

#define JUMP_SIZE 14 // worst case scenario
#endif

_OffsetType offset = 0;
res = distorm_decode(offset,    // offset for buffer, e.g. 0x00400000
50,                            // function size (code size to disasm)
dt,                            // x86 or x64?
decodedInstructions,        // decoded instr
MAX_INSTRUCTIONS,            // array size
&decodedInstructionsCount    // how many instr were disassembled?
);

if (res == DECRES_INPUTERR)
return;
DWORD InstrSize = 0;
for (UINT x = 0; x < decodedInstructionsCount; x++)
{
if (InstrSize >= JUMP_SIZE)
break;
InstrSize += decodedInstructions[x].size;
char Instr[100];
GetInstructionString(Instr, &decodedInstructions[x]);
fprintf(Log, "%s\n", Instr);
}
fprintf(Log, "\n\n\n");
}

VOID GetInstructionString(char *Str, _DecodedInst *Instr)
{
wsprintfA(Str, "%s %s", Instr->mnemonic.p, Instr->operands.p);
_strlwr_s(Str, 100);
}

{
DWORD Offset = Rva, Limit;
WORD i;
Img = IMAGE_FIRST_SECTION(NT);
if (Rva < Img->PointerToRawData)
return Rva;
for (i = 0; i < NT->FileHeader.NumberOfSections; i++)
{
if (Img[i].SizeOfRawData)
Limit = Img[i].SizeOfRawData;
else
Limit = Img[i].Misc.VirtualSize;

{
if (Img[i].PointerToRawData != 0)
{
Offset += Img[i].PointerToRawData;
}
return Offset;
}
}
return NULL;
}

The command line syntax is: pefile logfile (e.g. disasmtest ntdll.dll ntdll.log). As you can see, I took 10 bytes for x86 hooks. It's possible to use 5 bytes jumps on x86/x64, but it's necessary to check that there's less than 2 GB between the original function and our code and between the bridge and the original function. Well, we have to check that on x86 as well, but it is very likely. The worst case scenario either for x86 and x64 is this absolute jump:

jmp [xxxxx]
xxxxx: absolute address (DWORD on x86 and QWORD on x64)

This means we'd have a worst case scenario of 10 bytes on x86 and of 14 bytes on x64. In this hook engine, I'm using only worst case scenarios (no 5 byte relative addresses), simply because if the space between the original function and the hooked one is > 2 GB or the space between the original function and the bridge is > 2 GB, then I would have to recreate the bridge from scratch every time I hook/unhook the function. A professional engine should do this (and it's not much work), but I'll keep it simple (for me) and use only absolute jumps. As for the results of the little program above, I created logs for the ntdll.dll and advapi32.dll both for x86 and x64. Here, for instance, is a small part of the ntdll.dll x86 log:

mov eax, 0x44
mov edx, 0x7ffe0300

mov eax, 0x45
mov edx, 0x7ffe0300

mov eax, 0x46
mov edx, 0x7ffe0300

mov eax, 0x47
mov edx, 0x7ffe0300

mov eax, 0x48
mov edx, 0x7ffe0300

mov eax, 0x49
mov edx, 0x7ffe0300

mov eax, 0x4a
mov edx, 0x7ffe0300

mov eax, 0x4b
mov edx, 0x7ffe0300

mov eax, 0x4c
mov edx, 0x7ffe0300

This is of course pretty encouraging, but let's see the results for the x64 platform.

sub rsp, 0x48
mov rax, [rsp+0x78]
mov byte [rsp+0x30], 0x0

mov [rsp+0x10], rbx
mov [rsp+0x18], rbp
mov [rsp+0x20], rsi

push rsi
push r14
push r15
sub rsp, 0x480

mov rax, rsp
mov [rax+0x8], rbx
mov [rax+0x10], rsi
mov [rax+0x18], r12

sub rsp, 0x38
mov [rsp+0x20], r8
mov r9d, edx
mov r8, rcx

mov rax, rsp
mov [rax+0x8], rsi
mov [rax+0x10], rdi
mov [rax+0x18], r12

mov [rsp+0x10], rbx
mov [rsp+0x18], rsi
push rdi
push r12

sub rsp, 0x68
mov rax, r9
mov r9d, [rsp+0xb0]

But what about the functions which just call a syscall after moving a number into a register likeNtCreateProcessNtOpenKey etc.? These functions have very few instructions and our 14 bytes jump will overwrite more code than the one of the function itself. But that doesn't seem to be a problem, since as we can see from the disassembler these functions have a 16-bytes alignment. So, we won't overwrite other functions code anyway.

Here's the main code of the hook engine (all the code is about 300 lines of code):

//
// This function creates a bridge of the original function
//
VOID *CreateBridge(ULONG_PTR Function, const UINT JumpSize)
{
if (pBridgeBuffer == NULL) return NULL;

#define MAX_INSTRUCTIONS 100

_DecodeResult res;
_DecodedInst decodedInstructions[MAX_INSTRUCTIONS];
unsigned int decodedInstructionsCount = 0;

#ifdef _M_IX86
_DecodeType dt = Decode32Bits;
#else ifdef _M_AMD64
_DecodeType dt = Decode64Bits;
#endif

_OffsetType offset = 0;
res = distorm_decode(offset,    // offset for buffer
(const BYTE *) Function,    // buffer to disassemble
50,            // function size (code size to disasm)
// 50 instr should be _quite_ enough
dt,            // x86 or x64?
decodedInstructions,        // decoded instr
MAX_INSTRUCTIONS,            // array size
&decodedInstructionsCount    // how many instr were disassembled?
);

if (res == DECRES_INPUTERR)
return NULL;

DWORD InstrSize = 0;
VOID *pBridge = (VOID *) &pBridgeBuffer[CurrentBridgeBufferSize];
for (UINT x = 0; x < decodedInstructionsCount; x++)
{
if (InstrSize >= JumpSize)
break;
BYTE *pCurInstr = (BYTE *) (InstrSize + (ULONG_PTR) Function);
//
// This is an sample attempt of handling a jump
// It works, but it converts the jz to jmp
// since I didn't write the code for writing
// conditional jumps
//
/*
if (*pCurInstr == 0x74) // jz near
{
ULONG_PTR Dest = (InstrSize + (ULONG_PTR) Function)
+ (char) pCurInstr[1];

WriteJump(&pBridgeBuffer[CurrentBridgeBufferSize], Dest);

CurrentBridgeBufferSize += JumpSize;
}
else
{*/
memcpy(&pBridgeBuffer[CurrentBridgeBufferSize],
(VOID *) pCurInstr, decodedInstructions[x].size);

CurrentBridgeBufferSize += decodedInstructions[x].size;
//}

InstrSize += decodedInstructions[x].size;
}

WriteJump(&pBridgeBuffer[CurrentBridgeBufferSize], Function + InstrSize);
CurrentBridgeBufferSize += GetJumpSize((ULONG_PTR)
&pBridgeBuffer[CurrentBridgeBufferSize],
Function + InstrSize);

return pBridge;
}

//
// Hooks a function
//
extern "C" __declspec(dllexport)
BOOL __cdecl HookFunction(ULONG_PTR OriginalFunction, ULONG_PTR NewFunction)
{
//
// Check if the function has already been hooked
// If so, no disassembling is necessary since we already
// have our bridge
//
HOOK_INFO *hinfo = GetHookInfoFromFunction(OriginalFunction);
if (hinfo)
{
WriteJump((VOID *) OriginalFunction, NewFunction);
}
else
{
if (NumberOfHooks == (MAX_HOOKS - 1))
return FALSE;

VOID *pBridge = CreateBridge(OriginalFunction,
GetJumpSize(OriginalFunction, NewFunction));

if (pBridge == NULL)
return FALSE;

HookInfo[NumberOfHooks].Function = OriginalFunction;
HookInfo[NumberOfHooks].Bridge = (ULONG_PTR) pBridge;
HookInfo[NumberOfHooks].Hook = NewFunction;

NumberOfHooks++;

WriteJump((VOID *) OriginalFunction, NewFunction);
}
return TRUE;
}

//
// Unhooks a function
//
extern "C" __declspec(dllexport)
VOID __cdecl UnhookFunction(ULONG_PTR Function)
{
//
// Check if the function has already been hooked
// If not, I can't unhook it
//
HOOK_INFO *hinfo = GetHookInfoFromFunction(Function);

if (hinfo)
{
//
// I'm not completely unhooking since I'm not
// restoring the original bytes
//
WriteJump((VOID *) hinfo->Function, hinfo->Bridge);
}
}

//
// Get the bridge to call instead of the original function from hook
//
extern "C" __declspec(dllexport)
ULONG_PTR __cdecl GetOriginalFunction(ULONG_PTR Hook)
{
if (NumberOfHooks == 0)
return NULL;

for (UINT x = 0; x < NumberOfHooks; x++)
{
if (HookInfo[x].Hook == Hook)
return HookInfo[x].Bridge;
}

return NULL;
}

I implemented it as a DLL (but you can include it in your code as well).

Using the Code

Using the code is very simple. Basically, the DLL only exports three functions: one to hook, another to unhook and the last to get the address of the bridge of the hooked function. Of course, we need to retrieve the address of the bridge, otherwise we can't call the original code of the hooked function.

Let's see a little code sample which works both on x86 and x64:

#include <span class="code-string">"stdafx.h"</span>

#include <span class="code-string">"NtHookEngine_Test.h"</span>

BOOL (__cdecl *HookFunction)(ULONG_PTR OriginalFunction,
ULONG_PTR NewFunction);
VOID (__cdecl *UnhookFunction)(ULONG_PTR Function);
ULONG_PTR (__cdecl *GetOriginalFunction)(ULONG_PTR Hook);

int WINAPI MyMessageBoxW(HWND hWnd, LPCWSTR lpText, LPCWSTR lpCaption,
UINT uType, WORD wLanguageId, DWORD dwMilliseconds);

int APIENTRY _tWinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance,
LPTSTR lpCmdLine, int nCmdShow)
{
//
// Retrieve hook functions
//
HookFunction = (BOOL (__cdecl *)(ULONG_PTR, ULONG_PTR))
UnhookFunction = (VOID (__cdecl *)(ULONG_PTR))
GetOriginalFunction = (ULONG_PTR (__cdecl *)(ULONG_PTR))
if (HookFunction == NULL || UnhookFunction == NULL ||
GetOriginalFunction == NULL)
return 0;
//
// Hook MessageBoxTimeoutW
//
"MessageBoxTimeoutW"),
(ULONG_PTR) &MyMessageBoxW);

MessageBox(0, _T("Hi, this is a message box!"), _T("This is the title."),
MB_ICONINFORMATION);

//
// Unhook MessageBoxTimeoutW
//
"MessageBoxTimeoutW"));

MessageBox(0, _T("Hi, this is a message box!"), _T("This is the title."),
MB_ICONINFORMATION);

return 0;
}

int WINAPI MyMessageBoxW(HWND hWnd, LPCWSTR lpText, LPCWSTR lpCaption,
UINT uType, WORD wLanguageId, DWORD dwMilliseconds)
{
int (WINAPI *pMessageBoxW)(HWND hWnd, LPCWSTR lpText,
LPCWSTR lpCaption, UINT uType, WORD wLanguageId,
DWORD dwMilliseconds);

pMessageBoxW = (int (WINAPI *)(HWND, LPCWSTR, LPCWSTR, UINT, WORD, DWORD))
GetOriginalFunction((ULONG_PTR) MyMessageBoxW);

return pMessageBoxW(hWnd, lpText, L"Hooked MessageBox",
uType, wLanguageId, dwMilliseconds);
}

In this sample I'm hooking the API MessageBoxTimeoutW. I tried to hook MessageBoxW and that worked fine on x86, then I tried on x64 and the code generated an exception. So, I disassembled the MessageBoxW function on x64:

Unfortunately, as you can notice, the first instructions of this API include a jz which is going to be overwritten by our unconditional jump. And since we don't relocate jumps in our bridge, we can't hook this function. So, I had to hook the function MessageBoxTimeoutW, which is called inside MessageBoxW and has no jumps at the beginning.

In the code example I first hook the function and call it, then I unhook it and call it again. So, the output will be:

That's all. Of course, this code works only if MessageBoxTimeoutW is available. I'm not completely sure about when it was first introduced, since it's an undocumented API. I guess it has been introduced with XP, so chances are that this particular hook won't work on Windows 2000.

Conclusions

As it's possible to see from the previous example, the hook engine isn't perfect, but it can easily be improved. I don't develop it further because I don't need a more powerful one (right now, I mean). I just needed an x86/x64 hook engine with no license restrictions. I wrote this engine and the article in just one day, it really wasn't much work. Most of the work in such a hook engine is writing the disassembler, which I didn't do. So, in my opinion, it doesn't make much sense paying for a hook engine. The only thing which I really can't provide in this engine is support for Itanium. That's because I don't have a disassembler for this platform. But I would rather write one myself than buy a hook engine. I might actually add an Itanium disassembler in the future, who knows...

I hope you can find this code useful.