Semantics-Aware Malware Detection

Notes from: "Semantics-Aware Malware Detection"
	    [Christodorescu, Jha, Seshia, Song, Bryant]
--------
Problem:
--------
 - old trend in malware: create a supervirus, infect millions with it

 - new trend in malware: create a virus, create several derivatives of
   that virus (each of which would require a distinct signature for AV
   scanners) which differ from the base virus only slightly, have each 
   derivative infect smaller numbers of (possibly just local) hosts

   --> so, how to make virus signatures for the main base virus that also
       "catch" these derivatives, rather than having to have a sample of each
       derivative and having to have even more bloated signature databases

------------------
malware detector : system which attempts to detect whether a program has
------------------ malicious intent

 - evade detection via obfuscating the malware

 - malware detectors use pattern matching are susceptible to obfuscation

   -- a purely syntactic approach

      - match some sequence of instructions (perhaps using regular expressions)

      - but not matching based on what code DOES, matching based on
        particular instruction sequences in code

----------------
Their approach :
----------------
 - malware detection algo incorporates instruction semantics : seeking to 
   match based on malicious program traits

 - so theoretically would catch derivative programs which have the same
   functionality as the original

========
I. Intro
========
malware instance : program that has malicious intent

 - e.g. viruses, trojans, worms

 - can classify malware w.r.t. its propagation method and goal

malware detector : system that attempts to identify malware

 - e.g. a virus scanner : uses sigs and other heuristics to identify malware

goal of malware writer : modify or morph malware so that it evades detection

 - ideally do this derivative/variant-generation automatically

 - e.g. evade detection via program obfuscation

   -- e.g. virus encrypts its malicious payload and decrypts it during exec

   -- e.g. polymorphism and metamorphism

   -- polymorphic virus : obfuscates decryption loop using transformations

      -- sample transformations :

	 -- nop insertion

	 -- code transposition (change order of instructions and put jumps
	    in in order to retain original semantics)

	 -- register reassignment (permute register allocation)

   -- metamorphic virus : obfuscate entire virus

      -- on replication, change code in variety of ways

	 -- e.g. code transposition, substitution of equivalent
            instruction sequences, change of conditional jumps,
	    register reassignment

 - another (obfuscation) technique : add new behaviors to existing malware

   -- by adding new behavior to base code, able to evade detection even though 
      base code would have been detected

   -- cf. Sobig.A through Sobig.F (worms), Netsky, Beagle (from 2004) to present

 - commercial malware detectors : use simple pattern matching approach

   -- match : program contains a sequence of instructions that is matched
	      by a regular expression

   -- basic deficiency : ignore SEMANTICs of these instructions

   -- not resilient to slight variations

      -- so must use different patterns for detecting two malware instances
         that are slight variations of each other

 - suppose program P transformed by a compiler phase to produce program P^t
   (maybe that compiler phase consisted of register allocation)

   - translation-validation problem :

     -- determine whether P^t simulates P

     -- used to prove that various compiler phases preserve semantics of P

   - view hacker obfuscations as compiler phases

     -- then malware detection problem seems similar to translation-validation 
	problem

	- after hacker obfuscation produces P' out of P (where P is a
          known bad program), does P' have same semantics as P?

     -- but hacker obfuscations will produce P' such that P' has *different*
        semantics than P (added functionality, e.g.)

	- so P and P' are not equivalent

     -- plus, translation-validation makes assumptions a/b transformed program :

	-- e.g. all branches in P' must match branches in P

	   - this does not necessarily hold for hacker transformations

     -- all the same, some useful ideas from translation-validation : 
	(1) modeling semantics of instructions, (2) use decision procedures
     
 - observe that certain malicious behaviors appear in all variants of a
   certain malware (e.g. the decryption loop in a polymorphic virus)
	
   -- but decryption loop, e.g., might appear in different form in variants

 - algo for discovering specified malicious behaviors in a given program

 - formal semantics : define problem of determining whether a program
		      exhibits a specified malicious behavior

   -- in general, undecidable

 - semantics-aware malware detection algo : 

   -- handle limited set of transformations

   -- detects multiple variants with single template

   -- more resilient than commercial AV scanners [claim]

==================================
II. Semantics of Malware Detection
==================================
(1) Specify malicious behavior : via templates

 Template T : instruction sequence (I_T) which uses variables (V_T) and constants (C_T)
 ----------------
 Sample template:
 ----------------
   (1)	A = const_addr1      // encrypted memory starts here
   (2)	B = const_addr2      // location to put decrypted memory
   (3)	condition(A)?	TRUE : goto (7), FALSE : goto (4)
   (4)	mem[B] = f(mem[A])
   (5)	A = A + c
   (6)	B = B + d
	goto (3)
   (7)	jump const_addr2

 variables, V_T : { A, B }

 symbolic constants, C_T : { const_addr1, const_addr2, c, d, f(x), p(x) }

  -- symbolic constants : either n-ary function, F(n), or n-ary predicate, P(n)
  -- actual constants (e.g. c, d) are 0-ary functions
  -- n-ary function : takes n arguments, returns a value
  -- n-ary predicate : takes n arguments, returns a boolean

  -- c : TAU	     <==>    the type of symbolic constant c is TAU

 ==> resilient to obfuscations : 
     -- register renaming (since use vars and constants)
     -- changing starting addy of a memory block (since use vars)

 ==> When does an instruction sequence contain a malicious behavior?
     -- if template and instruction sequence have the same effect on memory
     -- IOW, the malicious behavior specified by the template is
	demonstrated by the instruction sequence

 ==> also resilient to insertion of irrelevant instruction sequences

 -----------------------------------------------------------------
 Sample malware instance (instruction sequence) of above template:
 -----------------------------------------------------------------
   (1) eax = 0x403000
   (2) ebx = 0x400000
   (3) edx = eax + 3
   (4) eax >= 406000 ?	TRUE : goto (9), FALSE : goto (5)
   (5) mem[ebx] = mem[edx - 3] << 2 + 1
   (6) eax = eax + 4
   (7) edx = edx + 4
   (8) ebx = ebx + 4
       goto (4)
   (9) jump 0x400000

   - edx above is totally unnecessary, could have written :

   (1) eax =  0x403000
   (2) ebx =  0x400000
   (3) eax >= 0x406000 ?	TRUE : goto (7), FALSE : goto (4)
   (4) mem[ebx] = mem[eax] << 2 + 1
   (5) eax = eax + 4
   (6) ebx = ebx + 4
       goto (3)
   (7) jump 0x400000

   - but this second instance is a more obvious match to the given template

   - so provided first instance to illustrate that even when additional
     registers are included, that the first instance matches would be caught

     --> some resilience based on semantic matching to hacker obfuscation

 - execution context : an assignment of values (of appropriate types) to
		       the set of symbolic constants, C_T

   -- e.g. const_addr1  <-- 0x403000
	   const_addr2  <-- 0x400000
	   condition(X) <-- X >= 0x406000
	   f(X)         <-- X << 2 + 1
	   c	        <-- 4
	   d		<-- 4

   -- EC_T : execution context for a template T

      EC_T( c ) : where c is a constant, returns the value of that constant

   -- EC_T : is a function with domain C_T such that 

      for all c in C_T, the type of c and the type of EC_T(c) are the same

 - EC_T( T ) : the template obtained by replacing every constant c in C_T by EC_T(c)

 - state s^T for the template T is a 3-tuple denoted : ( val^T, pc^T, mem^T )

   -- val^T : V_T --> Values, is an assignment of values to the variables in V_T

   -- pc^T : value for the program counter

   -- mem^T : Addr --> Values, gives the memory content

 - a state for the instruction sequence I is a 3-tuple (val, pc, mem)

   -- val : Reg --> Values, is an assignment of values to the set of registers

   -- pc : a value for the PC

   -- mem : Addr --> Values, gives memory contents

 - S^T is state space for template; S is state space for instruction sequence

 - given EC_T for a template T and template is in state s^T

 - if execute instruction i from the template EC_T(T) in state s^T, we
   transition to a new state s_1^T and generate a system event e (event
   may be sys call or kernel trap, e.g.); denote this via :

   s^T ---[e]---> s_1^T

 - if an instruction i does NOT generate a system event, e(i) = null
   [it generates the null event]

 - for every initial template state, s_0^T, executing T in an execution
   context EC_T generates a sequence :

   SIGMA( T, EC_T, s_0^T ) = s_0^T ---[e_1^T]---> s_1^T ---[e_2^T]---> ...

   - for i  >= 1, 

     s_i^T == state after executing i'th instruction from the template EC_T(T) and 

     e_i^T == the event generated by the i'th instruction

     --> s_i^T ---> execute (i+1)-th instruction ---> generate event ---> s_(i+1)^T

   - if template doesn't terminate, sequence can be infinite

   SIGMA( I, s_0 ) == sequence when the instruction sequence I is executed
		      from the initial state s_0

   ==========================================================================
 - Definition 1 : "instruction sequence I contains a behavior specified by T"
   ==========================================================================
   an instruction sequence I contains a behavior specified by the template
   T if there exists a program state s_0, an execution context EC_T and a
   template state s_0^T such that mem( s_0^T ) == mem( s_0 ) and the two
   sequences SIGMA( T, EC_T, s_0^T ) and SIGMA( I, s_0 ) are finite and :

   -- note that the initial PCs don't have to be the same

   ------------------------------------
   requirements for the two sequences :
   ------------------------------------
   (1) SIGMA( T, EC_T, s_0^T ) 

       = s_0^T ---[e_1^T]---> s_1^T ---[e_2^T]---> ... ---[e_k^T]---> s_k^T

       SIGMA( I, s_0 ) 

       = s_0 ---[e_1]---> s_1 ---[e_2]---> ... ---[e_r]---> s_r

    Let affected( SIGMA( T, EC_T, s_0^T ) ) be the set of addresses such that :
    
      mem(s_0^T)[a] != mem(s_k^T)[a]

      (addresses whose contents before applying the template from the initial
       state are different than AFTER applying the template to completion)

    REQUIRED :	 mem( s_k^T )[a] == mem( s_r )[a] for all a in affected(...)

    for any memory address whose contents change when applying the template, 
    the value of those contents are the same as the value of those contents 
    when applying the instruction sequence from the init state to completion
   
    RECALL: the memory in the two starting states is identical
 
    THIS SAYS: call M the set of memory addresses whose content changes
    during the execution of the template. Then for every m in M,
    mem[m] at the end of execution of the template must equal mem[m] at
    the end of execution of the instruction sequence

    --> NOTE that this is NOT the same thing as saying that the contents
	of memory at the end of both must be identical

	we only care about the memory addresses whose content changed
	during execution of the template

	this means that there could be some memory address A and its
	contents could be changed during execution of the instruction
	sequence BUT NOT during execution of the template

	--> for example, edx in the given example
	--> edx could have value 0 at the beginning of both
	--> then edx could have value 0 at the end of the template but
	    have value 0x406000 at the end of the instruction sequence

	    ==> this is OK; edx as such is "extraneous"

    (2) ignoring null events, the event sequence < e_0^T, e_1^T, ..., e_k^T > 
	is a subsequence of < e_1, e_2, ..., e_r >

    two system events e_i^T and e_j MATCH if the arguments to each are the
    same and the return values from each are the same

    (3) if pc( s_k^T ) is in affected( SIGMA( T, EC_T, s_0^T ) )

        then pc( s_r ) is in affected( SIGMA( T, EC_T, s_0^T ) )

    - if the program counter at the end of applying the template is an
      address whose contents have changed (due to applying the template)
      then the program counter after issuing the instruction sequence
      must also have changed due to that instruction sequence

    [A problem is that affected(...) contains ADDRESSES whereas pc( s_i^T )
     refers to a VALUE or the *contents* of some address]

     - do we assume that the *address* of the program counter stays static?

    < Tangent re: condition 3; see cond_3_tangent.txt for details >

   ==============================================
 - Definition 2 : Template Matching Problem (TMP)
   ==============================================
   a program P satisfies a template T iff P contains an instruction
   sequence I such that I contains a behavior specified by T

     NOTATION : P |= T        (P satisfies template T)

   e.g. P is malware; T specifies decryption loop; if P contains *any*
   instruction sequence I which contains the behavior specified by T
   then P satisfies T (e.g. PGP should satisfy that T)

   ----------------
   Variant Family : 
   ----------------
    - Let T be a set of templates (e.g. T = { T_1, T_2, ..., T_l } and
      T_1 specifies decryption loop behavior, T_2 specifies email addy
      search behavior, ... )

    - The set T defines a variant family F : 

	  { P | for all T_i in T, P satisfies T_i }

      -- note that it is possible for some family members, f in F,
	 to have *additional* behavior than what is specified by T

	 -- so adding functionality to a family member will not cause that
            family member to be thrown out of the family

	 -- removing some functionality may result in booting the family
            member though

   ------------------------------
 - Theorem 1 : TMP is undecidable
   ------------------------------
   Let M be a Turing machine and P_M be a program that uses instructions
   and simulates M

   Assume there is some address sp_addr which P_M doesn't alter the contents of

   Before starting to simulate M, P_M sets mem[sp_addr] = 0
   After simulating M, P_M sets mem[sp_addr] = 1

   Consider template T_1 : mem[sp_addr] = 0
			   mem[sp_addr] = 1

   So P_M satisfies template T_1 if and only if M halts which implies a
   solution to the halting problem which is undecidable therefore...

------------------
A weaker semantics
------------------
Semantics from above may be too strict; imagine a template uses certain
memory locations to store temporary values then such mem locs should not
be checked for equality in comparing the executions of an instruction
sequence and a template.

core memory locations: core( SIGMA( T, EC_T, s_0^T ) ) is a subset of the
		       affected memory addresses

Then modified condition 1 :

 - for all a in core( SIGMA( T, EC_T, s_0^T ) ), 

   mem( s_k^T )[a] = mem( s_r )[a]

 - so we only check "core" memory locations for equality

How to define "core" memory locations?

 - label each instruction in I_T as "temp" or "persistent"

 - core memory locations = any address in affected( ... ) that is the
			   target of a PERSISTENT load or store

=======================================
III. Semantics-aware matching algorithm :
=======================================
 - Sound but not complete

   Sound : no false positives (anything that matches definitely matches)   
    --> every solution produced is indeed a solution

   Complete : no false negatives (catch everything that matches)
    --> produces all possible solutions

 - tool takes as input : template and a binary (for MSFT Windows on Intel
		         IA-32 (x86) platform)

 - tool outputs whether any fragment of the program contains the behavior 
   specified by the template

   -- describes the matched fragment of the program plus relates template
      vars to program registers and memory locations and lists irrelevant
      instruction sequences that are part of the matched program fragment

 - architecture : tool built on top of IDA Pro disassembler

   - binary comes in and is disassembled, create control flow graphs from
     it (one per program function), produce intermediate representation (IR)

     -- IR generated using library of x86 instruction transformers
     -- IR is architecture-independent
     -- IR can still contain library and sys calls that are platform specific
	--> remove this dependency by using summary functions expressed in IR
	    - not much info on this (how works exactly, examples, ...)

     ==> point of all this, though, is for IR to be platform- and
	 architecture-independent, so could conceivably use one template
	 to detect some virus with semantically identical variants
	 produced for Win32, Linux, ...

   - so malware detector algo takes Template and program IR, interacts with :
     - decision procedures, which include :
       -- nop library, random execution, simplify, UCLID
     and outputs YES or NO

     nop library : predefined patterns of irrelevant instructions
     randomized execution : randomized x86 interpreter
     simplify : theorem prover
     UCLID : decision procedure with bit-vector arithmetic

   ===============================================
 - algo description, malware detection algo : A_MD
   ===============================================
   - finds, for each template node, a matching node in the program

   - a template node is an expression; e.g. "A = const_addr1", "condition(A)?",

   - a program node is an expression; e.g. "eax = 0x403000", "eax >= 0x406000?"

   - two nodes, n_T (from the template) and n_P (from the program) match if
     there exists an assignment to vars from the n_T expression that unifies 
     it with n_P

   - once two matching nodes are found, check whether def-use relationships
     true b/n template nodes also hold true in the corresponding program

     -- def-use relationship : 

	two nodes n_1 and n_2 are related by a def-use relationship IF
	n_1's DEFinition of a var reaches n_2's USE of the same var

	e.g. n_1 = statement S_i and n_2 = statement S_j and 
	n_1 defines some variable X and n_2 *uses* that value of X
	(as set during statement S_i)

        [see def_use_rp.txt in this folder for slighly more in-depth
	 though still somewhat elementary explanation]

    - if all the nodes in the template have matching counterparts in the
      program that satisfy these conditions, then the algo has found a
      program fragment that satisfies the template and produces a proof of
      this relationship

 - formally, given a malicious code template T and a program P, 

   A_MD : is a predicate on pairs of programs and templates

     A_MD : Programs x Templates ---> { "yes", "don't know" }

   "don't know" == algo can't determine that prog satisfies the template

   such that A_MD( P, T ) returns "yes" when P satisfies T under the
   conditions below
----------------------
Simplified template T;  X^key == X (bit-vector XOR) key
----------------------
(1)   A = const_addr1					// b_1
(2)   B = const_addr2					// b_2
(3)   A > const_addr3 ?  TRUE: goto 7, FALSE: goto 4	// b_3
(4)   mem[B] = mem[A]^key				// b_4
(5)   A = A + c						// b_5
(6)   B = B + d						// b_6
      goto 3 // they have another typo here
(7)   jump const_addr2					// b_7
---------------------
Simplified program P:
---------------------
(1) eax = 0x403000					// a_1
(2) ebx = 0x400000					// a_2
(3) nop							// a_3
(4) ecx = eax + 1					// a_4
(5) ecx - 1 > 0x406000 ?  TRUE: goto 10, FALSE: goto 6	// a_5
(6) eax = ecx - 1					// a_6
(7) mem[ebx] = mem[eax]^5				// a_7
(8) eax = eax + 1					// a_8
(9) ebx = ebx + 2					// a_9
    goto 4    
(10) jump 0x400000					// a_10

===========================================================
For A_MD(P, T) = "yes", the following conditions must hold:
===========================================================
 -----------------------------------------------
 (1) matching of template nodes to program nodes
 -----------------------------------------------
 - each node n in T has to unify with a node m in P
 
 - call n_T : the set of template nodes and n_P the set of program nodes

 - IOW, there is an onto partial function f from the nodes of P to the
   nodes of T (there may exist nodes p in n_P such that f(p) = not defined;
   however for every node t in n_T, there exists some p in n_P such that
   f(p) = t)

   then f : { nodes in P }         ---> { nodes in T }
	    { n_P }		   ---> { n_T }
	    { a_1, a_2, ..., a_m } ---> { b_1, b_2, ..., b_n }

	    m >= n

   then:    f( a_i ) = b_j   for some i,j

	    and a_i and b_j are unifiable

   B( X, n, m ), where n = f( m ) 

      == the binding of variable X (where X is referred to in the template
	 instruction at node n) TO an expression referred to in the
	 program instruction at node m; 

   ==> So is it okay if more than one program node maps to the SAME
       template node? By the definition of f, this is allowed...

   e.g.  B(A,1,1) = eax		     also	B(B,2,1) = eax
	 B(A,2,2) = ebx				B(A,1,2) = ebx
	 B(A,3,5) = (ecx - 1)
	 B(A,5,8) = eax
	 B(B,6,9) = ebx

   ----------------------------------------------------------------------------------
   sample definition of f [there is more than one possible that satisfies this cond]:
   ----------------------------------------------------------------------------------
     a_1 -->    b_1
     a_2 -->	b_2
     a_3 -->	undefined
     a_4 -->	undefined
     a_5 -->	b_3
     a_6 -->	undefined
     a_7 -->	b_4
     a_8 -->	b_5
     a_9 -->	b_6
     a_10 -->   b_7

 ---------------------------------
 (2) preservation of def-use paths
 ---------------------------------
  - for each node n in T, define def(n) as the set of vars defined in n
  - for each node n in T, define use(n) as the set of vars USED in node n
    ----------	    ------	 ------
    nodes in T	    def(n)	 use(n)
    ----------	    ------	 ------
    b_1		    A		 none
    b_2		    B		 none
    b_3		    none	 A
    b_4		    none	 A, B
    b_5		    A		 A
    b_6		    B		 B
    b_7		    none	 none

  - nodes n_pre : predecessor to the template entry node
  - nodes n_post : successor to all template exit nodes
  - def( n_pre ) = use( n_post ) = V_T
  - so  def( n_pre ) = set of all nodes defined in the template?
  - and use( n_post ) = set of all nodes used in the template?
 
  - a def-use path is a sequence of nodes < n_1, n_2, ..., n_k > such that

    + for 1 <= i < k, there is an edge from n_i to n_(i+1) in T

      [i.e. < n_1, n_2, ..., n_k > is a valid path through template T ]

    + I( n_1 ) == the instruction at node n_1

      I( n_1 ) DEFines a variable that I( n_k ) USEs

      [i.e. def( n_1 ) INTERSECT use( n_k ) != {} ]

    + none of the instructions at the intermediate nodes redefine the
      value of the variables in def(n_1)

      [i.e. def(n_1) INTERSECT def(n_i) = {} for 1 < i < k ]

  - so from our template graph, 

    paths : n_1 --> n_2 --> n_3
	    n_3 --> n_4
	    n_3 --> n_7
	    n_4 --> n_5 --> n_6 --> n_3 

    def( A ) : { n_1, n_5 }	 use( A ) : { n_3, n_4, n_5 }
    def( B ) : { n_2, n_6 }	 use( B ) : { n_4, n_6 }

    ---------------------------------------------
    def-use relationships, denoted duse(n_1,n_2):
    ---------------------------------------------
    A : duse( n_1, n_3 ) path : n_1 --> n_2 --> n_3
    A : duse( n_1, n_4 ) path : n_1 --> n_2 --> n_3 --> n_4    
    A : duse( n_1, n_5 ) path : n_1 --> n_2 --> n_3 --> n_4 --> n_5
    A : duse( n_5, n_3 ) path : n_5 --> n_6 --> n_3
    A : duse( n_5, n_4 ) path : n_5 --> n_6 --> n_3 --> n_4
    A : duse( n_5, n_5 ) path : n_5 --> n_6 --> n_3 --> n_4 --> n_5

    B : duse( n_2, n_4 ) path : n_2 --> n_3 --> n_4
    B : duse( n_2, n_6 ) path : n_2 --> n_3 --> n_4 --> n_5 --> n_6
    B : duse( n_6, n_4 ) path : n_6 --> n_3 --> n_4
    B : duse( n_6, n_6 ) path : n_6 --> n_3 --> n_4 --> n_5 --> n_6

 - any def-use relationship in the template, e.g. between nodes n and n',
   must hold between nodes m and m' WHERE n = f(m) and n' = f(m')

 - consider two nodes n and n' in T that are def-use related

   - let nodes n and n' in T be matched to nodes m and m' in P
   
   - suppose X is a var that is defined in n and used in n'

     B( X, n, m )   = r	     "X is bound to expression r in node m"
     B( X, n', m' ) = r'     "X is bound to expression r' in node m'"

   - for example, n = n_1, n' = n_3, X = A
		  m = m_1, m' = m_5, X = A

     then B(X, n_1, m_1 ) = eax
	  B(X, n_3, m_5 ) = (ecx - 1)

     so r = eax and r' = (ecx - 1)

     then need to make sure that value of (eax) after "eax = 0x403000" (m_1)
     is equal to value of (ecx - 1) before "ecx - 1 > 0x406000" (m_5)
     *for every program path corresponding to the template path from n_1 to n_3*

     - in this case there is just one such program path

   - need to make sure that value of r AFTER executing I(m) is same as
     value of r' BEFORE executing I(m')

   - let's try another example: n_6 and n_4 are related by def-use r/p in T

     and n_6 corresponds to m_9 and n_4 corresponds to m_7

     B( B, n_6, m_9 ) = ebx
     B( B, n_4, m_7 ) = ebx

     condition is trivially satisfied

   ===========================================================
 - Theorem 2 : A_MD is sound with respect to the TMP semantics; 
   ===========================================================

	i.e. A_MD( P, T ) = "yes" ===> P |= T

 - sound but not complete
 - can show there exists a program P and a template T such that 

   P satisfies T but A_MD( P, T ) returns "don't know"

   template T				  program P
   ----------				  ---------
   for ( i = 0; i < 10; i++ )		  for ( i = 0; i < 10; i += 2 )
     a[i] = 0;				      a[i] = 0;
					  for ( i = 1; i < 10; i += 2 )
					      a[i] = 0;

   - A_MD won't catch this matching as the ordering of updates in the
     template loop is different than that of the program loops

   =================
 - Local unification
   =================
   -- produces an assignment to vars in an instruction at a template node
      such that this matches a program node instruction

   -- program node expressions contain ground terms only
   -- instantiate template vars with program expression

   -- template node 1 ( A = const_addr1 ) can be unified with

        program node 1 ( eax = 0x403000 ) OR
	program node 2 ( ebx = 0x400000 )

      but not with program node 8 ( eax = eax + 1 ) or 9 ( ebx = ebx + 2 )

   ++ so note that unification assigns to vars BUT ALSO to constants

   -- [ t_1, p_1 ] { A <-- eax, const_addr1 <-- 0x403000 }
      [ t_2, p_2 ] { B <-- ebx, const_addr2 <-- 0x400000 }
      [ t_3, p_5 ] { A <-- ecx - 1, const_addr3 <-- 0x406000 }
      [ t_4, p_7 ] { B <-- ebx, A <-- eax, key <-- 5 }
      [ t_5, p_8 ] { A <-- eax, c <-- 1 }
      [ t_6, p_9 ] { B <-- ebx, d <-- 2 }
      [ t_7, p_10 ] { const_addr2 <-- 0x400000 }

      --> B is only assigned ebx so we don't have to worry about B
      --> const_addr1 is only assigned 0x403000
      --> const_addr2 is only assigned 0x400000
      --> key is only assigned 5
      --> const_addr3 is only assigned 0x406000
      --> c is only assigned 1
      --> d is only assigned 2

      --> A is assigned both: { eax, ecx - 1 }

 RULES about producing an assignment to vars in an instruction at a
 template node such that the resulting expr matches a prog node instr:

  (1) a var in the template can be unified with any prog expr except for
      assignment expressions

  (2) a constant in the template can only be unified with a prog constant

  (3) the memory function (in the template) can only be unified with the
      program's memory function

  (4) a predefined operator in the template (e.g. OR, AND, XOR, NOT,
      BITAND, LEQ, PLUS, MINUS, ...) can only be unified with THE SAME
      operator in the program

  (5) an external function call in the template can only be unified with
      the same external function call in the program

  ==> result of this is a binding relating template variables to program expressions

 - note that we don't require that bindings be consistent (monomorphic) since
   as above we have that A is bound to eax at one point yet bound to ecx - 1 
   at another

   -- such a requirement would be overly restrictive (not resilient to
      obfuscation attacks)

 - instead we want to check program expressions bound to the same template
   variable (e.g. eax is bound to A and so is ecx - 1 )

   --> verify that such expressions change value in the same way that the
       template variable changes value

   ---------------------=====================================
   ACCOMPLISH THIS via : value preservation on def-use chains
   ---------------------=====================================
   - for each def-use chain in the template, check :

     whether the value of the program expression corresponding to the
     template-variable use is the same as the value of the program
     expression corresponding to the template-variable definition

   - e.g. t_1 and t_3 are def-use related for variable A

     p_1 maps to t_1 and p_5 maps to t_3

     R : the program fragment between (exclusive between) p_1 and p_5

     R :     ebx = 0x400000
	     nop
	     ecx = eax + 1

     so the value for A after p_1 must be the same as the value for A
     before p_5

     -- the value for A at p_1 is : eax
     -- the value for A at p_5 is : ecx - 1

     so, confirm that from < p_2, p_3, p_4 >, eax == (ecx - 1)

     "the expressions corresponding to template variable A after p_1
      must be equal in value to the expressions corresponding to 
      template variable A before p_5 (since p_1 and p_5 are related by
      a def-use relationship for variable A)"

     express this via a predicate : 

      for each path I in program fragment R

        val_pre( I )(eax) = val_post( I )(ecx - 1)

      [e.g. I = <p_2, p_3, p_4>]

      val_pre( I )( x ) : the value of x before path I
      val_post( I )( x ) : the value of x after path I

 - for each match between a program node p_i and a template node t_j, the
   algo checks whether the def-use chain is preserved for each program
   expression corresponding to a template var used at the matched node

 - does R maintain the value predicate PHI : 

   PHI = { for all I in R, 

	     ( val_pre<I>(eax) = val_post<I>( ecx - 1 ) ) 
	 }

 - treat decision procedures as oracles that answer the question, 

   "Does a program fragment R maintain an invariant PHI?"

 - another example, def-use chain t_5, t_3 maps to p_8, p_5

   R : < p_9, p_4 >

     : ebx = ebx + 2
       ecx = eax + 1

   definition of A in t_5 : eax
   definition of A in t_3 : ecx - 1

   val_pre( I )( eax ) =? val_post( ecx - 1 ) Yes.

 - another example, def-use chain t_1, t_4 maps to p_1, p_7

   R : < p_2, p_3, p_4, p_5, p_6 >

     : ebx = 0x400000
       nop
       ecx = eax + 1
       ecx - 1 > 0x406000 ?
       eax = ecx - 1

   definition of A in t_1 : eax
   definition of A in t_5 : eax

   val_pre( I )( eax ) =? val_post( eax ) Yes.

   =====================================================
 - Using decision procedures to check value-preservation
   =====================================================
   - want to figure out whether a program fragment P is a semantic nop
     with respect to some conditions, e_1 and e_2

     if -- for every path I in that program fragment -- the value of e_1 
     *before* that path is the same as the value of e_2 *after* that path
     then that PHI( I, e_1, e_2 ) holds

     if ( PHI( I, e_1, e_2 ) for every I in R holds )

        then PHI( P, e_1, e_2 ) holds and D returns TRUE

     if on the other hand, there exists some I' in R for which

	PHI( I', e_1, e_2 ) == "don't know"  or "false"

	then D( P, e_1, e_2 ) = "don't know" or "false"

     we can also determine whether: !PHI( P, e_1, e_2 ) holds...

       if ( PHI( I, e_1, e_2 ) for every I in R *does not hold* )

         then !PHI( P, e_1, e_2 ) does not hold

     D^+ : a decision procedure for PHI( P, e_1, e_2 )
     D^- : a decision procedure for !PHI( P, e_1, e_2 )

   =======================================================================
   - "semantic nops" : some program fragment I is a semantic nop with
     respect to two conditions e_1 and e_2 if the value of e_1 before that
     fragment is the same as the value of e_2 after that fragment
   =======================================================================

   - value-preservation oracle : decision procedure, determines whether a
     given program fragment is a semantic nop w.r.t. certain prog vars

     D determines whether the value predicate, 

       PHI( P, e_1, e_2 ) 

     holds for all paths in P; i.e.

       for all paths I in P, 

	   ( val_pre<I>(e_1) = val_post<I>(e_2) )

   -------------
 - NOP library : identifies sequences of nop instructions
   -------------
   -- annotate basic blocks as nop sequences when applicable
   -- use actual nop instructions plus some sequences which are known to be nops

   -----------------------------
 - randomized symbolic execution
   -----------------------------
   -- program frag executed using a random initial state

   -- at completion, check whether !PHI(P,e_1,e_2) is true

      -- if so, at least one path in the program fragment is not a
         semantic nop; and so whole P is NOT a semantic nop

      -- else, whole P may be a semantic nop

   --------------------------
 - theorem proving : SIMPLIFY
   --------------------------
   -- value preservation problem formulated as a theorem to be proved
      (as long as program frag has NO loops)

   -- program frag represented as state transformer DELTA

   -- then use SIMPLIFY to prove the formula

      DELTA ==> PHI( P, e_1, e_2 )

      - if SIMPLIFY confirms that the formula is valid, 
	then the program fragment is a semantic nop

   -- can't handle programs which include bit-vector operations

   2: ebx = 0x400000
   3: nop
   4: ecx = eax + 1

   Simplify formula :

     (IMPLIES (AND (EQ ebx_1 4194304)
		   (EQ ecx_4 (+ eax_pre 1) )
		   (EQ ecx_post ecx_4)
	       )
              (EQ eax_pre (- eax_post 1))
     )

   same as:

   (ebx_1 == 4194304) && (ecx_4 == (eax_pre + 1)) && (ecx_post == ecx_4)

    ==> ( eax_pre == (eax_post - 1) )

 - theorem proving : UCLID

   -- based on UCLID infinite-state bounded model checker

   -- logic supported by UCLID is a superset of that supported by SIMPLIFY

   -- UCLID precisely models int and bit-vector arith

   -- model program fragment instructions as state transformers for each
      register and for memory

   -- UCLID simulates program frag for a given # of steps and determines
      whether PHI( P, e_1, e_2 ) holds at the end of the simulation

=========================
Strengths and Limitations
=========================
----------------------------------------------------------------
4 classes of obfuscations, A_MD can handle (to varying degrees):
----------------------------------------------------------------
 - code reordering, equivalent instruction replacement, register renaming,
   garbage insertion

(1) code reordering : change order of instructions; maintain execution
		      order using jumps

 - handled via use of control flow graphs

(2) register renaming : reassigns registers in selected program fragments

 - so any tool searching for signature which is based on specific register
   names will miss the obfuscated variant

 - use templates; vars assigned registers only during unification

 - so can detect any program with the same behavior as the template
   irrespective of register allocation

(3) garbage insertion : insert instruction sequences that don't change
			behavior of program

 - uses decision procedures to reason about value predicates on def-use chains

(4) equivalent instruction replacement 

 - uses rich instruction set to replace groups of consecutive instructions
   with other short instruction sequences that have the same semantics

 - e.g. to add 1 to register X, can use:  < add X, 1 > or < inc X >

 - handle limited kind of this via normalize code into IR which supports only
   semantically-disjoint operations

-----------
LIMITATIONS
-----------
 - doesn't recognize that:  "x = x + 2" is equivalent to "eax = eax << 1"

 - can only detect programs that exhibit same ordering of memory updates

   -- recall the problem where we populated a[i] with zero via pattern:

      <1,2,3,...,20> versus <2,4,6,8,...,20,1,3,5,...,19>

   -- comes from use of def-use chains for value preservation checking

   but NB : automatically generating functionally-equivalent variants is hard

Used : (1) decryption-loop template and (2) mass email template
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值