Undo .NET Constant Obfuscation in IDA Pro

While .NET malware samples are usually easy to decompile using dnspy and similar tools, possibly after an initial unpacking step using dnspy’s debugger or a dedicated unpacker like ConfuserEx, there often remain additional “small” obfuscations of strings or constants. While these can be reverted manually, this often results in a tedious process, which calls for automation. IDA Pro, the traditional tool to perform such tasks in x86 binaries, offers a great Python interface - unfortunately only in its commercial version. Until some time ago, the support for .NETs intermediate language MSIL (Microsoft Intermediate Language) was limited, in particular when binary patching was required, but current versions can deal with this. We’ll look into a very simple example of a constant obfuscation we sometimes saw, and how a simple script can do the job. While the example is a real one, it’s nearly trivial and could also solved in other ways; but we think it’s a nice educational study.

Let’s first look into how this constant obfuscation manifests itself in the decompiled code:

if (num <= (1757753783U ^ 144094884U))
{
 if (num <= (391156780U ^ 190026018U))
 {
  if (num != (3105210592U ^ 3177030179U))
  {
   if (num != (2570662017U ^ 2471372789U))
   {
    if (num == (3212393845U ^ 2742696059U))
                {

Our script will basically calculate the XOR operations and transform this fragment into:

if (num <= 1616086803U)
{
 if (num <= 469959950U)
 {
  if (num != 71852739U)
  {
   if (num != 175576948U)
   {
    if (num == 469959950U)
    {

Another example is

DateTime kj = new DateTime(1870872117 ^ 1870870921, 1, 1, 0, 0, 0);

Which becomes

DateTime kj = new DateTime(1980, 1, 1, 0, 0, 0);

Of course we could just use dnspy’s project export feature and modify the produced .cs files using text pattern matching and replacing the constants. However, it is tricky to find the correct patterns and parse the values correctly, mainly if more complicated obfuscations are used, and we might miss some analysis steps of constants that dnspy can do for us. One example of such a tricky case is

BindingFlags invokeAttr = (BindingFlags)633067955 ^ (BindingFlags)633063859;

The problem here is the cast to (BindingFlags) that our text based pattern must be able to take care of, probably producing something like "(BindingFlags)4096". However, if we patch this on binary level and then pass it to dnspy, we get the following much nicer result:

BindingFlags invokeAttr = BindingFlags.GetProperty;

In this case, dnspy can also use the correct enum alias GetProperty instead of 4096.

If we load the binary into IDA and examine the MSIL code, it’s easy to see how the obfuscator works - here is an example how this looks in the (BindingFlags) case:

20 B3 D9 BB 25     ldc.i4   0x25BBD9B3
20 B3 C9 BB 25     ldc.i4   0x25BBC9B3
61                 xor

There are also (rare) examples of 64 bit operations:

21 61 58 78 81 C6 24 EB F3      ldc.i8   0xF3EB24C681785861
21 00 00 00 00 00 00 00 80      ldc.i8   0x8000000000000000
61                              xor

The ldc instruction simply pushes an immediate constant encoded in the bytes behind on the stack, while the xor instruction applies an exclusive XOR operation to the top two stack elements, as MSIL is a stack based engine. So we can scan for this sequence of instructions and replace them by just one ldc instruction, replacing the rest by nop operations represented by a 00 opcode, so we’d like to get the following code in above case:

00                 nop
00                 nop
00                 nop
00                 nop
00                 nop
20 00 10 00 00     ldc.i4   0x1000
00                 nop

We could try to just read the file as binary data, searching for the patterns using yara like signatures, and the apply the patches. However, we might find wrong matches in data sections, so it’s preferrable to apply patches using a real disassembler. We make a few assumptions to keep the code simple:

The 3 instructions always appear in sequence, without other instruction in between, be those simple nop or other instructions like jumps. In the samples we studied, this was always the case. Otherwise, more complex state machines must be implemented.
No “short” instructions, like ldc.i4.2, are used. These would push a very small constant (here 2) on the stack and allow to save one byte, as 2 is an implicit operand encoded directly in the opcode. Short instructions are only available for constants from one to eight, but the probability for them to be used for our xor obfuscations is pretty small, if we assume the constants to be chosen randomly by the obfuscator - roughly 1 to 500 million.

So let’s start with th script. We’re using the sark module, which offers a neat assembly instruction wrapper fpr IDA Python, and define a dataclass to store the relevant values of a specific ldc instruction:

import sark
from dataclasses import dataclass

# Ldc stores information about an ldc.i4 or ldc.i8 instruction (pushing a constant to the stack)
@dataclass
class Ldc:
    ea: int     # effective address of the ldc instruction
    size: int   # number of bytes of ldc instruction (5 for 32 bit ldc.i4, 9 for 64 bit ldc.i8)
    value: int  # value pushed to stack

We iterate through all functions and initialize a list of (consecutive) ldc instructions xor_list. Then we iterate all instructions of the function. Any non-ldc instruction will clear xor_list, while ldc instructions append to it. Note that sark offers no direct way to access a 64 bit operands; the upper 32 bits are interpreted as displacement, which we must or with the lower 32 bits in the immediate property:

for fct in sark.functions():
    xor_list: list[Ldc] = []  
    for l in fct.lines:
        ops = l.insn.operands  # just a shortcut
        if l.insn.mnem == 'ldc.i4' and len(ops) == 1 and ops[0].type.is_imm:
            # ldc.i4 instruction detected, append it to xor_list (5 bytes in size)
            xor_list.append(Ldc(l.ea, 5, ops[0].imm))
        elif l.insn.mnem == 'ldc.i8' and len(ops) == 1 and ops[0].type.is_imm:
            # ldc.i8 instruction detected, append it to xor_list (9 bytes in size)
            xor_list.append(Ldc(l.ea, 9, (ops[0].displacement << 32) | ops[0].imm))
        elif l.insn.mnem == 'xor' and len(xor_list) > 1:
            # xor instruction detected after at least 2 consequituive ldc instructions
            apply_xor(xor_list)
            xor_list.clear()
        else:
            # Any other instruction clears the list
            xor_list.clear()

apply_xor() does the actual work, after having verified that the sizes of the previous two ldc instrustions are the same (which they always should). Depending on this size, we patch a double word in the case of 5 bytes, or a quad word in the case of 9 bytes, with the calculated value. Here We use the second ldc instruction for patching. Finally, the actual xor instruction (1 byte) and the other ldc instruction are replaced by nop instructions:

def apply_xor(xor_list: list[Ldc]):
    op1, op2 = xor_list[-2:]
    if op1.size != op2.size:
        print(f'Different op sizes in fct {fct.name} on {sark.Line(ea=op1.ea)} and {sark.Line(ea=op2.ea)}, ignored')
        return

    # patch in calculated value to 2nd ldc:
    print(f' Patching VA {l.ea:x}')
    if op2.size == 5:
        idc.patch_dword(op2.ea + 1, op1.value ^ op2.value)
    else:
        idc.patch_qword(op2.ea + 1, op1.value ^ op2.value)

    # nop out actual xor instruction (1 byte):
    idc.patch_byte(l.ea, 0)

    # nop first ldc instruction
    for i in range(op1.size):
        idc.patch_byte(op1.ea + i, 0)