Free Malware Analysis F A U T . L Full Playlist

The Evolution of Executable Formats

From Raw Machine Code to Modern PE Format

×
Psst! I'm W0rm.exe

I've infected this page with valuable knowledge. Subscribe for more malware analysis secrets!

Subscribe Now
Psst... Click Me!

Welcome to the Interactive PE Format Evolution Timeline

This interactive timeline explores the fascinating journey of executable formats, from the earliest days of computing with raw machine code to the sophisticated PE (Portable Executable) format used in modern Windows systems.

Understanding the evolution of executable formats is essential for malware analysts, reverse engineers, and anyone interested in how programs actually work at a fundamental level. As you progress through this timeline, you'll discover how each advancement in executable design enabled new capabilities while addressing limitations of previous formats. We'll also touch upon the role of compilers and linkers in creating these executable files.

Begin your journey by clicking on any point in the timeline below, or use the "Get Started" button to begin at the beginning with Binary Foundations.

Binary Basics
CPU Registers
Machine Code
Assembly Language
Memory Models
File Identification
Early Formats
PE Intro
PE Headers
PE Sections
Advanced

Binary Foundations: The Basics

×

Number Systems in Computing

Computers operate using binary (base-2) because electronic components have two reliable states:

  • 0: Off, low voltage, false
  • 1: On, high voltage, true

All data in a computer is ultimately stored as sequences of these binary digits ("bits"):

10110110 01101001 11001010 10101101

Binary to Hexadecimal Conversion

Hexadecimal (base-16) serves as a more concise way to represent binary data:

  • Each hex digit represents exactly 4 binary bits (a nibble)
  • Range: 0-9 and A-F (where A=10, B=11, C=12, D=13, E=14, F=15)
1011
0110
B
6

This makes hexadecimal ideal for representing binary data compactly while maintaining a direct mapping to the underlying bits. It's commonly used in memory dumps, debuggers, and PE analysis tools.

Bits, Bytes, Words, and Beyond

Data is organized into progressively larger units. The standard sizes evolved with processor architectures:

  • Bit: The smallest unit (0 or 1).
  • Nibble: 4 bits (half a byte).
  • Byte: 8 bits. The fundamental unit of addressable memory. Early microprocessors like the Intel 8080 worked primarily with bytes.
  • Word: 16 bits (2 bytes). Became standard with 16-bit processors like the Intel 8086.
  • Double Word (DWORD): 32 bits (4 bytes). The standard size for registers and addresses in 32-bit architectures (IA-32, like the Intel 80386).
  • Quad Word (QWORD): 64 bits (8 bytes). The standard size for registers and addresses in 64-bit architectures (x86-64).

These terms are essential vocabulary in assembly language, PE file format structures (which use types like WORD, DWORD), and low-level programming.

Endianness: Byte Order Matters

When dealing with multi-byte values (like Words, DWORDs, QWORDs), the order in which bytes are stored in memory becomes critically important:

  1. Little-Endian: Least significant byte (LSB) comes first in memory (at the lowest address).
    • Used by x86/x64 processors (Intel, AMD).
    • The DWORD value 0x12345678 is stored in memory as: 78 56 34 12
  2. Big-Endian: Most significant byte (MSB) comes first in memory (at the lowest address).
    • Used by some architectures (e.g., older PowerPC, SPARC, MIPS) and standard network protocols (hence "network byte order").
    • The DWORD value 0x12345678 is stored in memory as: 12 34 56 78

This concept of "endianness" is crucial in malware analysis because:

  • You must interpret multi-byte values from memory dumps or file structures correctly based on the target architecture (usually little-endian for Windows malware).
  • Some malware deliberately uses reversed byte order to obscure strings or values.
  • Network communication often requires byte-swapping when moving between the big-endian network format and little-endian memory.

Malware analysts must master number systems and data representations to properly analyze binary files. The ability to read and convert between binary, hexadecimal, and decimal, understand data sizes (Byte, Word, DWORD, QWORD), and interpret endianness is fundamental to understanding executable file structures, memory dumps, and disassembled code. This knowledge forms the bedrock for all deeper analysis of PE files.

CPU Registers: The Processor's Workbench

×

CPU Registers: High-Speed CPU Storage

Registers are small, extremely fast storage locations built directly into the CPU. They are the primary working space for the processor, holding data currently being processed, instruction pointers, and status flags. Understanding registers is key to understanding assembly language.

16-bit (8086/80286)

AX
AH
AL
BX
BH
BL
CX
CH
CL
DX
DH
DL

SP
BP
SI
DI
IP
FLAGS

32-bit (i386+)

EAX
AX
EBX
BX
ECX
CX
EDX
DX
ESP
EBP
ESI
EDI
EIP
EFLAGS

64-bit (x86-64)

RAX
EAX
AX
RBX
EBX
BX
RCX
ECX
CX
RDX
EDX
DX
RSP
RBP
RSI
RDI
RIP
R8
R9
R10
R11
R12
R13
R14
R15
RFLAGS

Understanding how registers evolved (16-bit -> 32-bit -> 64-bit) and how smaller registers are part of larger ones (e.g., AL/AH make up AX, AX is the lower 16 bits of EAX, EAX is the lower 32 bits of RAX) is crucial for analyzing code across different architectures.

Register Categories and Common Uses

Registers are often grouped by their typical function:

General Purpose Registers (GPRs)

  • AX/EAX/RAX: Accumulator - Often used for arithmetic results, function return values, and some I/O operations.
  • BX/EBX/RBX: Base - Historically used as a base pointer for memory access (e.g., [BX+SI]). In 64-bit, RBX is often preserved across function calls (non-volatile).
  • CX/ECX/RCX: Counter - Frequently used as a loop counter (LOOP instruction) or for string operations (REP prefixes). First argument in x64 fastcall convention.
  • DX/EDX/RDX: Data - Used for I/O port access (IN/OUT instructions), dividend/remainder in multiplication/division. Second argument in x64 fastcall.

Index and Pointer Registers

  • SI/ESI/RSI: Source Index - Often used as a source pointer in string/memory operations (e.g., LODSB, MOVSB). Third argument in x64 fastcall. Often non-volatile in x64.
  • DI/EDI/RDI: Destination Index - Often used as a destination pointer in string/memory operations (e.g., STOSB, MOVSB). Fourth argument in x64 fastcall. Often non-volatile in x64.
  • SP/ESP/RSP: Stack Pointer - Points to the current top of the stack. Crucial for function calls (PUSH, POP, CALL, RET) and local variables.
  • BP/EBP/RBP: Base Pointer - Points to the base of the current stack frame, used to access parameters and local variables. Often optional in optimized 64-bit code where RSP-relative addressing might be used instead. Often non-volatile.

Instruction Pointer

  • IP/EIP/RIP: Instruction Pointer - Holds the address of the next instruction to be executed. Cannot be accessed directly by most instructions but modified by jumps, calls, and returns. Central to control flow.

Flags Register

  • FLAGS/EFLAGS/RFLAGS: Status Register - Contains individual bits (flags) indicating results of arithmetic/logical operations (Zero Flag (ZF), Carry Flag (CF), Sign Flag (SF), Overflow Flag (OF)) and controlling CPU behavior (Interrupt Flag (IF), Direction Flag (DF) for string ops). Conditional jumps (JZ, JNE, JC, etc.) depend on these flags.

Segment Registers (16-bit, but still relevant concepts in protected/long mode)

  • CS (Code Segment): Points to the segment containing executable instructions. Implicitly used with EIP/RIP.
  • DS (Data Segment): Default segment for most data access.
  • SS (Stack Segment): Points to the segment containing the program stack. Implicitly used with ESP/RSP and EBP/RBP.
  • ES (Extra Segment): Additional data segment, often used for string operations with DI/EDI.
  • FS & GS (Extra Segments): Additional data segments with no specific hardware-defined use. In modern Windows:
    • FS (32-bit) / GS (64-bit) are famously used to point to thread-specific data structures:
      • TEB (Thread Environment Block) / TIB (Thread Information Block): Accessed via FS:[0] in 32-bit Windows. Contains pointers to PEB, SEH chain, Stack Base/Limit, ThreadID, LastError.
      • Malware frequently accesses FS:[0x18] (TEB), FS:[0x30] (PEB pointer in TEB), or GS:[0x30] (TEB in x64), GS:[0x60] (PEB in x64) for anti-debugging (checking `PEB.BeingDebugged`), finding loaded modules, or getting other process/thread info without direct API calls.

Note: In modern "flat" memory models used by Windows, segment registers typically point to selectors that cover the entire address space, so their explicit manipulation for memory addressing is less common than in older segmented architectures. However, FS/GS have taken on special roles.

64-bit Additional GPRs

  • R8 - R15: Additional general-purpose registers available in 64-bit mode. R8/R9 are used for 5th/6th arguments in x64 fastcall, R10-R15 can be used for more arguments or general computation.

CPU registers are the heart of low-level execution. Malware analysts must meticulously track register values when debugging or reverse engineering disassembled code. EIP/RIP dictates control flow, ESP/RSP manages the stack (critical for buffer overflows), EBP/RBP helps understand function context, and GPRs reveal data manipulation and function arguments/return values. Malware often uses registers in non-standard ways to obfuscate its actions.

Machine Code: The CPU's Native Language

×

Raw Machine Code: Direct CPU Instructions

At the most fundamental level, CPUs only understand binary sequences known as machine code or opcodes. Each sequence directly triggers a specific hardware operation.

While technically binary, we almost always represent machine code in hexadecimal for readability:

Binary:             Hexadecimal:   Assembly:
01010101            55             push ebp
10001001 11100101   89 E5          mov ebp, esp  (Standard 32-bit function prologue)

Note: The mov ebp, esp instruction shown above uses the bytes 89 E5. However, due to redundancy in the x86 instruction set, the functionally identical instruction could also be encoded as 8B EC. Different compilers or assemblers (like NASM vs MASM) might choose either valid encoding. This is important because seeing 8B EC instead of 89 E5 doesn't mean the code is wrong, just that a different (but valid) encoding was chosen. This variation can sometimes be used to guess which compiler produced the code (compiler fingerprinting).

Instruction Encoding Concepts

x86/x64 instructions don't have a fixed length; they can range from 1 to 15 bytes. An instruction is typically composed of several parts, though not all parts are present in every instruction:

  • Prefixes (Optional): Single bytes that modify instruction behavior (e.g., operand size override, segment override, lock prefix, repeat prefixes).
  • Opcode (Required): One or more bytes specifying the core operation (e.g., mov, add, push, ret).
  • ModR/M Byte (Often Required): A complex byte that specifies operands. It indicates whether operands are registers or memory locations and defines the addressing mode used for memory access.
  • SIB Byte (Sometimes Required): Scale-Index-Base byte. Used with ModR/M for more complex memory addressing involving a scaled index register (e.g., [eax + ecx*4]).
  • Displacement (Optional): An offset (1, 2, or 4 bytes) added to a base address when accessing memory.
  • Immediate Value (Optional): A constant value (1, 2, 4, or 8 bytes) embedded directly in the instruction, used as an operand.

Examples showing different structures:

  • 50push eax (Opcode only)
  • C3ret (Opcode only - near return, no stack pop)
  • B8 01000000mov eax, 1 (Opcode + 32-bit Immediate)
  • 89 E5mov ebp, esp (Opcode + ModR/M specifying two registers)
  • E8 00000000call near_relative_offset (Opcode + 32-bit Displacement/relative offset)
  • FF 15 00104000call dword ptr [0x401000] (Opcode + ModR/M + 32-bit Displacement/absolute address - typical IAT call)

You don't need to memorize all encodings, but understanding that instructions have variable lengths and different components is key for reading disassembly and hex dumps. It's the job of a compiler (like GCC, Clang, or MSVC) to translate high-level code (like C++ or C#) into these machine code sequences, often via an intermediate assembly language step.

Common Instruction Byte Patterns

In malware analysis, recognizing common byte patterns directly in a hex editor or disassembler can quickly reveal program structure and behavior:

Function Prologues/Epilogues (32-bit)

  • 55 89 E5 or 55 8B EC: Standard 32-bit prologue (push ebp; mov ebp, esp)
  • C9 C3: Standard 32-bit epilogue (leave; ret)

Function Prologues/Epilogues (64-bit)

  • 40 55 / 48 89 E5 / 48 8B EC: Various common 64-bit prologue starts (often involve saving non-volatile registers).
  • C3: Simple return (often ends functions).

Control Flow

  • E8 xx xx xx xx: Relative call
  • E9 xx xx xx xx: Relative jmp
  • EB xx: Short relative jmp
  • 74 xx: je (short jump if equal/zero)
  • 75 xx: jne (short jump if not equal/zero)
  • FF 15 xx.. / FF 25 xx..: Indirect call/jmp via absolute address (often used for IAT calls or jump tables). Sometimes implemented via a small piece of code called a thunk, which simply jumps to the real target address (e.g., jmp dword ptr [__imp__FunctionName]).

Stack Operations

  • 50-57: push General Purpose Register (eax, ecx, edx, ebx, esp, ebp, esi, edi)
  • 58-5F: pop General Purpose Register
  • 68 xx xx xx xx: push immediate_dword
  • 6A xx: push immediate_byte

No Operation

  • 90: nop (Often used for padding or overwritten by hooks)

Return Instructions

  • C3: Near Return: Pops the return address (pushed by call) from the stack into EIP/RIP. Used for returns within the same code segment.
  • C2 iw: Near Return and Pop N bytes: Pops the return address, then pops an additional N bytes (specified by the 16-bit immediate word iw) off the stack. Used by conventions like stdcall where the callee cleans up arguments.
  • CB: Far Return: Pops CS:IP (Code Segment and Instruction Pointer) from the stack. Used for returns between different code segments (rare in modern flat memory models).
  • CA iw: Far Return and Pop N bytes: Pops CS:IP, then pops an additional N bytes off the stack.

Machine code is the raw material of executables. All higher-level structures in PE files ultimately translate down to sequences of these byte instructions. When malware analysts perform deep static analysis or examine memory dumps, they are often looking directly at this machine code. Recognizing common patterns (like function prologues, API call sequences, or loops) in the raw hex can significantly speed up the analysis process, especially when dealing with obfuscated or packed malware where standard disassembly might fail.

Assembly Language: Human-Readable Machine Code

×

Assembly Language Basics

Assembly language provides human-readable mnemonics for machine code instructions. An assembler translates assembly code into machine code, and a disassembler does the reverse. There's typically a direct one-to-one mapping (though some assemblers support macros).

High-level programming languages like C, C++, Go, or Delphi are translated by a compiler (e.g., GCC, Clang, MSVC) into assembly language (or sometimes directly to machine code), which is then assembled into the final machine code bytes stored in the executable file. Different compilers might generate slightly different, but functionally equivalent, assembly code for the same high-level source due to optimization choices or instruction selection.

Example: From Assembly to Machine Code

Address   Machine Code      Assembly Instruction   Comment
-------   ------------      --------------------   -------
00401000  B8 01000000       mov eax, 1             ; Load 1 into EAX
00401005  03 C3             add eax, ebx           ; Add EBX to EAX
00401007  50                push eax               ; Push EAX onto stack
00401008  E8 F3FFFFFF       call 00401000          ; Call relative address (example)
0040100D  C3                ret                    ; Return from function

Assembly uses mnemonics (mov, add, push), register names (eax, ebx), memory addressing modes ([ebp+8], [my_var]), and labels (start_loop:) to represent the underlying machine operations.

Disassembly vs. Decompilation vs. Bytecode

These terms represent different levels of abstraction when analyzing code:

  • Machine Code: The raw binary instructions the CPU directly executes (e.g., 55 89 E5).
  • Disassembly: Translating machine code into human-readable assembly language (e.g., push ebp; mov ebp, esp). This is a direct, accurate representation of the machine code. Tools: IDA Pro, Ghidra, Binary Ninja, debuggers (x64dbg, OllyDbg, WinDbg).
  • Decompilation: Attempting to translate assembly/machine code back into a high-level language like C/C++. This is an interpretive process, generating an approximation of potential source code. It's very helpful for understanding logic but may lose low-level details or be inaccurate. Tools: Hex-Rays Decompiler (IDA Pro plugin), Ghidra's decompiler, Binary Ninja.
  • Bytecode: An intermediate code format used by some languages (e.g., Java .class files, Python .pyc files, .NET CIL). Bytecode is executed by a Virtual Machine (JVM, PVM, CLR) rather than directly by the CPU. Decompiling bytecode back to its original source language (e.g., Java bytecode to Java source using JD-GUI, or .NET CIL to C# using dnSpy/ILSpy) is generally much easier and more accurate than decompiling native machine code, because bytecode often retains more metadata and structure.

Understanding these differences is crucial. Analyzing native code (C, C++, Go, Delphi PE files) primarily involves disassembly, with decompilation as a helpful aid. Analyzing managed code (.NET) or interpreted languages (Java, Python) often involves specific bytecode decompilers.

Function Calling Conventions (32-bit stdcall Example)

Calling conventions define how parameters are passed, return values are handled, and registers are managed during function calls. This example uses the common 32-bit stdcall convention, widely used by Windows APIs.

Caller Side

; Calling MyStdcallFunc(arg1, arg2) which is declared as STDCALL
push arg2           ; Push arguments onto stack (right-to-left)
push arg1
call MyStdcallFunc  ; Call the function (pushes return address)
; NO stack cleanup here! The callee does it in stdcall.
; Return value is typically in EAX

Callee Side (MyStdcallFunc)

MyStdcallFunc:
  ; Prologue
  push ebp          ; Save old base pointer
  mov ebp, esp      ; Set new stack frame base (using 8B EC or 89 E5 encoding)

  ; Access arguments
  mov eax, [ebp+8]  ; Access arg1 (first arg is at ebp+8)
  mov ecx, [ebp+12] ; Access arg2 (second arg is at ebp+12)

  ; ... function body ...
  ; Place return value in EAX (if any)

  ; Epilogue
  mov esp, ebp      ; Deallocate local variables (if any)
  pop ebp           ; Restore old base pointer
  ret 8             ; Return AND clean up 8 bytes (2*DWORD) from stack (Opcode C2 0800)

Key differences from cdecl: The callee (the function being called) is responsible for cleaning the arguments off the stack using the ret N instruction (opcode C2), where N is the total size of the arguments in bytes. Many Windows APIs use stdcall.

Assembly language is the primary tool for reverse engineering and malware analysis. Disassemblers convert the machine code within PE files back into assembly. By reading the assembly, analysts can understand the program's logic, identify algorithms, track data flow, pinpoint API calls, and discover vulnerabilities or malicious behavior, even without the original source code. Recognizing standard patterns like function prologues/epilogues and calling conventions (like stdcall for WinAPIs) is essential for efficient analysis.

Memory Models & Virtual Memory

×

Process Memory Layout: Stack, Heap, Code, Data

When a program runs, the operating system allocates a virtual address space for it, typically organized into several key regions:

Typical Process Layout

Kernel Space (OS - inaccessible)
Stack (Grows Down ↓)
(Function calls, local vars)
Free / Unallocated Memory
Heap (Grows Up ↑)
(Dynamic allocation - malloc/new)
BSS (Uninitialized Data)
.data (Initialized Data)
.text (Executable Code)
Low Addresses (Often Null page, etc)

The Stack

A LIFO (Last-In-First-Out) structure managed automatically by the CPU/compiler. Grows towards lower memory addresses.

  • Stores function return addresses.
  • Holds local variables declared within functions.
  • Used to pass arguments (in some calling conventions).
  • Fast allocation/deallocation (just move stack pointer).
  • Limited size, susceptible to stack buffer overflows.
⚙️ Interactive Stack

The Heap

A region for dynamically allocated memory (using malloc, new). Grows towards higher memory addresses.

  • Used for data whose size isn't known at compile time.
  • Used for data that needs to outlive the function that created it.
  • Slower allocation/deallocation (requires memory management).
  • Larger size available, susceptible to heap overflows, use-after-free, etc.

Virtual Memory vs. Physical Memory

Modern OSes use virtual memory to give each process its own private, contiguous address space, isolating it from other processes and the underlying physical RAM:

Physical Memory (RAM)

  • The actual hardware memory chips.
  • A limited, shared resource managed by the OS kernel.
  • OS maps parts of physical RAM to different processes' virtual addresses.

Virtual Memory

  • An abstraction provided by the OS and CPU's Memory Management Unit (MMU).
  • Each process gets its own large, linear address space (e.g., 4GB for 32-bit, much larger for 64-bit).
  • Addresses used by the program (pointers, EIP/RIP) are virtual addresses.
  • The MMU translates virtual addresses to physical addresses on-the-fly.
  • Allows for memory protection (read/write/execute permissions per page).
  • Enables features like paging (swapping data to disk).

PE files are designed entirely around this virtual memory concept. Addresses within the PE file (like the entry point or section locations) are virtual addresses (or RVAs relative to a virtual base address).

Image Base Address & Relocation

The Image Base is the preferred starting virtual address where the OS loader attempts to map the PE file into memory:

  • Defined in OptionalHeader.ImageBase.
  • Typical defaults: 0x00400000 (32-bit EXE), 0x10000000 (32-bit DLL), 0x0000000140000000 (64-bit EXE).
  • If this preferred address is available (and ASLR doesn't override it), the file is loaded there.
  • If the address is occupied (e.g., by another DLL), the loader must place the module elsewhere. This is called rebasing.
  • When rebasing occurs, any hardcoded absolute virtual addresses within the module's code/data become incorrect.
  • The Base Relocation Table (.reloc section, pointed to by Data Directory entry 5) contains a list of locations within the image that need to be "fixed up" by adding the difference between the actual load address and the preferred ImageBase.
  • EXEs are often compiled assuming they will load at their ImageBase (no relocations needed), while DLLs almost always include relocation information because they are likely to be rebased.

PE File Memory Mapping Process

When a PE file is executed, the Windows loader performs a detailed sequence of steps to load it into virtual memory:

  1. Read Headers: Parse the DOS MZ Header to find e_lfanew, jump to that offset, validate the PE Signature ('PE\0\0'), and then parse the COFF Header and the crucial Optional Header.
  2. Reserve Address Space: Based on OptionalHeader.ImageBase and OptionalHeader.SizeOfImage, reserve a contiguous block of virtual address space. If ASLR is enabled and supported (DYNAMIC_BASE flag), the OS chooses a randomized base address instead of the preferred ImageBase. If the preferred/randomized address is unavailable, the loader attempts to find another free block (rebasing).
  3. Map Sections: Iterate through the Section Table (using NumberOfSections from the COFF Header). For each IMAGE_SECTION_HEADER:
    • Calculate the target memory address: Actual Load Address + SectionHeader.VirtualAddress.
    • Allocate virtual memory pages for the section based on SectionHeader.VirtualSize, respecting SectionAlignment.
    • Copy the section's raw data from the file (from offset SectionHeader.PointerToRawData, length SectionHeader.SizeOfRawData) into the allocated virtual memory. Note that VirtualSize can be larger than SizeOfRawData (e.g., for .bss), in which case the extra space is zero-filled.
    • Set initial memory page protections (Read/Write/Execute) based on SectionHeader.Characteristics.
  4. Process Imports (Recursively): Examine the Import Table (via Data Directory 1). For each required DLL:
    • Check if the DLL is already loaded in the process. If not, load it by performing these same steps (1-7) for the DLL. This can trigger loading of further dependencies.
    • Once the DLL is loaded, get the actual memory addresses of the functions listed in the Import Name Table (INT) / OriginalFirstThunk.
    • Write these actual function addresses into the Import Address Table (IAT) / FirstThunk for the module being loaded.
  5. Perform Base Relocations: If the module was rebased (loaded at an address different from OptionalHeader.ImageBase), process the Base Relocation Table (via Data Directory 5). This table lists all the locations in the code/data that contain absolute addresses which need to be adjusted ("fixed up") based on the difference between the actual load address and the preferred ImageBase.
  6. Set Final Memory Protections: Apply the final, potentially stricter, memory protections based on section characteristics and system policies (like DEP). For example, code sections typically become Read+Execute, data sections Read+Write (or Read-Only for .rdata).
  7. TLS Callbacks: If a Thread Local Storage table exists (via Data Directory 9) and contains callback function pointers, execute these callbacks.
  8. Transfer Execution: Finally, set up the initial thread context and jump to the module's entry point RVA (OptionalHeader.AddressOfEntryPoint added to the actual load address).

Memory Allocation, Compilers, & Data Sections

How variables and constants end up in specific PE sections is largely determined by the compiler (like GCC, Clang, MSVC) and linker based on C/C++ (or other language) declarations:

C/C++ Example Storage Class Typical PE Section Memory Permissions Initialized?
void func() { int x; } Local Automatic Stack (Not in PE file) Read/Write No (Garbage)
int global_y = 10; Global Initialized .data Read/Write Yes (value 10 stored in file)
static int static_z = 20; Static Initialized .data Read/Write Yes (value 20 stored in file)
int global_a; Global Uninitialized .bss Read/Write No (Zeroed by loader)
static int static_b; Static Uninitialized .bss Read/Write No (Zeroed by loader)
const char* str = "Hello";
const int val = 5;
Constant / String Literal .rdata (often) Read-Only Yes (values stored in file)
int* ptr = new int; Dynamic Allocation Heap (Not in PE file) Read/Write Varies (by allocator)

The compiler makes optimization decisions (e.g., placing truly constant data in .rdata, pooling identical strings). The linker then gathers all the code and data generated by the compiler (from potentially multiple source files and libraries) and arranges them into the final PE sections according to rules and directives.

Understanding memory layout, virtual memory, and the loading process is crucial for malware analysis. Attacks like stack/heap overflows, Return-Oriented Programming (ROP), and process injection directly manipulate these memory structures. Malware might try to load at unusual ImageBases, map sections with incorrect permissions (e.g., writable code), or abuse the relocation process. Analyzing memory dumps requires knowing where different types of data (code, stack, heap, imports) reside in the virtual address space.

File Identification: Digital File Structure and Signatures

×

Understanding Digital File Signatures

Digital files are not just random sequences of bytes - they follow specific formats that help operating systems and applications identify and process them correctly:

Magic Numbers and File Signatures

  • PE Files: Begin with "MZ" (4D 5A) at offset 0, and "PE\0\0" (50 45 00 00) at the PE header offset.
  • ELF Files: Start with 7F 45 4C 46 (DEL + "ELF").
  • Java Class: Begin with CA FE BA BE.
  • .NET Assemblies: Use the PE format but contain a CLR header.
  • Office Documents: Usually begin with D0 CF 11 E0 (Compound File Binary Format).
  • ZIP-based: Start with "PK\x03\x04" (50 4B 03 04), including:
    • JAR files (Java Archives)
    • APK files (Android Packages)
    • DOCX/XLSX/PPTX (Modern Office)

These signatures serve multiple purposes:

  • Quick file type identification without parsing the whole file
  • Validation of file integrity and format
  • Prevention of accidental misuse (e.g., trying to execute non-executable files)
  • Historical compatibility (e.g., MZ header for DOS)

File Headers and Metadata Structures

Most modern file formats include sophisticated header structures that provide metadata about the file's contents and organization:

Common Header Elements

  • Signature/Magic Number: Identifies the file type
  • Version Information: Format version, compatibility flags
  • Size Fields: File/content sizes, offsets to important structures
  • Checksums/Hashes: For integrity verification
  • Timestamps: Creation, modification dates
  • Feature Flags: Indicates supported features or restrictions

PE Format Header Chain

The PE format demonstrates a sophisticated header chain design:

                    +---------------------------+
                    |        File Start         |
                    |     DOS Header (MZ)       |
                    +---------------------------+
                                │
                                ▼
                    +---------------------------+
                    |         DOS Stub          |
                    |  Optional DOS Program     |
                    +---------------------------+
                                │
                                ▼
                    +---------------------------+
                    |         PE Header         |
                    | PE Signature + File Header|
                    +---------------------------+
                                │
                                ▼
                    +---------------------------+
                    |      Optional Header      |
                    | Windows-Specific Fields   |
                    +---------------------------+
                                │
                                ▼
                    +---------------------------+
                    |       Section Table       |
                    |   Section Definitions     |
                    +---------------------------+
                                │
                                ▼
                    +---------------------------+
                    |       Section Data        |
                    |     Actual Content        |
                    +---------------------------+
                 

Rigorous File Identification

Proper file identification involves more than just checking signatures:

Multi-Layer Validation

  • Signature Checking:
    • Verify magic numbers at correct offsets
    • Check for secondary signatures (e.g., PE after MZ)
    • Validate header checksums
  • Structural Validation:
    • Parse and validate header fields
    • Verify pointer/offset validity
    • Check section alignment and sizes
  • Content Analysis:
    • Validate internal data structures
    • Check for format-specific markers
    • Analyze entropy and patterns

Security Implications

Thorough file identification is crucial for security:

  • Prevents file type confusion attacks
  • Identifies malformed or crafted files
  • Detects attempts to bypass file type restrictions
  • Helps identify packed or obfuscated malware

File Format Evolution

File formats have evolved to meet changing needs:

Historical Progression

  • Early Era (1960s-70s):
    • Simple binary formats
    • No standardized headers
    • Platform-specific designs
  • Standardization Era (1980s-90s):
    • Introduction of magic numbers
    • Structured headers
    • Cross-platform considerations
  • Modern Era (2000s+):
    • Complex metadata structures
    • Security features
    • Extensible designs
    • Container formats (e.g., ZIP-based)

File identification and format understanding is fundamental to malware analysis and reverse engineering. Malware authors often manipulate file headers and structures to evade detection or confuse analysis tools. A deep understanding of file formats enables analysts to:

  • Identify malformed or suspicious files
  • Detect attempts to hide malicious content
  • Understand packing and obfuscation techniques
  • Extract and analyze embedded payloads
  • Reconstruct damaged or manipulated files

Early Executable Formats: The Precursors

×

The Dawn: Raw Machine Code & Punch Cards

In the earliest days of computing (ENIAC, UNIVAC), there wasn't really an "executable format" as we know it. Programs were:

  • Entered via physical switches or wiring plugboards.
  • Loaded from punch cards or paper tape containing raw machine instructions.
  • Loaded directly into specific memory locations.
  • Execution started by manually setting the instruction pointer.
  • No metadata, no OS loader assistance, just raw bytes loaded and run.

The .COM Era (CP/M, Early MS-DOS)

The .COM (Command) file format was a step up, but still incredibly simple:

  • Structure: Essentially formatless. The file is just raw x86 machine code.
  • Loading: The OS allocated a 64KB memory segment, loaded the entire file content starting at offset 0x100 within that segment, set all segment registers (CS, DS, ES, SS) to point to the start of the segment, set SP to the end of the segment, and jumped to 0x100 to start execution.
  • Size Limit: Maximum size was 65,280 bytes (64KB - 256 bytes for the PSP).
  • No Metadata: No header, no relocation info, no import/export tables. Everything (code, data, stack) had to fit and manage itself within the single 64KB segment.
  • Relocatability: Inherently non-relocatable due to the fixed loading offset (0x100).
; Example COM program structure (NASM syntax)
org 0x100     ; Tell assembler code starts at 0x100

section .text
start:
    mov ah, 9       ; DOS function: Print string
    mov dx, message ; Address of string
    int 21h         ; Call DOS interrupt

    mov ah, 4Ch     ; DOS function: Terminate program
    int 21h

section .data
message db 'Hello from COM!', 0Dh, 0Ah, '$' ; String must end with '$'

Simple, but extremely limited for larger, more complex programs.

The MZ Revolution (.EXE in MS-DOS)

The .EXE format, identified by the "MZ" signature (for Mark Zbikowski), was a major leap forward introduced with MS-DOS:

  • Structure: Introduced the first real header: the MZ Header.
  • MZ Header: Contained metadata like file size, initial stack segment/pointer, entry point (CS:IP), and crucially, a Relocation Table.
  • Relocatability: The relocation table listed segment addresses within the code/data that needed to be "fixed up" by the DOS loader based on the actual memory segment where the program was loaded. This allowed EXEs to be loaded anywhere in memory.
  • Multi-Segment Support: Allowed programs to use multiple code and data segments, breaking the 64KB barrier of COM files.
  • No Imports/Exports Yet: Still lacked standardized ways to link dynamically with other code modules (libraries).
; Conceptual MZ EXE Structure
┌─────────────────────┐
│ MZ Header           │ Contains file size, entry point (CS:IP),
│ (IMAGE_DOS_HEADER)  │ initial SS:SP, relocation table offset...
├─────────────────────┤
│ Relocation Table    │ List of segment addresses needing fixup
├─────────────────────┤
│                     │
│ Program Code & Data │ Loaded into memory based on header info
│ (Load Module)       │
│                     │
└─────────────────────┘

The MZ header is still present at the beginning of modern PE files, primarily for backward compatibility and to point to the real PE header via the e_lfanew field.

The evolution from raw code to COM and then MZ EXE files demonstrates the increasing need for metadata and flexibility as programs became more complex. COM files were simple but restrictive. MZ EXEs introduced headers and relocation, enabling larger programs that could load anywhere in memory. However, they still lacked features like dynamic linking and robust memory protection found in modern formats. This historical context helps understand why the PE format includes elements like the MZ header and why features like relocation tables were developed.

PE Format Introduction: The Modern Standard

×

The Portable Executable (PE) Format

Introduced with Windows NT, the Portable Executable (PE) format is the standard for executables, object code, DLLs, and others on 32-bit and 64-bit versions of Windows. It's derived from the Unix COFF (Common Object File Format) specification and adds features specific to Windows.

Key Goals and Characteristics

  • Portability: Designed to support multiple CPU architectures (though primarily used for x86/x64). The COFF header specifies the target machine.
  • Extensibility: Supports various data types beyond code and basic data, like resources, debug info, digital signatures, and .NET metadata via Data Directories.
  • Virtual Memory Centric: Designed explicitly for paged, protected virtual memory operating systems. Addresses and layout are defined in terms of virtual addresses.
  • Dynamic Linking: Rich support for importing functions from DLLs and exporting functions for others to use (Import/Export Tables).
  • Section-Based Layout: Organizes the file into logical sections (.text, .data, .rsrc, etc.) with specific memory permissions (Read/Write/Execute). This organization is typically determined by the linker tool, which combines compiled code and data.

The PE format is used for nearly all executable content on Windows:

  • .exe: Applications
  • .dll: Dynamic Link Libraries
  • .sys: Kernel-mode Drivers
  • .ocx: ActiveX Controls
  • .cpl: Control Panel Applets
  • .scr: Screen Savers
  • Object files (.obj) during compilation also use COFF/PE structure.

High-Level PE Structure Overview

A PE file follows a well-defined structure, starting with legacy headers and progressing to Windows-specific information:

DOS MZ Header (Legacy compatibility, points to PE Header via e_lfanew)
DOS Stub (Optional, runs in DOS mode)
PE Signature ("PE\0\0") (Marks start of PE structures)
COFF / Image File Header (Machine type, # sections, timestamp, optional header size)
Optional Header (Entry point, ImageBase, sizes, subsystem, Data Directories... *Required* for executables)
Section Table / Headers (Defines layout/properties of each section)
Section 1 Data (e.g., .text - Code)
Section 2 Data (e.g., .data - Initialized Data)
Section 3 Data (e.g., .rsrc - Resources)
...
Section N Data (e.g., .idata - Import Data)
Other Data (Debug Info, Overlay, etc.)

Note: Section order in the file doesn't necessarily match memory layout.

PE32 vs PE32+ (64-bit)

The PE format adapts for 32-bit and 64-bit architectures, primarily within the Optional Header:

PE32 (32-bit)

  • Optional Header Magic Number: 0x10B (IMAGE_NT_OPTIONAL_HDR32_MAGIC)
  • Addresses/Sizes (like ImageBase, stack sizes): 32-bit (DWORD)
  • Includes BaseOfData field in Optional Header.
  • Designed for 32-bit address space.

PE32+ (64-bit)

  • Optional Header Magic Number: 0x20B (IMAGE_NT_OPTIONAL_HDR64_MAGIC)
  • Addresses/Sizes (like ImageBase, stack sizes): 64-bit (ULONGLONG or DWORD64)
  • Omits BaseOfData field.
  • Designed for 64-bit address space.
  • Structurally very similar to PE32, just wider fields for addresses.

Tools analyzing PE files must check the Magic number to parse the Optional Header correctly.

The PE format is the container for almost all executable code on Windows. For malware analysts, it's the first thing encountered. Understanding its structure is fundamental. Analyzing the PE headers and section layout provides initial clues about a sample's nature: Is it packed? Is it a DLL or EXE? What architecture does it target? Does it import suspicious functions? Does it contain unusual resources? Are security features like ASLR/DEP enabled? Mastering PE structure is step one in static malware analysis.

PE Headers In Detail: The Blueprint

×

DOS MZ Header (IMAGE_DOS_HEADER)

The very first part of a PE file, a remnant from DOS days. Starts with the signature 'MZ' (4D 5A in hex).

typedef struct _IMAGE_DOS_HEADER {      // DOS .EXE header
    WORD   e_magic;                     // Magic number (0x5A4D)
    WORD   e_cblp;                      // Bytes on last page of file
    WORD   e_cp;                        // Pages in file
    WORD   e_crlc;                      // Relocations
    WORD   e_cparhdr;                   // Size of header in paragraphs
    WORD   e_minalloc;                  // Minimum extra paragraphs needed
    WORD   e_maxalloc;                  // Maximum extra paragraphs needed
    WORD   e_ss;                        // Initial (relative) SS value
    WORD   e_sp;                        // Initial SP value
    WORD   e_csum;                      // Checksum
    WORD   e_ip;                        // Initial IP value
    WORD   e_cs;                        // Initial (relative) CS value
    WORD   e_lfarlc;                    // File address of relocation table
    WORD   e_ovno;                      // Overlay number
    WORD   e_res[4];                    // Reserved words
    WORD   e_oemid;                     // OEM identifier (for e_oeminfo)
    WORD   e_oeminfo;                   // OEM information; e_oemid specific
    WORD   e_res2[10];                  // Reserved words
    LONG   e_lfanew;                    // **File address of PE header**
} IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;

Key Fields for PE:

  • e_magic: Must be 0x5A4D ('MZ'). Identifies the file as potentially executable.
  • e_lfanew: Crucial. This 4-byte value at offset 0x3C gives the file offset where the actual PE Signature and Headers begin.

A small "DOS stub" program often follows this header, which prints "This program cannot be run in DOS mode" if executed on DOS.

PE Signature & COFF/Image File Header (IMAGE_FILE_HEADER)

At the offset specified by e_lfanew, we find:

  1. PE Signature: 4 bytes - 50 45 00 00 ('P' 'E' \0 \0).
  2. COFF / Image File Header: Contains basic properties of the file.
typedef struct _IMAGE_FILE_HEADER {
    WORD    Machine;                   // Target architecture (e.g., 0x14c=x86, 0x8664=x64)
    WORD    NumberOfSections;          // How many sections follow the headers
    DWORD   TimeDateStamp;             // Linker timestamp (seconds since Unix epoch)
    DWORD   PointerToSymbolTable;      // File offset of COFF symbol table (usually 0)
    DWORD   NumberOfSymbols;           // Number of entries in symbol table (usually 0)
    WORD    SizeOfOptionalHeader;      // Size of the *next* header (Optional Header)
    WORD    Characteristics;           // Flags describing the file (e.g., EXE, DLL, ASLR aware)
} IMAGE_FILE_HEADER, *PIMAGE_FILE_HEADER;

Key Fields:

  • Machine: Identifies the target CPU (IMAGE_FILE_MACHINE_I386, IMAGE_FILE_MACHINE_AMD64, etc.).
  • NumberOfSections: Tells the loader how many section headers to read from the Section Table.
  • TimeDateStamp: Can sometimes indicate compilation time, but easily forged by malware.
  • SizeOfOptionalHeader: Size of the next structure (IMAGE_OPTIONAL_HEADER).
  • Characteristics: Important flags like:
    • IMAGE_FILE_EXECUTABLE_IMAGE (0x0002): File is runnable.
    • IMAGE_FILE_DLL (0x2000): File is a DLL.
    • IMAGE_FILE_LARGE_ADDRESS_AWARE (0x0020): App can handle >2GB addresses (32-bit).
    • IMAGE_FILE_RELOCS_STRIPPED (0x0001): No relocation info (bad for DLLs/ASLR).

Optional Header (IMAGE_OPTIONAL_HEADER32 / IMAGE_OPTIONAL_HEADER64)

Despite the name, this header is required for executable images (EXEs, DLLs). It contains the most critical information for the OS loader.

// Structure differs slightly between 32/64 bit (field sizes)
typedef struct _IMAGE_OPTIONAL_HEADER {
    // Standard COFF fields.
    WORD    Magic;                     // 0x10b = PE32, 0x20b = PE32+ (64-bit)
    BYTE    MajorLinkerVersion;
    BYTE    MinorLinkerVersion;
    DWORD   SizeOfCode;                // Sum of all code sections' size
    DWORD   SizeOfInitializedData;
    DWORD   SizeOfUninitializedData;   // Size of .bss section
    DWORD   AddressOfEntryPoint;       // RVA where execution starts
    DWORD   BaseOfCode;                // RVA of the beginning of the code section
    // DWORD   BaseOfData;             // RVA of beginning of data section (PE32 only!)

    // NT additional fields.
    ULONGLONG ImageBase;               // Preferred load address (64-bit in PE32+)
    DWORD   SectionAlignment;          // Alignment (in bytes) of sections in memory
    DWORD   FileAlignment;             // Alignment (in bytes) of sections in file
    WORD    MajorOperatingSystemVersion;
    /* ... other version fields ... */
    DWORD   SizeOfImage;               // Total size of the image in memory
    DWORD   SizeOfHeaders;             // Size of DOS hdr + PE sig + COFF hdr + Opt hdr + Section hdrs
    DWORD   CheckSum;                  // Image file checksum (often 0)
    WORD    Subsystem;                 // Target subsystem (e.g., Windows GUI, Console)
    WORD    DllCharacteristics;        // Flags like ASLR, DEP, CFG support
    ULONGLONG SizeOfStackReserve;      // Total stack size to reserve (64-bit in PE32+)
    /* ... other stack/heap size fields ... */
    DWORD   NumberOfRvaAndSizes;       // Number of entries in DataDirectory (usually 16)
    IMAGE_DATA_DIRECTORY DataDirectory[IMAGE_NUMBEROF_DIRECTORY_ENTRIES]; // Array of directory entries
} IMAGE_OPTIONAL_HEADER;

Key Fields:

  • Magic: Distinguishes PE32 (0x10B) from PE32+ (0x20B).
  • AddressOfEntryPoint: RVA of the first instruction to execute. Crucial for analysis.
  • ImageBase: Preferred virtual address for loading.
  • SectionAlignment / FileAlignment: Dictate how sections are aligned in memory vs. the file. Must be powers of 2.
  • SizeOfImage: Total virtual size needed when mapped into memory.
  • SizeOfHeaders: Combined size of all headers, rounded up to FileAlignment. Defines where the first section's data starts in the file.
  • Subsystem: (IMAGE_SUBSYSTEM_WINDOWS_GUI, _CONSOLE, _NATIVE, etc.).
  • DllCharacteristics: Security flags (IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE (ASLR), _NX_COMPAT (DEP), _GUARD_CF (CFG)).
  • DataDirectory: Array pointing to other important data structures (Imports, Exports, Resources, Relocations, etc.).

Data Directories (IMAGE_DATA_DIRECTORY)

The last field of the Optional Header is an array (typically 16 entries) of IMAGE_DATA_DIRECTORY structures. Each entry points to a specific table or data structure within the PE file, if present.

typedef struct _IMAGE_DATA_DIRECTORY {
    DWORD   VirtualAddress; // RVA of the data/table
    DWORD   Size;           // Size in bytes of the data/table
} IMAGE_DATA_DIRECTORY, *PIMAGE_DATA_DIRECTORY;

// Indices into the DataDirectory array:
#define IMAGE_DIRECTORY_ENTRY_EXPORT          0   // Export Table (.edata)
#define IMAGE_DIRECTORY_ENTRY_IMPORT          1   // Import Table (.idata)
#define IMAGE_DIRECTORY_ENTRY_RESOURCE        2   // Resource Table (.rsrc)
#define IMAGE_DIRECTORY_ENTRY_EXCEPTION       3   // Exception Table (.pdata)
#define IMAGE_DIRECTORY_ENTRY_SECURITY        4   // Certificate Table (Attribute Certificates)
#define IMAGE_DIRECTORY_ENTRY_BASERELOC       5   // Base Relocation Table (.reloc)
#define IMAGE_DIRECTORY_ENTRY_DEBUG           6   // Debug Directory
// ... Architecture Specific (7) ...
// ... Global Ptr (8) ...
#define IMAGE_DIRECTORY_ENTRY_TLS             9   // TLS Table
// ... Load Config (10) ...
#define IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT   11   // Bound Import Table
#define IMAGE_DIRECTORY_ENTRY_IAT            12   // Import Address Table
#define IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT   13   // Delay Import Descriptors
#define IMAGE_DIRECTORY_ENTRY_COM_DESCRIPTOR 14   // CLR Runtime Header (.NET)

If an entry's VirtualAddress and Size are both zero, that directory is not present. These directories are essential for finding imports/exports, resources, relocation info, digital signatures, .NET headers, etc.

Offset vs. Length: The PE Pattern

A recurring pattern in PE structures is the use of pairs of values to define data:

  • An Offset or Address indicating where the data starts.
  • A Size or Length indicating how much data there is.

Examples:

  • Section Headers: PointerToRawData (file offset) + SizeOfRawData (file length), and VirtualAddress (memory RVA) + VirtualSize (memory length).
  • Data Directories: VirtualAddress (RVA of the table) + Size (size of the table).
  • Resource Entries: Pointers to resource data + size of resource data.
  • Relocation Blocks: RVA of the block + size of the block.

Understanding this pattern is key to navigating the file. Always use the Size field to know how much data to read or parse starting from the given offset/address. Alignment rules (FileAlignment, SectionAlignment) can mean the space occupied is larger than the actual data size.

The PE headers act as the file's blueprint, guiding the OS loader. Malware frequently tampers with these headers for various purposes: anti-analysis (confusing tools by setting invalid sizes/pointers), hiding code (placing the entry point in an unusual section or outside any defined section), modifying characteristics (e.g., marking a data section as executable), or manipulating Data Directories (e.g., hiding imports). Careful examination of all header fields against expected values is a critical step in static analysis.

PE Sections: Organizing Code and Data

×

Segments vs. Sections

While sometimes used interchangeably, these terms have distinct meanings in the context of PE files and x86 architecture:

  • Segments (Historical x86): A memory management concept from older x86 modes (Real Mode, Protected Mode with segmentation) using segment registers (CS, DS, SS, ES, FS, GS) to define base addresses for memory access. Largely abstracted away in modern flat memory models used by Windows, but segment registers are still used (implicitly or explicitly).
  • Sections (PE File Format): Logical divisions of the PE file defined in the Section Table. Each section groups related content (code, data, resources) and has associated attributes like name, size in file, size in memory, file offset, virtual address (RVA), and memory permissions (Read/Write/Execute). The linker (part of the compiler toolchain, like `link.exe` for MSVC or `ld` for GCC/Clang) is responsible for grouping code and data into these sections.

In essence, the PE loader maps the Sections defined in the file into the process's virtual address space, which is typically treated as a single large Segment (flat memory model) by the running code.

Section Table (Array of IMAGE_SECTION_HEADER)

Immediately following the Optional Header is the Section Table, which is an array of IMAGE_SECTION_HEADER structures. The number of entries in this array is given by IMAGE_FILE_HEADER.NumberOfSections. Each header describes one section:

 #define IMAGE_SIZEOF_SHORT_NAME              8

 typedef struct _IMAGE_SECTION_HEADER {
     BYTE    Name[IMAGE_SIZEOF_SHORT_NAME]; // 8-byte, null-padded ASCII name (e.g., ".text\0\0\0")
     union {
             DWORD   PhysicalAddress;         // (Historical/Obsolete)
             DWORD   VirtualSize;             // **Total size of the section in memory (bytes)**
     } Misc;
     DWORD   VirtualAddress;            // **RVA of the section's start in memory**
     DWORD   SizeOfRawData;             // **Size of the section's data in the file (bytes)**
     DWORD   PointerToRawData;          // **File offset to the section's data**
     DWORD   PointerToRelocations;      // File offset to relocations for this section (OBJ files)
     DWORD   PointerToLinenumbers;    // File offset to line numbers (debug)
     WORD    NumberOfRelocations;       // Number of relocation entries
     WORD    NumberOfLinenumbers;     // Number of line number entries
     DWORD   Characteristics;           // **Flags describing section permissions and content type**
 } IMAGE_SECTION_HEADER, *PIMAGE_SECTION_HEADER;
 

Key Fields:

  • Name: An 8-byte name (often starting with '.', like .text, .data). Not guaranteed to be null-terminated if exactly 8 chars.
  • VirtualSize: The actual size the section will occupy in virtual memory. Can be larger than SizeOfRawData (e.g., for .bss).
  • VirtualAddress: The RVA (relative to ImageBase) where the section will be loaded in memory.
  • SizeOfRawData: The size of the section's data in the file. Must be a multiple of FileAlignment. Can be 0 for uninitialized data sections like .bss.
  • PointerToRawData: The offset from the beginning of the file where this section's data starts. Must be a multiple of FileAlignment. Can be 0 if SizeOfRawData is 0.
  • Characteristics: Flags defining memory permissions and content type. Very important for analysis. Common flags include:
    • IMAGE_SCN_CNT_CODE (0x20): Contains executable code.
    • IMAGE_SCN_CNT_INITIALIZED_DATA (0x40): Contains initialized data.
    • IMAGE_SCN_CNT_UNINITIALIZED_DATA (0x80): Contains uninitialized data (.bss).
    • IMAGE_SCN_MEM_EXECUTE (0x20000000): Section is executable.
    • IMAGE_SCN_MEM_READ (0x40000000): Section is readable.
    • IMAGE_SCN_MEM_WRITE (0x80000000): Section is writable.
    • IMAGE_SCN_MEM_SHARED (0x10000000): Section memory is shared across processes mapping the image.

The actual raw data for the sections follows the section table in the file, located at the offsets specified by PointerToRawData.

Common PE Section Names and Purposes

While developers can name sections arbitrarily, linkers typically use standard names:

Name Typical Content Common Characteristics Malware Relevance
.text Executable Code CNT_CODE, MEM_EXECUTE, MEM_READ Main analysis target; packers often encrypt/compress this.
.data Initialized global/static variables CNT_INITIALIZED_DATA, MEM_READ, MEM_WRITE Stores configuration, hardcoded strings/values.
.rdata Read-only data (constants, strings) CNT_INITIALIZED_DATA, MEM_READ Often contains import/export info, string literals.
.bss Uninitialized global/static variables CNT_UNINITIALIZED_DATA, MEM_READ, MEM_WRITE Takes no file space; zeroed by loader. Used for large buffers.
.idata Import Tables (DLL names, function names/ordinals) CNT_INITIALIZED_DATA, MEM_READ, (sometimes WRITE for IAT patching) Crucial for understanding external dependencies; often obfuscated.
.edata Export Table (Functions exported by a DLL) CNT_INITIALIZED_DATA, MEM_READ Defines the DLL's interface; malware DLLs export malicious functions.
.rsrc Resources (Icons, Dialogs, Menus, Strings, Version Info, custom data) CNT_INITIALIZED_DATA, MEM_READ Common place for malware to hide payloads, config, or dropper files.
.reloc Base Relocation Information CNT_INITIALIZED_DATA, MEM_READ, MEM_DISCARDABLE Needed if ASLR rebases the image; malware might strip this or add fake entries.
.pdata Exception Handling Information (x64 primarily) CNT_INITIALIZED_DATA, MEM_READ Used for stack unwinding; less common target for manipulation.
.tls Thread Local Storage data & callbacks CNT_INITIALIZED_DATA, MEM_READ, MEM_WRITE TLS callbacks run before entry point; common malware trick for early execution/anti-debug.
Custom/Packer Names (Varies - often packed code/data) (Often unusual combinations like RWE or high entropy) e.g., UPX0, .RLPACK, .themida - Strong indicator of packing/protection.

RVAs, File Offsets, and Alignment

Mapping between memory addresses (RVAs) and file positions (Offsets) is fundamental:

  • RVA (Relative Virtual Address): An address relative to the ImageBase when the file is loaded into memory. Actual Memory Address = ImageBase + RVA.
  • File Offset: A byte offset from the beginning of the PE file on disk.
  • Mapping: To find the file offset corresponding to an RVA:
    1. Iterate through the Section Table to find the section containing the RVA:
      Section.VirtualAddress <= RVA < Section.VirtualAddress + Section.VirtualSize
    2. Calculate the offset within the section: OffsetInSection = RVA - Section.VirtualAddress
    3. Calculate the file offset: FileOffset = Section.PointerToRawData + OffsetInSection
    4. Caveat: This only works if OffsetInSection < Section.SizeOfRawData. If the RVA points to data that exists in memory but not in the file (like the upper part of .bss), there's no direct file offset.
  • Alignment: FileAlignment dictates the alignment of PointerToRawData and SizeOfRawData in the file (typically 512 bytes or 4KB). SectionAlignment dictates the alignment of VirtualAddress in memory (typically page size, 4KB). This can create gaps between sections in the file or in memory.

PE analysis tools (like PE-bear, CFF Explorer) automate this mapping, but understanding the process is vital for manual analysis or scripting.

Section analysis is a cornerstone of static malware analysis. Analysts scrutinize section names, sizes (VirtualSize vs SizeOfRawData), file pointers, and especially characteristics. Red flags include: unusual names (often packers), sections with unexpected permissions (e.g., writable code or executable data sections), sections with zero size on disk but large size in memory (.bss or packed data), sections with high entropy (indicating encryption/compression), or code execution starting from non-.text sections. Malware often adds its own sections or modifies existing ones to hide code or data.

Advanced PE Concepts & Malware Techniques

×

Import & Export Tables (.idata, .edata)

These tables manage dynamic linking – how PE files use functions from other DLLs or provide functions for others to use.

Import Address Table (IAT) & Import Directory Table (IDT)

Located via Data Directory entry 1 (IMAGE_DIRECTORY_ENTRY_IMPORT):

  • The Import Directory Table (IDT) is an array of IMAGE_IMPORT_DESCRIPTOR structures, one for each imported DLL. Each descriptor points to the DLL name and two parallel arrays:
    • Import Name Table (INT) / OriginalFirstThunk: An array of RVAs pointing to hint/name structures (IMAGE_IMPORT_BY_NAME) or ordinals for each imported function. This table remains unchanged after loading.
    • Import Address Table (IAT) / FirstThunk: Another array, initially identical to the INT. The Windows loader overwrites this array with the actual memory addresses of the imported functions during the loading process.
  • Code within the PE file typically calls imported functions indirectly via the IAT (e.g., call dword ptr [iat_entry_for_MessageBoxA]). Often, the compiler generates a small piece of code, called a thunk, for each imported function. This thunk usually contains just a jump instruction (e.g., jmp dword ptr [__imp__FunctionName]) that redirects execution to the address stored in the IAT.
  • Malware Uses: IAT Hooking (overwriting IAT entries to redirect API calls), manually parsing IDT/INT to resolve APIs dynamically (to hide imports from static analysis), stripping import information.

Export Table (.edata)

Located via Data Directory entry 0 (IMAGE_DIRECTORY_ENTRY_EXPORT). Used primarily by DLLs:

  • Contains the DLL name, a list of exported function names, a list of exported function addresses (RVAs), and a list of ordinals (numeric IDs for functions).
  • Allows functions to be imported by name or by ordinal.
  • Malware Uses: Malicious DLLs export functions for other malware components to call, sometimes using non-descriptive names or exporting only by ordinal to hinder analysis. Export Forwarding (redirecting an export to a function in another DLL).

Symbols & Debug Information

Symbols map memory addresses or offsets to human-readable names (functions, variables). Debug information provides more detail for source-level debugging.

  • Storage: Can be embedded (partially) in the PE file via the Debug Directory (Data Directory entry 6) or, more commonly, stored externally in Program Database (.PDB) files (common for MSVC compiler). Other compilers like GCC or Clang might use DWARF format embedded in sections or separate files, though they can also generate PDBs on Windows. The Debug Directory often contains a reference (GUID, path) to the external debug file.
  • Origin: Symbol and debug information is generated by the compiler (e.g., GCC, Clang, MSVC) and linker during the build process, usually controlled by build configurations (e.g., "Debug" vs "Release").
  • Content: Function names, variable names (global/static/local), type information (structs, classes), source file/line number mappings.
  • Malware Relevance: Malware is almost always stripped of symbols and debug information to make reverse engineering harder. The presence of rich symbols in a suspicious file might indicate it's a legitimate tool being misused, or potentially an unsophisticated threat actor. Analysts often create their own symbols (renaming functions/variables) during analysis in tools like IDA Pro or Ghidra.

PE Security Features & Evasion

Windows and the PE format include features to mitigate exploits, which malware often tries to bypass. Many of these features require support from the compiler (like MSVC, GCC, Clang) and linker during the build process to be effective.

  • ASLR (Address Space Layout Randomization): Randomizes base addresses of DLLs, EXEs, stack, heap. Flag: IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE. Makes fixed-address exploits unreliable. Malware may try to find ways around it (information leaks, spraying) or target non-ASLR modules.
  • DEP (Data Execution Prevention): Marks memory regions (stack, heap, data sections) as non-executable using hardware support (NX/XD bit). Flag: IMAGE_DLLCHARACTERISTICS_NX_COMPAT. Prevents simple shellcode execution from data areas. Malware uses techniques like ROP (Return-Oriented Programming) or changes memory permissions (VirtualProtect) to bypass DEP.
  • SafeSEH (Structured Exception Handling Overwrite Protection): Validates exception handlers before calling them (32-bit). Flag: IMAGE_DLLCHARACTERISTICS_NO_SEH (disables SEH if set). Prevents classic SEH overwrite exploits.
  • CFG (Control Flow Guard): Validates targets of indirect calls at runtime against a bitmap of valid function entry points generated by the linker. Flag: IMAGE_DLLCHARACTERISTICS_GUARD_CF. Mitigates exploits that hijack indirect call pointers. Requires compiler support. Malware may target non-CFG-aware code or find ways to bypass checks.
  • Authenticode (Digital Signatures): Cryptographically signs PE files to verify publisher identity and integrity. Stored via Data Directory entry 4 (IMAGE_DIRECTORY_ENTRY_SECURITY). Malware is usually unsigned or uses stolen/forged certificates.

Modern Languages and PE Files (e.g., Go)

While C and C++ are traditional sources of PE files, modern languages like Go, Rust, and Nim also compile directly to native code and produce PE executables. These often have distinct characteristics relevant to malware analysis:

  • Static Linking & Large Size: Go binaries, by default, statically link their runtime and all dependencies. This results in large PE files (often several megabytes minimum) that contain the Go runtime scheduler, garbage collector, and standard library code, alongside the developer's code.
  • Custom Runtime & Imports: They don't typically rely heavily on standard C runtime libraries (like MSVCRT) or make numerous direct calls to common Windows APIs visible in the Import Table. Instead, they use their own runtime, which then makes necessary system calls. This can make initial import analysis less informative compared to C/C++ binaries.
  • Section Names & Symbols: Go binaries often have unique section names like .gopclntab (Go program counter line table) or .gosymtab (Go symbol table), although these might be stripped. Recovering meaningful symbols often requires Go-specific tooling.
  • Malware Popularity Reasons:
    • Ease of Distribution: Static linking means the malware is a single file, requiring no external DLL dependencies on the target system.
    • Cross-Compilation: Go makes it relatively easy to compile Windows executables from other operating systems (like Linux).
    • Analysis Challenges: The large size, custom runtime, and non-standard structure can hinder analysis by tools and techniques primarily designed for traditional C/C++ PE files. Standard API import analysis is less effective, and decompilers may struggle with Go's runtime conventions.

Other Notable PE Structures & Techniques

  • Resources (.rsrc): Hierarchical structure storing icons, strings, dialogs, version info, and arbitrary binary data. Malware frequently hides encrypted payloads, configuration, or entire dropped files within resources.
  • TLS (Thread Local Storage): Allows per-thread data. Includes optional TLS Callbacks (array of function pointers) that execute before the official AddressOfEntryPoint when a process or thread starts/stops. Heavily abused by malware for anti-debug tricks and early code execution. Located via Data Directory entry 9.
  • Relocations (.reloc): Table of fixups needed if ASLR rebases the image. Malware might strip this from DLLs to make them crash if rebased (simple anti-analysis) or add invalid entries. Located via Data Directory entry 5.
  • .NET Headers (CLR): For managed code (.NET), Data Directory entry 14 points to CLR metadata, replacing traditional native code in .text with Intermediate Language (IL) bytecode. Requires different analysis tools (dnSpy, ILSpy).
  • Packing/Encryption: Malware often compresses/encrypts its original code/sections and embeds a small "stub" loader as the new entry point. The stub unpacks/decrypts the original code into memory at runtime. Indicated by few imports, unusual section names/permissions (Write+Execute), and high entropy sections. Requires dynamic analysis or unpacking to analyze the real code.
  • Anti-Analysis Tricks: Manipulating header values (e.g., incorrect SizeOfImage, overlapping sections, invalid RVA pointers), using TLS callbacks, checking for debugger presence, timing checks.

Advanced PE concepts are where malware authors and defenders play a constant cat-and-mouse game. Malware leverages imports/exports, TLS, resources, and relocations in non-standard ways to hide, persist, and execute. They actively work to bypass security features like ASLR, DEP, and CFG, often relying on specific compiler/linker behaviors or exploiting weaknesses in the loading process. Understanding these advanced structures and techniques, along with common evasion tactics like packing and header manipulation, is essential for analyzing sophisticated modern threats. Analysis often involves combining static examination of these PE structures with dynamic analysis (debugging, memory forensics) to uncover the true behavior.