Welcome to the Interactive PE Format Evolution Timeline
This interactive timeline explores the fascinating journey of executable formats, from the earliest days of computing with raw machine code to the sophisticated PE (Portable Executable) format used in modern Windows systems.
Understanding the evolution of executable formats is essential for malware analysts, reverse engineers, and anyone interested in how programs actually work at a fundamental level. As you progress through this timeline, you'll discover how each advancement in executable design enabled new capabilities while addressing limitations of previous formats. We'll also touch upon the role of compilers and linkers in creating these executable files.
Begin your journey by clicking on any point in the timeline below, or use the "Get Started" button to begin at the beginning with Binary Foundations.
Binary Foundations: The Basics
Number Systems in Computing
Computers operate using binary (base-2) because electronic components have two reliable states:
- 0: Off, low voltage, false
- 1: On, high voltage, true
All data in a computer is ultimately stored as sequences of these binary digits ("bits"):
10110110 01101001 11001010 10101101
Binary to Hexadecimal Conversion
Hexadecimal (base-16) serves as a more concise way to represent binary data:
- Each hex digit represents exactly 4 binary bits (a nibble)
- Range: 0-9 and A-F (where A=10, B=11, C=12, D=13, E=14, F=15)
This makes hexadecimal ideal for representing binary data compactly while maintaining a direct mapping to the underlying bits. It's commonly used in memory dumps, debuggers, and PE analysis tools.
Bits, Bytes, Words, and Beyond
Data is organized into progressively larger units. The standard sizes evolved with processor architectures:
- Bit: The smallest unit (0 or 1).
- Nibble: 4 bits (half a byte).
- Byte: 8 bits. The fundamental unit of addressable memory. Early microprocessors like the Intel 8080 worked primarily with bytes.
- Word: 16 bits (2 bytes). Became standard with 16-bit processors like the Intel 8086.
- Double Word (DWORD): 32 bits (4 bytes). The standard size for registers and addresses in 32-bit architectures (IA-32, like the Intel 80386).
- Quad Word (QWORD): 64 bits (8 bytes). The standard size for registers and addresses in 64-bit architectures (x86-64).
These terms are essential vocabulary in assembly language, PE file format structures (which use types like WORD, DWORD), and low-level programming.
Endianness: Byte Order Matters
When dealing with multi-byte values (like Words, DWORDs, QWORDs), the order in which bytes are stored in memory becomes critically important:
- Little-Endian: Least significant byte (LSB) comes first in memory (at the lowest address).
- Used by x86/x64 processors (Intel, AMD).
- The DWORD value
0x12345678
is stored in memory as:78 56 34 12
- Big-Endian: Most significant byte (MSB) comes first in memory (at the lowest address).
- Used by some architectures (e.g., older PowerPC, SPARC, MIPS) and standard network protocols (hence "network byte order").
- The DWORD value
0x12345678
is stored in memory as:12 34 56 78
This concept of "endianness" is crucial in malware analysis because:
- You must interpret multi-byte values from memory dumps or file structures correctly based on the target architecture (usually little-endian for Windows malware).
- Some malware deliberately uses reversed byte order to obscure strings or values.
- Network communication often requires byte-swapping when moving between the big-endian network format and little-endian memory.
Malware analysts must master number systems and data representations to properly analyze binary files. The ability to read and convert between binary, hexadecimal, and decimal, understand data sizes (Byte, Word, DWORD, QWORD), and interpret endianness is fundamental to understanding executable file structures, memory dumps, and disassembled code. This knowledge forms the bedrock for all deeper analysis of PE files.
CPU Registers: The Processor's Workbench
CPU Registers: High-Speed CPU Storage
Registers are small, extremely fast storage locations built directly into the CPU. They are the primary working space for the processor, holding data currently being processed, instruction pointers, and status flags. Understanding registers is key to understanding assembly language.
16-bit (8086/80286)
32-bit (i386+)
64-bit (x86-64)
Understanding how registers evolved (16-bit -> 32-bit -> 64-bit) and how smaller registers are part of larger ones (e.g., AL/AH make up AX, AX is the lower 16 bits of EAX, EAX is the lower 32 bits of RAX) is crucial for analyzing code across different architectures.
Register Categories and Common Uses
Registers are often grouped by their typical function:
General Purpose Registers (GPRs)
- AX/EAX/RAX: Accumulator - Often used for arithmetic results, function return values, and some I/O operations.
- BX/EBX/RBX: Base - Historically used as a base pointer for memory access (e.g.,
[BX+SI]
). In 64-bit, RBX is often preserved across function calls (non-volatile). - CX/ECX/RCX: Counter - Frequently used as a loop counter (
LOOP
instruction) or for string operations (REP
prefixes). First argument in x64 fastcall convention. - DX/EDX/RDX: Data - Used for I/O port access (
IN
/OUT
instructions), dividend/remainder in multiplication/division. Second argument in x64 fastcall.
Index and Pointer Registers
- SI/ESI/RSI: Source Index - Often used as a source pointer in string/memory operations (e.g.,
LODSB
,MOVSB
). Third argument in x64 fastcall. Often non-volatile in x64. - DI/EDI/RDI: Destination Index - Often used as a destination pointer in string/memory operations (e.g.,
STOSB
,MOVSB
). Fourth argument in x64 fastcall. Often non-volatile in x64. - SP/ESP/RSP: Stack Pointer - Points to the current top of the stack. Crucial for function calls (
PUSH
,POP
,CALL
,RET
) and local variables. - BP/EBP/RBP: Base Pointer - Points to the base of the current stack frame, used to access parameters and local variables. Often optional in optimized 64-bit code where RSP-relative addressing might be used instead. Often non-volatile.
Instruction Pointer
- IP/EIP/RIP: Instruction Pointer - Holds the address of the next instruction to be executed. Cannot be accessed directly by most instructions but modified by jumps, calls, and returns. Central to control flow.
Flags Register
- FLAGS/EFLAGS/RFLAGS: Status Register - Contains individual bits (flags) indicating results of arithmetic/logical operations (Zero Flag (ZF), Carry Flag (CF), Sign Flag (SF), Overflow Flag (OF)) and controlling CPU behavior (Interrupt Flag (IF), Direction Flag (DF) for string ops). Conditional jumps (
JZ
,JNE
,JC
, etc.) depend on these flags.
Segment Registers (16-bit, but still relevant concepts in protected/long mode)
- CS (Code Segment): Points to the segment containing executable instructions. Implicitly used with EIP/RIP.
- DS (Data Segment): Default segment for most data access.
- SS (Stack Segment): Points to the segment containing the program stack. Implicitly used with ESP/RSP and EBP/RBP.
- ES (Extra Segment): Additional data segment, often used for string operations with DI/EDI.
- FS & GS (Extra Segments): Additional data segments with no specific hardware-defined use. In modern Windows:
FS
(32-bit) /GS
(64-bit) are famously used to point to thread-specific data structures:- TEB (Thread Environment Block) / TIB (Thread Information Block): Accessed via
FS:[0]
in 32-bit Windows. Contains pointers to PEB, SEH chain, Stack Base/Limit, ThreadID, LastError. - Malware frequently accesses
FS:[0x18]
(TEB),FS:[0x30]
(PEB pointer in TEB), orGS:[0x30]
(TEB in x64),GS:[0x60]
(PEB in x64) for anti-debugging (checking `PEB.BeingDebugged`), finding loaded modules, or getting other process/thread info without direct API calls.
- TEB (Thread Environment Block) / TIB (Thread Information Block): Accessed via
Note: In modern "flat" memory models used by Windows, segment registers typically point to selectors that cover the entire address space, so their explicit manipulation for memory addressing is less common than in older segmented architectures. However, FS/GS have taken on special roles.
64-bit Additional GPRs
- R8 - R15: Additional general-purpose registers available in 64-bit mode. R8/R9 are used for 5th/6th arguments in x64 fastcall, R10-R15 can be used for more arguments or general computation.
CPU registers are the heart of low-level execution. Malware analysts must meticulously track register values when debugging or reverse engineering disassembled code. EIP/RIP dictates control flow, ESP/RSP manages the stack (critical for buffer overflows), EBP/RBP helps understand function context, and GPRs reveal data manipulation and function arguments/return values. Malware often uses registers in non-standard ways to obfuscate its actions.
Machine Code: The CPU's Native Language
Raw Machine Code: Direct CPU Instructions
At the most fundamental level, CPUs only understand binary sequences known as machine code or opcodes. Each sequence directly triggers a specific hardware operation.
While technically binary, we almost always represent machine code in hexadecimal for readability:
Binary: Hexadecimal: Assembly: 01010101 55 push ebp 10001001 11100101 89 E5 mov ebp, esp (Standard 32-bit function prologue)
Note: The mov ebp, esp
instruction shown above uses the bytes 89 E5
. However, due to redundancy in the x86 instruction set, the functionally identical instruction could also be encoded as 8B EC
. Different compilers or assemblers (like NASM vs MASM) might choose either valid encoding. This is important because seeing 8B EC
instead of 89 E5
doesn't mean the code is wrong, just that a different (but valid) encoding was chosen. This variation can sometimes be used to guess which compiler produced the code (compiler fingerprinting).
Instruction Encoding Concepts
x86/x64 instructions don't have a fixed length; they can range from 1 to 15 bytes. An instruction is typically composed of several parts, though not all parts are present in every instruction:
- Prefixes (Optional): Single bytes that modify instruction behavior (e.g., operand size override, segment override, lock prefix, repeat prefixes).
- Opcode (Required): One or more bytes specifying the core operation (e.g.,
mov
,add
,push
,ret
). - ModR/M Byte (Often Required): A complex byte that specifies operands. It indicates whether operands are registers or memory locations and defines the addressing mode used for memory access.
- SIB Byte (Sometimes Required): Scale-Index-Base byte. Used with ModR/M for more complex memory addressing involving a scaled index register (e.g.,
[eax + ecx*4]
). - Displacement (Optional): An offset (1, 2, or 4 bytes) added to a base address when accessing memory.
- Immediate Value (Optional): A constant value (1, 2, 4, or 8 bytes) embedded directly in the instruction, used as an operand.
Examples showing different structures:
50
→push eax
(Opcode only)C3
→ret
(Opcode only - near return, no stack pop)B8 01000000
→mov eax, 1
(Opcode + 32-bit Immediate)89 E5
→mov ebp, esp
(Opcode + ModR/M specifying two registers)E8 00000000
→call near_relative_offset
(Opcode + 32-bit Displacement/relative offset)FF 15 00104000
→call dword ptr [0x401000]
(Opcode + ModR/M + 32-bit Displacement/absolute address - typical IAT call)
You don't need to memorize all encodings, but understanding that instructions have variable lengths and different components is key for reading disassembly and hex dumps. It's the job of a compiler (like GCC, Clang, or MSVC) to translate high-level code (like C++ or C#) into these machine code sequences, often via an intermediate assembly language step.
Common Instruction Byte Patterns
In malware analysis, recognizing common byte patterns directly in a hex editor or disassembler can quickly reveal program structure and behavior:
Function Prologues/Epilogues (32-bit)
55 89 E5
or55 8B EC
: Standard 32-bit prologue (push ebp; mov ebp, esp
)C9 C3
: Standard 32-bit epilogue (leave; ret
)
Function Prologues/Epilogues (64-bit)
40 55
/48 89 E5
/48 8B EC
: Various common 64-bit prologue starts (often involve saving non-volatile registers).C3
: Simple return (often ends functions).
Control Flow
E8 xx xx xx xx
: Relativecall
E9 xx xx xx xx
: Relativejmp
EB xx
: Short relativejmp
74 xx
:je
(short jump if equal/zero)75 xx
:jne
(short jump if not equal/zero)FF 15 xx..
/FF 25 xx..
: Indirectcall
/jmp
via absolute address (often used for IAT calls or jump tables). Sometimes implemented via a small piece of code called a thunk, which simply jumps to the real target address (e.g.,jmp dword ptr [__imp__FunctionName]
).
Stack Operations
50
-57
:push
General Purpose Register (eax, ecx, edx, ebx, esp, ebp, esi, edi)58
-5F
:pop
General Purpose Register68 xx xx xx xx
:push immediate_dword
6A xx
:push immediate_byte
No Operation
90
:nop
(Often used for padding or overwritten by hooks)
Return Instructions
C3
: Near Return: Pops the return address (pushed bycall
) from the stack into EIP/RIP. Used for returns within the same code segment.C2 iw
: Near Return and Pop N bytes: Pops the return address, then pops an additional N bytes (specified by the 16-bit immediate wordiw
) off the stack. Used by conventions likestdcall
where the callee cleans up arguments.CB
: Far Return: Pops CS:IP (Code Segment and Instruction Pointer) from the stack. Used for returns between different code segments (rare in modern flat memory models).CA iw
: Far Return and Pop N bytes: Pops CS:IP, then pops an additional N bytes off the stack.
Machine code is the raw material of executables. All higher-level structures in PE files ultimately translate down to sequences of these byte instructions. When malware analysts perform deep static analysis or examine memory dumps, they are often looking directly at this machine code. Recognizing common patterns (like function prologues, API call sequences, or loops) in the raw hex can significantly speed up the analysis process, especially when dealing with obfuscated or packed malware where standard disassembly might fail.
Assembly Language: Human-Readable Machine Code
Assembly Language Basics
Assembly language provides human-readable mnemonics for machine code instructions. An assembler translates assembly code into machine code, and a disassembler does the reverse. There's typically a direct one-to-one mapping (though some assemblers support macros).
High-level programming languages like C, C++, Go, or Delphi are translated by a compiler (e.g., GCC, Clang, MSVC) into assembly language (or sometimes directly to machine code), which is then assembled into the final machine code bytes stored in the executable file. Different compilers might generate slightly different, but functionally equivalent, assembly code for the same high-level source due to optimization choices or instruction selection.
Example: From Assembly to Machine Code
Address Machine Code Assembly Instruction Comment ------- ------------ -------------------- ------- 00401000 B8 01000000 mov eax, 1 ; Load 1 into EAX 00401005 03 C3 add eax, ebx ; Add EBX to EAX 00401007 50 push eax ; Push EAX onto stack 00401008 E8 F3FFFFFF call 00401000 ; Call relative address (example) 0040100D C3 ret ; Return from function
Assembly uses mnemonics (mov
, add
, push
), register names (eax
, ebx
), memory addressing modes ([ebp+8]
, [my_var]
), and labels (start_loop:
) to represent the underlying machine operations.
Disassembly vs. Decompilation vs. Bytecode
These terms represent different levels of abstraction when analyzing code:
- Machine Code: The raw binary instructions the CPU directly executes (e.g.,
55 89 E5
). - Disassembly: Translating machine code into human-readable assembly language (e.g.,
push ebp; mov ebp, esp
). This is a direct, accurate representation of the machine code. Tools: IDA Pro, Ghidra, Binary Ninja, debuggers (x64dbg, OllyDbg, WinDbg). - Decompilation: Attempting to translate assembly/machine code back into a high-level language like C/C++. This is an interpretive process, generating an approximation of potential source code. It's very helpful for understanding logic but may lose low-level details or be inaccurate. Tools: Hex-Rays Decompiler (IDA Pro plugin), Ghidra's decompiler, Binary Ninja.
- Bytecode: An intermediate code format used by some languages (e.g., Java
.class
files, Python.pyc
files, .NET CIL). Bytecode is executed by a Virtual Machine (JVM, PVM, CLR) rather than directly by the CPU. Decompiling bytecode back to its original source language (e.g., Java bytecode to Java source using JD-GUI, or .NET CIL to C# using dnSpy/ILSpy) is generally much easier and more accurate than decompiling native machine code, because bytecode often retains more metadata and structure.
Understanding these differences is crucial. Analyzing native code (C, C++, Go, Delphi PE files) primarily involves disassembly, with decompilation as a helpful aid. Analyzing managed code (.NET) or interpreted languages (Java, Python) often involves specific bytecode decompilers.
Function Calling Conventions (32-bit stdcall Example)
Calling conventions define how parameters are passed, return values are handled, and registers are managed during function calls. This example uses the common 32-bit stdcall
convention, widely used by Windows APIs.
Caller Side
; Calling MyStdcallFunc(arg1, arg2) which is declared as STDCALL push arg2 ; Push arguments onto stack (right-to-left) push arg1 call MyStdcallFunc ; Call the function (pushes return address) ; NO stack cleanup here! The callee does it in stdcall. ; Return value is typically in EAX
Callee Side (MyStdcallFunc)
MyStdcallFunc: ; Prologue push ebp ; Save old base pointer mov ebp, esp ; Set new stack frame base (using 8B EC or 89 E5 encoding) ; Access arguments mov eax, [ebp+8] ; Access arg1 (first arg is at ebp+8) mov ecx, [ebp+12] ; Access arg2 (second arg is at ebp+12) ; ... function body ... ; Place return value in EAX (if any) ; Epilogue mov esp, ebp ; Deallocate local variables (if any) pop ebp ; Restore old base pointer ret 8 ; Return AND clean up 8 bytes (2*DWORD) from stack (Opcode C2 0800)
Key differences from cdecl
: The callee (the function being called) is responsible for cleaning the arguments off the stack using the ret N
instruction (opcode C2
), where N is the total size of the arguments in bytes. Many Windows APIs use stdcall
.
Assembly language is the primary tool for reverse engineering and malware analysis. Disassemblers convert the machine code within PE files back into assembly. By reading the assembly, analysts can understand the program's logic, identify algorithms, track data flow, pinpoint API calls, and discover vulnerabilities or malicious behavior, even without the original source code. Recognizing standard patterns like function prologues/epilogues and calling conventions (like stdcall
for WinAPIs) is essential for efficient analysis.
Memory Models & Virtual Memory
Process Memory Layout: Stack, Heap, Code, Data
When a program runs, the operating system allocates a virtual address space for it, typically organized into several key regions:
Typical Process Layout
(Function calls, local vars)
(Dynamic allocation - malloc/new)
The Stack
A LIFO (Last-In-First-Out) structure managed automatically by the CPU/compiler. Grows towards lower memory addresses.
- Stores function return addresses.
- Holds local variables declared within functions.
- Used to pass arguments (in some calling conventions).
- Fast allocation/deallocation (just move stack pointer).
- Limited size, susceptible to stack buffer overflows.
The Heap
A region for dynamically allocated memory (using malloc
, new
). Grows towards higher memory addresses.
- Used for data whose size isn't known at compile time.
- Used for data that needs to outlive the function that created it.
- Slower allocation/deallocation (requires memory management).
- Larger size available, susceptible to heap overflows, use-after-free, etc.
Virtual Memory vs. Physical Memory
Modern OSes use virtual memory to give each process its own private, contiguous address space, isolating it from other processes and the underlying physical RAM:
Physical Memory (RAM)
- The actual hardware memory chips.
- A limited, shared resource managed by the OS kernel.
- OS maps parts of physical RAM to different processes' virtual addresses.
Virtual Memory
- An abstraction provided by the OS and CPU's Memory Management Unit (MMU).
- Each process gets its own large, linear address space (e.g., 4GB for 32-bit, much larger for 64-bit).
- Addresses used by the program (pointers, EIP/RIP) are virtual addresses.
- The MMU translates virtual addresses to physical addresses on-the-fly.
- Allows for memory protection (read/write/execute permissions per page).
- Enables features like paging (swapping data to disk).
PE files are designed entirely around this virtual memory concept. Addresses within the PE file (like the entry point or section locations) are virtual addresses (or RVAs relative to a virtual base address).
Image Base Address & Relocation
The Image Base is the preferred starting virtual address where the OS loader attempts to map the PE file into memory:
- Defined in
OptionalHeader.ImageBase
. - Typical defaults:
0x00400000
(32-bit EXE),0x10000000
(32-bit DLL),0x0000000140000000
(64-bit EXE). - If this preferred address is available (and ASLR doesn't override it), the file is loaded there.
- If the address is occupied (e.g., by another DLL), the loader must place the module elsewhere. This is called rebasing.
- When rebasing occurs, any hardcoded absolute virtual addresses within the module's code/data become incorrect.
- The Base Relocation Table (
.reloc
section, pointed to by Data Directory entry 5) contains a list of locations within the image that need to be "fixed up" by adding the difference between the actual load address and the preferred ImageBase. - EXEs are often compiled assuming they will load at their ImageBase (no relocations needed), while DLLs almost always include relocation information because they are likely to be rebased.
PE File Memory Mapping Process
When a PE file is executed, the Windows loader performs a detailed sequence of steps to load it into virtual memory:
- Read Headers: Parse the DOS MZ Header to find
e_lfanew
, jump to that offset, validate the PE Signature ('PE\0\0'), and then parse the COFF Header and the crucial Optional Header. - Reserve Address Space: Based on
OptionalHeader.ImageBase
andOptionalHeader.SizeOfImage
, reserve a contiguous block of virtual address space. If ASLR is enabled and supported (DYNAMIC_BASE
flag), the OS chooses a randomized base address instead of the preferredImageBase
. If the preferred/randomized address is unavailable, the loader attempts to find another free block (rebasing). - Map Sections: Iterate through the Section Table (using
NumberOfSections
from the COFF Header). For eachIMAGE_SECTION_HEADER
:- Calculate the target memory address:
Actual Load Address + SectionHeader.VirtualAddress
. - Allocate virtual memory pages for the section based on
SectionHeader.VirtualSize
, respectingSectionAlignment
. - Copy the section's raw data from the file (from offset
SectionHeader.PointerToRawData
, lengthSectionHeader.SizeOfRawData
) into the allocated virtual memory. Note thatVirtualSize
can be larger thanSizeOfRawData
(e.g., for.bss
), in which case the extra space is zero-filled. - Set initial memory page protections (Read/Write/Execute) based on
SectionHeader.Characteristics
.
- Calculate the target memory address:
- Process Imports (Recursively): Examine the Import Table (via Data Directory 1). For each required DLL:
- Check if the DLL is already loaded in the process. If not, load it by performing these same steps (1-7) for the DLL. This can trigger loading of further dependencies.
- Once the DLL is loaded, get the actual memory addresses of the functions listed in the Import Name Table (INT) / OriginalFirstThunk.
- Write these actual function addresses into the Import Address Table (IAT) / FirstThunk for the module being loaded.
- Perform Base Relocations: If the module was rebased (loaded at an address different from
OptionalHeader.ImageBase
), process the Base Relocation Table (via Data Directory 5). This table lists all the locations in the code/data that contain absolute addresses which need to be adjusted ("fixed up") based on the difference between the actual load address and the preferredImageBase
. - Set Final Memory Protections: Apply the final, potentially stricter, memory protections based on section characteristics and system policies (like DEP). For example, code sections typically become Read+Execute, data sections Read+Write (or Read-Only for
.rdata
). - TLS Callbacks: If a Thread Local Storage table exists (via Data Directory 9) and contains callback function pointers, execute these callbacks.
- Transfer Execution: Finally, set up the initial thread context and jump to the module's entry point RVA (
OptionalHeader.AddressOfEntryPoint
added to the actual load address).
Memory Allocation, Compilers, & Data Sections
How variables and constants end up in specific PE sections is largely determined by the compiler (like GCC, Clang, MSVC) and linker based on C/C++ (or other language) declarations:
C/C++ Example | Storage Class | Typical PE Section | Memory Permissions | Initialized? |
---|---|---|---|---|
void func() { int x; } |
Local Automatic | Stack (Not in PE file) | Read/Write | No (Garbage) |
int global_y = 10; |
Global Initialized | .data |
Read/Write | Yes (value 10 stored in file) |
static int static_z = 20; |
Static Initialized | .data |
Read/Write | Yes (value 20 stored in file) |
int global_a; |
Global Uninitialized | .bss |
Read/Write | No (Zeroed by loader) |
static int static_b; |
Static Uninitialized | .bss |
Read/Write | No (Zeroed by loader) |
const char* str = "Hello"; const int val = 5; |
Constant / String Literal | .rdata (often) |
Read-Only | Yes (values stored in file) |
int* ptr = new int; |
Dynamic Allocation | Heap (Not in PE file) | Read/Write | Varies (by allocator) |
The compiler makes optimization decisions (e.g., placing truly constant data in .rdata
, pooling identical strings). The linker then gathers all the code and data generated by the compiler (from potentially multiple source files and libraries) and arranges them into the final PE sections according to rules and directives.
Understanding memory layout, virtual memory, and the loading process is crucial for malware analysis. Attacks like stack/heap overflows, Return-Oriented Programming (ROP), and process injection directly manipulate these memory structures. Malware might try to load at unusual ImageBases, map sections with incorrect permissions (e.g., writable code), or abuse the relocation process. Analyzing memory dumps requires knowing where different types of data (code, stack, heap, imports) reside in the virtual address space.
File Identification: Digital File Structure and Signatures
Understanding Digital File Signatures
Digital files are not just random sequences of bytes - they follow specific formats that help operating systems and applications identify and process them correctly:
Magic Numbers and File Signatures
- PE Files: Begin with "MZ" (
4D 5A
) at offset 0, and "PE\0\0" (50 45 00 00
) at the PE header offset. - ELF Files: Start with
7F 45 4C 46
(DEL + "ELF"). - Java Class: Begin with
CA FE BA BE
. - .NET Assemblies: Use the PE format but contain a CLR header.
- Office Documents: Usually begin with
D0 CF 11 E0
(Compound File Binary Format). - ZIP-based: Start with "PK\x03\x04" (
50 4B 03 04
), including:- JAR files (Java Archives)
- APK files (Android Packages)
- DOCX/XLSX/PPTX (Modern Office)
These signatures serve multiple purposes:
- Quick file type identification without parsing the whole file
- Validation of file integrity and format
- Prevention of accidental misuse (e.g., trying to execute non-executable files)
- Historical compatibility (e.g., MZ header for DOS)
File Headers and Metadata Structures
Most modern file formats include sophisticated header structures that provide metadata about the file's contents and organization:
Common Header Elements
- Signature/Magic Number: Identifies the file type
- Version Information: Format version, compatibility flags
- Size Fields: File/content sizes, offsets to important structures
- Checksums/Hashes: For integrity verification
- Timestamps: Creation, modification dates
- Feature Flags: Indicates supported features or restrictions
PE Format Header Chain
The PE format demonstrates a sophisticated header chain design:
+---------------------------+ | File Start | | DOS Header (MZ) | +---------------------------+ │ ▼ +---------------------------+ | DOS Stub | | Optional DOS Program | +---------------------------+ │ ▼ +---------------------------+ | PE Header | | PE Signature + File Header| +---------------------------+ │ ▼ +---------------------------+ | Optional Header | | Windows-Specific Fields | +---------------------------+ │ ▼ +---------------------------+ | Section Table | | Section Definitions | +---------------------------+ │ ▼ +---------------------------+ | Section Data | | Actual Content | +---------------------------+
Rigorous File Identification
Proper file identification involves more than just checking signatures:
Multi-Layer Validation
- Signature Checking:
- Verify magic numbers at correct offsets
- Check for secondary signatures (e.g., PE after MZ)
- Validate header checksums
- Structural Validation:
- Parse and validate header fields
- Verify pointer/offset validity
- Check section alignment and sizes
- Content Analysis:
- Validate internal data structures
- Check for format-specific markers
- Analyze entropy and patterns
Security Implications
Thorough file identification is crucial for security:
- Prevents file type confusion attacks
- Identifies malformed or crafted files
- Detects attempts to bypass file type restrictions
- Helps identify packed or obfuscated malware
File Format Evolution
File formats have evolved to meet changing needs:
Historical Progression
- Early Era (1960s-70s):
- Simple binary formats
- No standardized headers
- Platform-specific designs
- Standardization Era (1980s-90s):
- Introduction of magic numbers
- Structured headers
- Cross-platform considerations
- Modern Era (2000s+):
- Complex metadata structures
- Security features
- Extensible designs
- Container formats (e.g., ZIP-based)
File identification and format understanding is fundamental to malware analysis and reverse engineering. Malware authors often manipulate file headers and structures to evade detection or confuse analysis tools. A deep understanding of file formats enables analysts to:
- Identify malformed or suspicious files
- Detect attempts to hide malicious content
- Understand packing and obfuscation techniques
- Extract and analyze embedded payloads
- Reconstruct damaged or manipulated files
Early Executable Formats: The Precursors
The Dawn: Raw Machine Code & Punch Cards
In the earliest days of computing (ENIAC, UNIVAC), there wasn't really an "executable format" as we know it. Programs were:
- Entered via physical switches or wiring plugboards.
- Loaded from punch cards or paper tape containing raw machine instructions.
- Loaded directly into specific memory locations.
- Execution started by manually setting the instruction pointer.
- No metadata, no OS loader assistance, just raw bytes loaded and run.
The .COM Era (CP/M, Early MS-DOS)
The .COM
(Command) file format was a step up, but still incredibly simple:
- Structure: Essentially formatless. The file is just raw x86 machine code.
- Loading: The OS allocated a 64KB memory segment, loaded the entire file content starting at offset
0x100
within that segment, set all segment registers (CS, DS, ES, SS) to point to the start of the segment, set SP to the end of the segment, and jumped to0x100
to start execution. - Size Limit: Maximum size was 65,280 bytes (64KB - 256 bytes for the PSP).
- No Metadata: No header, no relocation info, no import/export tables. Everything (code, data, stack) had to fit and manage itself within the single 64KB segment.
- Relocatability: Inherently non-relocatable due to the fixed loading offset (0x100).
; Example COM program structure (NASM syntax) org 0x100 ; Tell assembler code starts at 0x100 section .text start: mov ah, 9 ; DOS function: Print string mov dx, message ; Address of string int 21h ; Call DOS interrupt mov ah, 4Ch ; DOS function: Terminate program int 21h section .data message db 'Hello from COM!', 0Dh, 0Ah, '$' ; String must end with '$'
Simple, but extremely limited for larger, more complex programs.
The MZ Revolution (.EXE in MS-DOS)
The .EXE
format, identified by the "MZ" signature (for Mark Zbikowski), was a major leap forward introduced with MS-DOS:
- Structure: Introduced the first real header: the MZ Header.
- MZ Header: Contained metadata like file size, initial stack segment/pointer, entry point (CS:IP), and crucially, a Relocation Table.
- Relocatability: The relocation table listed segment addresses within the code/data that needed to be "fixed up" by the DOS loader based on the actual memory segment where the program was loaded. This allowed EXEs to be loaded anywhere in memory.
- Multi-Segment Support: Allowed programs to use multiple code and data segments, breaking the 64KB barrier of COM files.
- No Imports/Exports Yet: Still lacked standardized ways to link dynamically with other code modules (libraries).
; Conceptual MZ EXE Structure ┌─────────────────────┐ │ MZ Header │ Contains file size, entry point (CS:IP), │ (IMAGE_DOS_HEADER) │ initial SS:SP, relocation table offset... ├─────────────────────┤ │ Relocation Table │ List of segment addresses needing fixup ├─────────────────────┤ │ │ │ Program Code & Data │ Loaded into memory based on header info │ (Load Module) │ │ │ └─────────────────────┘
The MZ header is still present at the beginning of modern PE files, primarily for backward compatibility and to point to the real PE header via the e_lfanew
field.
The evolution from raw code to COM and then MZ EXE files demonstrates the increasing need for metadata and flexibility as programs became more complex. COM files were simple but restrictive. MZ EXEs introduced headers and relocation, enabling larger programs that could load anywhere in memory. However, they still lacked features like dynamic linking and robust memory protection found in modern formats. This historical context helps understand why the PE format includes elements like the MZ header and why features like relocation tables were developed.
PE Format Introduction: The Modern Standard
The Portable Executable (PE) Format
Introduced with Windows NT, the Portable Executable (PE) format is the standard for executables, object code, DLLs, and others on 32-bit and 64-bit versions of Windows. It's derived from the Unix COFF (Common Object File Format) specification and adds features specific to Windows.
Key Goals and Characteristics
- Portability: Designed to support multiple CPU architectures (though primarily used for x86/x64). The COFF header specifies the target machine.
- Extensibility: Supports various data types beyond code and basic data, like resources, debug info, digital signatures, and .NET metadata via Data Directories.
- Virtual Memory Centric: Designed explicitly for paged, protected virtual memory operating systems. Addresses and layout are defined in terms of virtual addresses.
- Dynamic Linking: Rich support for importing functions from DLLs and exporting functions for others to use (Import/Export Tables).
- Section-Based Layout: Organizes the file into logical sections (
.text
,.data
,.rsrc
, etc.) with specific memory permissions (Read/Write/Execute). This organization is typically determined by the linker tool, which combines compiled code and data.
The PE format is used for nearly all executable content on Windows:
.exe
: Applications.dll
: Dynamic Link Libraries.sys
: Kernel-mode Drivers.ocx
: ActiveX Controls.cpl
: Control Panel Applets.scr
: Screen Savers- Object files (
.obj
) during compilation also use COFF/PE structure.
High-Level PE Structure Overview
A PE file follows a well-defined structure, starting with legacy headers and progressing to Windows-specific information:
Note: Section order in the file doesn't necessarily match memory layout.
PE32 vs PE32+ (64-bit)
The PE format adapts for 32-bit and 64-bit architectures, primarily within the Optional Header:
PE32 (32-bit)
- Optional Header Magic Number:
0x10B
(IMAGE_NT_OPTIONAL_HDR32_MAGIC
) - Addresses/Sizes (like ImageBase, stack sizes): 32-bit (
DWORD
) - Includes
BaseOfData
field in Optional Header. - Designed for 32-bit address space.
PE32+ (64-bit)
- Optional Header Magic Number:
0x20B
(IMAGE_NT_OPTIONAL_HDR64_MAGIC
) - Addresses/Sizes (like ImageBase, stack sizes): 64-bit (
ULONGLONG
orDWORD64
) - Omits
BaseOfData
field. - Designed for 64-bit address space.
- Structurally very similar to PE32, just wider fields for addresses.
Tools analyzing PE files must check the Magic number to parse the Optional Header correctly.
The PE format is the container for almost all executable code on Windows. For malware analysts, it's the first thing encountered. Understanding its structure is fundamental. Analyzing the PE headers and section layout provides initial clues about a sample's nature: Is it packed? Is it a DLL or EXE? What architecture does it target? Does it import suspicious functions? Does it contain unusual resources? Are security features like ASLR/DEP enabled? Mastering PE structure is step one in static malware analysis.
PE Headers In Detail: The Blueprint
DOS MZ Header (IMAGE_DOS_HEADER)
The very first part of a PE file, a remnant from DOS days. Starts with the signature 'MZ' (4D 5A
in hex).
typedef struct _IMAGE_DOS_HEADER { // DOS .EXE header WORD e_magic; // Magic number (0x5A4D) WORD e_cblp; // Bytes on last page of file WORD e_cp; // Pages in file WORD e_crlc; // Relocations WORD e_cparhdr; // Size of header in paragraphs WORD e_minalloc; // Minimum extra paragraphs needed WORD e_maxalloc; // Maximum extra paragraphs needed WORD e_ss; // Initial (relative) SS value WORD e_sp; // Initial SP value WORD e_csum; // Checksum WORD e_ip; // Initial IP value WORD e_cs; // Initial (relative) CS value WORD e_lfarlc; // File address of relocation table WORD e_ovno; // Overlay number WORD e_res[4]; // Reserved words WORD e_oemid; // OEM identifier (for e_oeminfo) WORD e_oeminfo; // OEM information; e_oemid specific WORD e_res2[10]; // Reserved words LONG e_lfanew; // **File address of PE header** } IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;
Key Fields for PE:
e_magic
: Must be0x5A4D
('MZ'). Identifies the file as potentially executable.e_lfanew
: Crucial. This 4-byte value at offset0x3C
gives the file offset where the actual PE Signature and Headers begin.
A small "DOS stub" program often follows this header, which prints "This program cannot be run in DOS mode" if executed on DOS.
PE Signature & COFF/Image File Header (IMAGE_FILE_HEADER)
At the offset specified by e_lfanew
, we find:
- PE Signature: 4 bytes -
50 45 00 00
('P' 'E' \0 \0). - COFF / Image File Header: Contains basic properties of the file.
typedef struct _IMAGE_FILE_HEADER { WORD Machine; // Target architecture (e.g., 0x14c=x86, 0x8664=x64) WORD NumberOfSections; // How many sections follow the headers DWORD TimeDateStamp; // Linker timestamp (seconds since Unix epoch) DWORD PointerToSymbolTable; // File offset of COFF symbol table (usually 0) DWORD NumberOfSymbols; // Number of entries in symbol table (usually 0) WORD SizeOfOptionalHeader; // Size of the *next* header (Optional Header) WORD Characteristics; // Flags describing the file (e.g., EXE, DLL, ASLR aware) } IMAGE_FILE_HEADER, *PIMAGE_FILE_HEADER;
Key Fields:
Machine
: Identifies the target CPU (IMAGE_FILE_MACHINE_I386
,IMAGE_FILE_MACHINE_AMD64
, etc.).NumberOfSections
: Tells the loader how many section headers to read from the Section Table.TimeDateStamp
: Can sometimes indicate compilation time, but easily forged by malware.SizeOfOptionalHeader
: Size of the next structure (IMAGE_OPTIONAL_HEADER).Characteristics
: Important flags like:IMAGE_FILE_EXECUTABLE_IMAGE (0x0002)
: File is runnable.IMAGE_FILE_DLL (0x2000)
: File is a DLL.IMAGE_FILE_LARGE_ADDRESS_AWARE (0x0020)
: App can handle >2GB addresses (32-bit).IMAGE_FILE_RELOCS_STRIPPED (0x0001)
: No relocation info (bad for DLLs/ASLR).
Optional Header (IMAGE_OPTIONAL_HEADER32 / IMAGE_OPTIONAL_HEADER64)
Despite the name, this header is required for executable images (EXEs, DLLs). It contains the most critical information for the OS loader.
// Structure differs slightly between 32/64 bit (field sizes) typedef struct _IMAGE_OPTIONAL_HEADER { // Standard COFF fields. WORD Magic; // 0x10b = PE32, 0x20b = PE32+ (64-bit) BYTE MajorLinkerVersion; BYTE MinorLinkerVersion; DWORD SizeOfCode; // Sum of all code sections' size DWORD SizeOfInitializedData; DWORD SizeOfUninitializedData; // Size of .bss section DWORD AddressOfEntryPoint; // RVA where execution starts DWORD BaseOfCode; // RVA of the beginning of the code section // DWORD BaseOfData; // RVA of beginning of data section (PE32 only!) // NT additional fields. ULONGLONG ImageBase; // Preferred load address (64-bit in PE32+) DWORD SectionAlignment; // Alignment (in bytes) of sections in memory DWORD FileAlignment; // Alignment (in bytes) of sections in file WORD MajorOperatingSystemVersion; /* ... other version fields ... */ DWORD SizeOfImage; // Total size of the image in memory DWORD SizeOfHeaders; // Size of DOS hdr + PE sig + COFF hdr + Opt hdr + Section hdrs DWORD CheckSum; // Image file checksum (often 0) WORD Subsystem; // Target subsystem (e.g., Windows GUI, Console) WORD DllCharacteristics; // Flags like ASLR, DEP, CFG support ULONGLONG SizeOfStackReserve; // Total stack size to reserve (64-bit in PE32+) /* ... other stack/heap size fields ... */ DWORD NumberOfRvaAndSizes; // Number of entries in DataDirectory (usually 16) IMAGE_DATA_DIRECTORY DataDirectory[IMAGE_NUMBEROF_DIRECTORY_ENTRIES]; // Array of directory entries } IMAGE_OPTIONAL_HEADER;
Key Fields:
Magic
: Distinguishes PE32 (0x10B
) from PE32+ (0x20B
).AddressOfEntryPoint
: RVA of the first instruction to execute. Crucial for analysis.ImageBase
: Preferred virtual address for loading.SectionAlignment
/FileAlignment
: Dictate how sections are aligned in memory vs. the file. Must be powers of 2.SizeOfImage
: Total virtual size needed when mapped into memory.SizeOfHeaders
: Combined size of all headers, rounded up to FileAlignment. Defines where the first section's data starts in the file.Subsystem
: (IMAGE_SUBSYSTEM_WINDOWS_GUI
,_CONSOLE
,_NATIVE
, etc.).DllCharacteristics
: Security flags (IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE
(ASLR),_NX_COMPAT
(DEP),_GUARD_CF
(CFG)).DataDirectory
: Array pointing to other important data structures (Imports, Exports, Resources, Relocations, etc.).
Data Directories (IMAGE_DATA_DIRECTORY)
The last field of the Optional Header is an array (typically 16 entries) of IMAGE_DATA_DIRECTORY
structures. Each entry points to a specific table or data structure within the PE file, if present.
typedef struct _IMAGE_DATA_DIRECTORY { DWORD VirtualAddress; // RVA of the data/table DWORD Size; // Size in bytes of the data/table } IMAGE_DATA_DIRECTORY, *PIMAGE_DATA_DIRECTORY; // Indices into the DataDirectory array: #define IMAGE_DIRECTORY_ENTRY_EXPORT 0 // Export Table (.edata) #define IMAGE_DIRECTORY_ENTRY_IMPORT 1 // Import Table (.idata) #define IMAGE_DIRECTORY_ENTRY_RESOURCE 2 // Resource Table (.rsrc) #define IMAGE_DIRECTORY_ENTRY_EXCEPTION 3 // Exception Table (.pdata) #define IMAGE_DIRECTORY_ENTRY_SECURITY 4 // Certificate Table (Attribute Certificates) #define IMAGE_DIRECTORY_ENTRY_BASERELOC 5 // Base Relocation Table (.reloc) #define IMAGE_DIRECTORY_ENTRY_DEBUG 6 // Debug Directory // ... Architecture Specific (7) ... // ... Global Ptr (8) ... #define IMAGE_DIRECTORY_ENTRY_TLS 9 // TLS Table // ... Load Config (10) ... #define IMAGE_DIRECTORY_ENTRY_BOUND_IMPORT 11 // Bound Import Table #define IMAGE_DIRECTORY_ENTRY_IAT 12 // Import Address Table #define IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT 13 // Delay Import Descriptors #define IMAGE_DIRECTORY_ENTRY_COM_DESCRIPTOR 14 // CLR Runtime Header (.NET)
If an entry's VirtualAddress
and Size
are both zero, that directory is not present. These directories are essential for finding imports/exports, resources, relocation info, digital signatures, .NET headers, etc.
Offset vs. Length: The PE Pattern
A recurring pattern in PE structures is the use of pairs of values to define data:
- An Offset or Address indicating where the data starts.
- A Size or Length indicating how much data there is.
Examples:
- Section Headers:
PointerToRawData
(file offset) +SizeOfRawData
(file length), andVirtualAddress
(memory RVA) +VirtualSize
(memory length). - Data Directories:
VirtualAddress
(RVA of the table) +Size
(size of the table). - Resource Entries: Pointers to resource data + size of resource data.
- Relocation Blocks: RVA of the block + size of the block.
Understanding this pattern is key to navigating the file. Always use the Size field to know how much data to read or parse starting from the given offset/address. Alignment rules (FileAlignment, SectionAlignment) can mean the space occupied is larger than the actual data size.
The PE headers act as the file's blueprint, guiding the OS loader. Malware frequently tampers with these headers for various purposes: anti-analysis (confusing tools by setting invalid sizes/pointers), hiding code (placing the entry point in an unusual section or outside any defined section), modifying characteristics (e.g., marking a data section as executable), or manipulating Data Directories (e.g., hiding imports). Careful examination of all header fields against expected values is a critical step in static analysis.
PE Sections: Organizing Code and Data
Segments vs. Sections
While sometimes used interchangeably, these terms have distinct meanings in the context of PE files and x86 architecture:
- Segments (Historical x86): A memory management concept from older x86 modes (Real Mode, Protected Mode with segmentation) using segment registers (CS, DS, SS, ES, FS, GS) to define base addresses for memory access. Largely abstracted away in modern flat memory models used by Windows, but segment registers are still used (implicitly or explicitly).
- Sections (PE File Format): Logical divisions of the PE file defined in the Section Table. Each section groups related content (code, data, resources) and has associated attributes like name, size in file, size in memory, file offset, virtual address (RVA), and memory permissions (Read/Write/Execute). The linker (part of the compiler toolchain, like `link.exe` for MSVC or `ld` for GCC/Clang) is responsible for grouping code and data into these sections.
In essence, the PE loader maps the Sections defined in the file into the process's virtual address space, which is typically treated as a single large Segment (flat memory model) by the running code.
Section Table (Array of IMAGE_SECTION_HEADER)
Immediately following the Optional Header is the Section Table, which is an array of IMAGE_SECTION_HEADER
structures. The number of entries in this array is given by IMAGE_FILE_HEADER.NumberOfSections
. Each header describes one section:
#define IMAGE_SIZEOF_SHORT_NAME 8 typedef struct _IMAGE_SECTION_HEADER { BYTE Name[IMAGE_SIZEOF_SHORT_NAME]; // 8-byte, null-padded ASCII name (e.g., ".text\0\0\0") union { DWORD PhysicalAddress; // (Historical/Obsolete) DWORD VirtualSize; // **Total size of the section in memory (bytes)** } Misc; DWORD VirtualAddress; // **RVA of the section's start in memory** DWORD SizeOfRawData; // **Size of the section's data in the file (bytes)** DWORD PointerToRawData; // **File offset to the section's data** DWORD PointerToRelocations; // File offset to relocations for this section (OBJ files) DWORD PointerToLinenumbers; // File offset to line numbers (debug) WORD NumberOfRelocations; // Number of relocation entries WORD NumberOfLinenumbers; // Number of line number entries DWORD Characteristics; // **Flags describing section permissions and content type** } IMAGE_SECTION_HEADER, *PIMAGE_SECTION_HEADER;
Key Fields:
Name
: An 8-byte name (often starting with '.', like.text
,.data
). Not guaranteed to be null-terminated if exactly 8 chars.VirtualSize
: The actual size the section will occupy in virtual memory. Can be larger thanSizeOfRawData
(e.g., for.bss
).VirtualAddress
: The RVA (relative to ImageBase) where the section will be loaded in memory.SizeOfRawData
: The size of the section's data in the file. Must be a multiple ofFileAlignment
. Can be 0 for uninitialized data sections like.bss
.PointerToRawData
: The offset from the beginning of the file where this section's data starts. Must be a multiple ofFileAlignment
. Can be 0 ifSizeOfRawData
is 0.Characteristics
: Flags defining memory permissions and content type. Very important for analysis. Common flags include:IMAGE_SCN_CNT_CODE (0x20)
: Contains executable code.IMAGE_SCN_CNT_INITIALIZED_DATA (0x40)
: Contains initialized data.IMAGE_SCN_CNT_UNINITIALIZED_DATA (0x80)
: Contains uninitialized data (.bss
).IMAGE_SCN_MEM_EXECUTE (0x20000000)
: Section is executable.IMAGE_SCN_MEM_READ (0x40000000)
: Section is readable.IMAGE_SCN_MEM_WRITE (0x80000000)
: Section is writable.IMAGE_SCN_MEM_SHARED (0x10000000)
: Section memory is shared across processes mapping the image.
The actual raw data for the sections follows the section table in the file, located at the offsets specified by PointerToRawData
.
Common PE Section Names and Purposes
While developers can name sections arbitrarily, linkers typically use standard names:
Name | Typical Content | Common Characteristics | Malware Relevance |
---|---|---|---|
.text |
Executable Code | CNT_CODE , MEM_EXECUTE , MEM_READ |
Main analysis target; packers often encrypt/compress this. |
.data |
Initialized global/static variables | CNT_INITIALIZED_DATA , MEM_READ , MEM_WRITE |
Stores configuration, hardcoded strings/values. |
.rdata |
Read-only data (constants, strings) | CNT_INITIALIZED_DATA , MEM_READ |
Often contains import/export info, string literals. |
.bss |
Uninitialized global/static variables | CNT_UNINITIALIZED_DATA , MEM_READ , MEM_WRITE |
Takes no file space; zeroed by loader. Used for large buffers. |
.idata |
Import Tables (DLL names, function names/ordinals) | CNT_INITIALIZED_DATA , MEM_READ , (sometimes WRITE for IAT patching) |
Crucial for understanding external dependencies; often obfuscated. |
.edata |
Export Table (Functions exported by a DLL) | CNT_INITIALIZED_DATA , MEM_READ |
Defines the DLL's interface; malware DLLs export malicious functions. |
.rsrc |
Resources (Icons, Dialogs, Menus, Strings, Version Info, custom data) | CNT_INITIALIZED_DATA , MEM_READ |
Common place for malware to hide payloads, config, or dropper files. |
.reloc |
Base Relocation Information | CNT_INITIALIZED_DATA , MEM_READ , MEM_DISCARDABLE |
Needed if ASLR rebases the image; malware might strip this or add fake entries. |
.pdata |
Exception Handling Information (x64 primarily) | CNT_INITIALIZED_DATA , MEM_READ |
Used for stack unwinding; less common target for manipulation. |
.tls |
Thread Local Storage data & callbacks | CNT_INITIALIZED_DATA , MEM_READ , MEM_WRITE |
TLS callbacks run before entry point; common malware trick for early execution/anti-debug. |
Custom/Packer Names | (Varies - often packed code/data) | (Often unusual combinations like RWE or high entropy) | e.g., UPX0 , .RLPACK , .themida - Strong indicator of packing/protection. |
RVAs, File Offsets, and Alignment
Mapping between memory addresses (RVAs) and file positions (Offsets) is fundamental:
- RVA (Relative Virtual Address): An address relative to the
ImageBase
when the file is loaded into memory.Actual Memory Address = ImageBase + RVA
. - File Offset: A byte offset from the beginning of the PE file on disk.
- Mapping: To find the file offset corresponding to an RVA:
- Iterate through the Section Table to find the section containing the RVA:
Section.VirtualAddress <= RVA < Section.VirtualAddress + Section.VirtualSize
- Calculate the offset within the section:
OffsetInSection = RVA - Section.VirtualAddress
- Calculate the file offset:
FileOffset = Section.PointerToRawData + OffsetInSection
- Caveat: This only works if
OffsetInSection < Section.SizeOfRawData
. If the RVA points to data that exists in memory but not in the file (like the upper part of.bss
), there's no direct file offset.
- Iterate through the Section Table to find the section containing the RVA:
- Alignment:
FileAlignment
dictates the alignment ofPointerToRawData
andSizeOfRawData
in the file (typically 512 bytes or 4KB).SectionAlignment
dictates the alignment ofVirtualAddress
in memory (typically page size, 4KB). This can create gaps between sections in the file or in memory.
PE analysis tools (like PE-bear, CFF Explorer) automate this mapping, but understanding the process is vital for manual analysis or scripting.
Section analysis is a cornerstone of static malware analysis. Analysts scrutinize section names, sizes (VirtualSize vs SizeOfRawData), file pointers, and especially characteristics. Red flags include: unusual names (often packers), sections with unexpected permissions (e.g., writable code or executable data sections), sections with zero size on disk but large size in memory (.bss
or packed data), sections with high entropy (indicating encryption/compression), or code execution starting from non-.text
sections. Malware often adds its own sections or modifies existing ones to hide code or data.
Advanced PE Concepts & Malware Techniques
Import & Export Tables (.idata, .edata)
These tables manage dynamic linking – how PE files use functions from other DLLs or provide functions for others to use.
Import Address Table (IAT) & Import Directory Table (IDT)
Located via Data Directory entry 1 (IMAGE_DIRECTORY_ENTRY_IMPORT
):
- The Import Directory Table (IDT) is an array of
IMAGE_IMPORT_DESCRIPTOR
structures, one for each imported DLL. Each descriptor points to the DLL name and two parallel arrays:- Import Name Table (INT) / OriginalFirstThunk: An array of RVAs pointing to hint/name structures (
IMAGE_IMPORT_BY_NAME
) or ordinals for each imported function. This table remains unchanged after loading. - Import Address Table (IAT) / FirstThunk: Another array, initially identical to the INT. The Windows loader overwrites this array with the actual memory addresses of the imported functions during the loading process.
- Import Name Table (INT) / OriginalFirstThunk: An array of RVAs pointing to hint/name structures (
- Code within the PE file typically calls imported functions indirectly via the IAT (e.g.,
call dword ptr [iat_entry_for_MessageBoxA]
). Often, the compiler generates a small piece of code, called a thunk, for each imported function. This thunk usually contains just a jump instruction (e.g.,jmp dword ptr [__imp__FunctionName]
) that redirects execution to the address stored in the IAT. - Malware Uses: IAT Hooking (overwriting IAT entries to redirect API calls), manually parsing IDT/INT to resolve APIs dynamically (to hide imports from static analysis), stripping import information.
Export Table (.edata)
Located via Data Directory entry 0 (IMAGE_DIRECTORY_ENTRY_EXPORT
). Used primarily by DLLs:
- Contains the DLL name, a list of exported function names, a list of exported function addresses (RVAs), and a list of ordinals (numeric IDs for functions).
- Allows functions to be imported by name or by ordinal.
- Malware Uses: Malicious DLLs export functions for other malware components to call, sometimes using non-descriptive names or exporting only by ordinal to hinder analysis. Export Forwarding (redirecting an export to a function in another DLL).
Symbols & Debug Information
Symbols map memory addresses or offsets to human-readable names (functions, variables). Debug information provides more detail for source-level debugging.
- Storage: Can be embedded (partially) in the PE file via the Debug Directory (Data Directory entry 6) or, more commonly, stored externally in Program Database (
.PDB
) files (common for MSVC compiler). Other compilers like GCC or Clang might use DWARF format embedded in sections or separate files, though they can also generate PDBs on Windows. The Debug Directory often contains a reference (GUID, path) to the external debug file. - Origin: Symbol and debug information is generated by the compiler (e.g., GCC, Clang, MSVC) and linker during the build process, usually controlled by build configurations (e.g., "Debug" vs "Release").
- Content: Function names, variable names (global/static/local), type information (structs, classes), source file/line number mappings.
- Malware Relevance: Malware is almost always stripped of symbols and debug information to make reverse engineering harder. The presence of rich symbols in a suspicious file might indicate it's a legitimate tool being misused, or potentially an unsophisticated threat actor. Analysts often create their own symbols (renaming functions/variables) during analysis in tools like IDA Pro or Ghidra.
PE Security Features & Evasion
Windows and the PE format include features to mitigate exploits, which malware often tries to bypass. Many of these features require support from the compiler (like MSVC, GCC, Clang) and linker during the build process to be effective.
- ASLR (Address Space Layout Randomization): Randomizes base addresses of DLLs, EXEs, stack, heap. Flag:
IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE
. Makes fixed-address exploits unreliable. Malware may try to find ways around it (information leaks, spraying) or target non-ASLR modules. - DEP (Data Execution Prevention): Marks memory regions (stack, heap, data sections) as non-executable using hardware support (NX/XD bit). Flag:
IMAGE_DLLCHARACTERISTICS_NX_COMPAT
. Prevents simple shellcode execution from data areas. Malware uses techniques like ROP (Return-Oriented Programming) or changes memory permissions (VirtualProtect
) to bypass DEP. - SafeSEH (Structured Exception Handling Overwrite Protection): Validates exception handlers before calling them (32-bit). Flag:
IMAGE_DLLCHARACTERISTICS_NO_SEH
(disables SEH if set). Prevents classic SEH overwrite exploits. - CFG (Control Flow Guard): Validates targets of indirect calls at runtime against a bitmap of valid function entry points generated by the linker. Flag:
IMAGE_DLLCHARACTERISTICS_GUARD_CF
. Mitigates exploits that hijack indirect call pointers. Requires compiler support. Malware may target non-CFG-aware code or find ways to bypass checks. - Authenticode (Digital Signatures): Cryptographically signs PE files to verify publisher identity and integrity. Stored via Data Directory entry 4 (
IMAGE_DIRECTORY_ENTRY_SECURITY
). Malware is usually unsigned or uses stolen/forged certificates.
Modern Languages and PE Files (e.g., Go)
While C and C++ are traditional sources of PE files, modern languages like Go, Rust, and Nim also compile directly to native code and produce PE executables. These often have distinct characteristics relevant to malware analysis:
- Static Linking & Large Size: Go binaries, by default, statically link their runtime and all dependencies. This results in large PE files (often several megabytes minimum) that contain the Go runtime scheduler, garbage collector, and standard library code, alongside the developer's code.
- Custom Runtime & Imports: They don't typically rely heavily on standard C runtime libraries (like MSVCRT) or make numerous direct calls to common Windows APIs visible in the Import Table. Instead, they use their own runtime, which then makes necessary system calls. This can make initial import analysis less informative compared to C/C++ binaries.
- Section Names & Symbols: Go binaries often have unique section names like
.gopclntab
(Go program counter line table) or.gosymtab
(Go symbol table), although these might be stripped. Recovering meaningful symbols often requires Go-specific tooling. - Malware Popularity Reasons:
- Ease of Distribution: Static linking means the malware is a single file, requiring no external DLL dependencies on the target system.
- Cross-Compilation: Go makes it relatively easy to compile Windows executables from other operating systems (like Linux).
- Analysis Challenges: The large size, custom runtime, and non-standard structure can hinder analysis by tools and techniques primarily designed for traditional C/C++ PE files. Standard API import analysis is less effective, and decompilers may struggle with Go's runtime conventions.
Other Notable PE Structures & Techniques
- Resources (.rsrc): Hierarchical structure storing icons, strings, dialogs, version info, and arbitrary binary data. Malware frequently hides encrypted payloads, configuration, or entire dropped files within resources.
- TLS (Thread Local Storage): Allows per-thread data. Includes optional TLS Callbacks (array of function pointers) that execute before the official
AddressOfEntryPoint
when a process or thread starts/stops. Heavily abused by malware for anti-debug tricks and early code execution. Located via Data Directory entry 9. - Relocations (.reloc): Table of fixups needed if ASLR rebases the image. Malware might strip this from DLLs to make them crash if rebased (simple anti-analysis) or add invalid entries. Located via Data Directory entry 5.
- .NET Headers (CLR): For managed code (.NET), Data Directory entry 14 points to CLR metadata, replacing traditional native code in
.text
with Intermediate Language (IL) bytecode. Requires different analysis tools (dnSpy, ILSpy). - Packing/Encryption: Malware often compresses/encrypts its original code/sections and embeds a small "stub" loader as the new entry point. The stub unpacks/decrypts the original code into memory at runtime. Indicated by few imports, unusual section names/permissions (Write+Execute), and high entropy sections. Requires dynamic analysis or unpacking to analyze the real code.
- Anti-Analysis Tricks: Manipulating header values (e.g., incorrect
SizeOfImage
, overlapping sections, invalid RVA pointers), using TLS callbacks, checking for debugger presence, timing checks.
Advanced PE concepts are where malware authors and defenders play a constant cat-and-mouse game. Malware leverages imports/exports, TLS, resources, and relocations in non-standard ways to hide, persist, and execute. They actively work to bypass security features like ASLR, DEP, and CFG, often relying on specific compiler/linker behaviors or exploiting weaknesses in the loading process. Understanding these advanced structures and techniques, along with common evasion tactics like packing and header manipulation, is essential for analyzing sophisticated modern threats. Analysis often involves combining static examination of these PE structures with dynamic analysis (debugging, memory forensics) to uncover the true behavior.