Rustifying Excavate + Recursive Decoding #2832

TheTechromancer · 2025-12-12T17:08:55Z

TheTechromancer
Dec 12, 2025
Maintainer

As we think about rustifying Excavate, there are three categories or "layers" of features that are really important:

1. Curate - Context-Dependent Reconstruction / Post-Processing

Web Module (context: Source URL)
- Reconstruct relative URLs
File System (context: file path)

2. Excavate - Extracting common goodies from strings

Examples include emails, URLs, secrets, etc.

Yara-x rules

3. Translate - Detecting and Decoding/Preprocessing data for Extraction

Having a recursive, modular system for detecting encodings/compressions/packing methods like base64, hex, gzip, webpack, etc. is extremely beneficial to Excavate, especially for complex and multifaceted blobs of data like web content, binaries, zip files, word documents.

Most likely, these would be two separate GitHub repos, with Excavate depending on Translate. Translate is a recursive system with its own modules, while Excavate is a simple, non-recursive system that intakes strings and extracts goodies.

Need a way to check if given text is compatible with a given translate module (in some cases, this could a yara rule)

Architecture

Before we commit to a concrete path for either of these, we should look at similar projects and see what lessons we can from learn them.

Also, we should have a general idea for how to compartmentalize the context, so that features that rely on context-specific info like current URL, .git config, syntax language, file path, etc., have access to everything they need, while we are still able to abstract the task-based string extraction, etc. and keep that code as simple as possible.

Similar Projects - Excavate

ripgrep
semgrep
jsluice
Also regex soup, but uses AST
js-link-finder
Regex Soup - resource intensive, doesn't handle edge cases well, can't peer into decoded content, no context
yara-x

Similar Projects - Translate

TODO:

Things NOT to do (avoid these mistakes)

Design Choices

Excavate layer just yara-x rules
Tools will return basic meta data, such as type, which rule found it, description etc.

TheTechromancer · 2025-12-12T21:45:26Z

TheTechromancer
Dec 12, 2025
Maintainer Author

Relevant: #2094

0 replies

TheTechromancer · 2025-12-17T20:09:29Z

TheTechromancer
Dec 17, 2025
Maintainer Author

A quick analysis of binwalk's codebase (TIL binwalk is written in Rust):

Architecture Overview

Binwalk uses a signature-based system with three components:

Signatures — define what to search for and how to validate
Parsers — validate matches and determine boundaries
Extractors — extract identified data regions

Signature Organization

Signatures are defined in magic.rs as a list of Signature structs:

pub struct Binwalk {
    /// Count of all signatures (short and regular)
    pub signature_count: usize,
    /// Count of all magic patterns (short and regular)
    pub pattern_count: usize,
    /// The base file requested for analysis
    pub base_target_file: String,
    /// The base output directory for extracted files
    pub base_output_directory: String,
    /// A list of signatures that must start at offset 0
    pub short_signatures: Vec<signatures::common::Signature>,
    /// A list of magic bytes to search for throughout the entire file
    pub patterns: Vec<Vec<u8>>,
    /// Maps patterns to their corresponding signature
    pub pattern_signature_table: HashMap<usize, signatures::common::Signature>,
    /// Maps signatures to their corresponding extractors
    pub extractor_lookup_table: HashMap<String, Option<extractors::common::Extractor>>,
}

Each signature includes:

Magic bytes: byte patterns to search for
Parser function: validates matches and determines boundaries
Extractor (optional): extracts the identified data
Metadata: name, description, confidence level

Identification Process

The scanning process uses the Aho-Corasick algorithm to search for all magic patterns simultaneously:

        /*
         * Same pattern matching algorithm used by fgrep.
         * This will search for all magic byte patterns in the file data, all at once.
         * https://en.wikipedia.org/wiki/Aho–Corasick_algorithm
         */
        let grep = AhoCorasick::new(self.patterns.clone()).unwrap();

        debug!("Running Aho-Corasick scan");

        /*
         * Outer loop wrapper for AhoCorasick scan loop. This will loop until:
         *
         *  1) next_valid_offset exceeds available_data
         *  2) previous_valid_offset <= next_valid_offset
         */
        while is_offset_safe(available_data, next_valid_offset, previous_valid_offset) {
            // Update the previous valid offset in praparation for the next loop iteration
            previous_valid_offset = Some(next_valid_offset);

            debug!("Continuing scan from offset {next_valid_offset:#X}");

            /*
             * Run a new AhoCorasick scan starting at the next valid offset in the file data.
             * This will loop until:
             *
             *  1) All data has been exhausted, in which case previous_valid_offset and next_valid_offset
             *     will be identical, causing the outer while loop to break.
             *  2) A valid signature with a defined size is found, in which case next_valid_offset will
             *     be updated to point the end of the valid signature data, causing a new AhoCorasick
             *     scan to start at the new next_valid_offset file location.
             */
            for magic_match in grep.find_overlapping_iter(&file_data[next_valid_offset..]) {
                // Get the location of the magic bytes inside the file data
                let magic_offset: usize = next_valid_offset + magic_match.start();

                // Get the signature associated with this magic signature
                let magic_pattern_index: usize = magic_match.pattern().as_usize();
                let signature: signatures::common::Signature = self
                    .pattern_signature_table
                    .get(&magic_pattern_index)
                    .unwrap()
                    .clone();

                debug!(
                    "Found {} magic match at offset {:#X}",
                    signature.description, magic_offset
                );

                /*
                 * Invoke the signature parser to parse and validate the signature.
                 * An error indicates a false positive match for the signature type.
                 */
                if let Ok(mut signature_result) = (signature.parser)(file_data, magic_offset) {
                    // Calculate the end of this signature's data
                    let signature_end_offset = signature_result.offset + signature_result.size;

                    // Sanity check the reported offset and size vs file size
                    if signature_end_offset > available_data {
                        info!("Signature {} extends beyond EOF; ignoring", signature.name);
                        // Continue inner loop
                        continue;
                    }

                    // Auto populate some signature result fields
                    signature_result_auto_populate(&mut signature_result, &signature);

                    // Add this signature to the file map
                    file_map.push(signature_result.clone());

                    info!(
                        "Found valid {} signature at offset {:#X}",
                        signature_result.name, signature_result.offset
                    );

                    // Only update the next_valid_offset if confidence is at least medium
                    if signature_result.confidence >= signatures::common::CONFIDENCE_MEDIUM {
                        // Only update the next_valid offset if the end of the signature reported the size of its contents
                        if signature_result.size > 0 {
                            // This file's signature has a known size, so there's no need to scan inside this file's data.
                            // Update next_valid_offset to point to the end of this file signature and break out of the
                            // inner loop.
                            next_valid_offset = signature_end_offset;
                            break;
                        }
                    }
                } else {
                    debug!(
                        "{} magic match at offset {:#X} is invalid",
                        signature.description, magic_offset
                    );
                }
            }
        }

Boundary Determination

Parsers determine data region boundaries in several ways:

1. Header-based size (most common)

Formats with explicit size fields in headers, e.g., SquashFS:

    // Parse the squashfs header
    if let Ok(squashfs_header) = parse_squashfs_header(&file_data[offset..]) {
        // Sanity check the reported image size
        if squashfs_header.image_size <= available_data {
            /*
             * To better validate SquashFS images, we want to verify at least some of the SquashFS image contents.
             * There are situations where the SquashFS header itself is valid and in-tact, but the data is not; for example,
             * gzipping a SquashFS image often leaves some of the SquashFS data uncompressed, since SquashFS images are already
             * compressed and the gzip utility realizes that it cannot further compress some sections. This can result in the
             * contents of the gzipped data containing an uncorrupted copy of the SquashFS header, while some of the SquashFS
             * image contents are gzipped compressed.
             *
             * The easiest field to validate seems to be the UID table pointer, which is an offset in the SquashFS image whre
             * the UID table resides. This table is just an array of 64-bit pointers, each one pointing to a compressed data block
             * which contains the actual UIDs. Validate that the UID table pointer is sane, *and* that the first 64-bit pointer
             * in the UID table is sane.
             */

            // Get the offset of the UID table, an array of pointers to metadata blocks containing lists of user IDs
            let uid_table_start: usize = offset + squashfs_header.uid_table_start;

            // Validate that the UID table pointer points to a location after the end of the SquashFS header (it's usually at the end of the image)
            if uid_table_start > squashfs_header.header_size {
                // Get the UID table data
                if let Some(uid_entry_data) = file_data.get(uid_table_start..) {
                    // Parse one entry from the UID table
                    if let Ok(uid_entry) = parse_squashfs_uid_entry(
                        uid_entry_data,
                        squashfs_header.major_version,
                        &squashfs_header.endianness,
                    ) {
                        // Make sure the first UID table entry is either 0, or falls within the bounds of the SquashFS image data
                        if (uid_entry == 0)
                            || (uid_entry > squashfs_header.header_size
                                && uid_entry <= squashfs_header.image_size)
                        {
                            // Format the modified time into something human readable
                            let create_date = epoch_to_string(squashfs_header.timestamp as u32);

                            // Make sure the compression type is supported
                            if squashfs_compression_types.contains_key(&squashfs_header.compression)
                            {
                                let compression_type_str = squashfs_compression_types
                                    [&squashfs_header.compression]
                                    .to_string();

                                // Select the appropriate extractor to use
                                if squashfs_header.endianness == "little" {
                                    result.preferred_extractor = Some(squashfs_le_extractor());
                                } else if squashfs_header.major_version == SQUASHFSV4 {
                                    result.preferred_extractor = Some(squashfs_v4_be_extractor());
                                } else {
                                    result.preferred_extractor = Some(squashfs_be_extractor());
                                }

                                result.size = squashfs_header.image_size;

2. Decompression-based size

For compressed formats, boundaries are determined by decompressing:

/// Validates gzip signatures
pub fn gzip_parser(file_data: &[u8], offset: usize) -> Result<SignatureResult, SignatureError> {
    // Length of the GZIP CRC located at the end of the deflate data stream
    const GZIP_CRC_SIZE: usize = 4;
    // Length of the ISIZE field located after the CRC field
    const GZIP_ISIZE_SIZE: usize = 4;

    // Do a dry-run decompression
    let dry_run = gzip_decompress(file_data, offset, None);

    // If dry-run was successful, this is almost certianly a valid gzip file
    if dry_run.success {
        // Get the size of the deflate data stream
        if let Some(deflate_data_size) = dry_run.size {
            // The dry run has already validated the header, but we want some header info to display to the user
            if let Ok(gzip_header) = parse_gzip_header(&file_data[offset..]) {
                // Original file name is optional
                let mut original_file_name_text: String = "".to_string();

                if !gzip_header.original_name.is_empty() {
                    original_file_name_text =
                        format!(" original file name: \"{}\",", gzip_header.original_name);
                }

                // Total size of the gzip file is the size of the header, plus the size of the compressed data, plus the trailing CRC and ISIZE fields
                let total_size =
                    gzip_header.size + deflate_data_size + GZIP_CRC_SIZE + GZIP_ISIZE_SIZE;

                return Ok(SignatureResult {
                    offset,
                    size: total_size,
                    confidence: CONFIDENCE_HIGH,
                    description: format!(
                        "{},{} operating system: {}, timestamp: {}, total size: {} bytes",
                        DESCRIPTION,
                        original_file_name_text,
                        gzip_header.os,
                        common::epoch_to_string(gzip_header.timestamp),
                        total_size,
                    ),
                    ..Default::default()
                });
            }
        }
    }

    Err(SignatureError)
}

3. Block-based parsing

For formats with block structures (e.g., LZ4, LZOP), parsers iterate through blocks:

/// Processes the LZ4 data blocks and returns the size of the raw LZ4 data
fn get_lz4_data_size(lz4_data: &[u8], checksum_present: bool) -> Result<usize, SignatureError> {
    let mut lz4_data_size: usize = 0;
    let mut last_lz4_data_size = None;
    let available_data = lz4_data.len();

    // Loop while there is still data and while the offsets are sane
    while is_offset_safe(available_data, lz4_data_size, last_lz4_data_size) {
        // Get the next block's data
        match lz4_data.get(lz4_data_size..) {
            None => {
                break;
            }
            Some(lz4_block_data) => {
                // Parse the next block's data
                match parse_lz4_block_header(lz4_block_data, checksum_present) {
                    Err(_) => {
                        break;
                    }
                    Ok(block_header) => {
                        // Update offsets
                        last_lz4_data_size = Some(lz4_data_size);
                        lz4_data_size += block_header.header_size
                            + block_header.data_size
                            + block_header.checksum_size;

                        // Only return success if a last block header is found
                        if block_header.last_block {
                            return Ok(lz4_data_size);
                        }
                    }
                }
            }
        }
    }

    Err(SignatureError)
}

4. Fallback inference

If a parser returns size 0, binwalk infers boundaries from the next signature or EOF:

        /*
         * Ideally, all signatures would report their size; some file formats do not specify a size, and the only
         * way to determine the size is to extract the file format (compressed data, for example).
         * For signatures with a reported size of 0, update their size to be the start of the next signature, or EOF.
         * This makes the assumption that there are no false positives or false negatives.
         *
         * False negatives (i.e., there is some other file format or data between this signature and the next that
         * was not correctly identified) is less problematic, as this will overestimate the size of this signature,
         * but most extraction utilities don't care about this extra trailing data being included.
         *
         * False positives (i.e., some data inside of this signature is identified as some other file type) can cause
         * this signature's file data to become truncated, which will inevitably result in a failed, or partial, extraction.
         *
         * Thus, signatures must be very good at validating magic matches and eliminating false positives.
         */
        for i in 0..file_map.len() {
            if file_map[i].size == 0 {
                // Index of the next file map entry, if any
                let next_index = i + 1;

                // By default, assume this signature goes to EOF
                let mut next_offset: usize = file_data.len();

                // If there are more entries in the file map
                if next_index < file_map.len() {
                    // Look through all remaining file map entries for one with medium to high confidence
                    for file_map_entry in file_map.iter().skip(next_index) {
                        if file_map_entry.confidence >= signatures::common::CONFIDENCE_MEDIUM {
                            // If a signature of at least medium confidence is found, assume that *this* signature ends there
                            next_offset = file_map_entry.offset;
                            break;
                        }
                    }
                }

                file_map[i].size = next_offset - file_map[i].offset;
                warn!(
                    "Signature {}:{:#X} size is unknown; assuming size of {:#X} bytes",
                    file_map[i].name, file_map[i].offset, file_map[i].size
                );
            } else {
                debug!(
                    "Signature {}:{:#X} has a reported size of {:#X} bytes",
                    file_map[i].name, file_map[i].offset, file_map[i].size
                );
            }
        }

Extraction Process

Extractors can be internal (Rust functions) or external (command-line tools):

/// Executes an extractor for the provided SignatureResult.
pub fn execute(
    file_data: &[u8],
    file_path: &str,
    signature: &SignatureResult,
    extractor: &Option<Extractor>,
) -> ExtractionResult {
    let mut result = ExtractionResult {
        ..Default::default()
    };

    // Create an output directory for the extraction
    if let Ok(output_directory) = create_output_directory(file_path, signature.offset) {
        // Make sure a defalut extractor was actually defined (this function should not be called if signature.extractor is None)
        match &extractor {
            None => {
                error!(
                    "Attempted to extract {} data, but no extractor is defined!",
                    signature.name
                );
            }

            Some(default_extractor) => {
                let extractor_definition: Extractor;

                // If the signature result specified a preferred extractor, use that instead of the default signature extractor
                if let Some(preferred_extractor) = &signature.preferred_extractor {
                    extractor_definition = preferred_extractor.clone();
                } else {
                    extractor_definition = default_extractor.clone();
                }

                // Decide how to execute the extractor depending on the extractor type
                match &extractor_definition.utility {
                    ExtractorType::None => {
                        error!(
                            "Signature {}: an extractor of type None is invalid!",
                            signature.name
                        );
                    }

                    ExtractorType::Internal(func) => {
                        debug!("Executing internal {} extractor", signature.name);
                        // Run the internal extractor function
                        result = func(file_data, signature.offset, Some(&output_directory));
                        // Set the extractor name to "<signature name>_built_in"
                        result.extractor = format!("{}_built_in", signature.name);
                    }

                    ExtractorType::External(cmd) => {
                        // Spawn the external extractor command
                        match spawn(
                            file_data,
                            file_path,
                            &output_directory,
                            signature,
                            extractor_definition.clone(),
                        ) {
                            Err(e) => {
                                error!(
                                    "Failed to spawn external extractor for '{}' signature: {}",
                                    signature.name, e
                                );
                            }

                            Ok(proc_info) => {
                                // Wait for the external process to exit
                                match proc_wait(proc_info) {
                                    Err(_) => {
                                        warn!("External extractor failed!");
                                    }
                                    Ok(ext_result) => {
                                        result = ext_result;
                                        // Set the extractor name to the name of the extraction utility
                                        result.extractor = cmd.to_string();
                                    }
                                }
                            }
                        }
                    }
                }

                // Populate these ExtractionResult fields automatically for all extractors
                result.output_directory = output_directory.clone();
                result.do_not_recurse = extractor_definition.do_not_recurse;

                // If the extractor reported success, make sure it extracted something other than just an empty file
                if result.success && !was_something_extracted(&result.output_directory) {
                    result.success = false;
                    warn!("Extractor exited successfully, but no data was extracted");
                }
            }
        }

        // Clean up extractor's output directory if extraction failed
        if !result.success {
            if let Err(e) = fs::remove_dir_all(&output_directory) {
                warn!(
                    "Failed to clean up extraction directory {output_directory} after extraction failure: {e}"
                );
            }
        }
    }

    result
}

Supported Formats

From magic.rs, binwalk supports:

Compression formats: gzip, bzip2, lzma, xz, zstd, lz4, lzop, lzfse, zlib, compressd

Archive formats: zip, 7zip, tarball, rar, cab, arj, cpio, deb

Filesystem formats: squashfs, jffs2, yaffs, yaffs2, cramfs, ext2/3/4, fat, ntfs, apfs, btrfs, romfs, ubi, ubifs, iso9660, qcow

Firmware formats: uimage, trx, seama, android_bootimg, android_sparse, uboot, cfe, jboot, packimg, tplink, rtk, dlink_tlv, mh01, csman, matter_ota

Image formats: png, jpeg, gif, bmp, svg

Other: elf, pe, pdf, pcap, pcapng, dtb, uefi, gpg, pem, srecord, dmg, riff, and many others

The architecture is modular: adding a new format requires defining a signature with magic bytes, a parser function, and optionally an extractor.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rustifying Excavate + Recursive Decoding #2832

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Rustifying Excavate + Recursive Decoding #2832

Uh oh!

Uh oh!

TheTechromancer Dec 12, 2025 Maintainer

1. Curate - Context-Dependent Reconstruction / Post-Processing

2. Excavate - Extracting common goodies from strings

3. Translate - Detecting and Decoding/Preprocessing data for Extraction

Architecture

Similar Projects - Excavate

Similar Projects - Translate

Design Choices

Replies: 2 comments

Uh oh!

TheTechromancer Dec 12, 2025 Maintainer Author

Uh oh!

TheTechromancer Dec 17, 2025 Maintainer Author

Architecture Overview

Signature Organization

Identification Process

Boundary Determination

1. Header-based size (most common)

2. Decompression-based size

3. Block-based parsing

4. Fallback inference

Extraction Process

Supported Formats

TheTechromancer
Dec 12, 2025
Maintainer

TheTechromancer
Dec 12, 2025
Maintainer Author

TheTechromancer
Dec 17, 2025
Maintainer Author