Rustifying Excavate + Recursive Decoding #2832
Replies: 2 comments
-
|
Relevant: #2094 |
Beta Was this translation helpful? Give feedback.
-
|
A quick analysis of binwalk's codebase (TIL binwalk is written in Rust): Architecture OverviewBinwalk uses a signature-based system with three components:
Signature OrganizationSignatures are defined in Each signature includes:
Identification ProcessThe scanning process uses the Aho-Corasick algorithm to search for all magic patterns simultaneously: Boundary DeterminationParsers determine data region boundaries in several ways: 1. Header-based size (most common)Formats with explicit size fields in headers, e.g., SquashFS: 2. Decompression-based sizeFor compressed formats, boundaries are determined by decompressing: 3. Block-based parsingFor formats with block structures (e.g., LZ4, LZOP), parsers iterate through blocks: 4. Fallback inferenceIf a parser returns size 0, binwalk infers boundaries from the next signature or EOF: Extraction ProcessExtractors can be internal (Rust functions) or external (command-line tools): Supported FormatsFrom Compression formats: gzip, bzip2, lzma, xz, zstd, lz4, lzop, lzfse, zlib, compressd Archive formats: zip, 7zip, tarball, rar, cab, arj, cpio, deb Filesystem formats: squashfs, jffs2, yaffs, yaffs2, cramfs, ext2/3/4, fat, ntfs, apfs, btrfs, romfs, ubi, ubifs, iso9660, qcow Firmware formats: uimage, trx, seama, android_bootimg, android_sparse, uboot, cfe, jboot, packimg, tplink, rtk, dlink_tlv, mh01, csman, matter_ota Image formats: png, jpeg, gif, bmp, svg Other: elf, pe, pdf, pcap, pcapng, dtb, uefi, gpg, pem, srecord, dmg, riff, and many others The architecture is modular: adding a new format requires defining a signature with magic bytes, a parser function, and optionally an extractor. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
As we think about rustifying Excavate, there are three categories or "layers" of features that are really important:
1. Curate - Context-Dependent Reconstruction / Post-Processing
Web Module (context: Source URL)
File System (context: file path)
2. Excavate - Extracting common goodies from strings
Examples include emails, URLs, secrets, etc.
Yara-x rules
3. Translate - Detecting and Decoding/Preprocessing data for Extraction
Having a recursive, modular system for detecting encodings/compressions/packing methods like base64, hex, gzip, webpack, etc. is extremely beneficial to Excavate, especially for complex and multifaceted blobs of data like web content, binaries, zip files, word documents.
Most likely, these would be two separate GitHub repos, with Excavate depending on Translate. Translate is a recursive system with its own modules, while Excavate is a simple, non-recursive system that intakes strings and extracts goodies.
Need a way to check if given text is compatible with a given translate module (in some cases, this could a yara rule)
Architecture
Before we commit to a concrete path for either of these, we should look at similar projects and see what lessons we can from learn them.
Also, we should have a general idea for how to compartmentalize the context, so that features that rely on context-specific info like current URL, .git config, syntax language, file path, etc., have access to everything they need, while we are still able to abstract the task-based string extraction, etc. and keep that code as simple as possible.
Similar Projects - Excavate
ripgrep
semgrep
jsluice
Also regex soup, but uses AST
js-link-finder
Regex Soup - resource intensive, doesn't handle edge cases well, can't peer into decoded content, no context
yara-x
Similar Projects - Translate
TODO:
Design Choices
Beta Was this translation helpful? Give feedback.
All reactions