As researchers, we aim to make our code fully reproducible. However, due to the evolving nature of third-party APIs, tools, and malware samples, issues may arise. This guide outlines key modules, common pitfalls, and optimization tips for working and extending ADAPT.
- Third-party tools & APIs:
- Censys API
- lief
- oletools
- malcat yara
- FLOSS and Exiftool (included in
bin/directory with fixed versions)
/downloaded_samples_folder/
├── 0123abcd…/
├── 0123abcd… # Sample file (PDF, EXE, DOC, etc.)
├── 0123abcd….json # VT metadata file (required for Censys queries)
Note:
The presence of the {file_hash}.json VT metadata file is crucial.
It provides the first submission date, which is used to narrow the time window in Censys certificate and host queries.
Before running the pipeline, ensure that the file paths are correctly configured for your local environment.
async def main():
BASE_DIR = r"provide\\the\\folderpath\\malware\\samples"
...Update BASE_DIR to point to your folder containing malware samples.
ADAPT implements two different strategies for regex-based feature extraction:
- Approach: Matches each regex individually against string content.
- Pros: High flexibility; useful for detailed analysis and debugging.
- Cons: Very slow on large datasets (~10,000+ samples).
- Use case: Debugging or working with small datasets.
- Reference: See code block starting around line 1366 in
feature_processing.py.
- Approach: Combines all regexes into a single large pattern using named groups.
- Pros: Much faster; optimized for batch processing.
- Cons: Less fine-grained control; regex patterns must be carefully structured.
- Optimization: Strings longer than 2000 characters are skipped to avoid regex timeouts and high memory usage.
candidate_strings = [
s.get("string").strip()
for s in all_strings.get("static_strings", [])
if s.get("string") and len(s.get("string")) < 2000
]You can adjust or remove the string length constraint in the snippet above depending on your dataset and system capabilities.
ADAPT queries the Censys API to extract certificate and host metadata.
To enable this, you need to set your API credentials as environment variables.
In the code, the credentials are accessed like this:
censys_api_id = os.getenv("CENSYS_API_ID")
censys_api_secret = os.getenv("CENSYS_API_SECRET")These must be set in your system or runtime environment and avoid hardcoding them into the script for security reasons. These credentials are required to authenticate your requests with the Censys API and fetch metadata reliably.
The Censys responses can change over time depending on how Censys structures its API. Refer to the following code block in case the response structure from Censys is not producing expected results.
def censys_certificate_data(domain_name: str, sample_left_date: datetime, sample_right_date: str = "*") -> list:
sample_left_date_str = sample_left_date.strftime("%Y-%m-%d")
certificate_query = (
f"parsed.extensions.subject_alt_name.dns_names:{domain_name}"
f"AND added_at:[{sample_left_date_str} TO {sample_right_date}]"
)After feature extraction, your folder structure should look like this.
/downloaded_samples_folder/
├── 0123abcd…/
├── 0123abcd… # Sample file (PDF, EXE, DOC, etc.)
├── 0123abcd….json # VT metadata file (required for Censys cert queries)
├── censys_features_withhostdata.json
├── exiftool_results.json
├── flossresults_reduced_7.json
├── lief_features.json # Present only for PE (executable) files
├── malcatYararesults.json
├── oletool_features_updated.json
└── regex_results.json
The groupAttribution.ipynb notebook performs clustering of malware samples based on extracted group-level features.
It merges features from several sources:
- exiftool metadata
- malcat rule matches
- regex-matched patterns
- censys data
These features often include string-based metadata (e.g., authors, company names, email addresses), which can be semantically similar even if lexically different. To normalize these and group similar entries, ADAPT computes text embeddings using a transformer-based language model (Model Name: sentence-transformers/multi-qa-MiniLM-L6-cos-v1).
The core logic for embedding-based feature normalization is implemented in the following files:
group_features.py
def compute_embeddings(self, data):
# This function loops over selected columns and applies embedding-based normalization using string_feature_embed_similarity.util.py
Similarity Computation: Computes cosine similarity between embeddings. For each value, find similar entries above a given threshold.
def compute_similar_candidates(self, unique_values_sets, doc_emb, sim_threshold=0.9) -> dict:
...
Returns a mapping: {original_value: [similar_candidates...]}.Below is where the following computation happens.
scores = torch.mm(query_emb, doc_emb.transpose(0, 1)).squeeze()
scores_list = scores.cpu().tolist()torch.mm creates a full similarity vector for each input string, and if you have 10,000 unique values, you’re creating and holding a 10,000 x 10,000 similarity matrix. That's 100M floats (~400MB just for the scores).
And finally, the below function performs normalization that includes extracting unique strings from the column, embedding them using a transformer, and finding similar values using cosine similarity.
def string_feature_embed_similarity(self, data, column, tokenizer, model, similarity_threshold=0.70) -> pd.Series:Computing embeddings for a large number of unique strings can consume significant memory, especially on machines without GPUs or with limited VRAM.
Additionally, certain scenarios can increase processing time:
-
Large Malware Samples: Samples larger than 20 MB and with huge string content (evidenced by large FLOSS output files) can slow down embedding generation.
-
Agglomerative Clustering: This step can be computationally expensive, particularly if: You're clustering a large number of samples, or You're trying more than ~50 clusters, which can increase runtime due to pairwise distance computations and hierarchical merging.
RuntimeError: CUDA out of memory
- Use
encode_list_of_texts_batched()instead of the full-text version. - Reduce batch size (e.g.,
batch_size=16). - Filter very long strings (e.g., skip strings longer than 2000 characters).
- Consider switching to a smaller transformer model (e.g.,
distilbertinstead ofbert-large). - Run on CPU (slower but safer): comment out
.to(device)or setdevice = torch.device("cpu"). - Pre-filter large samples (e.g., skip samples > 20 MB unless necessary).
Feel free to open issues or pull requests if you encounter any bugs or improvements!