Python › Malware Analysis Automation

Static malware triage with Python

3 min read Advanced 3 sections

Malware analysis uses Python for static triage — pulling strings, indicators, and structure out of a sample without running it. This lesson covers IOC and string extraction and the static-analysis basics. The techniques mirror recon (it’s all extraction), now aimed at samples. Analyse malware only in an isolated, authorised lab.

You'll learn to

  • Extract printable strings from a binary
  • Pull indicators of compromise with regex
  • Identify a sample by its hashes

Strings and indicators

A malware sample is binary, so you read it as bytes and extract the printable runs — the classic strings utility, reimplemented:

import re

def strings(data, min_len=5):
    return re.findall(rb"[\x20-\x7e]{%d,}" % min_len, data)

IOC = {
    "ipv4":   re.compile(rb"\b(?:\d{1,3}\.){3}\d{1,3}\b"),
    "url":    re.compile(rb"https?://[^\s\"'<>]+", re.I),
    "domain": re.compile(rb"\b(?:[a-z0-9-]+\.)+[a-z]{2,}\b", re.I),
}

def extract(path):
    data = open(path, "rb").read()              # read as BYTES
    blob = b"\n".join(strings(data))            # printable strings first
    out = {}
    for name, rx in IOC.items():
        out[name] = sorted({m.decode(errors="ignore") for m in rx.findall(blob)})
    return out

Note everything is bytes (rb"...", open(path, "rb")) because malware isn’t valid text. Extract the printable strings first — they surface URLs, IPs, commands, and config buried in the binary — then run the IOC battery over them.

Identifying the sample

import hashlib

def hashes(path):
    data = open(path, "rb").read()
    return {
        "md5":    hashlib.md5(data).hexdigest(),
        "sha256": hashlib.sha256(data).hexdigest(),
    }   # look these up on threat-intel platforms

Hashes uniquely identify a sample for threat-intel lookups and IOC sharing. A SHA-256 lets you check whether a sample is already known and what others have found.

Checkpoint

Why must malware-analysis code read the sample with open(path, 'rb') and use byte patterns (rb'...') rather than normal text mode?

Try it yourself

In an isolated environment, take any binary file (even a harmless one) and write a function that extracts its printable strings of length 5 or more. Then run an IP-address regex over those strings. Observe what surfaces — for real malware this is where C2 addresses appear.

Key takeaways

  • Read samples as bytes; extract printable strings first.
  • The IOC battery (IPs, URLs, domains) runs over the extracted strings.
  • Hashes (MD5/SHA-256) identify a sample for threat-intel lookups.
  • Run your own payloads through it to see what indicators they leak.

Quick quiz

Next, red-team automation — handling payloads, infrastructure, and the activity logging that real operations require.

Was this lesson helpful?