Python › Malware Analysis Automation
Static malware triage with Python
Malware analysis uses Python for static triage — pulling strings, indicators, and structure out of a sample without running it. This lesson covers IOC and string extraction and the static-analysis basics. The techniques mirror recon (it’s all extraction), now aimed at samples. Analyse malware only in an isolated, authorised lab.
You'll learn to
- Extract printable strings from a binary
- Pull indicators of compromise with regex
- Identify a sample by its hashes
Strings and indicators
A malware sample is binary, so you read it as bytes and extract the printable runs — the classic strings utility, reimplemented:
import re
def strings(data, min_len=5):
return re.findall(rb"[\x20-\x7e]{%d,}" % min_len, data)
IOC = {
"ipv4": re.compile(rb"\b(?:\d{1,3}\.){3}\d{1,3}\b"),
"url": re.compile(rb"https?://[^\s\"'<>]+", re.I),
"domain": re.compile(rb"\b(?:[a-z0-9-]+\.)+[a-z]{2,}\b", re.I),
}
def extract(path):
data = open(path, "rb").read() # read as BYTES
blob = b"\n".join(strings(data)) # printable strings first
out = {}
for name, rx in IOC.items():
out[name] = sorted({m.decode(errors="ignore") for m in rx.findall(blob)})
return out
Note everything is bytes (rb"...", open(path, "rb")) because malware isn’t valid text. Extract the printable strings first — they surface URLs, IPs, commands, and config buried in the binary — then run the IOC battery over them.
Identifying the sample
import hashlib
def hashes(path):
data = open(path, "rb").read()
return {
"md5": hashlib.md5(data).hexdigest(),
"sha256": hashlib.sha256(data).hexdigest(),
} # look these up on threat-intel platforms
Hashes uniquely identify a sample for threat-intel lookups and IOC sharing. A SHA-256 lets you check whether a sample is already known and what others have found.
Checkpoint
Why must malware-analysis code read the sample with open(path, 'rb') and use byte patterns (rb'...') rather than normal text mode?
A malware binary is not valid text — it contains arbitrary bytes that would cause decoding errors or corruption if read as a string. Reading in binary mode ('rb') gives you the raw bytes, and byte-string regex patterns (rb'...') match against those bytes directly. You extract printable runs first, then can safely decode just those for display.
Try it yourself
In an isolated environment, take any binary file (even a harmless one) and write a function that extracts its printable strings of length 5 or more. Then run an IP-address regex over those strings. Observe what surfaces — for real malware this is where C2 addresses appear.
Key takeaways
- Read samples as bytes; extract printable strings first.
- The IOC battery (IPs, URLs, domains) runs over the extracted strings.
- Hashes (MD5/SHA-256) identify a sample for threat-intel lookups.
- Run your own payloads through it to see what indicators they leak.
Quick quiz
Next, red-team automation — handling payloads, infrastructure, and the activity logging that real operations require.