Regex › Threat Hunting Regex
Threat hunting: extracting IOCs
Threat hunting often starts with extracting indicators of compromise — IPs, domains, URLs, file hashes, emails — from logs, reports, and samples. Each has a recognisable shape, so regex pulls them out of any text at scale. This lesson covers the IOC patterns and the defanging that analysts use.
You'll learn to
- Match the common IOC types
- Handle defanged indicators
- Extract IOCs from any dataset
The IOC patterns
IPv4: \b(?:\d{1,3}\.){3}\d{1,3}\b
Domain: \b(?:[a-z0-9-]+\.)+[a-z]{2,}\b
URL: https?://[^\s"'<>]+
MD5: \b[a-f0-9]{32}\b
SHA256: \b[a-f0-9]{64}\b
Email: \b[\w.+-]+@[\w-]+\.[\w.-]+\b
Each indicator has a fixed structure: an IPv4 is four dot-separated number groups, a SHA-256 is 64 hex characters, a URL starts with a scheme. These shapes make extraction reliable — run the battery over any text and collect the indicators.
Defanged indicators
Analysts 'defang' IOCs so they're not accidentally clicked or auto-blocked:
hxxp://evil[.]com/path 1.2.3[.]4 evil(dot)com user[at]evil.com
Your patterns must handle both live and defanged forms — match
hxxps?, the bracketed dot [.], and (dot)/(at) variants.
Checkpoint
Why must IOC-extraction patterns handle 'defanged' indicators like hxxp://evil[.]com?
Threat reports and intelligence feeds deliberately defang indicators — writing hxxp instead of http, [.] instead of a dot — so the URLs and IPs can't be accidentally clicked, resolved, or auto-blocked when the report is read or processed. A hunter extracting IOCs from those sources must therefore match the defanged forms (hxxp, the bracketed dot, (dot)/(at) variants), not just live ones, or they'll miss most of the indicators. Typically you then re-fang them to their real form to search your own live data for matches.
Try it yourself
Write patterns to extract IPv4 addresses and SHA-256 hashes from text, noting the exact length that makes the hash pattern reliable. Then describe how you’d modify a URL pattern to also catch the defanged hxxp:// and [.] forms used in threat reports.
Key takeaways
- IOCs (IPs, domains, URLs, hashes, emails) have fixed, matchable shapes.
- Hashes are reliable by exact length: MD5 is 32 hex, SHA-256 is 64.
- Handle defanged forms (hxxp, [.], (dot)) from reports, then re-fang to search.
- Validate matches (octet range, real TLD, exact length) to cut false IOCs.
Quick quiz
Next, language-specific regex implementations — the quirks across PHP, Java, C#, Ruby, Rust, and Perl.