Python › Regex for Pentesters
The re module for pentesters
Regex is the extraction engine of security automation, and Python’s re module is how you drive it from code. You already know patterns from the regex course; this lesson is about the Python side — which method to call, how matches come back, and the one rule that prevents most bugs.
You'll learn to
- Use the core re methods correctly
- Capture groups and extract just what you need
- Avoid the backslash trap with raw strings
The methods you’ll use
Python’s re module gives you a handful of functions. The ones that matter for security work:
import re
text = "Authorization: Bearer eyJhbGci.payload.sig and id=42"
re.search(r"Bearer\s+(\S+)", text).group(1) # 'eyJhbGci.payload.sig' — first match, group 1
re.findall(r"id=(\d+)", text) # ['42'] — every match of the group
re.sub(r"\d", "X", text) # replace every digit with X
# finditer keeps groups AND positions — best for structured extraction:
for m in re.finditer(r"(\w+)=(\w+)", text):
print(m.group(1), m.group(2)) # key, value
search finds the first match anywhere; findall returns every match (just the group if you have one); finditer gives you match objects with groups and positions; sub rewrites. For extracting many things with groups, prefer finditer — its return type is predictable.
The one rule: always use raw strings
Write every pattern as a raw string with the r prefix: r"\d+", not "\d+". Without it, Python’s string parser consumes your backslashes before the regex engine ever sees them.
Compile patterns you reuse
# Compile once, use many times — faster in loops:
AWS = re.compile(r"\b(?:AKIA|ASIA)[0-9A-Z]{16}\b")
for filename in files:
text = open(filename, errors="ignore").read()
for hit in AWS.findall(text):
print(filename, hit)
re.compile turns a pattern into a reusable object. When you run the same pattern across thousands of files or lines, compiling once is meaningfully faster than recompiling every call.
Checkpoint
Why must you write regex patterns as raw strings (r'...') in Python?
Because Python's normal string parser interprets backslash escapes (like \b, \n) before the regex engine sees the pattern. r'\b' passes a literal backslash-b to the regex engine (a word boundary), while '\b' becomes a backspace character. Raw strings keep your patterns intact.
Try it yourself
Take a block of text containing a few fake tokens. Use re.findall with a raw-string pattern to extract them, then re.sub to redact them (replace with asterisks). Notice how findall returns the group when your pattern has one.
Key takeaways
searchfinds first,findallgets all,finditergives groups+positions,subrewrites.- Always write patterns as raw strings:
r"...". re.compileonce and reuse for speed across many inputs.- The same patterns from the regex course become automated scanners here.
Quick quiz
Next, building custom web scanners that put these extraction skills to work against live targets.