Regex › Regex Bypass Techniques
Bypass techniques and the filter-interpreter gap
Almost every filter bypass comes from one idea: the filter and the thing it’s protecting read input differently. The filter sees a string; the database, browser, or shell sees something else after decoding. Exploiting that gap is the heart of bypass technique. This lesson names the gaps systematically.
You'll learn to
- Understand the filter-interpreter gap
- Apply the main bypass families
- See why normalisation is the defence
The core idea
A filter checks input as text. But the input then passes to an interpreter — a SQL engine, an HTML parser, a shell — that may decode or normalise it first. If the filter and the interpreter disagree about what the input means, you bypass the filter.
Filter sees: %3Cscript%3E (just a harmless-looking string)
Browser decodes to: <script> (the actual attack)
-> filter blocked '<script>' but never saw it, because it was encoded
The bypass families
Encoding: URL (%3C), HTML entity (<), Unicode (\u003c), double-encoding
Case: <ScRiPt> when the filter is case-sensitive
Whitespace: union/**/select, tab/newline where the filter expects spaces
Nesting: <scr<script>ipt> when a naive filter strips once and stops
Normalisation: Unicode forms that collapse to the dangerous char after the filter
Each family is a way the interpreter’s view differs from the filter’s. Encoding is the biggest: if the filter checks before decoding and the interpreter decodes after, the filter never sees the real payload.
Checkpoint
What single principle underlies almost every filter bypass?
The filter and the interpreter that ultimately processes the input read it differently. The filter checks the input as raw text, but the input is then decoded or normalised by the interpreter (a SQL engine, HTML parser, or shell) before it acts. If a payload is encoded, mixed-case, or otherwise shaped so the filter doesn't recognise it as dangerous but the interpreter still decodes it to the real attack, it slips through. Bypasses exploit that disagreement about what the input means.
Try it yourself
Take a filter that blocks the literal less-than sign. List how you’d represent that character so the filter misses it but the browser still interprets it — URL encoding, HTML entity, Unicode escape, and double encoding. Then state the one defensive change that defeats all of them.
Key takeaways
- Bypasses exploit the gap between how a filter and an interpreter read input.
- Families: encoding, case, whitespace, nesting, Unicode normalisation.
- Encoding is biggest: filter checks before decode, interpreter acts after.
- Defence: decode and normalise to final form first, then validate that.
Quick quiz
Next, ReDoS — when a regex pattern itself becomes a denial-of-service vulnerability.