Python › Source Code Review Automation
Automating source code review
When you have source access — a repo, a decompiled app, a grey-box target — Python lets you triage thousands of files in seconds. This lesson builds a reusable scanner that finds hardcoded credentials, dangerous functions, and vulnerability patterns, reporting them as file-and-line hits you can verify.
You'll learn to
- Walk a codebase efficiently
- Match the patterns that signal real bugs
- Report findings as file:line you can jump to
Walking the tree
import os
SKIP = {"node_modules", ".git", "vendor", "dist", "__pycache__"}
EXT = {".py", ".js", ".php", ".java", ".rb", ".go", ".env"}
def code_files(root):
for dirpath, dirnames, filenames in os.walk(root):
dirnames[:] = [d for d in dirnames if d not in SKIP] # prune noise dirs
for name in filenames:
if os.path.splitext(name)[1] in EXT:
yield os.path.join(dirpath, name)
os.walk recursively yields files; pruning dirnames in place stops it descending into node_modules and .git — a big speed and signal win. yield makes this a generator, so it scales to huge trees without building a giant list.
The patterns that find bugs
import re
RULES = {
"hardcoded-cred": re.compile(r"(?i)(password|secret|api[_-]?key|token)\s*[:=]\s*['\"][^'\"]{4,}['\"]"),
"code-exec": re.compile(r"\b(eval|exec|system|popen)\s*\("),
"py-deser": re.compile(r"\b(pickle\.loads|yaml\.load)\s*\("),
"sql-fstring": re.compile(r"(?i)(execute|cursor\.execute)\s*\(\s*f['\"]"),
}
def scan(path):
src = open(path, encoding="utf-8", errors="ignore").read()
for rule, rx in RULES.items():
for m in rx.finditer(src):
line = src[:m.start()].count("\n") + 1 # line number of the hit
print(f"{path}:{line}: [{rule}] {m.group(0)[:60]}")
Three families: hardcoded credentials, dangerous sinks (code execution, unsafe deserialization), and vulnerability patterns like SQL built with an f-string. The line-number trick — counting newlines before the match — turns each hit into a file:line reference you can jump straight to.
Checkpoint
Why does the scanner prune directories like node_modules and .git from the os.walk, and how?
Those directories contain huge amounts of third-party and version-control data that produce noise and slow the scan. The code prunes them by reassigning dirnames in place (dirnames[:] = [d for d in dirnames if d not in SKIP]), which tells os.walk not to descend into them. This keeps the scan fast and focused on the project's own source.
Try it yourself
Point the file walker at a small project directory and list the source files it finds. Then run one rule — say the code-exec pattern — over each file and print any file:line hits. Verify a hit by opening that line in context.
Key takeaways
os.walkwith in-place dirname pruning scans big trees fast.- Three rule families: credentials, dangerous sinks, vulnerability patterns.
- Count newlines before a match to get its line number.
- Review output is triage — verify each candidate in context; scan git history too.
Quick quiz
Next, automating Active Directory enumeration with LDAP and the impacket toolkit.