Python › Recon Automation

Parsing URLs and extracting domains

5 min read Intermediate 6 sections

Recon is mostly text wrangling at scale: pulling apart URLs, extracting domains, collecting parameters. The first rule is one beginners always break — don’t parse URLs by hand with string splitting. Python’s standard library does it correctly, handling every edge case you’d otherwise get wrong. Get this building block right and the rest of recon stacks on top of it.

You'll learn to

  • Break a URL into its parts reliably
  • Extract clean domains and parameters
  • Resolve relative links into absolute URLs

Use urllib, not string splitting

from urllib.parse import urlparse, parse_qs, urljoin

u = urlparse("https://api.example.com:443/v1/users?id=5&sort=asc")
u.scheme     # 'https'
u.hostname   # 'api.example.com'   ← the bare domain, no port
u.port       # 443
u.path       # '/v1/users'
u.query      # 'id=5&sort=asc'

urlparse returns an object whose attributes are the URL’s components. hostname gives you the bare domain with no port or path — which is what you collect and deduplicate. Doing this by hand with split("/") breaks on ports, credentials in the URL, missing schemes, and a dozen other cases the library already handles.

Parsing the query string

parse_qs(u.query)
# {'id': ['5'], 'sort': ['asc']} — each value is a list (params can repeat)

parse_qs turns a query string into a dictionary so you can read and manipulate parameters programmatically. Note each value is a list, because a parameter can appear more than once in a URL.

# A page links to "/login" — turn it into a full URL:
urljoin("https://example.com/app/", "/login")
# → 'https://example.com/login'

urljoin("https://example.com/app/", "../api/v1")
# → 'https://example.com/api/v1'

urljoin correctly resolves a relative link (the href="/login" you scraped from a page) against a base URL. This is essential for crawling, and it’s another thing that’s surprisingly easy to get wrong by hand — relative paths, .. segments, and absolute paths all have rules the library knows.

Putting it together: extract unique domains and params

from urllib.parse import urlparse, parse_qs

urls = open("urls.txt").read().splitlines()

domains = set()
params = set()
for raw in urls:
    u = urlparse(raw)
    if u.hostname:
        domains.add(u.hostname)
    for name in parse_qs(u.query):
        params.add(name)

print(f"{len(domains)} unique domains, {len(params)} unique parameters")

This reads a file of URLs and accumulates two sets — every unique host and every unique parameter name. It’s the shape of real recon: loop the input, parse each item, accumulate into sets that dedupe automatically. The parameter set is directly useful as a fuzzing wordlist.

Checkpoint

You've collected 8000 URLs from several tools and want a clean, deduplicated list of just the hostnames. What's the approach?

Try it yourself

Take a few sample URLs (mix in one with a port and one with query parameters). Use urlparse to print the hostname and path of each, then parse_qs to list the parameter names. Notice how it handles the port and the query correctly without any manual splitting.

Summary

Use urllib.parse, never manual string splitting, to break URLs apart: urlparse gives .hostname, .path, .query and more; parse_qs turns a query string into a dict (values are lists); urljoin resolves relative links against a base. The core recon pattern is to loop your URLs, parse each, and accumulate hostnames and parameter names into sets that dedupe automatically — producing clean host lists and observed-parameter wordlists.

Key takeaways

  • urlparse(url).hostname is the right way to get a clean domain.
  • parse_qs(query) turns parameters into a dict (each value is a list).
  • urljoin(base, link) resolves relative links for crawling.
  • Loop, parse, accumulate into sets — the shape of every recon script.

Quick quiz

Next, fetching and analysing JavaScript files to extract the endpoints, routes, and secrets an app’s frontend reveals.

Was this lesson helpful?