Generating a Constant Stream of Web Traffic with Python
There are honest reasons to want a steady trickle of outbound web traffic from a host you own. You might be testing a proxy, watching a DNS resolver under realistic load, validating a network monitoring rule, exercising an SD-WAN policy, or just generating background noise in a homelab so the dashboards aren’t flat. Whatever the motivation, the requirement is the same: predictable, polite, low-volume requests against well-known public sites — not scraping, not load-testing, not pretending to be something you’re not.
This post walks through a small Python script that does exactly that. It picks one URL at a time from a curated list of ten popular public sites, sleeps a randomised interval, and goes again. It respects robots.txt, enforces a per-domain cooldown, identifies itself honestly in the User-Agent, and backs off when a server says to. The whole thing is about 250 lines and depends on requests only.
The full code is on GitHub at michealgarner/web-traffic-generator. The walkthrough below is the same file, broken up so I can explain the bits that matter.
What it is and what it is not
It is a polite client. It sends one GET at a time, waits, and goes again. The defaults — 30 to 90 seconds between requests, 120 seconds minimum between two requests to the same host — keep it well inside what any large public site will treat as background noise.
It is not a scraper, a load tester, or a click-fraud tool. It does not log in, submit forms, or follow tracking pixels. It does not pretend to be a single specific browser; it rotates a small pool of real User-Agents and always appends traffic-generator/1.0 (+https://michealgarner.co.uk) so any operator who looks at their access logs can see what hit them and from where. That last detail matters — abusive scrapers omit it, and the entire premise of “polite” automation is that you do not.
If you want to use this against a service you do not own, the rule is the same as it is for any bot: read the target’s terms of use first, and don’t crank the intervals to zero just because the script lets you.
The site list
The ten sites it crawls by default:
en.wikipedia.org(Special:Random — different page every time)github.com/trendingstackoverflow.com/questionsdeveloper.mozilla.orgnews.ycombinator.compython.orgbbc.com/newstheguardian.com/internationalnasa.govreddit.com/r/linux/.json(Reddit’s documented public JSON endpoint)
These were chosen for three properties. They are all public and well-resourced — none of them will notice a request every minute or two. They all either welcome polite automated access (Wikipedia, MDN, NASA, Python.org explicitly do) or are large enough that the background noise is invisible. And none of them are politically charged, so the access logs you generate aren’t going to look strange in a corporate environment.
The walkthrough
Imports and configuration
import argparse
import logging
import random
import signal
import sys
import time
from dataclasses import dataclass, field
from typing import Optional
from urllib.parse import urlparse
from urllib.robotparser import RobotFileParser
import requests
Standard library plus requests. No aiohttp, no scrapy, no selenium — the whole point is that this is small enough to read in one sitting. urllib.robotparser is the often-forgotten part of the standard library that knows how to parse robots.txt properly, including Allow: rules and Crawl-delay:.
SITES: list[str] = [
"https://en.wikipedia.org/wiki/Special:Random",
"https://github.com/trending",
...
]
USER_AGENTS: list[str] = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...",
...
]
UA_SUFFIX = " traffic-generator/1.0 (+https://michealgarner.co.uk)"
Three flat constants. The site list is what we cycle through. The User-Agent pool is three real strings — one Chrome, one Safari, one Firefox — that we randomise per-request so we don’t look like a single misconfigured client. The suffix is the part you must not remove: it’s how a server operator can identify what’s hitting them.
Per-domain state
@dataclass
class DomainState:
last_hit: float = 0.0
robots: Optional[RobotFileParser] = None
robots_loaded: bool = False
consecutive_errors: int = 0
backoff_until: float = 0.0
For every domain we touch we keep a small bag of state. last_hit is a time.monotonic() timestamp that drives the per-domain cooldown. robots and robots_loaded cache the parsed robots.txt for the host so we fetch it once per process, not once per request. backoff_until is set when a host returns 429 Too Many Requests or 503 Service Unavailable — until that timestamp passes, we skip the host entirely.
time.monotonic() rather than time.time() is deliberate. Wall-clock time can jump backwards if the system clock is corrected or NTP slews; the monotonic clock will not, so our cooldowns can never end up “in the past” by accident.
robots.txt — fetch once, cache forever
def _robots_for(self, url: str) -> Optional[RobotFileParser]:
host = urlparse(url).netloc
state = self.domains.setdefault(host, DomainState())
if state.robots_loaded:
return state.robots
robots_url = f"{urlparse(url).scheme}://{host}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
state.robots = rp
except Exception as exc:
logging.warning("robots.txt unreachable for %s: %s — proceeding", host, exc)
state.robots = None
state.robots_loaded = True
return state.robots
The first time we touch a host, we fetch its robots.txt, parse it, and cache the parser. On subsequent requests the cache hit is free.
The try/except is important. If robots.txt 404s, returns malformed content, or the host is briefly unreachable, we set robots = None and fail open — i.e. allow the fetch. That’s the standard convention for crawlers: a missing robots.txt means “no rules”. The alternative — refusing to fetch anything when robots.txt is unreachable — would mean a single dropped packet at startup permanently disables the script for that host until restart.
Cooldowns and backoff
def _hit(self, url: str) -> None:
host = urlparse(url).netloc
state = self.domains.setdefault(host, DomainState())
now = time.monotonic()
wait = state.last_hit + self.per_domain_cooldown - now
if wait > 0:
time.sleep(wait)
if state.backoff_until > time.monotonic():
wait = state.backoff_until - time.monotonic()
time.sleep(wait)
...
Two layered waits. The first enforces the per-domain cooldown — even if the round-robin happens to come back to the same host quickly, we won’t hit it again until the cooldown expires. The second honours any backoff window from a previous 429/503. We re-read the monotonic clock before computing the second wait because the first sleep may have already eaten part of the backoff.
Sending the request
ua = random.choice(USER_AGENTS) + UA_SUFFIX
if not self._allowed(url, ua):
logging.info("robots.txt disallows %s — skipping", url)
return
headers = {
"User-Agent": ua,
"Accept": "text/html,application/xhtml+xml,application/json;q=0.9,*/*;q=0.5",
"Accept-Language": "en-GB,en;q=0.9",
}
r = self.session.get(url, headers=headers, timeout=self.timeout, allow_redirects=True)
Three things to call out.
The requests.Session is created once in the constructor and reused for every request. That keeps TCP connections alive across requests to the same host, lets the underlying urllib3 pool reuse sockets, and means gzip negotiation happens once.
The Accept header includes application/json because one of our endpoints (Reddit) returns JSON. Without that, Reddit’s CDN can return an HTML error page.
The timeout is critical. requests does not impose a default timeout, which means a hung server can wedge the script indefinitely. We pass an explicit timeout=20 from the CLI default.
Reading the response
if r.status_code in (429, 503):
retry = int(r.headers.get("Retry-After", "60") or "60")
state.backoff_until = time.monotonic() + retry
state.consecutive_errors += 1
self.requests_failed += 1
return
if 200 <= r.status_code < 400:
self.requests_ok += 1
state.consecutive_errors = 0
if self.crawl_links and "text/html" in r.headers.get("Content-Type", ""):
self._maybe_follow(r.text, url)
429 and 503 are the two status codes that mean “stop”. We honour Retry-After if present; if it isn’t, we fall back to a minute. Any error increments consecutive_errors, which a future version could use to drop a host out of the rotation entirely after enough failures.
2xx and 3xx (after redirects) count as success. If the user passed --crawl and the response is HTML, we optionally follow one link from the page.
Optional shallow link-following
def _maybe_follow(self, html: str, base_url: str) -> None:
import re
from urllib.parse import urljoin
host = urlparse(base_url).netloc
candidates: list[str] = []
for m in re.finditer(r'href="([^"#?]+)"', html):
full = urljoin(base_url, m.group(1))
p = urlparse(full)
if p.scheme in ("http", "https") and p.netloc == host:
candidates.append(full)
if len(candidates) >= 50:
break
if not candidates:
return
choice = random.choice(candidates)
self._hit(choice)
This is the “and now make it look more like a real browser” bit. With --crawl, after every successful HTML fetch we pick one same-domain link from the page and request that too. The regex is deliberately tiny — we are not building a real HTML parser, we just want a plausible URL — and we cap candidates at 50 so a 5MB landing page doesn’t make us loop forever.
Same-domain only is important. A <a href="https://ads.example.net/..."> would otherwise pull us to a host we never agreed to crawl, and we’d be making requests with no robots.txt check (because we’d already passed the gate at the top of _hit).
_hit calls itself recursively here. That’s safe: there is no recursion limit hazard because --crawl follows exactly one link, not a tree.
The main loop
def run(self) -> None:
random.shuffle(self.sites)
i = 0
while self.running:
url = self.sites[i % len(self.sites)]
i += 1
self._hit(url)
if not self.running:
break
sleep_for = random.uniform(self.min_interval, self.max_interval)
end = time.monotonic() + sleep_for
while self.running and time.monotonic() < end:
time.sleep(min(0.5, end - time.monotonic()))
Round-robin with a randomised inter-request delay. The shuffle on entry means two instances started simultaneously won’t lock-step.
The inner while sleep is a small piece of UX: rather than time.sleep(60) and forcing the user to wait up to a minute for Ctrl-C to take effect, we sleep in 500 ms slices and check self.running between them. The script exits within half a second of the signal.
Signal handling
def stop(self, *_args) -> None:
self.running = False
logging.info("Stop requested — finishing current request and exiting.")
signal.signal(signal.SIGINT, gen.stop)
signal.signal(signal.SIGTERM, gen.stop)
We catch SIGINT (Ctrl-C) and SIGTERM (the polite signal systemd and kill send by default) and flip a flag. The current request gets to finish cleanly; the next inter-request sleep wakes up and the loop exits. The try/finally in main() then prints the request summary so you can see what the run did.
Argparse
p.add_argument("--min", type=float, default=30.0, ...)
p.add_argument("--max", type=float, default=90.0, ...)
p.add_argument("--per-domain", type=float, default=120.0, ...)
p.add_argument("--timeout", type=float, default=20.0, ...)
p.add_argument("--crawl", action="store_true", ...)
p.add_argument("--log-file", type=str, default=None, ...)
p.add_argument("--verbose", "-v", action="store_true", ...)
The defaults are deliberately gentle. --min 30 --max 90 --per-domain 120 averages out to roughly one request a minute across the whole rotation, with no host being touched more often than every two minutes. If you want to dial it up for stress-testing your own infrastructure, you can — but be deliberate about it.
Running it from a bash prompt
Clone, set up a virtualenv, install one dependency, run.
git clone https://github.com/michealgarner/web-traffic-generator.git
cd web-traffic-generator
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python3 traffic_generator.py
Output looks like this:
2026-05-03 09:14:22 INFO traffic-generator starting — interval 30-90s, per-domain cooldown 120s
2026-05-03 09:14:22 INFO 200 https://en.wikipedia.org/wiki/Special:Random — 84211 bytes in 312ms
2026-05-03 09:15:08 INFO 200 https://news.ycombinator.com/ — 11820 bytes in 91ms
2026-05-03 09:16:01 INFO 200 https://www.python.org/ — 49003 bytes in 188ms
Stop it with Ctrl-C and you’ll get a one-line summary:
2026-05-03 17:42:11 INFO Stop requested — finishing current request and exiting.
2026-05-03 17:42:11 INFO Requests: 412 (ok=409, failed=3)
Running it under tmux
If you want it running in the background, surviving SSH disconnects and the closing of your terminal, run it in tmux. That’s the right tool for this — nohup works but loses you the ability to peek at what it’s doing; systemd is overkill for a personal script you start and stop by hand.
# Create a session named "traffic" and start the script in one go
tmux new -s traffic 'source .venv/bin/activate && python3 traffic_generator.py --log-file traffic.log'
Detach without stopping the script: press Ctrl-b, release, then press d. Your shell prompt comes back, the script keeps running.
Reattach later, from anywhere:
tmux attach -t traffic
List all sessions:
tmux ls
Stop it cleanly: reattach, Ctrl-C, watch the summary print, then exit the shell to close the session.
A neat side benefit of --log-file is that even if your tmux session is killed unexpectedly (server reboot, OOM-killer, you), the log on disk shows exactly what the script did up to that moment. Bodies are never logged — only request lines and bytes — so the file stays small even over a long run.
Where to take it next
Three obvious next steps:
- Add or swap sites. Edit the
SITESlist at the top of the file. Keep the per-domain cooldown in mind: ten sites at 120 seconds each is fine, two sites at 120 seconds each is borderline. - Change selection strategy. The current code shuffles once and round-robins. Replacing the inner loop with
random.choice(self.sites)gives you weighted-random; passing aweights=list torandom.choicesgives you “hit Wikipedia twice as often as Reddit”. - Persist counts. The
requests_made/requests_ok/requests_failedcounters are memory-only. Wiring them through to a file every N requests, or to a Prometheus textfile exporter, gives you a graph in Grafana for free.
The repo is at github.com/michealgarner/web-traffic-generator. MIT-licensed. Issues and pull requests welcome — particularly if you spot a site I’ve put in the default list that’s about to throttle me.