Who Sent That RST? Forensic Classification of TCP Resets with rst-forensics

“The customer says it’s our app. Our app says it’s their firewall.”

If you’ve supported a B2B SaaS for any length of time, you’ve taken this ticket. A specific customer — usually a big bank, government agency, or healthcare org — reports that connections to your service are dying mid-flight. Sometimes during the TLS handshake. Sometimes a few seconds in, right when their browser tries to upload something. Wireshark on their side shows a clean SYN, SYN-ACK, ACK, and then a [RST, ACK] killing the connection. Your side shows the same RST. Neither end sent it.

Or rather: somebody sent it, but TCP is a fairly anonymous protocol. The packet has a source IP and a TTL and a window value, and from those alone you’re supposed to magically deduce whether it came from the legitimate server stack, from the client kernel giving up, or from a corporate firewall that decided your traffic violated some policy nobody warned anyone about.

Wireshark won’t tell you. The five-tuple says “came from your server’s IP,” because spoofing that is the entire point of inline reset injection. So you fall back to tribal knowledge — “I’ve seen FortiGates do this when…” — and that’s not a verdict you can hand to a customer.

rst-forensics is my attempt to make the verdict reproducible. It takes a small struct of observations from each RST it sees and runs six independent scorers — TTL, IP-ID continuity, window value, TCP options, sequence-number plausibility, and arrival timing — that each vote for one of SERVER, MIDPATH, or CLIENT. Their weighted votes aggregate into a verdict with a confidence number and a per-scorer evidence list, so when you hand it to a customer you’re not saying “it feels like your firewall,” you’re saying “here are four independent signals that all point the same direction.”

The scenario that motivated it

A SaaS API hosted on Linux behind a normal cloud load balancer. Most customers work fine. One specific customer — call them BigBank — starts seeing intermittent failures last Thursday. Their security team rolled out a new IPS policy on their FortiGate the same week, but they swear that’s unrelated and ask us to investigate the server.

We pull a pcap from the LB-facing tap during a failing call. We see a normal three-way handshake from BigBank’s egress IP, a TLS ClientHello, our TLS ServerHello, and then four packets later — well before the application has done anything — a [RST, ACK] from our server’s IP, killing the session. The application logs on the LB show no close, no error, no connection event at all for that flow. From the server’s perspective, the connection just disappeared.

This is exactly the case rst-forensics was built for.

What it actually does

The library is two layers stacked: a pure-Python classifier that takes a RstObservation and returns a Verdict, and a set of capture adapters that turn live traffic, a pcap, or an active probe into those observations. The classifier has zero dependency on scapy — that lives only in the adapters. You can unit-test the verdict logic with no privileges, no libpcap, and no network at all, which makes it pleasant to extend.

The interesting work happens in flow.py’s FlowTracker. It pins the server side of every flow off the SYN-ACK (the first packet that unambiguously identifies who’s listening), then maintains a rolling baseline per flow: the server’s TTL, last IP-ID, window value, advertised TCP options, RTT, expected next sequence number in each direction, and the rolling rcv-window high/low. When a RST comes through, the tracker emits a RstObservation carrying the RST’s own header values plus the baseline they should be compared against. Adapters never touch scoring — they just fill in PacketMeta structs and hand them to the tracker. That separation is what lets the same classifier run against a live AsyncSniffer, a stored pcap, or an active probe without duplicated code.

The six scorers

Each scorer is a pure function from RstObservation to a Score(origin, weight, reason). They’re independent on purpose: any one of them being fooled by a weird path shouldn’t move the verdict on its own.

TTL delta is the classical fingerprint. A real RST from the server takes the same number of hops back as every other packet from the server, so its TTL on arrival should match the baseline within a hop or two. An inline injector forges the source IP but it can’t forge the hop count — its packets arrive with a higher TTL because they originate fewer hops away. Higher than baseline = injected from closer = MIDPATH.

IP-ID continuity catches stack mismatches. Linux running per-flow IP-ID counters often emits 0; a stack with a per-host counter increments by 1 every packet it sends. Either pattern is fine — the point is consistency. A RST whose IP-ID jumps thousands of values forward from the last server packet is almost certainly minted by a different stack entirely.

Window value is where the FortiGate fingerprint lives. Inline reset injectors don’t bother computing a real receive window for a packet they’re forging — they hardcode something. The set {0, 4128, 8192, 16384} covers Cisco ASA, FortiGate, Palo Alto, and several common IPS appliances. Real sockets very rarely land on those values by accident, so a sentinel match is a strong MIDPATH vote.

TCP options trip the same wire. A server stack that negotiated SACK and timestamps in the SYN-ACK keeps emitting the timestamp option on its RSTs (RFC 7323 doesn’t strictly require this, but every modern stack does it). Inline injectors strip everything down to the bare 20-byte TCP header to keep their forged packets fast and minimal. The absence of timestamps on a RST whose flow had them is the tell.

Sequence-number plausibility is the hardest signal to fake. To send a legitimate RST, you need to know what byte the receiver expects next — that’s the entire point of TCP state. A RST whose seq matches expected_seq exactly is a server vote with weight 0.85. A seq inside the advertised receive window is a softer server vote. A seq outside the window is a blind injection — the firewall doesn’t actually track per-flow byte-level sequence state, so it lobs in a best-guess seq and hopes the kernel accepts it. Modern Linux usually doesn’t.

Arrival timing is the dissent vote. A RST coming back from the server end of a flow has to take at least half an RTT to make the round trip — anything faster than that physically can’t have been the server. So a Δt < ½ RTT after our last outbound byte is a strong MIDPATH vote: the firewall is closer to us than the server is, so it can answer faster than the real endpoint could. The same scorer flips for outgoing RSTs: a RST going toward the server that arrives more than 1.5× RTT after the last activity is the local stack giving up — the CLIENT vote.

Aggregation

classify() runs all six scorers, sums weights into the bucket each scorer voted for (UNKNOWN abstentions don’t count), picks the bucket with the largest total, and reports confidence = winning_total / total_cast. A unanimous verdict shows up as 100%; a 4-of-6 split with two abstentions still hits 100% on the cast votes; a contested verdict with conflicting fingerprints comes out at 60-70% and tells you to look closer at the per-scorer evidence list. That confidence number is doing real work — it’s the difference between “definitely a midpath box” and “the timing scorer says midpath but everything else says server, you might be looking at a route flap.”

What the verdict looks like on BigBank’s pcap

Run the pcap through the CLI:

$ rst-forensics pcap captures/bigbank-rst.pcap
┏━━━━━┳━━━━━┳━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━┓
┃   # ┃ dir ┃ ttl ┃ ip-id ┃  win ┃ opts  ┃ verdict ┃ conf ┃
┡━━━━━╇━━━━━╇━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━┩
│   1 │ →cli│ 240 │ 38291 │    0 │ -     │ midpath │ 100% │
└─────┴─────┴─────┴───────┴──────┴───────┴─────────┴──────┘
exit 2

--json adds the full evidence list, and that’s what goes in the ticket. The four scorers that voted are each a paragraph BigBank’s security team can verify independently against their own logs:

[midpath  w=0.85] TTL 240 exceeds server baseline 64 by 176 (closer hop)
[midpath  w=0.75] window 0 matches firewall sentinel set
[midpath  w=0.80] seq 0 outside receive window [3284100132,3284165668]
[midpath  w=0.90] Δ=0.0008s < ½ RTT (0.0270s); too fast for the server to have replied

Four separate fingerprints — hop count, window value, sequence number, and arrival latency — each independently say the RST didn’t come from our server. None of them can be fooled by the firewall spoofing the source IP, because none of them rely on the source IP. The TTL says the packet originated closer to the capture point than our server is. The window value matches a hardcoded sentinel that real sockets don’t pick. The sequence number is wildly outside what our server would have used, because the firewall didn’t bother tracking byte state. And the timing is sub-millisecond from the last outbound byte, against a 27ms half-RTT — there is no way the server replied that fast.

With that in the ticket, BigBank’s security team can grep their FortiGate IPS log for a session-end on that five-tuple at that timestamp and the conversation moves from “who broke it” to “which IPS signature, and can we tune it.”

Capture, CLI, and CI

Three subcommands, Rich table by default, --json for machine consumption:

# Analyse a packet capture you already have
rst-forensics pcap captures/bigbank-rst.pcap

# Sniff live for thirty seconds (needs root or CAP_NET_RAW)
sudo rst-forensics passive --iface eth0 --timeout 30

# Initiate a connection and classify whatever RST comes back
sudo rst-forensics active --host api.example.com --port 443

# Pipe to jq in CI
rst-forensics --json pcap suspect.pcap | jq '.[] | select(.verdict=="midpath")'

Exit codes follow the same convention as pmtud-sweeper: 0 for clean (no RSTs, or RSTs with SERVER / CLIENT verdicts), 1 for setup errors (bad path, scapy missing, no privileges), 2 for “at least one RST classified MIDPATH.” The exit-code contract is the entire reason this is a CLI rather than a notebook — drop it into your post-deploy smoke pcap walk, fail the build on exit 2, and you’ll catch the day a new security appliance starts forging closes on your egress.

How it’s tested

A library that confidently labels firewalls had better not be wrong, and the test surface is where you check that. The repo carries a deterministic fixture builder (tests/fixtures/build_fixtures.py) that synthesises three lab pcaps with scapy:

  • server_netem_rst.pcap — a Linux server politely closing under simulated path latency. Verdict: SERVER, confidence 1.00.
  • fortigate_inline_rst.pcap — the firewall scenario from the post, with TTL=240, window=0, no options, sub-RTT arrival. Verdict: MIDPATH, confidence 1.00.
  • client_rst.pcap — the local stack giving up on an outgoing flow. Verdict relies on the timing scorer’s CLIENT vote in isolation, because every fingerprint scorer compares against the server baseline (a known soft spot, flagged for phase 4).

The pcaps themselves aren’t committed — they’re rebuilt from the script in CI before the suite runs, so the repo stays lean and the fixtures stay byte-identical across machines. The full suite (46 tests) runs against Python 3.10 through 3.14 on every push.

Where this fits

This is the companion tool to the Fortinet packet-flow series — same kernel-level packet behaviour, different forensic vantage. The packet-flow series walked through what a FortiGate does to a packet on its way through; rst-forensics is what you reach for when you suspect that work is closing connections on someone else’s behalf, and you need evidence to back the claim.

Repo, install instructions, deterministic fixtures, and CI matrix on GitHub: github.com/MichealGarner/rst-forensics. Pure-Python classifier, optional scapy adapters, MIT-licensed. Install with pip install git+https://github.com/MichealGarner/rst-forensics, point it at a pcap, and stop guessing who hung up.

Phase 4 — making outgoing client RSTs verdict cleanly as CLIENT instead of relying on the timing scorer’s lone dissent — is open work. The current scorers all read every RST against the server baseline, which is the right call for incoming RSTs but soft on outgoing ones. Direction-aware scoring is the natural next step, and PRs are welcome.