When protections outlive their purpose: A lesson on managing defense systems at scale

To keep a platform like GitHub available and responsive, it’s critical to build defense mechanisms. A whole lot of them. Rate limits, traffic controls, and protective measures spread across multiple layers of infrastructure. These all play a role in keeping the service healthy during abuse or attacks. We recently ran into a challenge: Those same protections can quietly outlive their usefulness and start blocking legitimate users. This is especially true for protections added as emergency responses during incidents, when responding quickly means accepting broader controls that aren’t necessarily meant to be long-term. User feedback led us to clean up outdated mitigations and reinforced that observability is just as critical for defenses as it is for features. We apologize for the disruption. We should have caught and removed these protections sooner. Here’s what happened. What users reported We saw reports on social media from people getting “too many requests” errors during normal, low-volume browsing, such as when following a GitHub link from another service or app, or just browsing around with no obvious pattern of abuse. Users encountered a “Too many requests” error during normal browsing. These were users making a handful of normal requests hitting rate limits that shouldn’t have applied to them. What we found Investigating these reports, we discovered the root cause: Protection rules added during past abuse incidents had been left in place. These rules were based on patterns that had been strongly associated with abusive traffic when they were created. The problem is that those same patterns were also matching some logged-out requests from legitimate clients. These patterns are combinations of industry-standard fingerprinting techniques alongside platform-specific business logic — composite signals that help us distinguish legitimate usage from abuse. But, unfortunately, composite signals can occasionally produce false positives. The composite approach did provide filtering. Among requests that matched the suspicious fingerprints, only about 0.5–0.9% were actually blocked; specifically, those that also triggered the business-logic rules. Requests that matched both criteria were blocked 100% of the time. Not all fingerprint matches resulted in blocks — only those also matching business logic patterns. The overall impact was small but consistent; however, for the customers who were affected, we recognize that any incorrect blocking is unacceptable and can be disruptive. To put all of this in perspective, the following shows the false-positive rate relative to total traffic. False positives represented roughly 0.003-0.004% of total traffic. Although the percentage was low, it still meant that real users were incorrectly blocked during normal browsing, which is not acceptable. The chart below zooms in specifically on this false-positive pattern over time. In the hour before cleanup, approximately 3-4 requests per 100,000 (0.003-0.004%) were incorrectly blocked. This is a common challenge when defending platforms at scale. During active incidents, you need to respond quickly, and you accept some tradeoffs to keep the service available. The mitigations are correct and necessary at that moment. Those emergency controls don’t age well as threat patterns evolve and legitimate tools and usage change. Without active maintenance, temporary mitigations become permanent, and their side effects compound quietly. Tracing through the stack The investigation itself highlighted why these issues can persist. When users reported errors, we traced requests across multiple layers of infrastructure to identify where the blocks occurred. To understand why this tracing is necessary, it helps to see how protection mechanisms are applied throughout our infrastructure. We’ve built a custom, multi-layered protection infrastructure tailored to GitHub’s unique operational requirements and scale, building upon the flexibility and extensibility of open-source projects like HAProxy. Here’s a simplified view of how requests flow through these defense layers (simplified to avoid disclosing specific defense mechanisms and to keep the concepts broadly applicable): Each layer has legitimate reasons to rate-limit or block requests. During an incident, a protection might be added at any of these layers depending on where the abuse is best mitigated and what controls are fastest to deploy. The challenge: When a request gets blocked, tracing which layer made that decision requires correlating logs across multiple systems, each with different schemas. In this case, we started with user reports and worked backward: User reports provided timestamps and approximate behavior patterns. Edge tier logs showed the requests reaching our infrastructure. Application tier logs revealed 429 “Too Many Requests” responses. Protection rule analysis ultimately identified which rules matched these requests. The investigation took us from external reports to distri

When protections outlive their purpose: A lesson on managing defense systems at scale

相关文章

NASA’s Artemis II rocket rolls to launch pad, but key test looms ahead

The first new Marathon game in decades will launch on March 5

AI 智能摘要

订阅 AI Daily 资讯