Friday Night, Lights Out – Deconstructing 20 Years of Infrastructure Failure

ID: FN-2023-07
STATUS: HARDENED
INTERVENTION: High-Concurrency Scoreboard Stabilization
OBJECTIVE: Uptime Reliability & Overhead Liquidation
OUTCOME: 0.00% Downtime | 75% Cost Reduction
Infrastructure Failure illustrated by an abstract image of a newsroom crew looking on at a news set. Image generated for a story about a failure in the newsroom.

It was Friday night, and the site had crashed for the 20th year in a row. I was the guy responsible. I could have blamed the website hosting provider, but being the expert means owning the problem until it’s solved. They hired me to be the one fix it, and I took the job knowing I could.

I spent the next 10 days in the logs tracking down this Infrastructure Failure. Here are my findings:

The Context: “The Fragile State”

The Symptom

High School Football Season coverage on a major news outlet crashes due to high traffic resulting in a 504 bottleneck.

The History

20 years of Agency-level patches. Multiple experts failed to solve the root-cause SQL query death-spiral.

The Business Consequence

Loss of local sports fans’ trust, ad revenue, and internal team burnout

The Diagnosis: “The Logic Leak”

The Discovery

Calls to the database were using standard calls in PHP. The scoreboard page was calling every game with all information at the refresh frequency, every 5 seconds, on the client side.

30,000 concurrent users were polling the database every 5 seconds—each.

The Decision

Extracted the Scoreboard Logic into a standalone, hardened engine.

The Intervention: “The Hardening”

Decoupling

The Scoreboard functionality would be deployed on its own server during Football season. This would result in a further 75% savings on hosting and completely balanced concerns, allowing news stories to load flawlessly, even during 4th quarter rush.

API Optimization

Switched for PHP standard cURL to WordPress optimized WPDB requests, and then simplified those to only grab what changed in the last seconds since last call.

Stress Testing

Simulated 45,000 concurrent users via Swarm, thought I made an error when the Scoreboard barely dipped in CPU and RAM availability, but it was configured correctly and proved itself on game night.

ProcessBeforeAfter
DB CallsPHP cURLWPDB Sanitized requests
Data requestedAll Games that nightScores that changed only
Tables storedEvery game for 20 yearsLast 2 years, the rest archived
Refresh methodEntire frameTargeted Ajax
HostingAWSCloudways serving Linode
The main steps that made the Scoreboard Page ready for game night

I had long requested that the website be moved to its own server, but every previous provider ignored that recommendation, resulting in the same failures year after year. Joshua not only listened but took the time to understand my reasoning, test the idea, and implement it. He was the first developer in over two decades to take that step.

This season our website has supported all traffic without a single failure. I believe that outcome reflects both Joshua’s expertise and the collaborative spirit he brings to every project.

Tim Loughry, VP, WVRC Digital

Facing Infrastructure Failure like this? Schedule an Operational Friction Audit