In November 2023, a major Brazilian retailer lost approximately R$ 4 million in sales in less than two hours. The reason? The checkout system couldn't handle the Black Friday traffic spike and simply went down. While the IT team scrambled to restore the service, competitors were converting customers who had migrated to other sites. It wasn't a product, pricing, or marketing problem. It was an architecture problem.
This scenario repeats itself every year, across companies of all sizes. And the most frustrating part is that it's almost always preventable. Black Friday is not a surprise. The date is known months in advance. The traffic volume is predictable within a reasonable range. And yet, systems crash, carts disappear, payments freeze. Why? Because most companies treat resilience as a future project, not as a fundamental property of the architecture.
In this article, I'll share the strategies I apply with my clients — including financial sector companies such as BTG, XP, and Inter — to ensure their systems not only survive Black Friday, but perform with excellence throughout it.
The problem isn't the traffic. It's the architecture that wasn't built to handle variation
When a system crashes during traffic spikes, the instinct is to blame the volume. "We received 10x more traffic than usual." But this explanation hides the real problem: an architecture that was designed for a stable state and was never tested for variation.
Monolithic systems, databases without a distributed read strategy, synchronous queues, and cascading dependencies are the most common culprits. When one part of the system becomes overloaded, it doesn't just degrade — it brings everything around it down. This is the domino effect of poorly planned architecture.
The first question I ask when evaluating a system before a major event is: where are the single points of failure? Every system has critical points. The difference between a resilient system and a fragile one is whether those points have been identified, isolated, and protected with redundancy and fallback mechanisms.
A client in the loyalty sector, who processes millions of points daily, came to me with exactly this problem. The system worked perfectly under normal conditions, but any campaign spike would take down the balance inquiry service. The solution wasn't simply to increase server capacity — it was to redesign the flow so that balance inquiries became independent from transaction processing, with proper cache layers and circuit breakers.
Scalability is not synonymous with "adding more servers"
There is a very common misconception in the industry between scalability and raw capacity. Scaling horizontally on AWS doesn't simply mean spinning up more EC2 instances. It means designing the system so that each component can grow independently, without creating bottlenecks elsewhere.
In practice, a resilient architecture for spikes like Black Friday needs at least four well-resolved layers:
- Entry layer: use of a CDN (such as CloudFront) to absorb static traffic and reduce the load on application servers. In some cases, up to 60% of traffic can be served directly from the edge, without touching your backend infrastructure.
- Application layer: decoupled services with auto scaling configured using policies based on real business metrics — not just CPU, but response latency, queue depth, and error rates.
- Data layer: separation between reads and writes, use of read replicas in RDS or Aurora, and a caching strategy with ElastiCache for high-frequency, low-variability queries.
- Integration layer: use of asynchronous queues (SQS, SNS, EventBridge) to decouple critical processes such as order confirmation, inventory updates, and communication with payment systems.
This separation seems obvious on paper, but in practice most systems I evaluate have at least two of these layers coupled in a way that causes one to degrade the other under pressure. The checkout directly depends on the real-time inventory system. The inventory system queries the primary database on every request. The primary database lacks proper indexes for the volume of simultaneous reads. And when checkout freezes, no one knows exactly why.
The role of chaos engineering and load testing
There is no resilience without testing. This is one of the simplest and most ignored statements in platform engineering. Companies spend months building redundant architectures and never validate whether the redundancy actually works when it's needed.
Chaos engineering — popularized by Netflix with Chaos Monkey — starts from a straightforward principle: if you don't introduce controlled failures into your system before the production environment does it for you, you're operating with a false sense of security. On AWS, the Fault Injection Simulator (FIS) service allows you to simulate instance failures, network latency, availability zone failures, and other degradation scenarios in a controlled manner.
With clients in the financial sector, where availability is not just a user experience metric but a regulatory obligation, chaos testing is part of the development cycle. It's not optional. Before any high-volume event — whether a Black Friday, an IPO, or a large-scale campaign — we run failure scenarios to validate that automatic recovery mechanisms work within the defined SLAs.
Load tests, in turn, need to be more sophisticated than simply firing a thousand simultaneous requests and seeing what happens. Black Friday traffic has a specific profile: there is gradual growth in the hours leading up to it, an abrupt spike at the stroke of midnight or at the start of flash promotions, and user behavior that differs from everyday patterns — more price comparison, more cart abandonment, more payment attempts. Simulating this realistic profile makes all the difference in the quality of the test.
Disaster recovery is not backup. It's a strategy with defined RTO and RPO
One of the most costly misconceptions I see in mid-sized companies is confusing backup with disaster recovery. Backup is a copy of your data. Disaster recovery is the ability to resume operations within an acceptable timeframe and with an acceptable level of data loss for the business.
Two concepts are fundamental here: RTO (Recovery Time Objective) — how long you can afford to be offline before the impact becomes unacceptable — and RPO (Recovery Point Objective) — how far back you can afford to lose data without serious consequences. For an e-commerce platform on Black Friday, an RTO of 4 hours can represent tens of millions in lost revenue. For a bank, an RPO of 1 hour can have serious regulatory implications.
AWS offers different DR strategies with clear trade-offs between cost and recovery speed: Backup and Restore (cheapest, slowest), Pilot Light (minimal infrastructure always running), Warm Standby (reduced capacity always available), and Multi-Site Active/Active (maximum availability, highest cost). The right choice depends on the criticality profile of each service — and most companies don't make this segmentation. They treat everything as if it had the same level of criticality, which results in either spending more than necessary or being unprotected where it matters most.
Observability: you can't protect what you can't see
During a traffic spike, the speed of problem detection is just as important as the speed of recovery. A system that fails silently — with no alerts, no dashboards, no log correlation — can take 20 or 30 minutes before anyone notices the problem. On Black Friday, those minutes are gold.
Observability is not just monitoring. It's the ability to understand the internal state of the system from the signals it emits: metrics, logs, and distributed traces. The three-pillar model — consolidated by tools in the AWS ecosystem such as CloudWatch, X-Ray, and OpenTelemetry — allows you to not only know that something is wrong, but to understand where and why.
A practical example: in a payments system I worked with, the average response time was within the SLA, but the error rate was gradually rising. Without granular observability, this could have gone unnoticed until it became a crisis. With distributed traces, we identified in under three minutes that the problem was in an external fraud prevention dependency that was degrading under load — not in the main system. The fix was to activate a fallback that was already built into the architecture, but had never been triggered before.
Operational dashboards for high-volume events should display, in real time: error rate per service, latency by percentile (p95, p99), transaction throughput, asynchronous queue depth, and the status of each availability zone. Any significant deviation needs to generate proactive alerts, not reactive ones.
The checklist I use before any major event
After more than 20 years working with systems architecture and having closely followed dozens of Black Fridays, IPOs, and high-volume campaigns, I've consolidated an operational checklist that I apply with my clients. It's not exhaustive, but it covers the highest-risk areas:
- Are all auto scaling groups tested with reviewed capacity limits for the expected peak?
- Does the database have active read replicas and is the connection pool sized for the expected peak?
- Is the cache warmed up with the most frequently accessed data?
- Are circuit breakers configured on all external integrations?
- Have load tests been run with a realistic traffic profile within the last two weeks?
- Is the incident runbook up to date, and does the on-call team know exactly what to do in each failure scenario?
- Are rate limiting and DDoS protection limits configured in WAF and CloudFront?
- Has the quick rollback process been tested for the most recently deployed versions?
Each item on this checklist represents a failure I've seen happen in production. None of them are trivial to implement, but all are manageable when addressed with sufficient lead time.
Resilience is a business decision, not just a technical one
What bothers me most when discussing this topic with company leaders is the perception that resilience is an infrastructure cost, not a business investment. This view is misguided and expensive.
When you calculate the cost of one hour of downtime during Black Friday — lost revenue, recovery costs, brand reputation impact, potential customer churn — and compare it to the cost of a well-designed resilient architecture, the math always favors resilience. The problem is that the cost of failure appears on one line and the investment in prevention appears on another, and the people who approve budgets rarely connect the two.
The role of the CTO, CIO, and technology advisor is precisely to make that connection explicit. To translate technical risk into business risk. To show that the choice is not between spending or not spending, but between spending in a planned way now or spending in an emergency — and far more costly — way during a crisis.
Black Friday 2025 is on the horizon. The infrastructure that will support it needs to be designed, tested, and validated before then. If you're reading this article in April, there is still time. If you're reading it in October, time is running out. Either way, the question that needs to be answered is the same: do you know, with certainty, where your system will break under pressure? If the answer is no — or an unconvincing "probably not" — that is your starting point.