Observability: Detect Failures Before Clients

At 11:47 PM on a Friday, your bank's support team starts receiving complaints. The app is slow. Transfers are failing. Within 13 minutes, the volume of complaints on Twitter has tripled. At 12:02 AM, someone wakes up the CTO. At 12:15 AM, the incident is officially declared. The problem? It had started at 10:31 PM.

Forty-four minutes of dissatisfied customers, failing operations, and reputation being eroded — all because nobody knew something had broken until the phone rang.

This scenario is not hypothetical. It is the daily reality of a large portion of technology companies in Brazil, including some you would recognize by name. The difference between companies that operate with excellence and those that are constantly fighting fires is not just about code quality. It lies in the ability to know what is happening inside systems in real time — before the customer notices, before the business is impacted, before the problem becomes a crisis.

This has a name: observability.

Monitoring and observability are not the same thing

There is a common confusion in the market that needs to be cleared up before any conversation on the topic. System monitoring and observability are related concepts, but fundamentally different.

Monitoring is reactive by nature. You define known metrics, configure alerts for when they deviate from normal, and receive a notification when something exceeds a threshold. CPU above 90%? Alert. Response time greater than 2 seconds? Alert. It is useful, but has a structural limitation: you can only monitor what you already know might go wrong.

Observability goes further. The term, borrowed from dynamic systems control theory, describes the ability to infer the internal state of a system from its external outputs. In practical terms, it means you can ask questions you have never asked before about your system's behavior — and find answers. Not just "is the server up?", but "why are users in the Southern region experiencing 3x higher latency only on checkout requests between 7 PM and 9 PM?"

To reach this level, observability relies on three fundamental pillars: metrics, logs, and distributed tracing (traces). Together, they form what the SRE community calls the three pillars of observability. In isolation, each one tells only part of the story.

The three pillars in practice

Understanding each pillar in a practical way is the first step toward building a platform that explains itself.

Metrics are time series of numerical data. Request rate per second, error percentage, latency at the 99th percentile, memory usage. They are ideal for dashboards, alerts, and trend analysis. Tools like Prometheus and AWS CloudWatch are references in this category. The key caveat: metrics aggregate information. They tell you something is wrong, but rarely tell you exactly where or why.

Logs are detailed records of events. Every transaction, every error, every decision the system makes can be recorded. They are the source of truth for post-incident investigations. The problem is volume: modern systems generate gigabytes of logs per hour, and without proper structure and tooling, finding the relevant log amid the noise is like looking for a needle in a burning haystack.

Distributed traces are the most powerful pillar and the least implemented. In microservices architectures — which today dominate financial, e-commerce, and logistics systems in Brazil — a single user request can travel through dozens of different services. A trace follows that request from start to finish, showing exactly where time was spent, where errors occurred, and how services interacted. Tools like Jaeger, Zipkin, and AWS X-Ray are references here.

When the three pillars work together, with correlation between them, you can go from a performance degradation alert directly to the responsible line of code, tracing the exact sequence of calls that led to the problem. This transforms an investigation that takes hours into one that takes minutes.

The real cost of the absence of observability

In 2023, the average cost of downtime for financial services companies was around R$ 1.5 million per hour, according to industry data. For e-commerce companies on dates like Black Friday, that number can be substantially higher. But the direct financial cost is only the surface.

There is the engineering cost: development teams that spend more time investigating incidents than building features. There is the regulatory cost: in Brazil, the Central Bank has increasingly specific requirements around availability and recovery time for financial systems — BACEN Resolution 4,658 and more recently the operational resilience guidelines make observability not just a best practice, but a regulatory necessity. And there is the hardest cost to quantify: the lost trust of a customer who tried to make a Pix transfer at 11 PM and the app froze.

In projects I have led at large financial institutions, the structured implementation of observability reduced MTTR (Mean Time to Recovery) — the average time to recover after an incident — by more than 70%. What used to take 4 hours to diagnose and fix started being resolved in under 45 minutes. Not because people became smarter. Because they finally had the right data, in the right place, at the right time.

How an SRE culture transforms observability into a competitive advantage

Observability is not just technology. It is an operational discipline, and this is where the concept of SRE — Site Reliability Engineering comes in as a structuring framework.

Originated at Google and now adopted by the world's leading technology companies, SRE treats reliability as a product feature, not as a vague infrastructure responsibility. Within this model, observability is the foundation upon which all reliability decisions are made.

Two SRE concepts are particularly relevant here:

SLIs (Service Level Indicators): the metrics that truly matter to the user — latency, availability, error rate, throughput.
SLOs (Service Level Objectives): the targets you commit to achieving for those metrics — for example, 99.9% of requests responded to in under 300ms.

Without observability, SLIs and SLOs are fiction. You have no way of knowing whether you are meeting your objectives if you cannot measure what is happening. With structured observability, they become real management instruments: you know exactly where your error budget stands, you can make data-driven decisions about where to invest in improvements, and you can have honest conversations with the business about what is technically feasible.

A practice I recommend for teams starting this journey: before implementing any tooling, define the five most important SLIs for your business. For a digital bank, these might be: Pix availability, login latency, transfer success rate, statement loading time, chat support availability. Everything else is secondary until you can measure these five with precision.

Where to start: a pragmatic roadmap

The question I hear most from CTOs and technical leaders when we discuss observability is: "where do I start without halting everything already in progress?" The answer is: start with what hurts the most.

A pragmatic roadmap for companies starting from zero in structured observability:

Phase 1 — Basic instrumentation (weeks 1-4): Implement system and application metrics collection on the most critical services. Define your SLIs. Set up dashboards for the indicators the business truly cares about. Entry-level tools: Prometheus + Grafana, or CloudWatch if you are on AWS.
Phase 2 — Log management (weeks 5-8): Centralize logs in a platform with search and correlation capabilities. Structure the logs of the main systems (structured JSON is the minimum). Define retention and access policies. Opensearch, Datadog, or CloudWatch Logs Insights are good options in the AWS context.
Phase 3 — Distributed tracing (weeks 9-16): Instrument services with tracing. Start with the services on the critical user journey. Correlate traces with logs and metrics. This phase is the most labor-intensive, but it is where ROI becomes most visible.
Phase 4 — Intelligent alerts and incident response (ongoing): Build runbooks based on the data you now have. Configure alerts based on SLOs, not just static thresholds. Implement on-call with context — when someone wakes up at 3 AM, they should receive not just the alert, but the relevant dashboard and the response playbook.

The most common mistake I see on this journey is jumping to sophisticated tools without having solved the basics. A state-of-the-art APM platform does not compensate for unstructured logs and the absence of defined SLOs. Foundation first, sophistication later.

Platform resilience starts with visibility

There is a principle that guides my work in platform resilience: you cannot protect what you cannot see.

Many companies invest heavily in redundancy, multi-region setups, circuit breakers, and chaos engineering — and all of these are legitimate and important practices. But none of them replace the ability to understand system behavior under real conditions, with real traffic, from real users.

Observability is the nervous system of your platform. Without it, you are operating in the dark, reacting to symptoms instead of treating causes, managing crises instead of preventing them. With it, you begin to have a continuous conversation with your systems — and the systems start telling you what they need before they fail.

In the projects I have led at companies such as B3, BTG, and other players in the Brazilian financial market, the pattern is consistent: organizations that invested in structured observability not only resolve incidents faster. They have fewer incidents. Because the visibility that enables failure detection also enables identifying degradations before they become failures — and fixing them silently, without the customer ever knowing there was a problem.

That is the ultimate goal. Not to be the best at managing crises. To be so good at preventing them that crises become rare.

The operational maturity of a technology company is not measured by what it does when the system goes down. It is measured by how often the system goes down — and that frequency is, in large part, a direct consequence of its ability to see what is happening before it is too late.

If you are a CTO, CIO, or founder of a company that still operates on the model of "we wait for the customer to complain to know something broke," this is the moment to change. Not because it is a market best practice. Because the competition is getting better at this, customers are becoming more demanding, and the tolerance window for instability is closing.

If you want to understand how to structure observability and platform resilience in a pragmatic way within the context of your business, get in touch with me. I have been helping companies transform reactive operations into self-monitoring platforms — and the results show up in weeks, not years.

Observability: how to know something broke before the customer complains

Monitoring and observability are not the same thing

The three pillars in practice

The real cost of the absence of observability

How an SRE culture transforms observability into a competitive advantage

Where to start: a pragmatic roadmap

Platform resilience starts with visibility

Does your system warn you before it breaks?

Observability: how to know something broke before the customer complains

Monitoring and observability are not the same thing

The three pillars in practice

The real cost of the absence of observability

How an SRE culture transforms observability into a competitive advantage

Where to start: a pragmatic roadmap

Platform resilience starts with visibility

Does your system warn you before it breaks?

Read also

Your competitor launches in weeks while you take months: how to accelerate time-to-market with modern engineering

AI Agents in 2026: what they are, how they work, and when they make sense for your business