-
Notifications
You must be signed in to change notification settings - Fork 0
feat(blog): add Taking Back Control observability post #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,215 @@ | ||
| --- | ||
| title: "Taking Back Control: Observability as Reclaiming Agency" | ||
| authorIds: ["nimisha"] | ||
| date: 2026-01-01 | ||
| draft: false | ||
| featured: false | ||
| weight: 1 | ||
| --- | ||
|
|
||
| The story of every growing system is fundamentally a story about losing and | ||
| regaining control. | ||
|
|
||
| ## Act I: The Illusion of Control | ||
|
|
||
| When the product is still in its inception and the team consists of a couple of engineers, | ||
| If there is a CTO (at that point in the product's journey) who is programming and | ||
| contributing, to building systems. The team is in tight control of what is happening, | ||
| The context is small enough that everyone knows every endpoint, every query and, | ||
| integration that exists. | ||
|
|
||
| Most likely, everything that is getting shipped and there is no environment | ||
| segregation and code moves from the engineer's development machine to production | ||
| directly. If something breaks, engineers are aware where and what is breaking, and | ||
| the team can reproduce them locally quickly and fix. Everyone knows what's changed, | ||
| who changed it and breaking production and quickly fixing it has become the way | ||
| team works. This is what control is, and it's natural because the system is still | ||
| in the engineer's cognitive capacity. | ||
|
|
||
| This creates a dangerous illusion: you believe control comes from your knowledge | ||
| of the code. But it doesn't. Control is coming from the context being small and tight communications. | ||
|
|
||
| ## Act II: The Gradual Loss | ||
|
|
||
| Control doesn't disappear suddenly. It erodes gradually, almost imperceptibly: | ||
|
|
||
| ### Week 1 | ||
|
|
||
| It's a 2-person engineering team. You have to optimize some API response | ||
| time, and as a default strategy for faster read operations teams add a | ||
| caching layer. There is a new thing you have to worry about when the | ||
| request is being served from cache and which one from the database. | ||
| The team has to figure out what's the right cache eviction strategy and | ||
| when to update it. | ||
|
|
||
| ### Month 2 | ||
|
|
||
| The product is doing well, and there is good traction in the market | ||
| and now your response time is critical for you. Immediately you figure | ||
| out there is an operation happening in the customer facing flow which can be | ||
| moved async, team jumped on this action and slashed your response time by | ||
| 30 percent. Team is happy, and the engineering team is getting good recognition | ||
| with so less turnaround time. It was a big win. But we can't trace the | ||
| end-to-end request easily but that concern is pushed for later. | ||
|
|
||
| ### Month 4 | ||
|
|
||
| The product team wants to take data based decisions and the roadmap is evolving | ||
| towards experimentations, AB testing and more mature data-driven product | ||
| decisions. For now engineering team adds a second database to unblock the | ||
| analytics team and data is in two places and consistency is questionable, but | ||
| this question goes under the carpet until it breaks | ||
|
|
||
| ### Month 9 | ||
|
|
||
| Your team has grown and the engineering team has around 20 people now.A single codebase | ||
| is getting hard to manage, and as the codebase grows, architecture choices are becoming | ||
| a bottleneck now. There are incidents every now and then but the product has | ||
| good traction and customers really love it. This keeps fueling the team energy | ||
| and everyone is very close to the customer and trying to solve for them. | ||
| CTO takes a call of rearchitecting the systems while it is easy to change | ||
| to enable the next phase of growth. Tech team decides to go towards the microservices | ||
| route and identifies 5 services, 3 databases, and message queues with multiple | ||
| deployment regions. | ||
|
|
||
| ### Month 12 | ||
|
|
||
| Team jumps in this, product development is halted because of this overhaul, and | ||
| achieve this goal within 8 weeks as the team ends up copy pasting most of the things | ||
| as timeline is tight. This goes to staging and is put under test. A lot of bugs and | ||
| issues are reported but engineers figure it out and pushes the fix quickly by | ||
| reading mostly logs on a single machine on staging. | ||
|
|
||
| This goes to production but the reality is completely different from what the | ||
| leadership has imagined. Uptime is a complex metric to come up with so | ||
| many systems in place. It's a distributed environment now. There are lot of | ||
| issues coming up, while the team wants to maintain the delivery speed. | ||
| No engineer has all the context which was normal 6 months back. Their deployment | ||
| breaks, contract breaks often because the task which was simple sometime back | ||
| now involves dependency of the context with other teams and availability. | ||
|
|
||
| Incidents have become scarier, because you have grown on the business | ||
| front and your customers expect more maturity from you. But every incident | ||
| is like running in the dark as most of the time you are getting this information | ||
| as an escalation from your operations or customers. Team is all reactive, running | ||
| on manual checks of hitting production APIs to figure out whether the systems | ||
| are working fine. More customers also means more escalations, your engineering | ||
| bandwidth goes into fixing these issues most of the time. Engineers are pulling | ||
| all-nighters and still things just looks out of control. | ||
nimisha-gj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Act III: The False Solutions | ||
|
|
||
| Teams try different approaches to regain control: | ||
|
|
||
| **Let's document everything** : You write architecture diagrams, API specs, | ||
| and deployment runbooks. They're outdated within weeks. Documentation describes | ||
| what the system should be, not what it is. | ||
|
|
||
| **Let's hire tech operations**: There will be a dedicated team who will keep | ||
| an extra eye on your systems manually from cloud to customer tickets and report | ||
| the issues back to the engineering team. This creates a false illusion that | ||
| we'll figure out the issue before our customer. | ||
|
|
||
| **Let's be more careful**: You add approval gates, slow down deployments, review | ||
| every change, you've gained caution, not control. You're still operating blind | ||
nimisha-gj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| just more slowly. | ||
|
|
||
| **Let's assign experts**: You designate "the person who understands the auth service" | ||
| and "the person who knows the database." You've created knowledge silos. When | ||
| they're unavailable, you're stuck. | ||
|
|
||
| None of these are bad practices. But none of them restores your ability to | ||
| understand what the system is actually doing right now. | ||
|
|
||
| ## Act IV: Understanding Control | ||
|
|
||
| Here's the realization: you can't control what you can't see. | ||
| Control isn't about preventing all problems that's impossible in complex systems. | ||
| Control means: | ||
|
|
||
| 1. Knowing what's happening: When something goes wrong, you can see what actually | ||
| failed, not a guess | ||
| 2. Understanding why: You can trace cause and effect across the system | ||
| 3. Predicting what's next: You can see trends before they become crises | ||
| 4. Acting with confidence: You can make changes knowing you'll see their actual | ||
| impact. | ||
| 5. Reduced bus factor: You are not dependent on just a couple of engineers to really | ||
| understand what's going wrong. | ||
|
|
||
| ## Act V: Observability as Agency | ||
|
|
||
| Observability is about taking back control. But it's not about returning to the | ||
| simplicity you had, it's about scaling your understanding to match your system's | ||
| complexity. | ||
|
|
||
|  | ||
|
|
||
| ### Scenario: Regaining Control of Deployment | ||
|
|
||
| With observability: You deploy a change. You immediately see: | ||
|
|
||
| - Request latency for the affected endpoints: no change | ||
| - Error rates: flat at baseline | ||
| - Database query times: unchanged | ||
| - Resource utilization: normal | ||
| - User-facing metrics: stable | ||
|
|
||
| You know it's working. You're in control. | ||
|
|
||
| ### Scenario: Regaining Control of Incidents | ||
|
|
||
| User reports checkout is broken. You check the checkout service, It's running. | ||
| You check logs, no errors. You check the database, it's up. You're 20 minutes in | ||
| and still don't know what's wrong. User is waiting. Pressure mounts. | ||
|
|
||
| With observability: User reports checkout is broken. You pull up their trace. | ||
| You see: | ||
|
|
||
| - Checkout service called payment gateway | ||
| - Payment gateway response time: 30 seconds (normally 200ms) | ||
| - Gateway returned 503 | ||
| - This started 15 minutes ago, affecting 12% of checkouts | ||
| - Only affecting users in the EU region | ||
|
|
||
| You understand the problem in 60 seconds. You're in control of the investigation. | ||
|
|
||
| ### Scenario: Regaining Control of Growth | ||
|
|
||
| Without observability: Your system has been running fine for months. Suddenly, | ||
| during a peak traffic event, everything falls apart. Database maxes out. Services | ||
| crash. You spend hours firefighting. Post-mortem question: "How did we not see | ||
| this coming?" | ||
|
|
||
| With observability: You see database connection usage trending upward over 8 weeks: | ||
| 40% → 55% → 70% → 85%. Traffic projections suggest you'll hit 100% in two | ||
| weeks during your product launch. You proactively optimize connection usage and | ||
| plan for scaling. Launch goes smoothly. | ||
|
|
||
| ## The Empowerment | ||
|
|
||
| Here's what makes this powerful: observability doesn't just help you react, it | ||
| changes what you can do. | ||
|
|
||
| With proper observability, you can: | ||
|
|
||
| **Deploy with confidence**: You're not hoping changes work, you're measuring | ||
| their actual impact and can roll back immediately if needed. | ||
|
|
||
| **Debug efficiently**: Investigation time drops from hours to minutes because | ||
| you have the data to understand what happened. | ||
|
|
||
| **Optimize strategically**: You're not guessing which optimizations matter; | ||
| You're measuring actual bottlenecks and their impact. | ||
|
|
||
| **Scale proactively**: You're not reacting to outages; you're seeing capacity | ||
| limits before you hit them. | ||
|
|
||
| **Experiment safely**: You can try new approaches and measure their real-world | ||
| impact, not theoretical benefits. | ||
|
|
||
| This is an agency. You're not at the mercy of your system's complexity. | ||
| You've built the visibility to understand it, predict it, and shape it. | ||
|
|
||
| Observability is the ability to maintain agency over increasingly complex systems. | ||
| It's how you ensure that growth doesn't mean loss of understanding. | ||
| It's how you prove to yourself, your team, your users that you're in control. | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.