
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
I.e. the insights of Karan: a former Staff SRE with years of experience spanning software engineering, systems reliability and incident response. He shared his thoughts on what separates great SREs from good ones, the hardest parts of the job and where the role is heading with the advent of AI.
The mindset shift from software engineering to SRE
Karan moved into SRE from a software engineering background, and he describes the core shift simply: “A developer says, ‘my code works.’ A great SRE says, ‘the system works.'”
As a developer, your world is bounded by your own code – i.e. does it compile, does it pass tests, does the feature ship? As an SRE, that boundary vanishes.
“The moment you shift to SRE, it’s all about resiliency. Not just of an application, but of the whole system. How observable is it? How operable is it? How fault-tolerant is it? Those are the questions we ask.”
And the scope of ownership changes dramatically. “As a developer, you own an application. As an SRE, you own the complete system: the database, API latency spikes, memory and disk usage, DNS, certificate expiries. All of it.”
Takeaways:
- Actively expand your mental model beyond your immediate application. Start asking questions about the systems adjacent to yours (DNS, storage, API dependencies) even if they’re not formally your responsibility yet. You need to be a systems thinker.
- Use the four lenses of SRE thinking as a self-check: is this system observable, operable, resilient and fault-tolerant? If you can’t answer those questions, that’s where to focus next.
The skill that separates great SREs from good ones
The biggest differentiator, Karan says, is the ability to think clearly under pressure:
“If somebody can really think well under pressure – if they can understand how components interact and make safe decisions quickly, with incomplete data – that is what separates a great SRE.”
Closely tied to this is the ability to rapidly identify the blast radius of an incident: how far-reaching is the impact, and what else might be affected?
Getting to that answer fast, with limited information, is a skill that takes time and deliberate effort to build.
So how do you develop it? Karan is candid: “It’s tricky. Being in those scenarios is the best way.” He points to simulation-based training as one route in, alongside shadowing experienced engineers during live incidents.
But there’s another habit he credits more than almost anything else: attending postmortems, even for incidents you weren’t directly involved in.
“No matter whether I’d been in that incident or not, I used to be present in every postmortem. Being there gives you exposure to every incident that’s happened across the company. Getting to know how systems have failed – in all their variety – is a much better way to understand how a system works, and it gives you the ability to make informed decisions when it’s your turn.”
Tips:
- Make attending postmortems (or post-incident reviews) a habit – even for incidents you had no part in. The pattern recognition you build across dozens of failures is hard to replicate any other way.
- Get stuck into incidents – starting with P3s and P4s for junior SREs, and working your way up. Pressure is a skill, and it can only be developed by practising under it.
What early-stage SREs usually get wrong during their first incident
Recall Einstein’s quote: ‘If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.’
Likewise for SREs, the most common mistake, Karan says, is jumping to solutions before properly understanding the problem. Sometimes interventions are necessary to move an investigation forward, and all interventions are calculated risks. The risk you’re prepared to take is entirely context specific. But sometimes, young SREs might be inclined towards impulsive and rash decisions.
“We don’t yet have a complete understanding of the system, and so we start reaching for fixes: restarting services, scaling infrastructure, looking only at application logs.” Without a broader view of the architecture, those actions can be premature at best (and damaging at worst).
He also flags process failures that are easy to overlook in the heat of the moment: not declaring a severity level, not establishing an incident commander and poor communication throughout.
“Think about the layers of a system – the data centre at the bottom, then virtual machines, then hosted applications, then your code, then your API interface. An issue can be at any of those layers. You need to think in terms of the bigger picture, because for the system to work, every layer has to work – not just the one that’s visible to you.”
That broader vision comes with time and experience. But being aware of the gap is the first step.
Takeaways:
- However tempting, resist the urge to act before you’ve diagnosed. In the early minutes of an incident, your job is to understand. Fixing comes later, and a premature restart can destroy the evidence you need.
- Always declare a severity level and establish an incident commander, even if it feels excessive. Process exists precisely for the moments when thinking clearly is hardest.
The advice Karan wishes someone had given him early on
When Karan started out, he expected his impact to be visible. It took time to realise that the best reliability work often isn’t.
“I started realising that reliability work is often invisible. The impact is huge – but if you’re doing your job well, nothing breaks, and nobody notices.”
He utilises football analogy: “It’s like a goalkeeper. Nobody values how many saves they make. People only remember the one that got through.”

“To find fault is easy; to do better is difficult.” – Plutarch
Credit to Lars Bo Nielsen on Unsplash
Learning to be at peace with that – and to find ways to make the invisible visible – is part of the job.
On the technical side, Karan is equally direct about the foundations he wishes he’d invested in earlier: “Strong SREs will always have a solid grounding in network basics, Linux internals, storage behaviour, and application failure modes. That knowledge compounds over time.”
Takeaways:
- Invest deliberately in the ‘unglamorous’ foundations: networking, Linux internals, storage and how applications fail. These insights will pay dividends in every incident you ever attend.
- Find ways to make your reliability work visible e.g. through incident metrics, postmortem summaries, trend reports. If nobody sees the saves, advocate for yourself by documenting them.
Making the case for reliability work
There is a persistent tension in most engineering organisations between shipping new features and investing in reliability. Karan’s advice on navigating it: stop speaking in technical language, and start speaking in business language.
“I used to say things like, ‘we need to improve our observability.’ But that’s hard for senior management to act on. So I started translating — revenue protection, operational efficiency, customer trust. The moment you frame it as ‘if we have this much downtime, we’ll lose this much revenue,’ people listen differently.”
The metrics he found most persuasive to senior stakeholders includes: time-to-recovery trends, incident frequency, repeat incidents, manual effort being absorbed by the team and SLA breaches. Despite metrics such as MTTR often getting a bad press in SRE circles (we wrote a whole blog about this), meeting stakeholders where they are is important in nudging them in the right direction.
He also makes the case that early-career SREs who can speak this language will stand out. “An SRE has end-to-end ownership: right from the customer experience through to the infrastructure underpinning it. If you can speak to engineers in technical terms, and to stakeholders in business terms, that is a very big plus. Especially early in your career.”
Takeaways:
- Translate every reliability proposal into business terms before presenting it upwards. For example, ‘Reduce MTTR’ becomes ‘protect revenue and customer trust during outages’. It’s the same idea, (hopefully) very different reception.
- Track and share metrics that resonate with leadership: incident frequency, time-to-recovery trends and the cost of manual toil. Numbers that connect to revenue or customer experience will always land better than technical benchmarks alone.
What it takes to get promoted to senior or staff SRE
Technical skill matters. But it will only take you so far.
“The biggest challenge I’ve seen with people who don’t progress is that they draw a boundary around themselves. ‘This is my remit. These are the technical issues I’m responsible for. Customer experience is not my problem. Stakeholder experience is not my problem.'”
The SREs who move forward, Karan says, are the ones who treat every problem as their problem – and who resist the urge to define themselves by what sits outside their job description.
He goes further: when asked whether he’d promote a less technically skilled SRE with a strong ownership mindset over a more technically gifted one who remained narrowly focused, his answer is unambiguous. “Exactly. Technical skill is one of the major factors – but unless you have complete ownership, people will always find a reason not to take on a problem. You need the mindset that says: I own this system, and if something is wrong, it’s mine to resolve.”
Takeaways:
- Notice when you’re tempted to say ‘that’s not my problem’, and try to push back against that instinct. Ownership is a behaviour you can practise before it becomes a mindset.
- Volunteer for problems that sit at the edges of your role, such as customer-facing incidents, cross-team reliability reviews, stakeholder updates, etc. That’s where senior SREs are forged.
Where the SRE role is heading (and what AI won’t replace)
Karan is measured on the question of AI. He believes it will change the role significantly – but not eliminate it.
“AI can detect anomalies. But we need people to interpret them. It can suggest auto-remediation, but we need people to judge whether that remediation is safe. Currently we have runbooks. In future, we’ll need intelligence in decision making – and that intelligence still needs to come from somewhere human.
I’ve seen systems break in a hundred different ways. The question I think SREs will increasingly be asked is: how do I bring that knowledge into architectural decisions from the start? How do I isolate failure to certain areas before it propagates? That shifts the SRE from incident responder to reliability architect.”
Finally, he points to domain knowledge as an undervalued differentiator. “If you’re working in industries like fintech or investment banking, where the cost of downtime is measured in ways beyond just revenue, the functional knowledge you carry matters enormously. That context shapes the decisions you make – and AI won’t have it.”
The SREs who thrive, he concludes, will be the ones who can hold all three things in mind simultaneously: the business function, the system architecture, and the codebase. They’ll be able to reason across all of them at once. Again, to repeat an earlier point, they’ll be a systems thinker.
Takeaways:
- Start moving upstream. The more you understand about system design and architectural decisions, the more valuable you become – both to your team and in an AI-augmented future where reactive work gets automated.
- Invest in domain knowledge, not just technical knowledge. Understanding why your system matters to the business, and what failure actually costs, is context that will always require a human to hold.
Good luck!
And, of course, if you’re an early-stage SRE looking to upskill, try our incident simulations, which will identify your strengths and areas and improvement. Log in and get started here.




