
Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
Karan was a Staff SRE at IG Group, and also worked at Morgan Stanley, Credit Suisse, Fidelity, IG Group and Tata Consultancy Services. He is now a Senior Customer Success Engineer at Uptime Labs.
The Mindset Shift of Moving From Senior to Staff SRE
Q: What do you think is the biggest mindset shift when moving from senior to staff SRE?
Karan: It’s a big one.
As a senior engineer, my focus was on me and the problem - resolving incidents quickly, keeping systems stable during my shift, handling logging and monitoring tasks.
But as a staff SRE, the mindset completely changes. It’s no longer about how fast I can fix something. It’s about how the team collectively owns uptime - from new graduates to senior engineers - and how we maintain that shared vision of reliability.
It’s about building systems for the long term, not just quick fixes. You need to think about reliability from design all the way through operations. That means getting involved earlier in the development process, training your team, and ensuring everyone understands that uptime is a shared responsibility.
In short, it’s not about individual performance anymore. It’s about enabling the team to succeed together.
Getting Promoted
Q: That’s a big shift. Why do you think you were promoted to staff SRE?
Karan: Yes - and honestly, I never chased the title.
By the time I was promoted, I was already acting like a staff SRE - taking ownership of the system as a whole, not just my eight-hour shift. I made sure my teammates were up to speed because uptime isn’t an eight-hour job - it’s 24/7.
That sense of ownership was key. When you truly believe the system’s reliability is your responsibility, everything else - team training, knowledge gaps, collaboration - starts to fall into place.
It’s like firefighting: when there’s a fire, everyone jumps in, regardless of rank. But the best firefighters are the ones who practice the most. The more you sweat in training, the less you stress when it counts.
Common Mistakes from First-Time SREs
Q: What common mistakes do you see first-time staff SREs make?
Karan: There are many, but the biggest one is getting stuck in constant firefighting.
You’re so focused on fixing incidents that you never pause to reflect on why they happen. Teams burn out because they stay reactive.
Another mistake is thinking you can solve everything within your team. You can’t. Successful staff SREs bring developers, platform, network, and performance engineers together around a shared goal.
When I worked under Hamed, we started holding weekly OKR (Operational Key Responsibility) meetings, where everyone - developers, platform, QA, and network teams - came together to discuss blockers and priorities. Making operational pain visible across teams helped reduce silos and improve collaboration.
Staff SRE - Not Just The Most Senior Engineer
Q: That ties into the idea that a staff SRE isn’t just the ‘most senior' engineer - they’re a connector. So, what does technical leadership mean to you?
Karan: To me, technical leadership is about connecting the dots.
You don’t need to be an expert in every system, but you must understand how everything fits together - how upstream and downstream systems interact, how network, application, and infrastructure layers influence each other.
A great staff SRE sees the whole system and can interpret the signals - identifying where problems might emerge and how components interrelate. That systems-level thinking defines technical leadership for me.
Mentoring Your Team
Q: And how do you mentor your team as a staff SRE?
Karan: Two main things.
First: let them make mistakes - just not the same one twice. Learning by doing (and failing) is essential.
Second: give them exposure. Encourage them to present, speak in forums, and take ownership of decisions. The more they interact with other teams, the faster they grow.
A Staff SRE needs deep horizontal knowledge - understanding how everything from data centres to application deployments connects. You don’t need full depth in every area, but you should grasp enough of each to make smart judgments.
Knowledge sharing is also crucial. We held daily handover calls so the entire team knew what was happening, ensuring no one was out of the loop during incidents.
Firefighting vs. Long-Term Improvements
Q: That’s a lot to juggle. How do you balance firefighting with long-term improvements?
Karan: It’s the classic Staff SRE dilemma. There’s no perfect balance.
At one point, under Hamed’s leadership, we tried a ‘car wash model’ - taking a set of applications, fixing them thoroughly, and then handing them back to dev before taking on new ones.
We also had a simple rule: if the same issue happens twice, fix the root cause. Don’t apply the same workaround more than twice. If the same problem recurs after that, it’s no longer an SRE issue - it’s a design issue, and the development team must own it.
Without that mindset, you’ll stay stuck in firefighting mode forever.
Automation vs. Process & People
Q: And when it comes to fixes - how do you decide between automating something versus improving process or people?
Karan:
I always start by blaming the system, not the person. Even human errors usually point to system design flaws - the system allowed the error to happen.
That mindset pushes you toward automation. Solve problems systematically, not manually.
Only when an issue falls outside the system’s control - like traders making manual configuration changes - should process improvements take the lead. Otherwise, technology should always be the first solution.
Keeping Healthy On-Call Habits
Q: Let’s talk about on-call. What’s the key to keeping it healthy?
Karan: The biggest factor is ensuring every alert is actionable.
When I first became staff SRE, 80% of after-hours alerts weren’t actionable - they were noise. Fixing that reduced burnout dramatically.
We audited every alert: what’s actionable, what’s just signal noise, what defines a ‘healthy’ system. It’s like a medical report - you only want alerts that indicate real problems, not false alarms.
Second, if someone is paged after hours, that alert becomes a top priority for a permanent fix the next day.
For me, the real work begins after you close an incident - understanding why it happened and preventing recurrence.
Good vs. Great Staff SREs
Q: Finally, what’s the difference between a good SRE and a great SRE?
Karan: That’s for my team to say! But I think empathy is the key.
A great Staff SRE shows empathy not only toward people but also toward the system. You understand how it behaves, where it struggles, and how other teams interact with it.
A great Staff SRE inspires developers, platform engineers and network teams to see uptime as a shared goal - not just an SRE responsibility.
Reliability is a team sport. When everyone feels ownership of uptime, that’s when you go from good to great.
Recapping the Advice
Shift from individual ownership to collective reliability.
Moving from senior to staff means thinking beyond your own incident queue. Your job is to help the whole team own uptime, not just fix problems yourself.
Focus on long-term reliability, not quick fixes.
A staff SRE examines the system holistically, influencing design, development and processes to prevent future incidents - moving beyond just reacting to outages.
Stop firefighting; start system thinking.
Constant reaction mode leads to burnout. Take time to analyse root causes and involve cross-functional teams in solving them.
Create shared ownership across teams.
Reliability isn’t an SRE-only problem. Bring developers, platform and network engineers into shared OKRs and conversations about uptime.
Lead through empathy and exposure.
Let your team make (and learn from) mistakes, encourage them to speak up in larger forums and ensure knowledge is widely shared.
Prioritise healthy on-call habits.
Every alert should be actionable. Cut out noise, fix repeat issues permanently and treat post-incident reviews as where the real work begins.
Automate where possible - blame systems, not people.
Even human errors often stem from system design flaws. Whenever possible, build systematic, automated safeguards instead of relying on manual processes.
Practice, don’t panic.
The more you rehearse incidents in low-risk settings, the calmer and more effective you’ll be when real ones occur - reinforcing Uptime Labs’ mantra: ‘drill like it's real, respond when it is real.’
Good luck!





