Ready to make incident response your competitive advantage?
See how Uptime Labs builds provable, scalable incident response capability across your organisation.
A conversation between Gandhi Mathi Nathan Kumar Gandhi M N Kumar, Principal Incident Commander at Twilio, and Hamed Silatani, CEO at Uptime Labs.
I am coming back to this conversation a few months after we first published it. We originally created it for Fixmas, but I wanted to share it again because the themes feel just as relevant now.
In our discussion, Gandhi reflects on the unique pressures of managing incidents during the festive season. Staffing is lighter. Escalation decisions carry more weight. Paging someone in the middle of the night can feel like a bigger interruption than usual.
There are moments in the year when the stakes feel higher. People are spending time with family. Colleagues are observing important holidays. Others are trying to disconnect from work for a short period of rest. During these times routines shift, and the balance between work, rest, and personal commitments becomes more delicate.
Periods like this test more than our systems. They test our culture. They reveal how comfortable people feel escalating an issue, how teams support one another, and whether organisations rely too heavily on individuals during critical moments.
Escalation and key-person dependency are both examples of the impact of leadership on incident response. I firmly believe that leadership plays a key role in incident response, but much of their work happens before an incident takes place.
Making the call to escalate during the uncertainty of an incident situation is hard. There is a fine balance between being seen as incompetent and restoring service. Also, bear in mind that sometimes it is not clear early on how serious the incident is or how fast it can expand.
Feeling safe to ask for help and clear guidelines on when to escalate is important, e.g. “If you are not able to work out what’s going on in 15 minutes, escalate”, “If you believe the incident could be sev1 or higher, escalate” or “the moment you think you need help, escalate”.
Escalation and asking for help should always be celebrated. Both of these are principles that leaders should regularly communicate and make example of.
The other side of the coin is that frequent escalation can put smaller, more experienced staff under a lot of pressure. The drive to restore the service as soon as possible leads to calling upon people who have experience of dealing with many incidents in their area or know the system the best.
The paradox is each time less experienced people escalate to the-person-who-knows-it-all, they lose an opportunity to learn and gain confidence, which leads to more escalation to the-person-who-knows-it-all. Only leaders have authority to break this cycle by driving initiatives to develop expertise across larger population of engineers.
Investing in game-days, simulations, and sometimes taking the cost of slightly longer recovery time as investment to build expertise and confidence across a larger group.
What follows is a thoughtful reflection from Gandhi on these challenges. We discuss escalation culture and the psychological weight that can come with being on call (especially during festive season).
Could you tell us about a holiday season incident horror story?
One of my first Christmas-time horror incidents dates back to 2011, when I was a network engineer on call that week. It had been my first year in corporate, where I was still learning. We experienced a major outage caused by an undersea cable cut, and suddenly hundreds of customers began reporting latency and service degradation, since it happened off business hours APAC, most customers actively experiencing impact were on the other side of the globe.
It took some time before word spread through the telco carrier network that the root cause was indeed a fibre cut, actually, two of them. The first hit the SEA-ME-WE 3 cable near the Suez Canal in Egypt, and the second impacted the i2i link between Chennai, India, and Singapore.
For context, SEA-ME-WE 3 is the longest data cable in the world, stretching about 39,000 kilometres (24,000 miles) and connecting Southeast Asia to Western Europe through the Red Sea.

Both cable cuts caused widespread internet slowdowns across South Asia and the Middle East, especially in the UAE.
As network engineers, our job was to work with NOC and other carriers to reroute traffic (where possible) to keep critical voice and video services running, with as little disruption as possible. The incident hit the UAE particularly hard, forcing us to redirect large volumes of traffic through alternate global routes to reach the same destinations.
I was fortunate to work in an environment where escalation was encouraged; I could reach out to senior leaders and our NOC leaders in the middle of the night without hesitation. I still remember some of my leaders in Pune driving to the office at 2 a.m. to support mitigation efforts. Such a culture is pivotal to the success of the organisation as it raises the bar and inspires junior engineers to give their best.
That experience taught me the value of teamwork and the power of a supportive, blameless culture within the IPSOC/NOC I was part of.
What are your thoughts on escalating/ recruiting help when no one is around? What about the guilt of disrupting colleagues’ time off?
It’s unpredictable how quickly calm can turn into chaos. One minute everything’s running smoothly, and the next you’re in the middle of an incident with alarms going off and dashboards lighting up like a Christmas tree. In those moments, asking for help can feel harder than fixing the issue itself, especially when it’s late at night during the holidays.
A lot of that comes down to culture. In some teams, escalation is seen as a weakness, as if you couldn’t handle it on your own. In others, it’s a sign of strength and maturity.
The difference usually lies in how leadership sets the tone. When a company promotes a blameless culture where people are encouraged to reach out, page others, and escalate issues early, trust grows. People feel safer, more loyal, and more invested in the bigger picture.
Most tech companies have a code freeze during the holidays, which helps minimise risk. But as we all know, incidents don’t check the calendar. They can happen anytime. Even with the best tools and processes, things break. That’s why it’s so important to have escalation protocols that kick in automatically based on severity, impact, or duration, so no one has to hesitate or feel guilty about pushing the button.
Compare that to a Black Friday or Cyber Monday. Those days are planned chaos. Teams know the traffic will spike, so war rooms are set up and extra hands are on deck. But Christmas is different. People are with family, unplugged, and it’s easy for the on-call engineer to feel that pang of guilt before waking someone up or paging a leader.
Here’s the thing: duty calls, and that’s okay. What matters is having a culture where both sides understand that it’s part of the job. The on-call shouldn’t feel guilty about escalating, and the person being paged should know it comes from necessity, not carelessness. When everyone’s aligned on that, the guilt fades, and what remains is a sense of shared responsibility, the kind that makes teams stronger long after the holidays are over.
Because in the end, reliability isn’t just about systems. It’s very much about people who’ve got each other’s backs, even at 2 a.m. on Christmas morning.
Is the bar for a major incident higher during the festive season? It should be far more serious to distract people.
I believe so, mainly because when stuff breaks and human availability is minimal, the resiliency of engineering and incident response teams is put to the test. Major Incident Managers / Incident Commanders are often put in the hot seat to navigate the incident to resolution without much assistance. The festive season only makes this all the more challenging because you are now engaging with minimal resources available and do not have SMEs as you would on a normal day at your disposal to seek help from in ambiguous situations, and as far as distracting people, I think it’s a double-edged sword and ultimately boils down to the severity of the situation.
Pulling people away from family or leave is a serious disruption.
Holiday major incidents have a psychological and organisational cost, so teams often want to be very sure an incident actually warrants that interruption. Systems are expected to withstand periods of low staffing.
So when something breaks, it tests automation, failover mechanisms, monitoring clarity, runbooks, and the self-reliance of the on-call engineer. This can make borderline issues feel more major, but paradoxically also makes teams more hesitant to label something “major” unless absolutely necessary. With fewer experts available, the Incident Commander has to: make faster decisions with less information, rely heavily on predefined processes, avoid escalation clutter, and maintain calm while resources are thin. This reinforces the tendency to reserve “Major Incident” status for only the most clearly impactful events, so when you look at it that way, it’s a time to truly realize strengths and weaknesses and learn from it, there is a hidden opportunity for learning, building muscle memory, running enough simulation and gamedays to practice and implement lessons learnt from incidents that happen during holiday season when the amount of experts available are fewer.
System baselines and traffic norms change during the festive seasons. For some, it’s deadly quiet; for others, it’s extremely busy. What does this tell us about designing alerts?
Typically, the majority of the technology systems follow Newton’s first law of motion; systems tend to work flawlessly until someone meddles with it, and a change is introduced, which is where most of the issues begin, and you can find it is often the highest contributor of incidents. Change is not inherently bad, but change without a deep and thorough understanding, regression testing, baking period in lower environments, canary deployment validation in prod, is a recipe for disaster. During the Christmas season, most companies instil a code freeze and this includes vendors and their customers alike, which largely contributes to the quiet.
However, for others, the Christmas season is the busiest time of the year, because of peak traffic from customers, especially in businesses like e-commerce, travel and entertainment and even businesses that support these players.
Alerts are crucial but they need to make sense. I call this noise management. The biggest challenge with alerts is that no one revisits them after they were set up (which could be years ago), to determine whether the alerts are still needed, relevant, or ask the question, when was the last time a rule was triggered? Is the alert actually adding value? Has the landscape changed? Teams seldom have the bandwidth to perform an alert sanity exercise, where one does a clean-up of alerts that no longer make sense, and tweaks them to be relevant to the new norms. Many companies are offering AI-assisted tooling for this, which helps associate alerts, correlate and link them and suppress unwanted alerts. I see alert management as an ongoing exercise that continues to evolve and adapt as the landscape changes, eventually we will have sentient monitoring and alerting systems that are constantly learning and adapting to the evergrowing and everchanging tech landscape but until then we will continue to depend on human beings to build alerts that make sense to what a particular team, product or org cares about.
Working or having the burden of being ready to jump on a call during the holiday period disrupts personal/family life, e.g. you can’t travel or can’t go out with family. You can’t immerse yourself in the festivities. What is the psychological impact?
Working or being on call during the festive period takes quite a toll that most people outside these roles rarely see. There’s this constant, invisible tension; you’re technically ‘off,’ but never fully present. You hesitate to make travel plans, skip family outings, or keep your phone within reach even during dinner. Even when you do, you make sure your phone is on loud and on full charge, and you have your laptop with you. You can’t fully sink into the spirit of Christmas Eve or Christmas Day because a part of your mind stays alert, ready to jump into incident mode at any moment.
Over time, that state of partial readiness tends to wear on you. It creates mental fatigue and emotional distance, even if nothing goes wrong. The anticipation itself can be stressful, the feeling that at any moment, you might have to switch from laughter with family to troubleshooting in front of a glowing screen. For many engineers and on-call professionals, that blurred boundary between work and rest can lead to burnout, irritability, and guilt: guilt for missing moments, and guilt if you don’t respond fast enough when things go wrong and the policy escalates to the next level before you joined the call, or let alone joining the call late only to realize your Dr. Director or VP or CXO is running the call that you should have been running.
That’s why organisational empathy matters so much. Rotating schedules, clear backup plans, and a culture that encourages real downtime help protect mental health. But most of all, leaders need to recognise that being “on call” isn’t just a technical role; it’s an emotional one, and they need to ensure that knowledge doesn’t concentrate in individuals; rather, it’s dissolved and absorbed into tools and processes.
Because sometimes the hardest incident to manage isn’t in the system; it’s in our own minds, when we never truly get to switch off. It can take years to build a team, culture, and processes that break down human single points of failure and allow the team to constantly learn and thrive together.





