It's a Saturday, and you're about to sit down and take another look at kickstarter when you get a call. It's a member of your team. She has bad news. A crucial system is down and it's going to take a while for it to come back up. Deployments to production have stopped and there are some people who are already starting to get worked up.
You know that your customers have had to stop work and pretty soon, you're going to be the centre of attention. There are a lot of teams that will be willing to help, but you know that the skills required are so niche that their help will not be much use. A Tech Bridge is being setup and some very large fish will be on the call wanting updates on the hour every hour.
You know that this will require a 24X7 effort until it's back up and running. Whilst you know your team will step up and give it their all, you're also concerned about your team.
You're big enough to have crucial responsibilities, but your teams skills are too niche for an enterprise Grade Major Incident Management process.
Welcome to the underworld of Major Incident Management with a small team. I have been in exactly this situation many times - sometimes being the only person to fix something, other times with only a couple of other people and also a manager of a crucial Category A business Service.
Major Incident Management Procedure
What I find works best, is having a Major Incident Management procedure all setup and ready to roll. The procedure goes something like this:
Get the whole team together, explaining the situation. Only the most life and death work on other items is to continue.
Sort out who has the relevant skills for the task and hand using your skills register (you do have one, right?).
For those people the skills set them up into 8-10 hour shifts, try and get the most relevantly skilled people into the shifts where the most activity is bound to happen. For those people who do not have the most relevant skills, they go on the shifts as well - these are your Point People. Ensure the shifts have a 30 minute handover.
Send the people who will be on shifts home to get some sleep. Now.
Get the roster and make sure it's published to those who need to know with the relevant contact details. Make sure people know that before calling anyone - they must check with the roster. Make it clear that people will be sleeping at odd times and that tired people make mistakes.
Cool, so now you have a micro squads setup in shifts. There are three roles at play here as follows.
The Technical Guru(s)
This is the person or people who will be working to get things going. Their focus is on solving the problem. They will need to focus.
The Point Person
This is often the role that is missed by small teams. This is the person who has the task to handle all the communications and updates and to support the Technical Guru(s) in whatever way is required. They will be the firewall who will give an update to the Developer who just got sent by his Manager to ask "are we there yet?". In short, their role is to minimise distractions and to keep morale up. They are the people who will sit on the tech Bridge whilst the Tech guru needs to "Facilitate the Amenities" or orders the Pizza, get's the caffeine flowing and that the dry cleaning is picked up.
The Manager
As their Manager, you are the third person in the squad. Congratulations, you will have to focus on any of the political situations or discussions. If you can find a peer who can help, rope them in so you can get some sleep. If you can't, well sometimes it's good to be king, other times, not so much.
The Post Operative Analysis
When the Incident is resolved, stand down your teams roster, and depending on how long the roster has been running, keep your team in skeleton crew mode until your team have readjusted their sleep cycles to the normal rhythm.
Always do a post incident analysis with the team and refine your process accordingly. What went well, what needs to change?
If applicable, update your shakeout procedures, monitoring and scripts to keep an eye out for the problem in the future.
Some Examples
Over the years I have found that this procedure works well. I have seen other teams who have their best 3 people kept awake for days on end, making mistakes. When I see this situation, I ask myself "why is their manager so irresponsible?"
Recently, my team intervened to assist another team who was in a whole world of trouble. By the end of the incident many days later, the results spoke for themselves. My team could have sustained the required support for many more days. The other team was a complete wreck - both physically and mentally.
PreRequisites
It is also worth noting that whilst there will always be limitations on small teams and the skills spread within the team, as their manager you should always work on ensuring that you have more than one person with any of the most relevant skills. Your team needs to "Cross skill" - particularly if it's a small team. If you have someone in your team who wants to be the maverick and won't share their expertise, they need to be managed.
You should find that the components in play here can be useful to tie into your company' Disaster Recovery and Business Continuity planning and procedures, which are a much broader scope than what this post can cover.
I'm not claiming that this is a perfect process, but I can tell you that this works and I sincerely hope it helps - even if it's just as a starting point for you. If there is one thing I really need to reinforce is that the point person is a crucial role here and it is one I have never seen other small teams adopt. At least not until they see it's benefit
Good Luck!
Jack has Managed a Software Configuration Management Service which services approximately 2000 developers. He hasn't had an opportunity to execute this procedure for over a year as the systems have achieved 100% uptime for the year ;)
Congratulations @jacksussmilch! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit