This article was originally posted on my Medium account. It has been reproduced here for your viewing pleasure.
--
As part of being on an on-call team, a huge concern for an IRT (Incident Response Team) is Alert Fatigue. Let’s dive in and discuss this.
Alert Fatigue: What is it?
According to whatis.techtarget.com, Alert Fatigue is as follows:
Alert fatigue, also called alarm fatigue, is an instance where an overwhelming number of alerts causes an individual to become desensitized to them. Alert fatigue can lead to a person ignoring or failing to respond to a number of safety alerts. — whatis.techtarget.com
If you are part of an IRT or if you are thinking of creating an IRT, you’ll need a couple of things for your team to be successful.
- You need a good way to consolidate alerts. You don’t want your team to have to have 17 different windows open to monitor different services concurrently. This is normally done via a SIEM like Splunk Enterprise Security.
- You need a good way to communicate those alerts to your team. To stick with the Splunk SIEM, we’re going to use Splunk On Call.
- You need at least 2 team members willing to give up sleep, events, and recreation for a specified period of time (usually a week out of the month) to be your on-call.
- You need a standard response plan that details the procedure to be taken by your team when an alert comes in.
Once you have checked the box next to the 4 items above, you essentially have the skeleton of an IRT. Now its up to you as the team lead to add the muscle to the skeleton and empower your team to take on any problem that arises head on.
The problem that you undoubtedly will encounter eventually is Alert Fatigue. This usually is caused by misclassification alerts (EG having all API calls come in under CRITICAL regardless of origin), or a sheer volume of alerts that exceeds what a single person can reasonably accommodate.
Misclassification of alerts
Misclassification of Alerts is a state of having a deluge of alerts that have classified as High/Critical/Very-Critical without that rating not necessarily applying to the event. This can quickly overwhelm an IRT especially if you only have 2 people on your IRT on-call.
The way to avoid this happening is to frequently take a sample of the alerts that are coming in and examine them. If the alerts coming in don’t need the IRT response, downgrading those alerts to a Medium or Low could be warranted.
The other way to tackle this is instead looking to the source of the alerts. If its something internal, it could be beneficial to communicate to the owner of the internal resource that these alerts are coming through. Many resource owners aren’t aware they generate alerts, and something as simple as clearing event logs to make space can trigger a Very Critical in IRT. Work with your resource owners and keep them in the loop.
Sometimes it's as easy as changing some settings on the source and as a bonus that can be a way to get security in the mind of those resource owners. In the above example, it may be that their machine is running out of the storage space and could need to be expanded.
Alert Flood
While I’m sure there’s a better term for it, an alert flood is exactly as its described, a flood of alerts that quickly overwhelm the response capability of the IRT. While some of these could be legitimate, they are quickly lost amongst the flood as page after page of alerts come in.
More often than not, these floods are caused by miscommunication or lack of communication from resource owners to security. Stress-testing a service exposed on your external net (even if it is dev) is a great way to flood your IRT.
If your team is experiencing this, you need to invoke a conversation with your resource owners. They need to tell you when they are testing, and all they need to do is send you an email. They should include the list of IPs or hostnames that will be involved, and the length of testing. This can allow you to suppress the alerts from those machines for that period of time. Your team can’t perform well when they are inundated with pages and pages of alerts.
Finally, consider changing what classifies as alerts and what is merely informational. Make use of your priority filters so that critical assets or events get acted on first.
The Impact of Alert Fatigue
Alert Fatigue can be quite devastating to your IRT. This can cause them to miss critical alerts or overlook alerts because ‘it always does that’ and creates a ‘cry wolf’ situation where even if a critical alert can be ignored because it comes in all the time and has turned out to be innocuous. If your IRT is experiencing these, communicate this to your resource owners. They should fix the issue, not just tell you that its safe (looking at you, my 2AM wake up call).
Finally, remember that you are part of your team. Make sure that you aren’t over-extending yourself. The same precautions mentioned in this article absolutely apply to yourself too.
Having a 24/7 on-call is definitely an important step to ensure that your company and your assets, but keep in mind the health of your team. Make sure your team knows they can come to you and discuss their IRT experience. You won’t know there’s a problem unless you reach out and ask.
I thrive on feedback. Feel free to start a discussion below!