Alerts are vital part of security, and the challenges with effectively reviewing these alerts at scale are frequently discussed. But managing the detection rules that generate these alerts is just as challenging. Over the last year we’ve completely changed the way we write detection rules at Monzo to better understand our coverage and increase the speed we can respond to new threats.
Why we need detection rules
There are two types of security control: preventative and detective. Preventative controls stop a bad thing happening in the first place. Requiring a password in order to log into your email is a preventative control: it prevents people who don’t know your password from logging into your account. But what happens if you’re a victim of a phishing attack and someone tricks you into telling them your password?
One solution is to layer on more preventative controls, for example, requiring a second factor of authentication like a one-time code sent to your phone. But these can also fail—someone can steal your phone or trick you into giving them the code.
Another approach is to accept that preventative controls can fail and build systems to detect this. A notification that you’ve signed into your email account on a new device is a detective control. It doesn’t attempt to stop the bad thing happening, but will alert you if it does.
Good security involves a healthy mix of preventative and detective controls.
How we originally wrote detection rules
When we first started writing detection rules at Monzo, our security team looked radically different than it does today. Back in 2019 we were only a dozen people so it made sense to adopt a process that was optimised for engineers (who made up the majority of the team).
So, rather than introducing any new technologies, we chose to reuse the design patterns and technology used by the wider Monzo engineering squads.
Across Monzo, we make heavy use of microservices written in Go that communicate with each other both via synchronous RPC, and asynchronous message queues.
Our security alerting setup looked a little like this:
Here we have a microservice accepting audit events from GitHub. It sends a copy of these events to long term storage as well as to a security alerting service. This alerting service is where we implemented our security rules as a set of Go functions. These processed each event and could raise a task for a human to review if it was deemed suspicious.
This design was very productive for us because:
this design pattern was well established at Monzo so we knew it could scale to the volume of events we wanted to process
the people writing rules (engineers) were already familiar with the language and so could quickly ship new rules
we could benefit from all the existing libraries our Platform teams had written—if we wanted to understand the performance of a rule, we could simply use the same metrics and logging libraries that we’d use for any other microservice
Struggling to scale as our team grew
Over the last couple of years, Monzo’s security team has grown rapidly both in size (we’re now almost 50 people!) and range of roles. Our original design failed to keep up on either dimension.
In many cases our security operations analysts (who are reviewing the alerts generated by these rules) are best placed to spot opportunities for new rules or ways our existing rules could be improved. But, without familiarising themselves with our engineering processes, they couldn’t contribute directly to our rules as they always needed to get an engineer to help them.
Even for engineers, the process of adding a rule wasn’t as fast as we wanted. Our rules were written to consume events from a message queue, and so there was no way for us to test a new rule against all the historical data we’d collected. To to avoid the risk of a bad rule being deployed and swamping us with spurious alerts, we had to first deploy rules in a test mode to evaluate their performance before upgrading them to “real” alerts.
It became hard to keep track of our ever-growing catalogue of rules. Since they were written in Go, we found it hard to analyse them and answer questions about our detection coverage. Did we already have a rule to detect a given suspicious behaviour? What detection rules are reliant on a particular logsource?
Our new detection rule language: Sigma
For our new and improved detection rules, we decided to write them in a language called Sigma.
Unlike Go, Sigma is a language designed specifically for detection rules. This makes it much faster to write and easier to work with. Rules are based on the very common and widely supported YAML language so even if someone isn’t already familiar with it they can quickly learn it.
Sigma gave us flexibility. Other detection rule languages tie you to a specific vendor or technology. For example, if you write your rules using Elastalert, you’re completely dependent on using Elasticsearch to run those rules. In contrast, the goal of Sigma is to be a language that can be portable between a range of different implementations.
A basic Sigma rule looks like this:
This alerts on using the built-in screenshot tool on MacOS to take a screenshot, but with the screenshot sound disabled (a tactic that malware might use for reconnaissance, without tipping off the user that something odd is happening).
This looked really promising as the new language for our detection rules because it's:
very quick to write with minimal boilerplate. As a pseudo-standard, it’s more likely that people would already be familiar with it as a detection rule language
rich in easy-to-parse metadata (like tags and the logsource the rule is consuming from) which would make it easier to understand our rule catalogue as it continued to grow
But, while we really liked the look of Sigma, there was still the challenge of how to start adopting it. Sigma is how you write your detection rules, but it’s supposed to be used alongside a SIEM (Security Information and Event Management) product that can actually evaluate rules and raise alerts. Using the provided `sigmac` tool, you can convert Sigma-formatted rules into your SIEM’s query language.
Although Sigma is portable across many different SIEM products, at the time, we didn’t have anything that was compatible and deploying a SIEM would be a large infrastructure project distracting us from our immediate challenge: detection rule writing. So, to start trialling Sigma, we wrote our own implementation that could fit into our existing architecture.
Writing our own implementation of Sigma
To start our migration to Sigma, we wrote our own library (sigma-go) for evaluating the minimum subset of the Sigma language that we needed. It could take a basic Sigma rule and tell you whether it matched a given event.
With this library, we could take our existing security alerting microservices and simply swap out the Go logic with calls to the sigma-go library.
This let us try out writing Sigma rules without any changes to our architecture. If we found it wasn’t the right fit for our needs, we could simply delete the sigma-go code, and look into other solutions. There wouldn’t be any infrastructure to tear down or software contracts we’d be stuck with.
But, this experiment confirmed that Sigma was going to be a good choice for the future of our detection rules and so we invested more effort into making it into the primary way we wrote rules.
Expanding our Detection as Code tooling
Moving from writing detection rules in Go to writing them in Sigma was a big win for productivity, but it meant we lost many of the CI checks we were used to. Sigma didn’t have any linters for checking rules were formatted correctly or test frameworks for ensuring rules actually worked as expected.
Thankfully, these were straightforward to build. The YAML-based language was easy to format and so we created sigmafmt to lint our rules. With sigma-go, we had a method to easily evaluate rules without any additional infrastructure and so sigma-test is a small wrapper that lets us unit test our rules.
With these automated checks, we put together a lightweight, GitHub-based approvals process that gives us confidence in our rules, without slowing us down. When someone wants to add or change a rule they go through this flow:
Firstly, they make their change on a branch and open a pull request.
Our automated checks verify the rule is formatted properly and runs any tests that have been included.
Once the checks have passed and someone else from Security has approved the change, it can be merged.
The merged change is automatically deployed to production.
The future for our detection rules
We now have rules that are significantly faster to write, easier to deploy, and far easier to maintain. For new detection rules, Sigma is our default choice and we’re working on porting many of our existing detection rules to Sigma. The majority of these rules are being written by our analysts with minimal support needed from engineers.
But we’re still investing in making the experience even better. The faster we can get high-quality rules deployed the faster we can protect ourselves from new tactics that attackers are using. One thing in particular we’re looking at is the ability to take new rules and evaluate them over historical data to see how well they’re likely to perform in future.
A great benefit of our new setup is that we’ve separated the secret bit of our detection pipeline (the exact techniques and signatures our rules are looking for) from the technology that runs them. This means we’ve been able to build all of our Sigma tooling as open source projects and we’ll continue to do so as we improve on them:
If you're interested in joining us to continue this work and much more, check out our open roles for a Security Engineering Director and Offensive Security Engineer.