What is a cron job anyway?
In software engineering, jobs that run on regular schedules are commonly known as “cron jobs”.
At Monzo we use cron jobs to do things like process inbound BACS files, make scheduled payments, or run health checks on our back-end systems. The pattern that we use is pretty straightforward; We expose endpoints in our microservices whose handlers perform the desired workloads, and then we automatically call those endpoints on various configured schedules. This pattern aligns with our general microservices architecture, and means that a workload invoked by a cron job behaves the same way as any other workload. The only difference is that it is initiated automatically, rather than by a human or by some other service.
The history of cron jobs at Monzo
As we’ve grown, the way we handle cron jobs has changed. In our early days, we used long running cron containers, and then migrated to Kubernetes CronJobs.
‘Long running cron containers’ were pods in our Kubernetes (k8s) cluster that were always running, with the sole purpose of calling some endpoint on a regular schedule. This worked well, but was a very inefficient use of system resources as the container would run 24/7, with nothing to do for the vast majority of the time.
Kubernetes CronJobs are the cron offering supported by Kubernetes, and as our services are all deployed on k8s, it was a natural progression for us. With k8s CronJobs, a pod is spun up, performs some work, and then powers down. This is better than something running 24/7, but there is still a lot of platform overhead in allocating resources and starting/stopping pods - and the more cron jobs we have, the more cost we incur.
While both of these approaches got the job done, we were spending energy and money to run pods that ultimately made a single request, and so even though k8s CronJobs were more efficient than long running containers, they were still inefficient for the way we’d use them. However, the thing that really tipped the scales and encouraged us to build something ourselves was the engineering experience of creating and operating cron jobs.
Where’s the magic?
At Monzo, we care deeply about providing a first class engineering experience. We create tooling that turns clunky, multi-step processes into simple, easy-to-execute commands with any laborious grunt work automated away behind the scenes. We also sprinkle a generous amount of “Monzo Magic” on these tools to make things simple, intuitive, and delightful - things like tips and tricks on startup, or seasonal emojis in our CLI prompt.
To put into context how undelightful creating cron jobs was, here’s an example of the configuration an engineer would need to write to create a cron:
30 * * * 1-5 sh /run.sh
The ‘30 * * * 1-5
’ part of this line is a crontab expression, and means “at every 30th minute from Monday to Friday”, call this particular shell script. The shell script itself would make a cURL request to a service on our platform which has all the business logic of what needs to be executed.
K8s CronJobs also trigger on UTC without any support for timezones, which often made them painful for us to deal with. Suppose in this example we want our cron to be called at 15:30 in the UK’s local time; for 7 months of the year the UK shifts to British Summer Time, which is 1 hour ahead of UTC. This means that during BST, our cron jobs would fire at 16:30, not 15:30. To get around this we had to do things like execute the cron every hour and check if it was the hour we wanted… and if it wasn’t, do nothing.
The cron system was also very opaque, and engineers had no easy way to know if their cron had been created successfully, if it ran successfully, or what went wrong if it failed. This meant that we had to write our own alerting and monitoring solutions within each handler, as we wouldn’t inherit any from the system itself.
Let’s build better
We had a strong conviction that this was a problem worth solving, so we spent some time thinking about what our dream cron system would look like and came up with some high level ambitions.
We wanted to:
Write crons in Go, the same language that we write all of our services in
Express, in simple language, that a cron should run every X minutes/hours/days, rather than using a crontab expression
Get slack notifications when crons were created, updated, or failed
Have fine grained control over things like retries, timeouts, and whether to page on failure
Define crons within the services they’ll call, so that their creation, deployment, and updates happen as part of our regular service deployments
Build delightful CLI tooling so that people can easily list crons, get full details for a single cron, and even do things like pause crons with minimal effort (very useful in a disaster scenario!)
We already have a deployment service that we can extend to register crons, and a “squad inbox” service that we can use to send Slack messages to teams, so the only bit we really needed to build was a “cron service” that would take care of storing crons, then scheduling and executing them.
At a high level, this is what we’d be looking at putting together:
We also built a new interface that engineers can use to define crons in Go, and exposed it through a library. Here’s what it looks like - this first example is how we’d define the same complicated cron from the example I shared earlier:
cron.Config {
CronName: "some-job",
Description: "Call service foo at 15:30 in Europe/London",
Request: fooproto.SomeRequest{},
Schedule: cron.Schedule {
Crontab: "30 15 * * 1-5",
Timezone: "Europe/London",
},
}
Much simpler, right?
Here’s an example of an even simpler cron that does something every 10 minutes and so doesn’t need a crontab expression at all:
cron.Config {
CronName: "some-simple-job",
Description: "Call service foo every 10 minutes",
Request: fooproto.SomeRequest{},
Schedule: cron.Schedule {
OncePerDuration: 10 * time.Minute,
},
}
There is a subtle superpower to the OncePerDuration
option here - crontabs need to be defined based on hours and minutes, which means that if you want a cron to run every 10 minutes, you express something like */10 * * * *
which means “every tenth minute”... but if we have lots of crons that we want to run every 10 minutes, they’ll all overlap on minutes 10, 20, 30, etc.
Worse, if we have hourly crons, they’ll overlap on the zeroth minute. This “top of the hour” problem meant that our backend systems would get a burst of cron traffic every hour on the hour. OncePerDuration
solves this entirely as we schedule executions starting from the cron’s creation time, which is already pretty random. This means that even with hundreds of OncePerDuration: 10 * time.Minute
crons, they’ll be naturally spread out.
You can solve the “top of the hour” problem with crontabs, but the syntax is not intuitive, and it would require engineers to manually select minutes that are sufficiently well spread apart. We don’t want our engineers to have to spend any time thinking about what minute they should select for their crontab offset - we want something that just works, and that is intuitive - and OncePerDuration
ticks those boxes.
And lastly, here’s a cron with some fine grained control over what to do in failure scenarios:
cron.Config {
CronName: "some-highly-configured-job",
Description: "Call service foo every hour",
Request: fooproto.SomeRequest{},
Schedule: cron.Schedule{
OncePerDuration: 1 * time.Hour,
},
FailureSemantics: &cron.FailureSemantics {
RpcTimeout: 5 * time.Minute,
ExecutionWindow: 20 * time.Minute,
PageOnFailure: true,
RunbookURL: "https://example.com/runbook",
},
}
Notice that the failure semantics are human readable. We’ve tried to maintain this property in all aspects of our new cron interface (which we refer to as cronfig
) so that crons are easily understandable at a glance, and by extension, much more accessible - including for Monzonauts who aren’t engineers.
Putting it all together
Once we were happy with the new interface, we came to the final piece of the puzzle, which was how we’d take that Go and turn it into an actual cron that would be scheduled and executed. For this, we reached for another of the common patterns that supercharge our velocity at Monzo - code generation.
We use the Go definition of crons within a service to generate their JSON representation and then, at deploy time, we make a request to our crons service telling it about all the crons that exist in the service being deployed. If nothing has changed, nothing happens, but if a new cron is added, one is updated, or one is deleted, the crons service will take the appropriate action, including recalculating any schedules or notifying the service owners of any changes.
Coupling cron registration to deployments in this way is another subtle superpower of the new system. Because we register crons on every deployment, if we want to add a new config option and polyfill some new default, we can do that within the cron service and rely on the fact that our users will inherit the change on their next deployment, even if their cron configuration hasn’t changed, and because our services get deployed a lot that means that we can let the polyfilling happen naturally rather than needing to ask all cron owners to do anything.
Every 60 seconds, the cron service polls to see if there are any crons that are due to be executed, and if so, it executes them and calculates their next scheduled execution. If anything goes wrong, we still schedule the next execution, but we also alert the owners of the cron and let them deal with things from there.
And that’s it!
The Results
Since we started this project, we’ve seen fantastic uptake and hugely positive feedback about the new way of handling crons at Monzo which has been hugely rewarding.
We’ve had contributions from other engineers who have added wonderful features like the ability to add “jitter” so that a cron doesn’t run on a fully predictable schedule which someone might become reliant on.