Building a Monitoring Stack for High Availability: A Real-Life Story

Publié dans Coder stories

18 sept. 2019


Building a Monitoring Stack for High Availability: A Real-Life Story
Max Hédouin

Technical lead @ Sqreen

At Sqreen, we’re big proponents of visibility. Sqreen is the industry’s first provider of Application Security Management (ASM), unifying application security needs on one single platform. Bringing visibility into application security is a major driver for our product, and it’s a big part of our culture as well.

When you visit our Sqreenhouse in Paris or San Francisco, you’ll notice that we always have our monitoring dashboards prominently displayed. We’ve built an integrated stack to keep an eye on a lot of different things, from our users’ interactions in our application to our infrastructure usage. The reason behind this is simple: We want to make data-driven decisions in order to provide our users with the best experience possible. And what worse experience is there for users than not being able to use our services at all? These situations can arise in many ways. It could be from a simple bug blocking functionality, to infrastructure issues. But no matter what it is, we do our best to minimize and avoid these issues.

Human beings aren’t perfect and make mistakes, so it’s inevitable that every software engineer has been the root cause of a production issue at least once. Things happen, so we need to thoroughly monitor our services. This isn’t about blaming an individual, but rather to figure out if we can add a check to prevent someone from making a similar mistake in the future.

Learning from our mistakes and our imperfect processes is important, so we use a variety of tools to add checks and balances to the development process before code makes it to production, and to validate the proper behavior once it is there. Here we’ll be reviewing some of the tools we have in place and run through a real-life example of how these tools helped us from the start of an incident to the implementation of the fix and how we validated the fix so that the issue didn’t reoccur.

Diving into our code lifecycle

To begin, let’s follow our code lifecycle, from inception to serving our customers. At each step, we’ll take a look at the different checks that let us iterate and validate quickly. Fixing a bug when the code is fresh in our mind is always going to be a lot easier than trying to do it months later. Here’s what our process looks like:

  1. No code is pushed directly onto the master branch, so all of it goes through a pull request (PR), and this is where the checks start.
  2. As soon as the PR is open, we check that all automated unit and integration tests are passing, that our global coverage isn’t being reduced, and that the coverage of the new/updated code is higher than the existing code. If the tests fail, the PR is prevented from being merged, while two other checks add a warning while this is being attempted.
  3. PRs also need to be approved before merging. This catches low-hanging fruit, as well as regressions on existing functionality if the coverage is high enough, and, with the team’s manual review, gives us the confidence to move on to production.

Sometimes, however, this is not enough, and errors appear in production when the new code is shipped. In this case, we receive notifications with the detailed stack trace as well as the application context when the error occurs. With this information, we can easily roll back our code to the previous release, properly fix the issue, and go through the whole process again. This direct feedback loop reduces the time between the development of the feature and the report, making the fix usually straightforward for the developer.

Once the code has reached production, we closely monitor the performance, our infrastructure, and our security activities, verifying that the newly introduced code doesn’t have a negative impact by setting alerts on key metrics. Memory leaks, performance issues, security bugs, and the like are a lot easier to debug if we can pinpoint the exact deployment in which one was introduced.

Having a lot of small deployments is preferable to having fewer, larger ones because it reduces the amount of code that is needed to review in order to identify the problem. For example, say we introduce a performance issue that only becomes noticeable further down the road as data is slowly migrating to a new format: The functionality is met, but the index doesn’t update properly and the performance slowly deteriorates. Once it reaches a non-acceptable threshold, we get notified, and so are quickly able to pinpoint the deployment introducing the regression and fix the problem. It would be way more difficult to identify the problem without knowing precisely when it first appeared. With bigger deployments, there’s a lot more code to dig through to uncover the source of the problem.

The tools we use for our deployment and monitoring processes

So, now that we’ve established our process, we’ll briefly go through each tool that we use to make that process possible.

  • Our code is hosted and reviewed on GitHub, built with AWS CodeBuild and Jenkins, and our coverage is generated with Codecov.
  • For errors and exceptions, we use Sentry to track software errors and Loggly to centralize our application logs.
  • For monitoring different aspects of our application, we use New Relic for our application performance as well as application monitoring, a combination of Datadog and CloudWatch to monitor our infrastructure, and of course, Sqreen to monitor our security.
  • Finally, we rely on Slack and Gmail for notifications and PagerDuty for incident management.

How our tools can come to the rescue: A real-life story

An important question to ask yourself when evaluating your stack is, “What value are those tools bringing to the table when stuff goes really wrong?” Now that we’ve covered the context and explanations for our monitoring tools, it’s story time! And the story today is about the time our database (MongoDB) went down(-ish).

Let’s start at the beginning, because as our good friend Murphy and his law states, things tend to go wrong at the worst time possible. At 1.57 AM, our Datadog monitoring noticed that one (out of three) of the instances in our Mongo cluster had stopped responding, and triggered a PagerDuty incident. It probably wouldn’t have been such a big deal if a read-replica had gone down, but in this case, the master node went down…

At this point, still foggy from getting woken up in the middle of the night, we went back to basics and checked for software errors and, well, there were a ton of them because our database wasn’t responding properly. Also, as our service availability was going down, PagerDuty incidents were coming from our monitoring in New Relic.

In the middle of the night, the goal is not to repair the underlying process, but to find a quick fix that will restore the availability of our services. The list of options here goes from increasing memory limits on containers, adding more containers, removing unused collections, restarting machines to freeing up RAM and increasing I/O capabilities.

The “game” at this point is to identify what is going wrong and see how we can address it quickly. In this case, it appeared that Mongo was swapping more and more (as we noted from our Datadog infrastructure monitoring) and that memory needed to be freed up. So, our obvious next step was to restart the culprit machines and go back to bed, right? Well, not entirely. While restarting the machine did free up the RAM, the initial problem came back to bite us really quickly as the new master started swapping, too, and had issues with keeping up with the load. A more extended analysis was needed there to get it back in working order.

The key finding after a few hours of analysis and exploration was that we also had to increase input/output (I/O), as the outage triggered a retry logic that increased our load as the system came back online, which put more pressure on our already-struggling Mongo cluster.

Being able to understand what is happening during an incident is key to being able to bring back our system’s availability. This incident kept our on-call engineers up for a good part of the night, even with good monitoring solutions in place, so I can’t even imagine what it would have been like if we didn’t have the proper monitoring in place.

Naturally, we don’t like being woken up at night, but providing reliable services and software is our top priority. So the next day (in the afternoon, after catching up on some sleep), we started working on a postmortem. The goal of this document was to understand the system failure and implement steps to prevent it from happening again. Using the same tools to do a thorough analysis, we identified key actions to take to make our system more reliable. They had an impact on various aspects of our stack, and had different levels of complexity.

In this case, the root cause was purely technical, and so were the fixes. For other incidents, the root cause could be a mistake we made, such as in the code while performing operational changes, in which case we would have looked for ways to add automated checks to the process to prevent such mistakes happening in the first place. The answer can never be that we will just simply not make the same mistake again, because humans, by definition, make mistakes (everyone can have a bad day after a disturbed night’s sleep or gets distracted by personal concerns).

So, once we understood what actions we needed to take, we went back to our usual iteration process: Implementation (using continuous integration/continuous delivery (CI/CD) checks), release, measurement of impact using targeted metrics, repeat. By following this, we were able to implement a few quick wins, which considerably released the pressure on our system shortly after that incident, and bought us time to prioritize and implement the longer-term solutions.

What about security monitoring?

As we also monitor security events, we extensively use Sqreen playbooks to slow down attackers. One of the ways we do this is by limiting the use of the password-reset functionality, and blocking the IP once a certain threshold is met.

Once, while we were working on updating our password reset page, we triggered a security response that blocked our own organization’s IP, preventing everyone in our offices from using our application. A quick configuration update to our playbook unblocked our IP, preventing it from occurring again.

This little anecdote highlights that tuning your monitoring is key, and as with software, it is never complete and there are always new iterations and improvements waiting. Our current processes have not reached their optimal state yet either. For example, one improvement we could make to our deployment process would be to use “canary” releases. The idea is to gradually deploy code to all of your instances. If an unusual amount of errors comes from one of the servers (the “singing canary”), then the deployment can be automatically reverted and the errors manually reviewed before they have any impact on customers. The goal with improvements like these is not to be perfect, but to continue to get better with each change.

In this article, we took a look at our monitoring tools and processes at Sqreen, and we saw how monitoring our services helps us to iterate quickly and with confidence, as well as their effectiveness, from identification of an incident to its resolution. We are continuously improving our monitoring and increasing the reliability of our software. All of this enables us to give our users a great experience while using our product!

This article is part of Behind the Code, the media for developers, by developers. Discover more articles and videos by visiting Behind the Code!

Want to contribute? Get published!

Follow us on Twitter to stay tuned!

Illustration by Blok

Les thématiques abordées