"We like to think of Opstrace as open-source distribution for observability"
Sebastien Pahl is one of the cofounders of the company Docker (formerly dotCloud), having previously worked as an engineer at Red Hat, Mesosphere, and Cloudflare. On March 25, 2021, he was our guest for an Ask Me Anything (AMA) session on Clubhouse, during which he discussed his latest project, Opstrace, which he recently launched with Mat Appelman. Initially described as the “open source Datadog,” Opstrace allows you to deploy secure, horizontally scalable open-source observability in your own cloud account, combining open APIs with the user experience of a large service provider. In this article, you will find the main insights gained from this AMA session. To listen to our future AMAs live, join the Welcome Tech club on Clubhouse!
The reasons behind Opstrace’s creation
I met Opstrace’s cofounder, Mat Appelman, at Mesosphere, along with a lot of our team. We wanted to solve new problems in the infrastructure space, and it seemed like a good time. So we started talking to a lot of companies about how they used observability and how they monitored and logged their systems, and we saw that it wasn’t a solved problem. Everybody was either using a SaaS company, which is fine but expensive, or they had to put something together themselves. And in the world of open source, to use cool things like Prometheus requires experts.
But we love open-source software! That’s the kind of software we want to use, and that’s the kind of software we want more people to be able to use. That’s why we decided to create Opstrace! An open-source tool that has more automation, thanks to APIs, making it really accessible and easy to monitor your platforms and systems at scale—not just to start with, but really something you can grow with, and without having to be an expert. We like to think of Opstrace as open-source distribution for observability.
Explore more in our section: Tech
The current state of the project
Right now, there are seven engineers working at Opstrace, and the project currently comprises an installer and a controller to basically install and maintain the system. So all you need to have to get started is a Google Cloud Platform [GCP] or an Amazon Web Services [AWS] account—you can deploy and install our open-source observability solution there. The kits have repository builds and a command line interface [CLI]. You take the CLI and you give it pod credentials, and then you can deploy a three-node cluster, or more if you need to, or a cluster that has the APIs you can start sending logs and metrics to, and then provide a CLI to upgrade. This is the current state of things. And we also built a user interface [UI]. We built some APIs that don’t exist anywhere else, and we’re going to be adding features that don’t necessarily exist anywhere else. For example, there’s nothing in open source today that can do synthetics.
And what we’re currently working on is how to make upgrades work well. We’ve thought about this from the beginning, and although we currently provide a command for upgrades, it’s not that easy. These are also things that we’re working on automating right now. And then, the things that we’re doing in parallel include simplifying certain things that we believe are repeatable tasks. For example, we’ve built something to help users easily collect metrics and logs from various cloud providers in one place.
In addition to building the platform itself, we are experimenting with some companies with a managed version of Opstrace. This basically involves us running it inside their cloud account, and we’re on call, so that they won’t have to be woken up if the system breaks. We know that we can automate it to a point where eventually you will be able to have Opstrace the open-source project, where everything is open, and then the managed service, where we are basically like an SaaS provider, but we will run it in all of the customer’s accounts and not just one central one. But that’s the future.
Integration with other platforms
We’re definitely in the business of using other open-source projects and cloud platforms, since we built this on AWS and GCP to make it work. We also use projects like Cortex and Grafana Loki. What we were focused on is, how do you automate them, how do you put APIs in front of them that, by default, have Transport Layer Security [TLS] or the user’s authentication and authorization? And then we test the system. We run this all in one way, where we have a continuous integration to make sure that you’ll be able to upgrade to the next version. This is another thing that’s not easy to do with open source—how do you make sure that everything works together in one coherent way? That’s what we’re building.
And one of the fundamental things, when wanting to automate and make sure that we build a platform that is cost effective, is to use an object storage provider like S3. At first, we looked at Thanos, because it is a super-simple thing that you could install next to Prometheus and it then starts sending the Prometheus time series database blocks straight to S3. That led us to Cortex, which at that point was not fully using S3, it was using S3 with Cassandra or DynamoDB. So we started automating around that because we saw that Cortex was going down the road of being fully backed up by object storage. It’s horizontally scalable on the query path, it’s horizontally scalable on the write path, and it is pure object storage.
But it can change in the future. There are other things coming along. There are other platforms from other vendors that will be usable. For example, I don’t see anything from Grafana Labs right now that could do pure events-based stuff. There are also things that you could do in the future with engines like PrestoDB. Because, with PrestoDB, you can write back, and so you could query pure S3 across these things. We’re just starting with where the things are and where it works, and then we’ll evolve the platform.
Managing the upgrades
It’s not enough to just build a system that you can put together and configure however you want. You need a system that has the least amount of nodes possible, while still satisfying quite a lot of use cases, at least to start with. And you need to make it so that you can always deploy it, install it, and constantly test it. So every release of Opstrace is tested, even once it’s been merged to main. You test and you launch upgrade tests. If you don’t take testing seriously, and if you don’t do that in a very disciplined way, you’re not going to be able to build on top of that foundation.
Then, when it comes to actually upgrading a managed service, down the road, when this works and we’re managing, let’s say, thousands of customers, we will have a relationship as well as metrics. We will hook up these systems and we will use Opstrace to monitor all these other Opstrace instances. So we will know how much traffic goes in. We will also have the controllers of these systems connected to a central place where we will be able to orchestrate what rolls out. We won’t be able to see the data, it will be in the customer’s network, but we still get to have a control plane to manage them. This is how to create something that you can actually upgrade, whether you manage it for the customers, or whether you’re just rolling it out as though it’s the latest Chrome release.
How to monetize Opstrace
Once you deploy something inside somebody else’s network where they pay their cloud provider for actually running it, you’re not incentivized anymore to charge them per byte, you’re incentivized to show them exactly how much it costs. They’re incentivized to actually build into the product to think, “This is how much it costs and this is how much it’s going to cost.” And then you are much freer to charge them by a model that seems fair: it can be by the size of the infrastructure or per user. There are exceptions, but big infrastructures usually have a lot of users, users logging in, being on alerts, using the data, machine users. You don’t necessarily just have human users, so you can eventually monetize your product this way.
And we are going to have to work on what we will exactly charge our customers. We could be selling the fact that you don’t have to be on call for your system, and then, the usual support and everything. But that’s too easy to say, so I don’t even go in that direction—we need to be a bit more creative. We could, for example, imagine something around helping people to upgrade their monitoring system, such as browsers that have a very controlled rollout of upgrades—CoreOS, which makes sure that everybody gets upgrades, or Red Hat for OpenShift. Because the last thing that you want to go down is monitoring.
It could also be helping companies to hook up their system to be monitored correctly. Prometheus has a wonderful system, where you can create error budgets and service-level objectives [SLOs], but you have to learn time series math to do that. We could instead create a few UIs that will do that. I know that other open-source software will hopefully get there, too, but it would be nice to streamline this and help people set up their alerts the right way, or at least 80 percent of the alerts the right way.
A community around Opstrace
This is meant to be an open-source project first, and then others learn too, because it’s quite ambitious. We can lay the foundations, we can demonstrate the strategy, but there will be developers in companies rebuilding this again and again, with Helm charts and other ways, inside their own infrastructure. So we’re hoping that other people join and participate. We’re going to actively work in terms of communication to achieve this goal.
This article is part of Behind the Code, the media for developers, by developers. Discover more articles and videos by visiting Behind the Code!
Want to contribute? Get published!
Follow us on Twitter to stay tuned!
Illustration by WTTJ
Tech Editor @ WTTJ
- Add to favorites
- Share on Twitter
- Share on Facebook
- Share on LinkedIn