Honeycomb is a tool for introspecting and interrogating your production systems. It’s a new type of tool, designed to infuse observability across platforms, microservices, serverless apps, and increasingly complex systems, as well as all the way down to individual customers.
I had a chance to catch up with Emily Nakashima, our Director of Engineering at Honeycomb, to discuss the concept of observability in practice, and dive into how new tools are helping developers troubleshoot the innerworkings of complex systems and take ownership over the entire lifecycle of their code.
Tell us about yourself and your role at Honeycomb.
I used to identify as the frontend engineer most likely to be added to the on-call rotation, which probably gives you some sense for how weird my technical path has been. I currently manage the engineering and design teams at Honeycomb, and in the recent past I’ve mostly worked on other SaaS developer tools, including Bugsnag and Github. I can find joy working on most types of products, but I am fascinated by the all the intricacies involved in shipping good software, so designing and building the technology that helps my peers do that every day is peak nerdy delight for me.
My interests as an engineer skew toward frontend technologies, monitoring, observability, and performance optimization, but I started my journey in the early 2000s — where everyone was a “webmaster” or a “web designer” and I was both — and I began with html skills and a design degree rather than a traditional CS degree program, so I also have a lot of enthusiasm for user experience design and related disciplines too.
How has the emergence of microservices and serverless complicated observability?
The rise of microservices and serverless are both helping us ship new software quickly, iterate fast, and control costs. They’ve also helped us build dramatically more complex systems with many new and poorly-understood failure modes, and they have made a lot of our traditional monitoring tools less effective. In particular, host-level system metrics and alerts on those metrics have become a whole lot less valuable. Charity Majors, Honeycomb’s cofounder and CEO, has spoken about this often and more eloquently than I will, but I’ve felt this pain over and over with my own systems while serving heterogeneous customer traffic.
There’s always the moment where one small customer contacts you about some problem they are having with your system, and there’s no way to dig down to that one tiny slice of traffic with your metrics. In the old monolith world, maybe you could save yourself by digging through the logs for a single service, but it’s a lot tougher to do that when the request might have been routed through a dozen distinct services and you don’t know where the issue is happening. This is the exact problem that sets the stage for needing observability instead of just metrics — suddenly you need to be able to ask questions about just this one user, just this one request, or just this one path through your infrastructure.
How is Honeycomb supporting some of the new responsibilities developers have?
Honeycomb is built by and for teams that write, deploy, operate, and maintain their code from end-to-end, from planning to production. We know not every team is here yet — there are still plenty of teams that throw their software over a wall to someone else to deploy or manage in production. The thing that brings Honeycomb engineers together is this common belief that the best software comes out of a model of software ownership — where the same team writes the code and deploys it to production and monitors its failure or success there once it’s deployed. Once you start doing that, it’s really valuable to have tools that let you dig into what your code is doing in production from the perspective of the code itself.
The prior generation’s tools — particularly metrics platforms and APM tools — are best equipped to answer “how is my infrastructure doing?” and “how is this application doing on this one set of hosts?” However, we’ve learned that the question we really want to answer most of the time is, “how are individual customer requests doing as they traverse the entire path from the customer’s device to our systems and back?”
As an engineer, when it’s suddenly your job to figure out why a particular canary deploy is throwing errors or why just one customer in Latvia can’t log in, it’s a massive time saver to have tools that let you drill down to the individual request and are aware of when you last deployed and what changed in that deploy.
In the early 2010s, I remember having the sad experience of constantly sending customer bug reports back to the support team with a message like, “Can’t reproduce. Let us know if the customer can provide reproduction steps.” With high cardinality querying over wide events, our software engineers can now identify what went wrong and fix it before the customer even has time to finish emailing us about it. It feels amazing.
There are definitely teams that may resist this trend toward software ownership. Some folks “just want to write code,” or they are rightly intimidated by all the new skills they might have to learn. I hope that the learning aspect is something we can make easier, which should help more teams start to adopt it in a bottom-up way, but I also expect that over time, companies will see the increase in software quality and decrease in time to incident resolution that can come out of software ownership and will start a top-down push too.
How does Honeycomb instrument observability?
On nearly every engineering team I know, instrumentation work has been a careful handicraft performed by only the most senior, forward-thinking engineers. This tends to work against the goal of achieving observability, where you really want all systems and all new features to have some basic level of instrumentation.
At Honeycomb, we knew we needed to help make instrumentation best practices more accessible to engineers to make them successful with the product. We definitely have a lot of blog posts and documentation on the topic — the basic advice is always, start at the edge and try to capture as much metadata as possible (“wide events”) at each important point in the request lifecycle. We encourage people to start with at least one event per HTTP request to your systems and build from there, although in a tracing universe, you often end up with many, many events (“spans”) per request.
The blog posts and documentation help, but most engineering teams are busy enough that it can be tough to take time to read all this educational content and then spend the time hand-instrumenting your apps. Beelines are our attempt to roll up a lot of those best practices into an easy-to-use open source library that will automatically instrument apps using the most common languages, patterns, and libraries.
The thing I like about them is that they are both automatic and extensible — so you get events and traces automatically out of the box, but it’s easy to add your own custom spans and context to them as you find there is more information you want to collect about your software. The early Honeycomb team was a lot of engineers with many gripes about the inability to do custom instrumentation in APM-style tools, so we tried to capture all that good APM-style data and leave room for customization too.
What are the most common or interesting use cases today?
I’m a bit biased because I came from a frontend engineering background, but my favorite observability use cases today are ones that span all the way from the client — usually running on a customer’s device, like a laptop or mobile phone — to the server. Charity always says, “Nines don’t matter if users aren’t happy,” meaning that an arbitrary statistic like having 99.999% uptime doesn’t matter if your users can’t successfully use your product.
But how do you know if users are happy? We started to try to answer that question by instrumenting our web application, but a web app running in the cloud somewhere is still many hops away from our end users. And new client-side web technologies like Service Worker suddenly mean that significant user interactions with your app may not trigger an HTTP request to your infrastructure at all.
What innovations on the product roadmap are you most excited about?
Honeycomb started as an event-driven product that mainly displayed time series data, but we realized that that event-based data model supported lots of other amazing visualizations quickly and easily — including tracing! In 2018 we launched a tracing product and I am so, so excited about using 2019 to better understand how the power of events and traces can be combined.
Tracing products tend to be very depth-first in the way they make you look at systems, and classic event-driven Honeycomb does great with breadth-first inquiries. We are working on building Honeycomb into a tool that lets you seamlessly move back and forth between those broad and deep views, something that can be tricky in other products when you are jumping among logs, metrics, and traces all living in different tools.
Talk a bit about the Honeycomb community. How does Honeycomb support a thriving community?
There are two ways we are enormously lucky here — we have a great community of diehard Honeycomb customers who have evangelized us to their teams and their peers, and we have a number of early engineers on the team who love to teach and help others be successful with Honeycomb.
In particular, the Honeycomb Slack workspace has been a great place to get customer feedback and share conversations around best practices — and we often find that Honeycomb engineers do as much learning as they do teaching. I think Honeycomb’s emphasis on transparency has helped so much with community building — we are open about saying what we don’t know and what problems we haven’t solved yet, and I think we end up with a much better conversation with customers and leaders in the observability space as a result.
More questions for Emily? Let us know in the comments below.