Charity Majors – Observability and the Glorious Future, Iterate 2018
Nate Barbettini: Welcome back to the Build track. Just a little bit of housekeeping, I wanted to mention that even though you're here there is still great talks going on the other side, you don't have to worry about missing them. All of these talks, all the slides, all the video is going to be up on YouTube very shortly after the conference so you can check out the conference website again to see anything that you might have missed.
With that being said, our next presenter has a very impressive background. Charity Majors was the first infrastructure engineer hired at Parse, and then after Parse was acquired by Facebook she went on to lead a team of production engineers at Facebook solving some of the toughest infrastructure challenges in the world. She now is the CEO and lead engineer of her own startup Honeycomb.io and she's the co-author of O'Reilly Database Reliability Engineering book.
She is an expert in the world of building distributed systems that won't fall apart at 2:00 AM and wake you up. So please welcome to the stage Charity Majors.
Charity Majors: Did you memorize all that? That's amazing. I don't think I can remember all that shit. Yeah. Hi, I'm Charity. I cannot see anything except for the lights in my eyes but I assume there are people out there. I do these things. You can tell I'm from ops because this is basically how I feel about software. I'm not completely against software but I do believe that it causes problems. I'm a fan of running less software wherever possible. And the only good diff is usually a red diff. Personal preference. Did write a book. If you happen to have this book you'll notice that it's a horse. It was not supposed to be a horse but O'Reilly won't let you do mythical creatures. So I have stickers that will fix it to be a unicorn. I brought them with me, you can just... No one will ever know, it just has a little rainbow unicorn and then rainbow mane. I got it under control.
So I do come from the ops side of the house but there is this association in monitoring with ops people and I hate it. I have always hated monitoring. I have always been very much from the databases side of the house where things are fun and have consequences. And we're not a monitoring company. And I'm not going to be talking about what we're doing at Honeycomb, but it's important to like .. You guys are mostly software engineers, right? Some of my best friends are software engineers, it's fine. Sorry. I think that's hilarious. So let's start with the problems with monitoring. Gregory gave this great talk at Monitorama a couple of years ago, a conference, which I have categorically boycotted because, again, it's right there in the name, it's a monitoring conference, it's terrible.
Monitoring is dead because they haven't changed in 20 years, they really haven't. Like we've gone through how many architectural revolutions and fashions in the last 20 years and monitoring is the same old, same old. It's like you've got your dashboards and you've got your metrics and some tags to group the metrics and you've got some checks, right? Every post mortem ends in like, "Well let's create a dashboard so that next time we have this problem we can find it immediately." Good theory, cool theory. This is hyperbole, this is his statement, not mine. I would say that it deserves to die, but it's not dead. No strong opinions to be had here. Any of you on call, generally speaking? I don't know why I'm looking at you, I can't see you, thank you for that verbal response. Good to know. This is an outdated model.
It's good for what it does but it doesn't actually speak to the problems that we're having developing modern software, most of which boils down to complexity. Now, you all are software engineers, I assume that you are nice computer scientists who went to computer science school where they teach you things like Halstead volumes and cyclomatic complexities. Did they teach you that at computer science school? Okay. Well, I'm a music major dropout, I did not know any of that shit until I started researching because it always bugs me when people go, "Complexity, mwa-mwa-mwa." And you're like, "You're using that word to avoid thinking very hard about what you're actually trying to say." Right? So when I started throwing around the word complexity I'm like, "Okay, I got to understand this."
So I did some research, but honestly, this is the graph that I'm going to draw for you. And I have a graph so you know that it's true. Okay, let's look at some architecture diagrams. There is the humble, wonderful, beloved-ish LAMP stack. I cut my teeth on LAMP stacks. I've been on call since I was 17, right? I grew up with these. And if you're familiar with Parse, mobile backend as a service, kind of like Heroku mobile, there is Parse's infrastructure diagram from a couple of years ago.
A tree fell across the highway. You can't predict it, only going to see it when you zoom way in. There are some problems that you'll only seen when you zoom way out like, "The state of Arizona is down!" And there's some problems that are going to be systemic and not really visible, like maybe all of the transformers that were manufactured by this power plant and replaced in 1983 are starting to wear out really fast.
The point of all this is you can't use the same way of thinking about your systems or you're going to have a bad time. And this to me encapsulates the differences between monitoring and observability. Now, if you read Twitter first of all stop, don't, it's a terrible habit. Why are you still reading Twitter? Second of all, you'll probably have heard some like, "Well, it's just a marketing term." It's not actually because we have 20 years of best practices that we've developed for monitoring systems. Things like "you should not stare at graphs all day", "Your system should inform when something is broken so you can go check it." Cool. With observability, the best practices are the opposite. It's that you will drive yourself crazy and drive everyone to quit if you page them about every fluctuation and everything that might matter, because you only know if something matters in the context that you bring to it, right?
Some things are spiky and that's fine, sometimes it's not fine that it's spiky and you have no way of telling that programmatically and the best practice is honestly just you have muscle memory: you ship a change, you go look at it. Did what you expected to happen actually happen? Did anything else just obviously jump out at you that also happened? And that's just a handful of things like that where I don't want to fuck with a very robust set of practices that we've developed by just going, "They're all the same." They're not. They're not the same. They're similar. They're first cousins, there's overlap, but not the same. Congratulations, you're now distributed systems engineers. You should probably ask for a raise, I hear they make lots of money.
This is basically the patron saint of distributed systems because this is the first thing you have to internalize. Your shit is never working. Never actually up. So many problems exist right now and if you're sleeping well through the night you probably shouldn't be. Your tools probably just aren't good enough. And you kind of just have to embrace it. It's fun if you like whiskey and gallows humor. The complexity that we're talking about is driven... It's Moore’s Law all the way down. Right? The proliferation of types of hardware, of types of software, of polyglot persistence. Remember when the best practice was the database? You have the database and it was a terrible best practice to run more than one type of database because supporting them is so expensive for your team.
Well almost none of us get away with just one storage layer anymore because there's too much competitive advantage to be gained from some of the awesome advances in data storage. And where known unknowns used to be almost everything. I remember Second Life, when I was just a wee pup, like two or three times a year something would puzzle us. It would be like, "All hands on deck, we have no idea what's going on, something crazy." And the rest of the time it was more like, "Oh, yeah, that, got to go fix that. Oh, yep, that again, we should get rid of this alert." Oh, cool, you'd be on call for a few months and then you kind of had it, you had your repertoire.
At Parse, the reason I started this company is... Around the time that we got acquired by Facebook, I was the first infrastructure engineer. I was there pre-beta. Took them through the Facebook acquisition to well over a million users. Around its Facebook acquisition though we were at about 60,000 mobile apps posted on a platform. And this is around the time that I began to realize with sinking dread that we had built a system that was effectively undebugable by some of the best engineers in the world. And I tried everything but like we were losing ground. Like 50% of our time, 60, 70, was taken up tracking down one-offs. Right? A user comes and like, "Parse is down, so angry at you!" And I'm like, "Parse is not down motherfucker. It's like look at my wall of dashboards, they are all green. Fuck you." Which super helps, right? I'm just losing credibility the longer I argue with them about their experience but... I would dispatch an engineer, or I'd go track it down.
And it was like I knew I was in for hours, sometimes days, of tracking down these incredibly weird edge cases, it could be anything. And anytime there's a platform that you have to think about one user you failed in some way. So we were failing, a lot. And the thing that helped us finally get on top of our shit was a combination of a tool at Facebook called Scuba, some stuff we wrote in-house... Whatever, like fast forward, slice, and dice high cardinality dimensions in real time, boom. We fixed it. We moved on with our lives. I'm in ops, like on to the next fire. I did not even really stop to think about why it was so transformative. God, I'm just telling the story, aren't I? It's probably all on the slides later, so we can just go fast forward.
But when I decided to leave Facebook I was planning on just being an engineer manager somewhere. And I'm like, "Oh shit, well surely the rest of the world must have come a long way since I was last looking, right?" Surely! And I went to every company's marketing website, and they all said yes, of course. Well, they haven't. So that is what I reluctantly did it. It's simple when you can ask the right question, it's incredibly hard when you can't, when you're dancing around trying to knit together things that... An NTP is off on this system and so you can't even correlate... It's insane. All right, it's insane.
Monitoring versus observability. In monitoring, like it's traditionally very black-boxy, right? It is very biased towards adages, towards actionable alerts, right? You should never get an alert you can't take an action on. Action, action, action!
Observability is more like well, maybe it's good, maybe it's bad, but I just need to ask a question. Maybe I need to understand it for business purposes? Maybe I'm trying to understand the flow of my code. It doesn't matter, it's very much more about getting inside the head of your software and make it attractable and clear. Okay. Also, it's about drinking a lot. Full honesty.
So the term observability comes from control theory. I have copy-pasted the traditional Wikipedia definition. Mwa, mwa, mwa. I don't really know what that means, but I take this to mean for systems that if you've done your job well you can answer any question, you can explain any state in your entire system without having to ship new code, right? Without having to predict what questions you're going to need to answer.
This sounds very simple but there are a lot of things that proceed from this that are not entirely obvious. Things about storage formats and how you choose what data to throw away, because believe me, we are all throwing away data, all the time. You know this, right? Right? I hate monitoring, but I love debugging. Like any day that I get to launch s-trays and justify it is such a good day to be alive. I also think it's a little ironic, we spent like 10 years telling people, "Because of DevOps monitoring is not just for operations." Well yeah, it is. It is. It's how you operate a service, right? Just like instrumentation is how you develop a service. It doesn't mean same people can't and don't do both, it's just what it means. You can also kind of think of observability as the super set of monitoring and other types of asking questions.
You have an understandable system when your team can track down any problem without having seen it before and without that being painful. This should not be a dread, pit of your stomach, "Oh god. This is something new?" We have to get rid of that feeling. Let's try some examples. Let's take a very familiar problem. Like voters are slow. Why? LAMP Stacker. I assume you all are just like, "Yep." Right? No?
Yep. Excellent. I don't know what I'd do if you were like, "No." You can monitor these things. They're so easy to monitor. Put a threshold on the connections to the database. Check the error output. It's so monitorable. Characteristics of the assistants, they're friendly to intuition. Dashboards are amazing. And this last one is subtler and funner, which is that the health of the system is pretty evenly distributed. So in other words, if your up time is 99.5 about half the percent of the time somebody's getting an error. This is pretty easy to handle with like client side magic that you guys do, and I don't. And it's not that bad because it's evenly distributed. We'll get back to this in a sec. Best practices. Blah, blah, blah. Let's look at the same questions, taking some real examples from Parse on Instagram.
What exactly am I supposed to monitor? I'm not quite sure. Also, I'm not done. I've got more. Oh the Heisenbug's. The ones that keep coming until you look and just won't ever happen until you're looking. Why? Oh. Oh. Oh. And this one was one of my favorites. They're like, "Push is down." And I'm like "Push is not down. I'm looking at it." Days go by. Every day they're like, "Push is down. No. Really. People are really upset. It's not just the usual suspects, but like it seems to really be down." I'm just like, "What?" And I went and dug it up and eventually realized that we had added more capacity, and we were using Round Robin DNS, and it had exceeded the EVP packet size. No biggie. It's supposed to fail over to PCP. Did everywhere in the world except one router in Eastern Europe for whom Push was down. Go figure.
Do you realize how much time we spend talking shit about our users and not respecting, and not wanting to follow up on their problem reports because it's so painful to do so. It's such a high bar for us to want to go, "Oh god. I guess I'll go track this one down." Like what if we could just shrink that to just be like, "Click, click, click. Yep. Pow." You know what the secret is? High cardinality data. Like, okay. Let me explain. High cardinality. Imagine you have a dataset of ten million users. The highest cardinality is going to be unique IDs of any sort because it's as many uniques as you have records, right? Somewhat lower cardinality will be last names, first names. Right? Very low cardinality would be gender and presumably species would be very, very, very low. Like one.
So what kinds of information would you just casually with your big computer science brains expect might be the most useful? High cardinality. Well, dashboards can't do that. Metrics can't do that. I'll explain why in a few minutes. Really the key to getting a handle on our shit at Parse was almost entirely just the ability to break down by one in a million app IDs and then break down by any combination of any other values. That's it, but suddenly it's like we had auto generated all of the dashboards, for all of Parse, for each user. Right? Instead of being the top 10 list that you get if you're lucky, we were just like, "Okay. Disney you're doing eight requests per second." Well we're doing 100,000 requests per second. You're never even going to show up, right?
Your error's never going to show. Well, you break down by just Disney's traffic and suddenly you go, "Oh, yeah. Your latency is going up. I see." "Oh. You deployed a query. Looks like it's timing out in the login endpoint." You deployed a query. Or your dataset grew by another ten rows and suddenly you went from 29.9 seconds to 30.1 seconds and now you think it's all down. There's so many things it could possibly be. You need to be able to inspect the actual reality of each user. These are all unknown-unknowns. I feel like the future of debugging for all of us looks like this infinitely long tail of things that almost never happen, except that once. You can either think this is awesome or terrifying. Probably based on how good your tools are. Characteristics of these systems.
Unknown-unknowns are most of the problems. Or rather if you find yourself in a system like this, and you're still going, "This again? Oh. This again?" You have different problems and I'm not here to help you with those today. You can't model the entire system in your head. Dashboards are often actively misleading because everything spikes at once, or everything drops at once. There's no causality. There's no correlation. There's no sense of context and where in time and space you exist. The hardest problem is often identified, which components to debug or trace.
If you have a system that is self-referential that can loop back in, a single note or container, or database instance that's running a snapshot on the live volume can infect every single node, right? It can be very hard to tell what it's emanating from. And this last one, I love this. Half of the system is irrelevant. It doesn't matter. I'm exaggerating a little bit, but not that much.
Imagine if you're in AWS and you've just done your job, you're distributed across some availability zones, and if one goes down you don't even get woken up. Your users can never tell. Meanwhile, all your dashboards are red and you're 25 or 33% down. Do you care? I don't want to care. I deeply don't want to care. On the other hand imagine that Disney... Imagine that your entire system is doing four requests per second, right? But they're all Disney four out of eight. You've got to care. You actually care about the health of each individual event and every high cardinality slicing and dicing grouping of these.
This is also relevant because in distributed systems it tends to be not everyone has the same experience, they're all using the same components. It's more like if your reliability is 99.5, well most people think that you're 99.999% up and everyone who's last name starts with B-H-E thinks you are 100% down. Or something that's localized. You know we do all this fancy-ass shit to like partition and like check tolerance, but what we've done is create a lot of little pockets that can go very unnoticed for whom experience can be very bad. And if you can't tell the difference, if you can't slice and dice, if you can't both zoom in from, "I have everything. I want to find where the problem is coming from." And then you need to pivot back to, "Okay. Now show me everyone who shares this. Who else is experiencing this." If you can't do that you have two choices.
One is just be comfortable with kind of sucking. And I do not say this with judgment. That should be your first choice if you can get away with it because not sucking is very hard, and expensive, and will take a lot of time and energy. Like I'm in no way saying you should definitely not suck. Do I sound judgmental? Those are your options. You can either suck or you can develop the tooling. Or you can throw bodies at this problem. It will work for a ways. Quite a ways actually. Usually. It depends. Sorry. I'm not really good at delivering good news.
Best practices. It's all about instrumentation, right? It's so much about developing those muscles. This is what separate a senior engineer from everyone else is having operational instincts, and knowledge, and just the reflex.
I wrap every call, right? You start at the edge and you work your way down. Events, not metrics. I think the rest of this talk I'm just going into each one of these in a little bit more detail. Events are powerful because ... All right. So there's two meanings for the word metrics. There's the generic noun meaning things I use to understand my business. That's not what I'm talking about. I'm talking about the metric, which is this thing that we devised back when hardware was incredibly expensive and we were just like, "Oh my god. I cannot write a log line to my 64 mg hard drive." So the metric is born. The metric is a number. That's it. It's a number. And then like in order to find and group our numbers we pin some tags. Well, you cannot use high cardinality values for those tags because of the write amplification.
So you're limited to the number of tags that you can have, which is typically in the hundreds. And you're going to want to assign some of them to like system details. If you have more than a couple hundred hosts or containers, you can't use the ID as a tag either. So spend them wisely. Anyone who has spent any time in this space has blown out their key space. Like just trying to put useful data in their time series database. How dare they? I have a thing against time series databases.
Events are powerful because they pack a wallop. They tell you not only a number or a bunch of numbers, but also these all describe the same event. And all of these things were true at once. And if you think about the many hours or days of your life that you have lost to trying to correlate things across different systems, I feel like you're just going to have a gut feeling for how important this actually is.
When I say events I basically mean structured logs that don't necessarily write to disk. In fact, please don't write to disk. If you don't have to, disks are terrible. Disks also cause problems, but structured data? Yeah. You want that. The difference between these unstructured strings that you all like to put in your logs and structured data is like the difference between Grep and all of computer science. And you went to computer science school. So fix your fucking logs. Sorry. You have to test in production with these systems a lot. And testing in production has gotten a bad rap. Mostly I think because of that really hilarious meme that's like, "I don't always test, but when I do I test in production." Funny meme, but in fact you test in production whether you admit it or not.
In fact, staging is like the equivalent of the tests, right? You can test some things. The easy things, the really dumb things. You do not know how your code is going to perform in production until it's in production. And I feel like a better metaphor for code and production is fourth trimester. You know how in every other mammalian species, babies are born and they just go toddling off to eat real food. Well, human babies don't. I don't know if this is news to you, but they come out extremely helpless and they can't do much for themselves. And this is how you should feel about your software when you're pushing it to prod for the first time. You should like wrap it up in a nice little feature flag so it does not get hit by all those terrible users out there who don't understand how cute your baby is. It's very cute. Just not to be trusted, but you should... And this advice is for people who have services that are mature enough to have users that count on them.
Again, if you don't have to care about quality please don't. It's hard and time consuming, but if you do then I recommend things like canaries. Roll it out to 10% of people. Facebook always rolls it out to Brazil for some reason. I never found out what they had against Brazil. They like give it to 10% of Brazil. Okay. Looks okay. Roll it. Depending on how much engineering time you have to devote to this. I personally find that almost everyone spends too much time trying to make staging work because it's a pain in the ass. And not nearly enough time thinking about guardrails for baking in and maturing their cone like a fine wine in production.
This last one is your reward. I'm not going to lie, the observability tools, tactics, techniques, they're a little bit more, they ask a little bit more of us up front. If you can get away with intuition, and some monitoring checks, and your LAMP Stack, do that. But when you reach the point where you're just like, "I'm dying." Or even before then, and you need to invest in observability, your reward is you don't need to page yourselves that much. Maybe the big four, right? Latency, request rate, errors, and possibly saturation. And a couple of end to end checks that traverse the critical path around what makes you money. That's it.
If you trust your tools to give you the answers no matter what it is, quickly, every time, you can get rid of all these terrible things that we're all just kind of embarrassed about and don't talk about.
Like the fact that we get paged by... We get a page about 50 unrelated things, but we know that it means that it's... This is not science. I'm going to skip through it because I really want to save some time for questions. The one thing that I wanted to talk about... I want to talk about sampling real quick. Oh, yes. Here's an example of some potentially useful data. It's all high cardinality. I wanted to talk about ... Oh, yes. Dashboards to find your dashboards with your dashboards. That's a terrible pattern. Every dashboard is an artifact of a past failure. Oh, right. Okay. So sampling, and then teams, and then I'm done.
So every time that you're understanding a system you make a choice about what to save and what to throw away. And you throw away far more than you save. This has been abstracted away from a lot of people by time series databases, which is part of why I hate them because what they do is they select an interval, like a second, and they smush everything that happens over that interval into one number. They're just like "Wee! Only one thing ever happened." That is a lie. Not only is it a lie, it's a dirty lie because you could never ask new questions of that data that you smushed and threw away, right? You don't have access to those events anymore.
Once smushed, you cannot un-smush, right? So they're throwing away everything that's happened to give you high-performing metrics. And I do not think that you can even call that observability if all you have is these time series aggregates because you can't ask a new question about what happened. You have to have the event. You have to have the request. You have to have something that you can trace back to a real users experience. However, nobody in their right mind is going to pay for an observability stack that is 100 times the size of production.
So if we're going to choose events, we have to choose what to throw away in order to get that nice, fat, rich event. And I'd recommend sampling. Now, sampling is big with data scientists. They're just like, "Well obviously you sample everything because counting every single thing is impossible and why would you even try?" I like these guys. It's cool, but we're not to used to it. Like we don't have the muscles. We don't have the language. We don't have the habits associated with this and a lot of people just... They immediately, they're like, "Oh. But that one weird event. What if I don't have it? I got to keep everything. 'Cause I don't know. Maybe I bill off of these logs." First of all, stop that. So many people do this. Like your billing logs and your operational data must not be the same thing or you will do very bad things in one direction or the other, or probably both really. Different strokes for different folks.
But you don't want to sample dumbly, right? You absolutely have the ability to keep all of things that happen really, keep most of things that happen semi-commonly, and sample heavily the stuff that almost never matters. Or that, to rephrase, or matters in aggregate, right? 'Cause when things are good, what do you care about? All you care about is the shape, the direction, the curve, right? What's happening on average. It's only when something goes wrong, or you have a specific question, that you want to go in and find that needle.
And in your brain, try to reframe it away from, "I must find that thing," to "I must find a representative's instance of that thing or something that is affected by that thing." That's all. It's a pretty easy mental habit. So for web data sets, 200s. Right? Very easy to sample. Very good to sample heavily. Except to in points that only get hit rarely. Keep all of those. Keep more of them. I don't know. 500s, keep them all, right? When the site goes down you can rely on the server side to cut the top off the mountain. It's fine, but under normal circumstances you want all those details.
For databases, often like Parse the only time to delete that issue was when somebody was deleting their app. So we kept 100% of those, right? Inserts, I think we kept close to 100% of all of those. But like selects, no. Not really. You get what I'm saying, right? And I feel like the curating sample rates is the new and vastly improved curating alerts. Right? It's not waking you up. It is something that you don't always immediately what you want, but over time oh my God. It's a great way to deal with this data.
More ranting about aggregates. Yes. Here are some examples of just things that get really easy with this kind of a system. So I feel like when you're debugging with dashboards, it's not really debugging. You're not doing science. You're pattern matching with your eyeballs, right? You're just kind of like flipping through them like, "Was it that one? Is it that? Is it that one? Is it that outage? Does that one? Oh. This one looks-" And the thing is that it takes so much submerged intuition and...
You know how the person that's the best at debugging the system is always the person who's been there the longest? Now I love being that person because I have a hero complex. It's good for me, but it's kind of demoralizing to everyone else. That's not very democratic. It's kind of shitty for me too when I'm in Hawaii. It just doesn't scale. So I've now had the experience three times in a row of being on teams where that was not the case. Where three to six months after joining a team the best debuggers were not the people who'd been there the longest, they were the best debuggers. They're the people who had written those services, right? And I thought a lot about why and how this was, and I think it's because the old style with dashboards it just asked so much of you to model so much shit in your head.
And especially when you're like reading lines of code and you have your mental model of how... It takes so much. And you have to be exposed to so many outages, and so many things, to just get that gut feeling of, "I know what it is." Right? You jump straight to the answer. It's the greatest feeling in the world, unless you can't do that, and then it's the worst feeling in the world. But it doesn't take you there, whereas this kind of debugging, it's more like the way we do business intelligence. Right? Instead of jumping to answers and sifting through answers, you're just asking a question, looking at the answer, asking a new question based on the answer. It's like you're following the data where it's taking you, right? And following the breadcrumbs where they take you, while it feels more open ended and terrifying in the beginning, it is actually so much more secure.
And you're just taking so much of what you need to know about the system out of your intuition and into the realm of your tool where you can interact with it. And it's mind blowing. This is another reason that I started Honeycomb because I feel like this has to be the future of engineering. I see too many teams just burn their people out, lean so heavily on their senior folks, and it's just toxic. And I just, it's just so much easier. We can just ask a question. Just like follow the trail. It's not that hard. It's like a puzzle. It's super fun. Anyway the thing to remember is our users don't give a shit about our nines and our fancy-ass metrics. Like they care about their experience and if our tools don't place experience front and center then they're the wrong tools.
Test in production… own their own services. Services need owners, not operators. I feel like the center of gravity is moving absolutely relentlessly towards a general software engineer who sits in the middle, follows these external services and APIs, just desperately kind of trying to get something to work. Sorry. You have a glorious profession. And Ops people like me, I mean we're not going away, but we increasingly live on the other side of an API. And you can't always run over and bug us. And when I talk about software engineers needing to learn Ops it's with this in mind. That it's empowering. It's reassuring. Well, to everyone around you, but also to you. On call responsibility do not have to death sentences, or life sentences, or really unpleasant at all.
And I know that a lot of people have scar tissue and bad experiences, but this can experienced as a very empowering and a very fun thing. And yeah. That's my view of the future. Anyway, I think it's table stakes. I do not think that operational skills are nice to have anymore. You would not hire an Ops person who couldn't write code. Right? Not optional. It's part of building software. And I'm going to go through all of the things that talk about the development process and how it's like observability driven. All this shit. It's just working ourselves out of a job. And I'll end on that. Woo. Thank you.
Any questions if we have time? I'm happy to talk afterwards too. Just come find me. Also, I have stickers that say very rude things about computers. So I'll put them out there and you can take some if you want some.
Question: So on the topic of testing and production, I think you're able to do that if all of your customers are end users where a subset of them will probably complain to you, but there are any other…?
Charity Majors: No. No. No. No. You should definitely have users that you can test on. Like your friends.
Question: What if you're a company that has a lot of very stringent SLAs with other companies and if you try to test in production you're going actually cause a hit on your SLA. What do you do in that case?
Charity Majors: So first of all, testing in production, a lot of it is isolating your critical path, making it as dependent on as few things as possible, as resilient as possible, as loosely coupled. Like you want so many things to be able to fail gracefully, right? That is where the bulk of the good engineering is done. You cannot have to care about all customers, all users, equally. You cannot. If you have to create fake tiers of users, or like free tiers of users, or something, like bots. Like you have to have people that you can experiment on is the high level answer.
Question: So you talked a little bit about sampling the common use case and saving everything from the rare use cases. Are you able to like decide what is common and what is rare? Are you having a machine figure that out for you?
Charity Majors: I mean it depends. Usually you have some knowledge of your systems that you can start with, and you're never going to get it completely right, but you usually kind of know where to start, or you just pay for two or three times as much. As you know, you're eventually going to want to and over time you're just kind of like prune it. Like dealing with unknown data sets is obviously categorically more challenging than dealing with ones that you know. I'm a big fan of like the way that we do events, you can have data sets that are like 300, 500, thousands of dimensions wide. We don't want to incentivize people to not put any data there. We want them to put as much as possible because who knows what's going to be valuable, right? So, yeah. It's always a trade off.