Insider Risk and Remote Work
June 1, 2020
Performance Anomaly Detection
June 3, 2020
Insider Risk and Remote Work
June 1, 2020
Performance Anomaly Detection
June 3, 2020

Observability Virtual Conference – Q&A Report from Slack

Be a part of the conversation on CMG’s Slack workspace

 

Thomas Cesare-Herriau asked speaker Austin Parker about his presentation: “I’m curious – how did you arrive to the factor of 5 for efficient observability?”

Speaker Austin Parker replied:We did user research with our customers and found that the average time reduction was usually around 5xA lot of that came from big reductions in ‘routine’ checks (so, things that would take 15 minutes would go down to 1 or 2).  Complex problems are still complex, but sometimes the context would be extremely helpful. One story I’m fond of is someone that was trying to figure out a DB issue. They used Lightstep to pinpoint the exact DB shard that was having problems in a few seconds (out of thousands of shards), so instead of having to comb through a bunch of logs doing correlation by hand, they could jump right to the server that was having problems and start remediating.”

 

Alex Podelko asked speaker Chris Bailey about tracing timeline: “As we see from https://en.wikipedia.org/wiki/Application_Response_Measurement v1 of ARM was released in 1996. And that was supposed to be an industry-wide tracing standard. It is interesting why it didn’t succeed and with all this interest in tracing nobody even mentioned it.”

 

Alex Podelko asked speaker Austin Parker about Java: “Java, for example, has some internal instrumentation – so you may get pretty good insights in what is happening inside using an APM product. What, say, OpenTelementry and Lightstep would add here?”

Speaker Liz Fong-Jones replied: “It makes it possible to do distributed tracing. most services aren’t single process anymore. APM is… fine if your service is a java monolith.”

 

Alex Podelko asked speaker Liz Fong-Jones about differentiation: “Some vendors like Dynatrace says that can work with distributed configurations (and that they are not APM already – AIOps or something like this). How would you differentiate Honeycomb/Lightstep/OpenTelemetry from them?”

Speaker Liz Fong-Jones replied: “It’s really confusing why Dynatrace isn’t talking about otel in sales

https://www.dynatrace.com/news/press-release/dynatrace-teams-with-google-and-microsoft-on-opentelemetry/. Dynatrace is part of the otel community and they’ve actually donated some of their java agent technology. So the answer is… in the future, maybe not today because people are still migrating, but in the future. OpenTelemetry is how you produce that telemetry data, and you can analyze it with Honeycomb, or with Dynatrace, or with Lightstep.  We’d all rather that you not be locked in and not have to instrument more than once as far as AIOps. I feel it’s mostly marketing snake oil. The New Stack: Observability and the Misleading Promise of AIOps https://thenewstack.io/observability-and-the-misleading-promise-of-aiops/ may be helpful.”

Amanda Hendley relayed a question to speaker Chris Bailey: “Related to Capacity Management as a practice.  Is it replaced by Observability?  What about the trend? What about pacct data?”

Speaker Chris Bailey replied: “Observability is the instrumentation that allows you to answer capacity questions. Instrument so that you can make good educated guesses on capacity – predict trends, when to scale, etc.”

 

Amanda Hendley relayed another question to speaker Chris Bailey:  “Hello Chris, am wondering if there is automation capabilities to discover and connect the dots and service path or those are manually build? Is there modeling behind observability?”

Speaker Chris Bailey replied: “Honestly everyone’s use cases are different and what they are trying to solve – so you do have to get in the weeks with your data and use cases.”

 

Jose Adan Ortiz asked speaker Melanie Cey: “What do you mean with Systems Thinking? Interesting concept.”

Shelby Spees (Honeycomb) replied: “This wiki page is a good starting point https://en.wikipedia.org/wiki/Systems_theory.”

Speaker Melanie Cey replied: “ http://donellameadows.org/archives/leverage-points-places-to-intervene-in-a-system/. Donella Meadows has a book and this publication to get started https://www.amazon.com/Thinking-Systems-Donella-H-Meadows/dp/1603580557. It was introduced to me by Ramin Khatibi https://www.youtube.com/watch?v=tlXbnPmhngY.”

 

Alex Podelko asked speaker Melanie Cey: “Why criticize is the second step after analyze? I’d rather say theorize and then criticize (the theory) – or you mean something else?”

Speaker Melanie Cey replied: “The idea is – thinking critically about the data you’re processing then creating a theory about how to act on that.  It comes mostly from my experience that – when people theorize before thinking critically – they do stuff that isn’t based on any rational ideas. But you can criticize the theories, for sure.”

 

Alex Podelko asked another question to speaker Melanie Cey: “Thinking critical” is rather a part of analysis for me – I wouldn’t equal it to “criticize”. [I completely agree with your point here, just somewhat confused with wording].”

Speaker Melanie Cey replied: “Yeah – I this language works for my current company culture and it’s useful for me when I help assess skill sets. I think that very experienced practitioners combine thinking critical and analysis into one step and pulling them apart has helped evaluate people who are not having a lot of success.”

 

Ganesh Joe asked speaker Melenie Cey: “Can you share few references on problem-solving blogs that you had researched?”

Speaker Melanie Cey replied:

Ganesh Joe asked a second question to speaker Melenie Cey: “Is only Google is the Bible for SRE and observability journey? Or Frontrunner?

Speaker Melanie Cey replied: “I think it’s a good reference – as Liz said – take from it what you think will work for your culture or what you want to make change happen. 

 

Ganesh Joe asked a third question to speaker Melenie Cey: “Are there any common observability slack channels?”

Speaker Melanie Cey replied: https://hangops.slack.com/

Shelby Spees (Honeycomb) replied: “chaosengineering.slack.com has an #observability channel. Honeycomb invites new users to our pollinators community slack honeycomb.io/signup/free. 

 

Ganesh Joe asked a question to Shelby Spees (Honeycomb): “I have seen multiple videos on honeycomb…Is this something we can contribute as beta-tester here with Honeycomb?”

Shelby Spees (Honeycomb) replied:  “We don’t currently have a beta program, but we do ask for user feedback in the pollinators slack. We also accept contributions on our open source repos.  We’ve accepted some significant contributions from users. Speaker Liz is also on the governance committee for OpenTelemetry, and she and speaker Austin Parker host OTel Tuesdays on twitch.”

Speaker Liz Fong-Jones replied: “ https://www.honeycomb.io/liz/observability-office-hours/ if you’d like a 1:1 meeting. And yeah, if you want to have a sales conversation, talk to [email protected], but I’m not here to sell Honeycomb. I’m here to be a vendor-neutral ambassador for the observability and OpenTelemetry community.  (Which is also why I wasn’t taking a dig at Dynatrace earlier, because they are a partner in the OTel community).”

 

Derek asked a general question: “Are the video presentations today going to be shared by any chance?” 

Speaker Liz Fong-Jones replied: “Yes, there will be. “Application Performance Management  https://www.apmdigest.com/advanced-observability-2 may be handy too.”

Amanda Hendley also replied: “YES! Videos will start being posted this evening. Also, presentations are being posted (as we get them) as well.

Jose Adan Ortiz also replied: “Great series about Observability.”

Alex Podelko asked speaker Austin Parker about differentiations: “About your explanation of the difference between monitoring/APM and observability. I guess it is another dimension of the problem: managing by derivation (traditional enterprise approach – anomaly detection is a more sexy name vs. exploratory approach when you are looking for something special (not set in advance). Either APM or tracing may be used in the same way. Actually APM may work better in some environments (say, Java / monolith) due to built-in instrumentation – as with manual tracing you need to decide what to trace in advance?”

Speaker Austin Parker replied: “I think it’s distinction without difference, Alex. OpenTelemetry, for example, is taking an approach of making automatic instrumentation a standard part of the project, so that people won’t feel like they’re required to use proprietary agents. If you want to get into more of a nitty-gritty and capture all your function calls and perf/timing/whatever, then sure, that’s a slightly different story but I’d argue that 90% of the time people are better served by having a standard set of high-level observability tools for distributed request tracing and top-line metrics and only dipping into profiling when they really need it.”

Speaker Liz Fong-Jones also replied:  “This, remember that more data doesn’t always mean more signal at least proportional to the cost to store it.”

 

George Herman asked a general question: “Here’s what I’ve been struggling with. I’m trying to track how much of physical resource (CPU, Storage, etc.) an application/microservice  and associated infrastructure takes as we scale. We’re running apps in containers/pods, that is scheduled by Kubernetes, running on top of Linux, running in a VM on a hypervisor (VMware) on a physical system. I would like to track the physical resources that each app takes. I see the problem that there are multiple layers of abstraction.  How do I track the physical resource usage of an app though all the layers to 

Speaker Austin Parker replied: “It’s hard to say without knowing why, I guess. your strategy would be different depending on what you want to use the data for. Is it so you can charge back resource utilization to dev teams? Is it because you’re trying to do capacity planning? I’d suggest, architecturally, you’re often better in figuring some of this stuff out after the fact – like, if you’re trying to manage spend, that’s going to have a different solution vs. other forms of perf/scale testing. And also the cadence you need to know this stuff on like, if it gets the job done, nothing wrong with excel, y’know?”

 

George Herman further added insights on why they track: “There are a number of reasons why we want to track this. One of which is that there isn’t a true understanding of the resources spent running a micro-service based application. From my early findings, we spend more time in the infrastructure than we do in the app itself.  All these layers of abstraction add to the costs. These costs go up as we scale the applications/users. I’ve seen plenty of times where we can get more return from optimizing the infrastructure than the app itself. Getting the necessary data to gain this understanding would require more granular data as well. (I can share a paper from IBM Labs that measures the overhead of a micro-services app and compares it to a monolithic app, if you’d like.) As for doing capacity planning, there are other techniques that can be used to get the information needed, as you pointed out.