With any digital transformation, technical infrastructure oftentimes is the foundation that ambitious innovation relies on. Evolving engineering best practices requires examination - even beyond developer circles. Nathan Chandler, partner and consulting-CTO at software development and product strategy firm, Vynyl, discusses a vital organizational competency called minimum viable observability (MVO), offering stability where many systems hemorrhage productivity.

Nathan Chandler, Partner & Consulting CTO at Vynyl

What does “minimum viable observability” mean for organizations?

Chandler: So at its core, minimum viable observability is the ability to understand and manage the performance and health of a given system. Specifically, we’re talking about centralized logging, real-time metrics, and error tracing. Tracking this data permits tuning responsiveness while optimizing infrastructure costs. MVO specifically traces errors end-to-end for rapid diagnosis, reducing the risky dependency on fading expertise from key engineers over time.

Systems without that baseline observability struggle with finding the root cause of incidents or issues. Analysis becomes tedious and sometimes the causes aren't even found - they just mark it transient and hope it doesn't recur. Most software produces logs but actually surfacing those logs and understanding them constitutes another critical piece. There are solutions that help centralize and structure logs as a starting point.

Then you advance to tracking specific performance indicators and errors. APM solutions like DataDog produce stack traces when bugs appear so engineers see the full trace across the many systems involved in a given request or body of work. This facilitates investigation versus just receiving an alert.

Does MVO primarily provide value when renovating legacy platforms or does it suit next-gen development as well?

Chandler: Legacy stacks certainly suffer chronic pain without observability. But universal problems like engineering staff turnover or unforeseen interactions inevitably emerge even in new code over time. So issues hiding within the complexity of any project plague all systems eventually.

MVO also facilitates institutional learning - documenting why past decisions occurred - to persist as systems and people change. Building flawless software is not practical or realistic. When it comes to legacy platforms, the laws governing entropy apply, more often than not, they’ll continue to increase in complexity over time - which brings with it fragility, so any tool that can help mitigate this will be extremely valuable.

Still, new platforms absolutely should enable key metrics tracking and error handling upfront through solutions like DataDog. But, in reality, that doesn't always happen despite familiarity with the concept and need. MVO constitutes necessary hygiene but oftentimes meets organizational resistance because of cost and the perception that it isn’t adding value, as many organization's definition of progress is typically through the production of new features. There remains a huge backlog of existing software that would benefit from retrospective additions.

Walk us through a common MVO engagement improving root cause analysis for a convoluted legacy platform.

Chandler: That's going to be pretty contextual, but we might first centralize dispersed logs into unified repositories with modern aggregators. The minimal version just extracts local data yet advanced installation automatically surfaces stack traces. This structured issue reporting reduces operational panic while expediting investigation and remediation.

Next we construct balanced feedback loops by monitoring key metrics but crucially taking informed actions. For example, tracking cluster capacity is useless without adjustments to maintain headroom. Data should inspire decisions, not simply pile up.

Depending on the organization's specific situation and tech stack, we may recommend starting with a search and aggregation tool like SumoLogic. This would provide some baseline insight into the operation of the system and allow the organization to start seeing the benefits of increased observability.

More advanced end-to-end and Application Performance Management (APM) tools like DataDog create stack traces automatically when errors occur, pushing these to engineers. So they see the full sequence of calls leading to a crash. There's necessary work instrumenting this visibility - not hard - but requiring consistency introducing organizational discipline around using data gathered to drive improvements.

When should an organization consider seeking external MVO assistance versus struggling internally?

Chandler: Honestly, it really depends on the CTO/SVP Product/etc’s relationship with their team and available talent pool. As far as some warning signs, those could include engineers constantly extinguishing fires or relying on institutional knowledge from long-tenured yet succession-risking engineers. Code navigable only by one hero inspires fear rather than future-proofing.

To keep things grounded, I always advise to start this process as soon as possible, before collapse redirects focus into a risky major overhaul. We meet clients with engineers working there 27 years. But obviously knowledge leaves when they eventually do.

Once achieving MVO, what outcomes are typical in a smoother-running IT environment?

Chandler: Suddenly staff have capacity to execute higher-value priorities and roadmapped objectives since constant urgent distractions lessen. Plus improved visibility, tailored alerts, and streamlined issue routing bolster app reliability and experience. But MVO crucially enables knowledge transfer - technical and business process - escaping individual-centric bottlenecks.

There are DevOps metrics constituting operational excellence too. First, are all logs actually centralized or remain dispersed? If centralized, do we proactively audit or just store them? Are we leveraging technology to surface key signals among the noise? Similarly, have we pulled together system state metrics to guide decisions on costs, performance, etc?

And for software, can we trace the full context of errors end-to-end when they occur? Are we detecting issues comprehensively, or only finding those customers happen to report manually? MVO smooths coordination between dev and ops teams through accountability. But reliability begets velocity - addressing drag permits pursuing ambition.

Have you seen systems that seemed too far gone? CTOs that assume they’ll need to start from scratch? Have you seen it turn around and work for them instead of against them?

Chandler: Yes, absolutely! It's painful, but this is deferred pain. It's a pain you've chosen to wait till later to address and so, of course, it's going to hurt. People oftentimes undervalue legacy software. We sometimes reflexively jump to let's build it new and better. There's a whole type of engineer that is exclusively that way. They only see the world through the new software they could build. And there's so many things you lose, prematurely throwing away legacy software, that just don't come to mind.

There's this parable - It's called Chesterton's Fence. If you're walking down the middle of a road and you see a fence across it, you shouldn't think ‘Ah, There is a fence in the middle of this road!! Let's take it out!’. The right answer is to ask why it's there in the first place. And legacy code could be that answer, illuminating why it is there in the first place. Oftentimes, there are decisions that are hard to understand just by looking at the code - it has to be taken holistically in terms of how it got there, who asked for it, and who wrote it. This is especially true in companies that have had high turnover, it's easy to assume you have a comprehensive understanding of what it would take to build something new. And the vast majority of the time you don't, especially if the software has been around for a decade, you do not have the visibility required without spending years in the platform, and purposefully digging that stuff out and documenting it in detail. There's a lot of hidden detail in legacy operations and , 'okay, let’s rebuild this from scratch’ is a decision many companies are making everyday.. Throw it away, build it new. Without proper preparation and experience, it can be a massive business-killing mistake to do that.

What's your philosophy on implementing new systems versus working with existing platforms?

Chandler: New is hard, especially in the context of a business that already exists. It's different to go from zero to one than it is to go from one to two. And so the company has to change their attitude at that point that they've hit ‘one’ knowing that they are in the business of operating a business, they oftentimes are now revenue positive and don't have to be concerned about that. And that changes the whole dynamic, it's different being with your friends/colleagues, sitting in a living room, hacking away at things, when you were making zero dollars, you can throw away whatever you want. Nobody cares. But now you are billing clients, you are providing a service, you are now legally on the hook to provide these products in the way that you agreed to provide them. And so now throwing it away is a massive risk, a massive, tremendous risk that needs to be taken seriously.

Final Words

Minimum viable observability can be a game-changer for organizations looking to create a more stable and efficient IT environment. By implementing a few key practices, companies can reduce downtime, resolve issues faster, and ensure smoother knowledge transfer. While starting fresh with new software may be tempting, it's important to recognize the value and hidden wisdom within legacy systems. Engaging external experts can help navigate technical challenges, break down information silos, and establish sustainable engineering practices. Ultimately, MVO empowers organizations to tackle long-standing pain points, boost reliability, and confidently pursue their goals.