Postmortem of Root Certificate Expiration: 30 May 2020

4 MIN READ
MIN READ

LogDNA is now Mezmo but the product you know and love is here to stay.

We had a partial production outage the weekend of 30 May 2020, and we missed a few things, outlined in the next few sections. Since others may encounter a similar incident in the future, we thought it would be worthwhile to share our experience and help others learn as much as we have. We’re addressing where we failed through better customer communication, improved planning for future workarounds, accelerated completion of our CI/CD improvements, and stronger endpoint alerting.

Note: All times listed are in UTC.

On 30 May 2020 at 13:04, a customer alerted us in our public Slack that their LogDNA agents suddenly stopped shipping data. Our on-call SRE began an investigation immediately.

What Happened

The certificate chain for our systems relied on the AddTrust External CA Root. That root certificate expired on 30 May 2020 at 10:48, which caused a certificate expiry error that broke the TLS handshake between some of our customers’ agents and libraries and our systems. Any system that attempted a new TLS handshake with our various endpoints with one of the clients using an older certificate authority certificate store or an older TLS implementation failed to connect.

Detailed Times (in UTC)

2020-05-30 10:48

The certificate expired as noted.

2020-05-30 12:59

We received our first customer ticket reporting something wasn’t working correctly.

2020-05-30 13:04

We first received word from a customer in our public Slack that their agents stopped shipping data.

2020-05-30 13:13

The SRE on-call finished verifying the customer report and opened an incident.

2020-05-30 13:52

In analyzing the data coming in for ingestion, we misunderstood the impact of the certificate expiration due to a discrepancy in two different datasets. Our data did not demonstrate a complete loss of ingestion traffic, so while a severe problem, we thought the issue was more isolated to one version of our agent. Based on customer reports, we thought that only the Docker-based image of our v1 agent was affected. As such, we focused on releasing a new v1 agent build and identifying a way for customers running older operating systems to update their certificate lists.

2020-05-30 15:22

We identified the patch for Debian-based systems and attempted to apply the same patch to our v1 agent image. Then, we realized that the problem with the v1 agent was due to how NodeJS manages certificates (described under NodeJS Certificate Management). We focused our efforts on rebuilding the v1 agent with the correct certificate store for both Docker and Debian-based systems.

2020-05-30 19:10

We started validating packages internally before shipping to customers. We also thought we identified the solution to the lack of any drop in ingestion traffic as existing agents did not need to attempt a new TLS handshake.

2020-05-30 19:13

We pushed a new v1 agent build that allowed customers to restart ingestion on most platforms and documented hotfixes for older Debian-based systems in our public Slack. We did not realize our build had failures on AWS due to a hiccup in our CI/CD process.

2020-05-31

We continued to think that the patched image solved the problem and pointed customers to the new 1.6.3 image as needed.

2020-06-01 13:07

We received reports from multiple customers with data that the new 1.6.3 image had issues that would cause CrashLoopBackOff. We discovered a bug in our release chain that prevented us from releasing a new image. We were back at square one.

2020-06-01 15:00

We fixed the bug in our release chain and started generating new packages. An intermittent hiccup in service with our CI/CD provider caused a further delay.

2020-06-01 17:14

We released the new packages and started exploring the option of switching to a new certificate authority. We wanted to remove the need for customer intervention and to avoid issues with how NodeJS vendored OpenSSL (described under NodeJS Certificate Management and Certificate Update Process).

2020-06-01 18:13

We began the process of updating our certificates on testing environments, pushing the changes out to higher environments one by one after testing was complete.

2020-06-01 21:46

We completed a switchover to a new certificate authority and certificate chain. This fix meant all systems except those that would require manual certificate chain installation, such as those using Syslog ingestion, would immediately begin ingestion again without any further customer action. Those systems with manual certificate chain installation were provided with further instructions and pointed to the new chain on the CDN.

Key Factors

NodeJS Certificate Management

Along with our libraries and direct endpoints, we have two separate agents we’re currently supporting: v1 agent, which is written in NodeJS, and v2 agent, which is written in Rust. The v1 agent has been kept on an older version of NodeJS to provide compatibility with older operating systems. That older version of NodeJS uses a default list of trusted certificates for certificate authorities that did not include the new certificate, and NodeJS overall does not read from the local system’s trusted certificate authority list.NodeJS, similarly to Java, ships with a bundled list of trusted certificates to ensure the security of TLS calls. Browsers do the same thing except they manage their own lists; NodeJS uses Mozilla’s list as the core developers deferred to Mozilla’s "well-defined policy" to ensure the list stays current. In the past, there were a number of calls for NodeJS to enable teams to add new certificates or otherwise enable better management of the certificate store. As of version 7.3.0 (coupled with LTS 6.10.0 and LTS 4.8.0), the core developers of NodeJS added the ability to include new certificates in that trusted list. Before that release, end-user developers and ops teams would have to recompile NodeJS with their own certificate additions or patches. Coincidentally, the NodeJS community raised a request to remove the AddTrust certificate and put the proper certificate in, and that fix landed on 01 June 2020. We discovered this change in the source code itself during the postmortem phase of this incident.NodeJS also ran into an issue with how it vendored OpenSSL. Different versions of NodeJS used different versions of OpenSSL, and some older versions of OpenSSL gave up when finding invalid certificates in a given path versus trying alternatives.How does that work, exactly? To ensure we’re all on the same page, let’s talk about certificate chains. The basic chain involves three pieces: a root certificate from a certificate authority, an intermediate certificate that is verified by the root certificate, and a leaf certificate that is verified by the intermediate certificate. The root certificates, as noted, are generally coded into operating systems, browsers, and other local certificate stores to ensure no one can (easily) impersonate a certificate authority. Leaf certificates are pretty familiar to most technical teams as they are the certificates that a team receives as a result of their request to a certificate authority. Intermediate certificates, on the other hand, serve a few functions. The most important function is they add a layer of security between the all-powerful root certificate with its private keys and the external world. One of the other functions, however, that’s the most relevant here is the need to bridge between old root certificates and new ones&emdash;a function known as cross-signing. Certificate authorities release two intermediate certificates, one for each root certificate, that then both validate a leaf certificate.Older versions of OpenSSL, specifically 1.0.x and older, have issues with this system where they follow the certificate path up the chain once and then fail if that path leads to an expired root certificate. In newer versions, OpenSSL attempts to follow an alternate chain when one is available, such as in this case where there was an additional cross-signed intermediate available that pointed to the new root. This last issue caused problems when we built our systems on different versions of NodeJS that ship with older versions of OpenSSL, and it also caused problems with other systems because of the next point.

Certificate Update Process

When we were updating certificates in the past, we only updated the leaf certificate (the bottommost part of the certificate chain) rather than including the intermediate certificate. Sectigo has offered intermediate certificates that cross-signed the AddTrust certificate with the new USERTrust RSA certificate to ensure that older systems supported. Since we didn’t add the intermediate certificate during our update process, we missed adding the cross-signed intermediate certificate. As noted, this omittance caused errors with OpenSSL and GnuTLS on other older systems such as older Debian or Fedora builds, not just our NodeJS build.

Endpoint Monitoring

Finally, and possibly most importantly, we did not have any external endpoint monitoring to alert us when this certificate chain broke. We ended up relying on customers flagging the incident for us.

Next Steps

Outage and Customer Communications

We weren't as prompt, nor as expansive, as we should have been in communicating with our customers. That is changing immediately, starting with this public postmortem. We are discussing steps to ensure that we have the right required channel of communication for customers based on incident severity, and we will implement them as quickly as possible. Here’s the most up-to-date version of our plan:

  • Low-severity incidents: Our customers can expect timely status page updates.
  • Medium-severity incidents: In addition to status page updates, we’ll deliver in-app notifications for all impacted customers. Note that we are working on this functionality.
  • High-severity incidents: You can expect the above communications, as well as direct email notice to all impacted customers.

Ultimately, if there is an incident, we owe it to every customer impacted to be notified as early as possible, along with consistent status page updates. Our customers can expect effective communication moving forward.

Workarounds and Solutions

We know customers rely on us every day to help with troubleshooting, security, compliance, and other critical tasks. We were not fast enough in providing potential workarounds for the full problem. We are starting work on documenting temporary workarounds for future use so we can respond faster during an incident.We know some customers need a backfill solution for security or compliance. We are aware of the problem and are in the extremely early stages of defining solutions and workarounds for future use. Given the current complexities, we do not recommend attempting to backfill missing logs from this recent incident. Some of the difficulties we need to address for a backfill solution are identifying when the problem actually requires backfill, minimizing impact of any potential option on throughput to livetail, reducing duplicate entries, and ensuring that timestamps are preserved and placed in order. For example, any log management system can see occasional delays when coupled with slower delivery from external systems, and it might appear that a batch of log lines was not ingested when, in fact, they are in the pipeline and would appear on their own without any intervention. As these difficulties could seriously warp or destroy data if not handled properly, we are starting with small, noncritical experiments and ensuring that any process we eventually recommend for any workaround our customers need is thoroughly tested, repeatable, and reliable.

Technical Initiatives

We’re improving our CI/CD systems to speed up releases when we need to. We actually were in the process of doing this very task, which caused some of the delays in releasing an initial patch immediately as we discovered issues that we needed to fix.Our current monitoring solution does not include external endpoint monitoring. To increase our level of proactive alerting and notifications, we have already started talking to vendors to supplement our current monitoring so we are notified the moment any of our endpoints one of our own systems goes down.  We are also working to identify the right automated solution to check certificates that includes the entire certificate chain rather than just the tail end of the certificate chain.

Wrap Up

While this certificate expiration incident affected multiple software providers over the weekend, we should have known of the issues proactively, we should have been more actively communicating updates during the incident, we should have been faster at identifying and providing workarounds, and ultimately, we should have resolved the issue much faster. Our commitment to you is that we are going to do better, starting with the fixes outlined above.

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    Mezmo + Catchpoint deliver observability SREs can rely on
    Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines