AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
Last week, I attended the Amazon Web Services (AWS) re:Invent conference in Las Vegas, NV, with 50,000+ others. It was quite a busy week with several keynotes, announcements, and many sessions. While the hot topic at re:Invent was generative AI, I’ll focus my blog post on a few customer sessions I attended around observability: Stripe, Capital One, and McDonald’s.
FSI319 | Stripe: Architecting for observability at massive scale
This is by far my favorite session as Stripe has a massive setup—3k engineers across 360 teams, 500M Metrics, 40k alerts, and 150k dashboards to support its service. With such a large scale, Cody Rioux, a Staff Software Engineer at Stripe, and others first looked at the axioms of observability within Stripe. In their review, they found three things:
- There were only a few dozen unique queries used for most alerts
- They were only using between 2-20% of their telemetry data (metrics, events, logs, and traces)
- Observability data value diminishes as time goes on (e.g. hour old data is more valuable than day-old data)
Once Cody and his team had an understanding of how the observability data was being used, Stripe looked into their architecture and noticed it was database-centric. While this setup is common, it has its trade-offs: the good - it is simple, limited failure modes, and a feature-rich walled garden; the bad - it can’t handle business growth, and creates high cardinality data leading to scalability, reliability, and database issues.
So, what did Cody and Stripe do? Priority one was evolving their architecture to be more scalable, reliable, and cost-effective. From a technical perspective, Cody highlighted five areas:
- Sharding partitions
- Tiered storage
- Streaming alerts
Cody also talked about the cultural shift that needed to take place at Stripe—observability should be self-reliant and easy to do the right thing. Watch this session in depth here.
FSI314 | Capital One: Achieving resiliency to run mission-critical applications
With over 7 million transactions a day for Capital One’s credit card business, achieving a target SLA of 99.9% can be pretty hard when your systems are dependent on each other. The reality is even if each service has a 99.9% SLA, if you have four dependencies, it leads to a 99.5% SLA (100-((.1)+(.1)+(.1)+(.1)+(.1)) - 99.5). When any of those dependencies fail, having a backup service is essential. In the example given, if an address verification system is down during the process of a credit card application, do you resort to seeing if the person already has an existing account to verify their identity?
Sharmila Ravi, SVP of Technology at Capital One discussed the CAP Theorem—Consistency, Availability, and Partition Tolerance—and how critical it is to have these discussions up front. Additionally, Sharmilla took a quote from Werner Vogel, CTO at Amazon, design for failure as “everything fails all the time.” At this point, they transitioned to architecture design with Kathleen DeValk, VP of Architecture, and she summed it up best with the need for operational excellence—testing, observability, resilience, and operations. Having end-to-end visibility is important to discover problems before your customers do. You can watch the session in full here.
PRO201 | McDonald’s & AWS ProServe implement reusable & observable pipelines
AWS Professional Services assessed McDonald’s DevOps capabilities and noticed they had a high toil, onerous security, and limited pipeline visibility. Additionally, they noted non-functional requirements of scalability, reliability, and performance. When Stephanie Koenig, Senior Director of Cloud Infrastructure at McDonald's, was presented with these findings, it became clear that re-architecting and changing the culture were necessary next steps.
Stephanie’s first step was to build a DevOps Platform team—with the goal to build consistent toolchains and workflows for McDonald’s developers. Doing this would reduce departmental silos, improve the developer experience, and improve cross-functional observability within McDonald’s services. Stephanie closed out by talking about what the future DevOps roadmap looks like at McDonald’s with the foreseeable future focused on ensuring a “minimum lovable platform (MLP)” and then expanding the DevOps platform to other teams. You can watch the session at length here.
A Common Theme: You need observability, but just at the right level
Finding the right ratio of observability data is challenging—too much information, and you're swimming in a sea of data, but too little information and you’re scratching your head at what’s happening. Discussing scale, reliability, and consistency when architecting services is vital for the technology and team culture. With Mezmo, you can help find that harmony by using features like data profiler to understand your telemetry data and decide what is useful and what isn’t. Learn more about Mezmo and how you can take control of your telemetry data.