AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s

4 MIN READ

8 MIN READ

April Yep

April has several years of experience in the observability space, leading back to the days when it was called APM, DevOps, or infrastructure monitoring. April is a Senior Product Marketing Manager at Mezmo and loves cats and tea.

4 MIN READ

8 MIN READ

Last week, I attended the Amazon Web Services (AWS) re:Invent conference in Las Vegas, NV, with 50,000+ others. It was quite a busy week with several keynotes, announcements, and many sessions. While the hot topic at re:Invent was generative AI, I’ll focus my blog post on a few customer sessions I attended around observability: Stripe, Capital One, and McDonald’s.

‍

*Looking at the good and bad of Stripe’s architecture*

FSI319 | Stripe: Architecting for observability at massive scale

This is by far my favorite session as Stripe has a massive setup—3k engineers across 360 teams, 500M Metrics, 40k alerts, and 150k dashboards to support its service. With such a large scale, Cody Rioux, a Staff Software Engineer at Stripe, and others first looked at the axioms of observability within Stripe. In their review, they found three things:

‍

There were only a few dozen unique queries used for most alerts
They were only using between 2-20% of their telemetry data (metrics, events, logs, and traces)
Observability data value diminishes as time goes on (e.g. hour old data is more valuable than day-old data)

‍

Once Cody and his team had an understanding of how the observability data was being used, Stripe looked into their architecture and noticed it was database-centric. While this setup is common, it has its trade-offs: the good - it is simple, limited failure modes, and a feature-rich walled garden; the bad - it can’t handle business growth, and creates high cardinality data leading to scalability, reliability, and database issues.

‍

So, what did Cody and Stripe do? Priority one was evolving their architecture to be more scalable, reliable, and cost-effective. From a technical perspective, Cody highlighted five areas:

Sharding partitions
Aggregation
Tiered storage
Streaming alerts
Isolation

Cody also talked about the cultural shift that needed to take place at Stripe—observability should be self-reliant and easy to do the right thing. Watch this session in depth here.

‍

FSI314 | Capital One: Achieving resiliency to run mission-critical applications

With over 7 million transactions a day for Capital One’s credit card business, achieving a target SLA of 99.9% can be pretty hard when your systems are dependent on each other. The reality is even if each service has a 99.9% SLA, if you have four dependencies, it leads to a 99.5% SLA (100-((.1)+(.1)+(.1)+(.1)+(.1)) - 99.5). When any of those dependencies fail, having a backup service is essential. In the example given, if an address verification system is down during the process of a credit card application, do you resort to seeing if the person already has an existing account to verify their identity?

‍

Sharmila Ravi, SVP of Technology at Capital One discussed the CAP Theorem—Consistency, Availability, and Partition Tolerance—and how critical it is to have these discussions up front. Additionally, Sharmilla took a quote from Werner Vogel, CTO at Amazon, design for failure as “everything fails all the time.” At this point, they transitioned to architecture design with Kathleen DeValk, VP of Architecture, and she summed it up best with the need for operational excellence—testing, observability, resilience, and operations. Having end-to-end visibility is important to discover problems before your customers do. You can watch the session in full here.

‍

*DevOps culture at McDonald’s is more “DevFoodOps”*

PRO201 | McDonald’s & AWS ProServe implement reusable & observable pipelines

AWS Professional Services assessed McDonald’s DevOps capabilities and noticed they had a high toil, onerous security, and limited pipeline visibility. Additionally, they noted non-functional requirements of scalability, reliability, and performance. When Stephanie Koenig, Senior Director of Cloud Infrastructure at McDonald's, was presented with these findings, it became clear that re-architecting and changing the culture were necessary next steps.

‍

Stephanie’s first step was to build a DevOps Platform team—with the goal to build consistent toolchains and workflows for McDonald’s developers. Doing this would reduce departmental silos, improve the developer experience, and improve cross-functional observability within McDonald’s services. Stephanie closed out by talking about what the future DevOps roadmap looks like at McDonald’s with the foreseeable future focused on ensuring a “minimum lovable platform (MLP)” and then expanding the DevOps platform to other teams. You can watch the session at length here.

A Common Theme: You need observability, but just at the right level

Finding the right ratio of observability data is challenging—too much information, and you're swimming in a sea of data, but too little information and you’re scratching your head at what’s happening. Discussing scale, reliability, and consistency when architecting services is vital for the technology and team culture. With Mezmo, you can help find that harmony by using features like data profiler to understand your telemetry data and decide what is useful and what isn’t. Learn more about Mezmo and how you can take control of your telemetry data.

false

TABLE OF CONTENTS

SHARE ARTICLE

RSS FEED

RELATED ARTICLES

Transforming Observability: Simpler, Smarter, and More Affordable Data Control

Debug Logs and Analyze Trends with Log Data Rehydration

How Telemetry Pipelines Save Your Budget

My Favorite Observability and DevOps Articles of 2024

From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines

April Yep

RELATED ARTICLES

Transforming Observability: Simpler, Smarter, and More Affordable Data Control

Debug Logs and Analyze Trends with Log Data Rehydration

How Telemetry Pipelines Save Your Budget

My Favorite Observability and DevOps Articles of 2024

From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines

SHARE ARTICLE

AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s

‍

FSI319 | Stripe: Architecting for observability at massive scale

‍

There were only a few dozen unique queries used for most alerts
They were only using between 2-20% of their telemetry data (metrics, events, logs, and traces)
Observability data value diminishes as time goes on (e.g. hour old data is more valuable than day-old data)

‍

So, what did Cody and Stripe do? Priority one was evolving their architecture to be more scalable, reliable, and cost-effective. From a technical perspective, Cody highlighted five areas:

Sharding partitions
Aggregation
Tiered storage
Streaming alerts
Isolation

Cody also talked about the cultural shift that needed to take place at Stripe—observability should be self-reliant and easy to do the right thing. Watch this session in depth here.

‍

FSI314 | Capital One: Achieving resiliency to run mission-critical applications

‍

PRO201 | McDonald’s & AWS ProServe implement reusable & observable pipelines

‍

AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s

FSI319 | Stripe: Architecting for observability at massive scale

FSI314 | Capital One: Achieving resiliency to run mission-critical applications

PRO201 | McDonald’s & AWS ProServe implement reusable & observable pipelines

A Common Theme: You need observability, but just at the right level

AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s

FSI319 | Stripe: Architecting for observability at massive scale

FSI314 | Capital One: Achieving resiliency to run mission-critical applications

PRO201 | McDonald’s & AWS ProServe implement reusable & observable pipelines

A Common Theme: You need observability, but just at the right level

AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s

April Yep

FSI319 | Stripe: Architecting for observability at massive scale

FSI314 | Capital One: Achieving resiliency to run mission-critical applications

PRO201 | McDonald’s & AWS ProServe implement reusable & observable pipelines

A Common Theme: You need observability, but just at the right level

Talk to an expert to learn more

Logging in the Age of DevOps eBook

Log Data Restoration beta program

LogDNA Streaming Early-Access

LogDNA Variable Retention Early-Access

@2022 Copyright Mezmo Inc.

NEWSLETTER SECTION

AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s

FSI319 | Stripe: Architecting for observability at massive scale

FSI314 | Capital One: Achieving resiliency to run mission-critical applications

PRO201 | McDonald’s & AWS ProServe implement reusable & observable pipelines

A Common Theme: You need observability, but just at the right level

AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s

FSI319 | Stripe: Architecting for observability at massive scale

FSI314 | Capital One: Achieving resiliency to run mission-critical applications

PRO201 | McDonald’s & AWS ProServe implement reusable & observable pipelines

A Common Theme: You need observability, but just at the right level

Use cases >

Reduce your SIEM cost

Integrations

Use cases >

ELK Replacement

Digital Transformation

DevSecOps

Control

Learn >

Blog

eBooks

Reports and Guides

Videos

Webinars

Case Studies

Infographics

Log Management

Observability

DevOps

Kubernetes

Security

About

Customers

Partners

Newsroom

Events

Career

Culture

Compliance & Security

Contact us

AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s

April Yep

Related articles

SHARE ARTICLE

FSI319 | Stripe: Architecting for observability at massive scale

FSI314 | Capital One: Achieving resiliency to run mission-critical applications

PRO201 | McDonald’s & AWS ProServe implement reusable & observable pipelines

A Common Theme: You need observability, but just at the right level

Related articles

share article

Talk to an expert to learn more

Logging in the Age of DevOps eBook

Log Data Restoration beta program

LogDNA Streaming Early-Access

LogDNA Variable Retention Early-Access

@2022 Copyright Mezmo Inc.