LogDNA Guide: Putting Alerts into Practice

4 MIN READ
MIN READ

Alerts are a core part of monitoring systems. Using alerts keeps you aware of changes within your infrastructure and applications, helping you identify and respond to issues faster. Log management solutions like Mezmo provide an ideal environment for configuring alerts, since it allows you to create detailed alerts based on your log data. Rather than manually search for problems, you can use alerts to scan your log data in real-time monitoring mode and receive immediate notifications on potential problems.

1. Monitoring System Activity

Log volume is a strong indicator of how active your systems and applications are. A sudden increase or decrease in overall log volume could indicate several things, including:

  • Surges in user-driven traffic to your website or web application
  • Surges in illegitimate traffic (e.g. a DDoS attack)
  • Problems with one or more components, resulting in spikes in error logs
  • Changes to your infrastructure or application (such as adding a new node to a cluster or scaling a Kubernetes Deployment)

While log volume will vary on a minute-by-minute basis, tracking your logs over a long period of time will help you create a baseline for comparison. LogDNA includes a powerful graphing feature that we can use to identify this baseline. For example, the following graph shows our total log volume over a 3 day period. Mousing over a point on the graph shows us the number of events recorded for that particular hour. Here we record an average of 2,700–2,800 events per hour. Note that the sudden drop at the end is the result of taking this screenshot shortly after 10am, and not any problem with our systems:

Log Volume Total

We can also see some pretty obvious deviations as indicated by the peaks. In one case, our log volume surged to 1.5X its normal value. If we click on this point and select Show Logs, we can jump directly to the events in the log timeline. As it turns out, one of our Kubernetes Pods was stuck in a constant loop of attempting to start, crashing, and restarting.

While Kubernetes logs only account for half of our total log volume, they are the direct cause the two biggest surges in the chart (as indicated by the yellow line):

Log Surge Kubernetes Comparison

Once the problem was addressed, our log volume immediately returned to normal.

Use Presence Alerts to Detect Surges

To monitor for future problems like this, we can create a view that only displays Kubernetes events. The chart shows an average of roughly 1,500 events per hour. With this in mind, we can then add a view-specific alert that notifies us if the number of events exceeds (for example) 1,800 events in one hour. As soon as we pass that threshold, LogDNA sends an alert to our DevOps team:

Kubernetes Alert

Use Absence Alerting to Detect Downtime

When monitoring for problems, the common approach is to look for increases in log volume. However, a decrease in log volume could indicate a bigger problem, such as:

  • Decreasing traffic to your applications
  • Hardware or software failures
  • Networking problems between your systems and your log management solution
  • Problems with the log management solution itself

With absence alerts, we can send a notification if volume drops below a certain threshold. Using our Kubernetes example, we can create an alert that notifies the team if we get fewer than 1,000 logs in an hour. If we want to reduce the risk of false positives (alerts triggered by natural changes in volume, rather than actual problems), we can reduce this number even further:

Kubernetes Absence Alert

2. Detecting Security Events

Software security is a constantly moving target, with attackers always searching for new ways to break into and exploit applications. Logging security incidents provide a significant number of benefits, including:

  • Creating an audit trail of incidents
  • Recording important contextual data such as the origin of the attack (e.g. IP addresses) and affected component(s)
  • Reporting incidents in near real-time

Alerting on security events lets you keep track of potential incidents so you can respond to and protect against threats more quickly.

For example, a common security event on any public-facing server is SSH probing. In an SSH probe, an attacker repeatedly tries to log into SSH using a variety of username and password combinations. Since SSH is the default administration tool for Linux servers, attackers use a variety of automated scripts to detect and attack public SSH servers.

For example, these logs record an attempt from 192.168.105.294 to log in as a user named bw:Feb 27 18:51:44 sshd[337]: Invalid user bw from 192.168.105.294 port 45012
Feb 27 18:51:44 sshd[337]: Disconnected from invalid user bw 192.168.105.294 port 45012 [preauth]
Feb 27 18:51:44 sshd[337]: Received disconnect from 192.168.105.294 port 45012:11: Bye Bye [preauth]

SSH probes are a frequent occurrence, and alerting on each one would quickly lead to alarm fatigue. But what if one of these attacks was successful? Let's say there actually is a bw user on the server, and the attacker happened to guess the right password. In that case, we would see this message appear in our logs:

pam_unix(sshd:session): session opened for user bw by (uid=0)

We could alert on all successful SSH logins, but that could also lead to alarm fatigue. Instead, we can create an alert only alerts on successful logins that don't belong to a predetermined set of users.

For example, let's say we have a server with a single administrator named logdna. We can view all SSH events caused by logdna by using the search program:sshd logdna:

Here we see a failed login followed by a successful login. The first and third logs show the authentication method (password), while the second and fourth logs show the result of the authentication attempt (failed and successful, respectively). Given that logdna is the only user with access, we can search for unauthorized logins using program:sshd "session opened for user" -logdna. This searches for all successful authentications where the user is not the logdna user. We'll save this as a new view, then create an alert based on that view.

Sshd Alert

Now, if any user other than logdna successfully logs in via SSH, the team will immediately receive a notification.

3. Billing Notifications

Besides tracking changes in demand, monitoring log volume plays another important role: estimating costs. This is particularly important when using SaaS-based log management solutions, where your costs are often directly related to log volume. You can view your total usage by opening the LogDNA web app and Settings and Usage. You can learn more about our enhancements to usage dashboard by reading this blog post.

You can also create an alert if your usage exceeds a certain amount. While in the LogDNA web app, navigate to Settings, Usage, and then Usage Alerts. Here, you can choose to email an alert after logging the amount specified (in GB). You should also choose to be notified when various thresholds are reached in such as 25%, 50%, 75% and 100% —so that you have enough time to create filters or suspend logging before reaching your budget.

LogDNA usage alerts

Alternatively, if you know the average size of your events, you can calculate the approximate maximum number of events that you can ingest over a 24 hour period. For example, let's say our logs average 200 bytes per event, and we want to be notified when we exceed 30 GB in a given month. That gives us a maximum of 150 million events per month, which equates to 5 million events per day, 208,333 events per hour, and 3,472 events per minute. Creating an alert with these limits can automatically and immediately notify you of a potential overage long before you approach your limit.

Conclusion

LogDNA Alerts is a powerful yet often underutilized part of log management. Using alerts effectively can help keep you and your team aware of your infrastructure's operational performance, potential problems, security incidents, and more. With LogDNA, you have the ability to create an unlimited number of alerts with thresholds as low as 30 seconds. Integrate your alerts with multiple different channels including email, Slack, Pagerduty, and Datadog. To get started, sign up for a free trial account.

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    Mezmo + Catchpoint deliver observability SREs can rely on
    Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines