How to Search Through Log Archives

4 MIN READ
MIN READ

Log retention is a crucial factor in adopting a log management solution. For most organizations, 30 days is a perfect balance between accessing historical log data and managing storage costs. However, some organizations need to retain logs for much longer—whether to meet compliance requirements, conduct audits, or observe long-term operational changes. This is where log archiving and searchability become essential parts of a modern telemetry pipeline.

With Mezmo, you can send archives of older log data to services such as AWS S3 (including s3 access logs), Google Cloud Storage, and IBM Cloud Object Storage. These archives are meant for cold storage; however, if you need to query or retrieve these logs, traditional parsing methods fall short. Archives store log data as standard JSON, but tools like grep or basic JSON parsers lack the flexibility of modern siem tools. Actions such as searching, filtering, and creating views would likely require you to build a custom frontend, adding to time and complexity.

The temptation is to download the archives locally and use grep to find what you need. But as archive sizes grow, this approach becomes inefficient and impractical.

Presto

Presto is a high performance, distributed SQL query engine for big data. You can install Presto in your local server and use the local files connector to read your archives. Amazon Athena's query engine is based on Presto so if you have the infrastructure in-house to locally extract the archives, this would be your lowest cost option.

DNAQuery

Life.Church developed DNAQuery, a Go-based utility for exporting LogDNA archives to Google BigQuery.

BigQuery is a cloud solution for querying large datasets. With BigQuery, you can search through log data without having to administer a database or use JSON parsing tools. And since your data is stored in the cloud, you can set your own retention policies, store as much log data as you need, and process terabytes of events at a time.

How the DNAQuery Script Works

The DNAQuery script performs the following steps:

1. Download LogDNA archives from your Google Cloud Storage (GCS) account.
2. Extract any logs matching the app name that you specify.
3. Parse the contents of each matched log using a regular expression (regex).
4. Store the results in a CSV file and upload it to a second GCS bucket.
5. Create a new BigQuery table and import the contents of the CSV file.

Running the DNAQuery script requires you to provide the date of the archive that you want to process. For example, this command processes an archive from May 6, 2019:

$ dnaquery --date 2019-05-06

The resulting BigQuery table is named after the archive data (in this case, “20190506”).

Log Archiving Prerequisites

Before continuing, follow the LogDNA archiving instructions and the instructions listed in the project’s readme. Due to recent changes in how archives are structured, the current version of the script does not support recently generated archives. You can download a forked version of the script that supports the new schema here.

In addition, you will need to have:

  • A GCS bucket containing your LogDNA archives.
  • A second GCS bucket for storing CSV files from DNAQuery.
  • A BigQuery dataset named “dnaquery”.
  • A table in the “dnaquery” dataset named “logdna”. The table’s schema should match the fields extracted using your DNAQuery regex. We’ll show an example of this later.
  • A Service Account and key file for use by DNAQuery. This account must have access to your GCS and BigQuery resources.

Example: Searching Kubernetes Logs in BigQuery

In this example, we’ll use DNAQuery to process older logs from a Kubernetes cluster. We created a GCS bucket named “logdna-archive” to store our LogDNA archives. We also created a second bucket named “logdna-to-bigquery” to store the CSV files created by DNAQuery.

Next, we cloned the DNAQuery repository to our workstation and renamed the “example.toml” file to “dnaquery.toml”. We modified the “[apps]” section to match the format of our Kubernetes logs:

[[apps]]
   Name = "daemon.log"

   Regex = '^(\w+\s+\d+\s+\d{2}:\d{2}:\d{2}) (\w+) (\w+.+)\[(\d+)\]: (.*)'

   TimeGroup = 1

   TimeFormat = "Jan  2 15:04:05"

What this means is that for each log where the app name is “daemon.log”, DNAQuery will run the regex on the log message, which splits it into capture groups. These capture groups correspond to the columns in our BigQuery dataset. “TimeGroup” is the index of the capture group containing the log’s timestamp, and “TimeFormat” specifies the format that the timestamp is in. For the “[gcp]” section, we simply replaced the default values with those specific to our GCP deployment.

In BigQuery, we configured our “logdna” table with the following schema. These fields correspond to the capture groups found in our regex with the exception of “app”, which is added automatically by DNAQuery:

With the setup complete, we then ran the script:

$ dnaquery --date 2019-05-02
2019/05/07 12:08:27 File in GCS is 0.010615 GB
2019/05/07 12:08:27 Downloading from GCS
Downloaded logs/8028507f8d.2019-05-02.json.gz 11397647 bytes
2019/05/07 12:08:29 Opening Logfile logs/8028507f8d.2019-05-02.json.gz
2019/05/07 12:08:29 Starting processLine
2019/05/07 12:08:29 Scanning log file
2019/05/07 12:12:49 Scanning complete. 268205 lines scanned
2019/05/07 12:12:49 Matched 114284 lines, Skipped 592 lines
2019/05/07 12:12:49 Completed processLine
2019/05/07 12:12:49 Starting upload to GCS
2019/05/07 12:12:49 Upload size: 37.235619 MB
2019/05/07 12:13:35 Completed upload to GCS
2019/05/07 12:13:35 Starting load into BQ
2019/05/07 12:13:36 BQ Job created...
2019/05/07 12:13:41 Completed load into BQ

We can verify that the data was loaded into BigQuery by clicking on the newly created “20190502” table and selecting the “Preview” tab:

Analyzing the Data

Once our data was in BigQuery, we could query it just like any other SQL table. For example, say we want to find all errors occurring between 2pm and 5pm. We can do so by running:

SELECT *
FROM dnaquery.20190502
WHERE timestamp BETWEEN '2019-05-02 14:00:00' AND '2019-05-02 17:00:00'
AND message LIKE '%error%'

If we want to monitor events coming from a specific node, we can do so using:

SELECT *
FROM dnaquery.20190502
WHERE `source` = 'node1'
 AND program = 'microk8s.daemon-kubelet'

We can even search across multiple archives by using a table wildcard and the BETWEEN clause to specify the archive dates to search:

SELECT timestamp, source, message
FROM `bigquery-dnaquery-v2.dnaquery.*`
WHERE _TABLE_SUFFIX BETWEEN '20190501' AND '20190504'
LIMIT 1000

Using Archive Data Outside of BigQuery

Although the DNAQuery script is made to work with BigQuery, it can be used to generate data for other querying tools such as Amazon Athena or Azure SQL Database. Since DNAQuery converts LogDNA archives into standard CSV files (which you can find in the project’s “logs/” directory), any database tool capable of importing CSV can import your log data. You just need to make sure that the table you’re importing into matches your schema.

You can find the CSV files in the DNAQuery project folder under the “logs/” directory.

For example, if you wanted to use MySQL instead of BigQuery, you can use the LOAD DATA statement to import your CSV file into a MySQL table. We’ll create a new table using the following query:

CREATE TABLE archive_20190502 (
app VARCHAR(255) NOT NULL,

timestamp TIMESTAMP NOT NULL,

source VARCHAR(255),

program VARCHAR(255),

pid INT,

message TEXT

)

Next, we’ll import the CSV file stored in our dnaquery/logs folder into the new table:

LOAD DATA LOCAL INFILE '/home/logdna/dnaquery/logs/results_2019-05-07.csv'
INTO TABLE archive_20190502
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'

Now, we can use any MySQL client such as MySQL Workbench or phpMyAdmin to query the table.

Conclusion

Having access to historical log data is a huge benefit that comes by default with LogDNA. You can configure the settings for archiving in the LogDNA app.configure

Use this fork of DNAQuery and you'll be able to extract the archives into the query tool of your choice whether it is Presto, IBM Cloud Object Storage, Amazon Athena, Google Big Query or simply searching a CSV. You can decide based on how big your archives are.

With BigQuery, you can search through as much historical log data as necessary from a single interface. To automate the process of sending your archives to BigQuery, you could even schedule a cron job to run the DNAQuery script each day.

Keep in mind that BigQuery only allows you to query log data. Creating views, alerts, graphs, and integrations is still only possible in the LogDNA web app. In addition, BigQuery charges you for each query you execute based on the amount of data being processed. For analyzing recent and real-time log data, the LogDNA web app is still the best way to go. However, for long-term and archived logs, DNAQuery is incredibly useful.

A big thanks to Chris Vaughn and the team at Life.Church for their work!
To learn more, visit the DNAQuery GitHub page.

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    Mezmo + Catchpoint deliver observability SREs can rely on
    Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines