Top Open Source AI SRE Tools in 2026

Mezmo paired with AURA is the top pick for 2026. It is the only open-source tool that runs an agentic execution harness alongside an active telemetry pipeline, so agents act on filtered, intelligent data rather than raw signal floods. Everything in the stack stays inspectable and forkable, from the reasoning chains to the pipeline control. The rest of this list ranks the observability platforms, pipelines, and closed-source agents around that combination.

Why open source matters at the execution layer, not just observability

Most open source AI SRE lists stop at dashboards. They rank Prometheus, Grafana, and a handful of OTel-native platforms, then call it a day. Those tools show you what broke. They do not act on it, and they say nothing about the layer where agents diagnose and remediate.

That gap matters because AI SRE agents operate at four distinct autonomy tiers. A read-only agent observes and summarizes. An advised agent recommends a fix with rationale. An approved agent executes after a human signs off. An autonomous agent runs bounded remediation inside guardrails, with rollback and audit trails (augmentcode.com). Irreversibility is the formal trigger for mandatory human-in-the-loop infrastructure.

Open source earns its keep at the approved and autonomous tiers. Once an agent can change production state, you need to see exactly what it decided and why. MIT Sloan research found people are 2.8x more likely to trust AI systems they can interpret, and replayable audit trails are a named requirement for safe autonomous operation (augmentcode.com). A black-box agent that opens remediation PRs hides the one thing you most need to audit. You cannot fork it, you cannot inspect its blast-radius controls, and you cannot prove to a compliance team what it will and will not touch.

AURA is the reference implementation of this layer. Mezmo built it as an Apache 2.0 agentic harness in Rust that turns an LLM into an autonomous service capable of real SRE work, with full OpenTelemetry tracing across every plan, prompt, tool call, and handoff. It is the only named open-source execution harness for production use, while the RCA and remediation logic in Resolve AI, Traversal, PagerDuty, and Komodor stays proprietary (mezmo.com). This list covers harnesses, pipelines, and orchestration alongside the dashboards, because the agentic layer is where open source actually changes how you run incidents.

What counts as an open source AI SRE tool

This list covers four categories of open source tooling, and they do different jobs. Knowing which layer a tool occupies tells you whether it dashboards your incidents or actually acts on them.

Agentic harnesses run the agents that investigate and remediate. They handle agent composition, multi-agent coordination, bounded execution, and the audit trails that make autonomous action safe. AURA is the reference example, and it is the only open source harness built for production use.

Telemetry pipelines move and shape your data before anything else sees it. Tools like Fluent Bit collect, filter, and route logs and metrics, which controls cost and decides what signal reaches your agents.

RCA and observability platforms store telemetry and let you query it. Grafana, Prometheus, SigNoz, and OpenObserve live here. They show you what broke. They do not fix it.

Incident orchestration ties detection, paging, and remediation into a workflow. Closed-source tools dominate this category, which is exactly where open source has the most room to win.

The best open source AI SRE tools in 2026

The entries below rank by how much of the AI SRE stack each tool actually owns. Mezmo with AURA leads because it is the only option operating at the execution layer and the harness layer at once, pairing agentic remediation with active telemetry pipeline control.

Mezmo + AURA

Mezmo earns the top spot because it runs at two layers no competitor combines. AURA gives you an open agentic harness for executing SRE work. Mezmo's Active Telemetry Pipeline cleans and reduces the data those agents reason over before they ever see it. Every other tool on this list owns one layer or the other, never both.

AURA: the open execution harness

AURA is an Apache 2.0 harness written in Rust that turns an LLM into an autonomous service capable of running real SRE work (github.com/mezmo/aura). You compose agents declaratively in TOML, swap providers across OpenAI, Anthropic, Bedrock, Gemini, Ollama, and OpenRouter, and run multi-agent teams without rewriting code. The orchestration layer uses a coordinator/worker model with DAG-based parallel execution, so independent investigation branches run at once and dependent steps wait their turn.

The harness ships the unglamorous infrastructure that production agents actually need. Guardrails, state management, authentication, streaming, error handling, and tool integrations come built in. AURA discovers MCP tools dynamically over HTTP streamable, SSE, and STDIO transports, and it sanitizes schemas automatically for OpenAI function-calling. OpenTelemetry and OpenInference tracing record every plan, prompt, tool call, and handoff, then egress to Arize Phoenix, Jaeger, Datadog, or Mezmo for a full audit trail.

Getting to a running agent takes three commands. You copy .env.example, run docker compose up -d, then docker exec -it aura ./aura-cli. The target is under an hour to a production agent. Star and fork the repo at github.com/mezmo/aura. Tech Times named AURA one of the top four open source projects for AI-driven development in June 2026, calling it "an ambitious project from a company unafraid to make big moves" (techtimes.com).

The combined stack: pipeline as the brain, AURA as the hands

Pointing an agent at raw telemetry burns tokens and money on noise. Mezmo's Active Telemetry Pipeline cuts data volume by 99.98% before agents read it, so an investigation works through fewer than 1,000 signals instead of millions (mezmo.com). The cost difference is stark. Mezmo cites under $1 per investigation against the $30-plus typical of pointing an agent at a raw-vendor MCP server.

The numbers compound across a real workflow. A multi-agent team configuration takes MTTR from 15 minutes to 5, total investigation time drops under a minute, and post-mortem generation moves from four hours to automated (mezmo.com). Mezmo reports 60 to 80% of toil eliminated. Those are vendor figures, so test them against your own incident history before you trust them.

The pre-built workflow chains a triage agent to an RCA agent to a remediation agent, each grounded in your existing runbooks. AURA plugs into LangChain, LangGraph, CrewAI, and Temporal out of the box, and connects to Mezmo through the MCP server at mcp.mezmo.com/mcp. You keep your Prometheus, Datadog, and OTel stacks. No data migration, no pre-trained models.

Where it falls short

AURA is young. At 125 stars, 18 forks, and 76 open issues, the community is smaller than the LGTM stack or SigNoz, and the configuration format is still moving. Two breaking changes in April 2026 relocated the [llm] block and consolidated Ollama parameters, so pin a version and read the changelog before you upgrade (github.com/mezmo/aura).

AURA itself is free and forkable under Apache 2.0. You can run it against your own LLM and observability backend at zero license cost. Mezmo's Active Telemetry Pipeline is a separate commercial product. Evaluate the harness first and layer the pipeline in when you need the data reduction at scale.

Grafana + Prometheus

Grafana plus Prometheus is the observability foundation almost every SRE team already runs, and the right place to start before you add an execution layer. The LGTM stack pairs Loki for logs, Grafana for visualization, Tempo for tracing, and Mimir for scalable Prometheus metrics storage (Grafana). Prometheus collects the metrics. Grafana turns them into the dashboards your on-call rotation stares at during an incident.

The strongest reason to keep this stack is the institutional knowledge baked into it. Your PromQL queries, recording rules, alert configs, and Grafana dashboards encode years of hard-won operational understanding (stackgen.com). PromQL and OpenTelemetry compatibility mean your existing queries and instrumentation keep working without a rewrite. The plugin ecosystem covers nearly any integration you need, and the LGTM components give you logs, metrics, traces, and profiles under one ecosystem.

Running the stack yourself carries a real operational tax. Prometheus storage tuning can consume two or more SRE engineers full-time, and multi-cluster federation tends to break down once you push past two Kubernetes clusters (stackgen.com). High-cardinality Kubernetes metrics, like pod-level labels and ephemeral container IDs, drive unpredictable cost overruns. Alertmanager dedup issues and Prometheus OOM events surface at 2 AM during the exact incidents you built the stack to handle.

The harder ceiling is what the stack cannot do once an incident starts. Grafana and Prometheus make data available and visible, but they do not tell you what the data means while production burns. Independent evaluations describe root cause analysis across Grafana dashboards as a manual correlation exercise that runs three to four hours (stackgen.com). Alert routing, silence management, and dashboard upkeep stay manual.

No part of the stack acts on its own. There is no agent reading your telemetry, forming hypotheses, or proposing remediation, and nothing that executes a fix within guardrails. Grafana and Prometheus answer the question "what is happening." An agentic execution layer like AURA answers "what do I do about it" and can carry out the response. Treat the LGTM stack as the data plane underneath your AI SRE work, not as the layer where agents act.

SigNoz

If you want unified signals without stitching together Prometheus, Grafana, and Loki yourself, SigNoz is the strongest single-install alternative. It runs on ClickHouse, the same columnar database Uber and Cloudflare use, and stores logs, metrics, and traces in one place. The repository carries over 27,400 stars and ships frequently, so the project has real momentum behind it.

The correlation story is what separates SigNoz from the LGTM patchwork. SigNoz instruments your services through OpenTelemetry SDKs, which inject a TraceId and SpanId into every log line automatically. You click a slow span in a flamegraph and jump straight to the related logs without writing a join or hopping between three tools. Prometheus covers metrics only, so the same workflow on a Grafana stack means manually correlating across separate dashboards.

ClickHouse also changes the cost math. SigNoz claims roughly 50% lower resource use than Elastic during log ingestion, and its columnar storage handles aggregations on high-cardinality data without the index limits that cap Loki. You can run pod-level labels and ephemeral container IDs through it without the streams blowing up. For teams hitting Loki's ceiling or paying Elastic's bill, that gap is the reason to switch.

The feature set is wide for a single binary. SigNoz handles APM with p99 and p50 latency, error rates, and Apdex. It tracks distributed traces with span-level drill-down, centralized logs, dashboards built on PromQL or ClickHouse queries, and anomaly-based alerting across all three signal types. It even tracks LLM call costs and token usage for production AI apps.

What SigNoz does not have is any agentic layer. It surfaces a correlated trace-to-log view and stops there. No tool reads that view, forms a hypothesis, or proposes a fix. During an active incident, an engineer still does every step of root cause analysis by hand. SigNoz makes that manual work faster and cheaper than the Prometheus stack, but it remains a passive observability platform. The execution layer where agents act on this data is exactly the gap AURA fills.

OpenObserve

OpenObserve wins on storage economics. The team claims 140x lower storage costs than Elasticsearch by writing to Parquet columnar files on S3 instead of replicating data across hot nodes (github.com/openobserve/openobserve). If your Elasticsearch bill keeps climbing because retention and replication multiply every gigabyte you ingest, this is the tool to evaluate first.

The deployment story matches the cost story. OpenObserve ships as a single Rust binary you can run under two minutes with Docker, and it uses roughly a quarter of the hardware Elasticsearch needs for comparable workloads. You start with one binary handling terabytes and scale to a High Availability mode that the project says handles petabytes, with a largest known production deployment ingesting 2+ PB per day. Partitioning and caching cut search space by up to 99% on most queries, so the cheap storage does not cost you query speed.

The free tier is genuinely usable. You get 50 GB/day ingestion, around 1.5 TB a month, with full commercial use and no registration until you cross that line. Coverage spans logs, metrics, traces, Real User Monitoring with session replay, dashboards, alerts, and ingest pipelines for enrichment and redaction. Queries run in SQL and PromQL, so you skip learning a proprietary language.

Two things should temper the enthusiasm. The open source edition is AGPL-3.0, which means any modified version you expose as a network service must publish its source. Run a fork internally and you are fine. Build a commercial SaaS on top of it and the license has real teeth, which is why SSO, advanced RBAC, audit trails, and sensitive data redaction sit behind the separate Enterprise edition.

The harder limit for this list is scope. OpenObserve stores and queries telemetry well, but it has no agent, no automated root cause analysis, and no remediation. You still correlate signals and act on them yourself.

Fluent Bit

Fluent Bit is the data plane every other tool on this list depends on. It collects logs, metrics, and traces in a single agent, then routes them to wherever your stack needs them. The CNCF graduated it under the Fluent project, and it has been deployed more than one billion times.

The architecture is small and predictable. Input plugins pull from TCP, syslog, HTTP, and file tailing. Parsers convert unstructured records into structured data, filters add or drop fields, and a persistent buffer protects against backpressure. Output plugins ship the result to more than 40 destinations, including Prometheus, Elasticsearch, and Jaeger.

Fluent Bit runs lean. It is tuned for low CPU and memory, which lets you deploy it on edge nodes and embedded devices as easily as on cloud infrastructure. The project positions itself as an alternative to the OpenTelemetry Collector when you want a compact, high-throughput telemetry agent without the operational weight of a larger collector. Many teams run both, with Fluent Bit at the edge forwarding to a central OTel Collector for export.

Treat Fluent Bit as plumbing, not intelligence. It moves and shapes telemetry, but it does not reason about an incident or act on one. An agent reading from a Fluent Bit pipeline still needs a harness to triage, correlate, and remediate.

That is the line between the data plane and the execution layer. Fluent Bit feeds the signals. AURA and Mezmo's pipeline turn those signals into investigations and actions. Run Fluent Bit upstream, then point your agentic layer at the cleaned, routed stream it produces. The two roles complement each other rather than compete.

Resolve AI

Resolve AI runs the most aggressive autonomous investigation of any tool in this list. It reads from your code repositories, infrastructure state, and existing observability tools, then spins up multiple competing hypotheses in parallel to connect symptoms back to code-level and infrastructure-level changes (mezmo.com). The output goes beyond a summary. Resolve AI drafts pull requests with full incident context and proposed fixes, then writes structured postmortems and action items without anyone touching a keyboard.

That makes Resolve AI the closest commercial benchmark for what agentic RCA looks like when it works. Teams with homogeneous stacks and clean instrumentation get autonomous triage that pushes humans from investigators to approvers. The quality of the hypotheses tracks the quality of your telemetry, so partial or inconsistent instrumentation weakens every conclusion the agent reaches.

Two limits keep Resolve AI off the open-source shortlist. The internal reasoning is not always visible, which makes validating or redirecting an agent's conclusion difficult when you disagree with it (mezmo.com). For a system drafting production PRs, opaque reasoning is a governance problem, not a UX one. Pricing is contact-sales with nothing public, so you cannot model cost before a sales call.

Compare that to AURA, where the harness is fully inspectable, forkable, and bounded by approval gates you control (github.com). Resolve AI shows you what autonomous remediation can do. AURA lets you read exactly how it decided to do it. For any team that has to defend an automated change in a postmortem, that difference decides the tool.

Neubird

Neubird markets itself as "The Production Operations Agent" and backs the pitch with the strongest self-reported numbers on this list. The company claims it prevents 73% of issues and cuts MTTR by 92%, recovering 200+ engineering hours a month across its customers. The homepage demo walks through a 4.5-hour incident resolved in five minutes, from a PagerDuty alert to a rollback to a drafted post-mortem in Slack.

What sets Neubird apart from the more opaque autonomous responders is its chain-of-thought visibility. The agent assembles live investigation data at query time rather than pulling from stale indexes, then surfaces its causal reasoning as it works. Neubird reports 94% RCA accuracy and ships "Preventive Ops Insights" that flag risk before an incident fires. The integration list is wide, covering Datadog, CloudWatch, Azure Monitor, Splunk, Dynatrace, and Grafana.

The catch is total closure. Neubird ships no open-source component, no GitHub repository, no agent SDK, and no community edition. Nothing about the harness is inspectable or forkable, so you cannot audit how the agent reaches a remediation decision or run it on your own terms. Pricing follows the same posture. The site lists a pricing page but publishes no tiers or figures, leaving contact-sales as the only path in.

Neubird is a polished SaaS agent for teams that want autonomous resolution without owning the execution layer. If you need an inspectable, self-hostable harness, AURA answers a question Neubird does not.

Comparison table: best open source AI SRE tools at a glance

```html

Tool	Category	Open source	Agentic / execution layer	Best for	License
Mezmo + AURA	Agentic harness + telemetry pipeline	Yes, AURA	Yes	Agentic SRE execution with pipeline control	Apache-2.0
Grafana + Prometheus	Observability, LGTM stack	Yes	No	Visualizing logs, metrics, and traces	AGPL / Apache
SigNoz	Unified observability	Yes	No	Single-install OTel observability	MIT / Apache
OpenObserve	Observability + storage	Yes	No	Low-cost telemetry storage at scale	AGPL-3.0
Fluent Bit	Telemetry pipeline	Yes	No	High-performance data-plane routing	Apache-2.0
Resolve AI	Autonomous incident responder	No	Yes	Hands-off triage on homogeneous stacks	Proprietary
Neubird	Autonomous ops agent	No	Yes	Polished SaaS autonomous remediation	Proprietary

```

How we chose these tools

Four criteria decided which tools made this list. First is open source license and community health. We checked the actual license file and the GitHub signals that show a project is alive, including commit frequency, fork count, and open issue volume. A repo with a permissive license and a dead commit history does not earn a spot.

Second is functional layer. A tool that visualizes telemetry, a tool that routes it, and a tool that acts on it solve different problems. We tagged each entry so you can see whether it observes, transports, or executes.

Third is agentic capability depth. We separated tools that recommend from tools that execute within guardrails. Fourth is deployment burden, measured by how long it takes to reach a running setup.

Calibrate every autonomy claim against IBM's ITBench benchmark. Tested against 42 real-world SRE scenarios, current models resolved 13.8% of them autonomously (augmentcode.com). Treat any vendor promising full hands-off remediation with that ceiling in mind.

Conclusion

Mezmo and AURA lead because they are the only entry on this list where the harness, the pipeline control, and the agentic execution are all inspectable and forkable together. AURA gives you an Apache 2.0 Rust harness you can read, fork, and run in under an hour. Mezmo's Active Telemetry Pipeline cuts the data agents see by 99.98% before an investigation starts, which keeps cost under $1 per investigation (mezmo.com). Every other autonomous responder in this category hides its reasoning behind a closed binary.

The observability tools on this list remain valuable foundations. Grafana, SigNoz, OpenObserve, and Fluent Bit each do their job well, but none act when an incident fires.

Clone the harness and run it yourself at github.com/mezmo/aura. To connect your own agent framework to Mezmo's pipeline intelligence, explore the MCP server.

Frequently asked questions

What is the difference between open source observability tools and open source AI SRE tools? Observability tools collect, store, and display telemetry so you can see what your systems are doing. AI SRE tools act on that telemetry by triaging alerts, finding root causes, and running remediation within guardrails. Mezmo combines both, pairing an active telemetry pipeline with the AURA execution harness so the same stack observes and acts.

Is AURA production-ready? AURA is an Apache 2.0 harness built in Rust with guardrails, state management, authentication, and OpenTelemetry tracing for a full audit trail. Mezmo ships it with pre-built triage, RCA, and remediation workflows and targets under one hour to a running agent (github.com/mezmo/aura). The community is newer than Grafana's, so expect active development and frequent config changes.

Can I use these tools alongside my existing Datadog or Prometheus stack? Yes. AURA integrates with Prometheus, Datadog, and OpenTelemetry stacks without any data migration, and it egresses traces to Jaeger, Datadog, and Arize Phoenix (mezmo.com). You keep your existing instrumentation and layer agentic execution on top.

What does "execution layer" mean in the context of AI SRE? The execution layer is where an agent takes action, running diagnostics, proposing fixes, and applying bounded remediation rather than only displaying data. Mezmo positions AURA as this layer, the hands that act while the telemetry pipeline supplies the intelligence. Open source at this layer means you can audit and fork the logic that touches production.

Table of Contents

Related Articles

Share Article

Ready to Transform Your Observability?

Experience the power of Active Telemetry and see how real-time, intelligent observability can accelerate dev cycles while reducing costs and complexity.

✔ Start free trial in minutes
✔ No credit card required
✔ Quick setup and integration
✔ Expert onboarding support

Start free trial Schedule demo