The intelligence layer for production AI
Mezmo's Active Telemetry reduces millions of raw events into curated, context-rich signals. AURA, the open-source control plane on your infrastructure, orchestrates agents that get smarter with every incident. Together, they give your AI the right data and the framework to act on it.
Pick your entry point
Single Agent
Pick a use case (incident triage, runbook RCA, or on-call assistant). Wire it up with a TOML config. Ship your first production agent in under an hour.
- OpenAI-compatible with streaming SSE: Point LibreChat, OpenWebUI, or any existing frontend at it—zero adapter code.
- LLM agnostic: OpenAI, Anthropic, Bedrock, Gemini, Ollama, etc.
- MCP tool discovery at runtime: Datadog, PagerDuty, Slack, internal APIs—dynamic discovery, no code changes.
- Pre-built agentic SRE workflows grounded in your runbooks: Triage agent fires first, passes curated context to RCA agent, remediation agent acts on confirmed root cause.
< 1 hr to running an agent
5 LLM providers
0 boilerplate
[llm]
provider = "anthropic"
api_key = "{{ env.ANTHROPIC_API_KEY }}"
model = "claude-opus-4-6"
[agent]
name = "Ops Assistant"
system_prompt = "You're an SRE assistant"
turn_depth = 3
[mcp.servers.clickhouse]
transport = "http_streamable"
url = "http://clickhouse-mcp:8000/mcp"
[mcp.servers.clickhouse.headers]
Authorization = "Bearer {{ env.MCP_TOKEN }}"
# Optional: Connect to Mezmo's MCP Server
[mcp.servers.mezmo]
transport = "http_streamable"
url = "https://mcp.mezmo.com/mcp"
[mcp.servers.mezmo.headers]
Authorization = "Bearer {{ env.MEZMO_API_KEY }}"Agent Team
One agent handled one job. Now coordinate a team of specialized agents to triage, investigate, and remediate with an orchestrator managing handoffs.
- Multi-agent orchestration: Specialized workers coordinated by an orchestrator agent for complex, multi-step investigations.
- Safety controls: turn_depth, streaming timeouts, graceful shutdown, backpressure. Human-in-the-loop approval gates before any remediation action.
- OpenTelemetry + OpenInference tracing: Full audit trail across every agent—plans, prompts, tool calls, handoffs. Egresses to Arize Phoenix, Jaeger, Datadog, Mezmo.
15 → 5 min MTTR
60-80% toil eliminated
4 hrs → auto post mortem
# Orchestrator routes to specialist agents
[llm]
provider = "openai"
api_key = "{{ env.OPENAI_API_KEY }}"
model = "gpt-5.2"
[[vector_stores]]
name = "runbooks"
type = "qdrant"
url = "http://{{ env.QDRANT_HOST | default: 'localhost' }}:6334"
collection_name = "sre_runbooks"
context_prefix = "Operational runbooks covering incident response procedures, known failure modes, and troubleshooting guides"
embedding_model = { provider = "openai", model = "text-embedding-3-small", api_key = "{{ env.OPENAI_API_KEY }}" }
[agent]
name = "SRE Orchestrator"
system_prompt = """
You are an SRE Orchestrator. Decompose incident response tasks and delegate:
- incident-responder: PagerDuty incident lookup, alert details, oncall schedules
- metrics-analyst: Prometheus queries to validate alerts and check trends
- log-analyst: Log search, error patterns, timeline correlation
Maximize parallel execution when tasks have no data dependency.
"""
turn_depth = 15
temperature = 0.3
[mcp]
sanitize_schemas = true
[mcp.servers.pagerduty]
transport = "http_streamable"
url = "https://mcp.pagerduty.com/mcp"
headers = { Authorization = "Token token={{ env.PAGERDUTY_API_KEY }}" }
description = "PagerDuty MCP for incident details, oncall schedules, and alert status"
[mcp.servers.prometheus]
transport = "http_streamable"
url = "http://{{ env.PROMETHEUS_MCP_HOST | default: 'localhost' }}:8080/mcp"
description = "Prometheus MCP for querying system metrics"
[mcp.servers.log_analysis]
transport = "http_streamable"
url = "https://mcp.mezmo.com/mcp"
description = "Log analysis MCP for searching and correlating log events"
[orchestration]
enabled = true
[orchestration.worker.incident-responder]
description = "PagerDuty incident triage: fetch incident details, parse alerts, check oncall schedules"
turn_depth = 8
mcp_filter = [
"list_incidents",
"get_incident",
"list_alerts_from_incident",
"get_alert_from_incident",
"list_services",
"get_service",
"get_current_time",
]
preamble = """
You are an Incident Responder. Use PagerDuty tools to fetch and parse incidents.
Extract: environment, alert category, severity, timestamp, metric value, RunBook URL, and triggering query.
Always use tools — do not fabricate incident data.
"""
[orchestration.worker.metrics-analyst]
description = "Prometheus metrics analysis: validate alerts, check trends, identify anomalies"
turn_depth = 20
mcp_filter = [
"execute_query",
"execute_range_query",
"list_metrics",
"get_current_time",
]
preamble = """
You are a Metrics Analyst. Query Prometheus to validate alerts, check trends, and identify anomalies.
Always get current time before range queries. Do not fabricate metric values.
Report query results clearly with metric names, labels, and values.
"""
[orchestration.worker.log-analyst]
description = "Log analysis: search logs, analyze error patterns, correlate events across time"
turn_depth = 20
vector_stores = ["runbooks"]
mcp_filter = [
"analyze_logs_*",
"deduplicate_logs_*",
"get_correlated_timeline_*",
"get_current_time",
"get_log_histogram",
"list_log_fields",
]
preamble = """
You are a Log Analyst. Search and analyze logs for operational investigations.
Search runbooks for known failure patterns when errors match documented scenarios.
Report findings with timestamps, error messages, and relevant context.
"""Engineered Context
Already using LangChain, CrewAI, or your own framework? The bottleneck is the data going in. Mezmo is the context layer that makes any agent smarter.
- Active Telemetry Pipeline: Deduplicate, cluster, enrich before agents see data. Up to 99.98% compression—every removed token saves inference cost.
- Agent-optimized MCP server: Returns curated, task-scoped data—not raw firehose.
- Just-in-time context delivery: Each workflow step gets precisely scoped data. Dynamic assembly as investigations unfold—not a dump of everything.
~$1 per investigation
99.98% data reduction
50-70% more efficient
AURA + Mezmo MCP (curated context)
[mcp.servers.mezmo]
transport = "http_streamable"
url = "https://mcp.mezmo.com/mcp"
headers = {
"Auth" = "Bearer {{ env.MEZMO_TOKEN }}"
}
# No local MCP server to run.
# Mezmo returns pipeline-processed signals,
# not raw API firehose.
# WITHOUT Mezmo (raw vendor MCP)
# → 2.4M tokens per investigation
# → 88% noise in context window
# → $30-36 per investigation
# → 14+ min MTTR
# WITH Mezmo pipeline + MCP
# → <1K curated signals
# → noise removed before agent sees it
# → <$1 per investigation
# → <5 min MTTRControl your data
Many teams start here. OTel migration, cost reduction, vendor consolidation. Get your data under control first, then layer agents on top when you're ready.
- Flexible telemetry routing: Ingest with OTel and route to Mezmo, Datadog, Grafana, Elastic, or S3. Migrate between destinations slowly or all at once.
- Cost profiling: Identify high-volume, low-value streams. Cut observability spend up to 70%.
- Proactive anomaly detection: Continuous monitoring for degraded signals and drift. Surface issues before they become incidents.
Up to 70% cost reduction
0 vendor lock-in
Proactive not reactive
Mezmo as the brain, AURA as the hands.
Ingests, profiles, and understands telemetry in real-time with the ability to modify and alert in stream with pipelines.
- Easy to get started with over 100 integrations
- In-stream parsing and enrichment with intent-based direction
- One-click oTel migration
Open-source agentic harness that orchestrates AI workflows across your stack. Forever open source & production ready.
- MCP-native tool connectivity, LLM agnostic
- Self-correcting through: plan → execute → synthesize → evaluate
- Custom agentic runbooks
The right data for your agents. Faster resolution for your team.






