Telemetry Tracing: Best Practices & Use Cases

Learning Objectives

This learn article dives deep into the key aspects of telemetry and OpenTelemetry. It covers definitions for traces and spans, offers best practices for OpenTelemetry tracing, and shares an example on how to set one up.

Telemetry Tracing: Best Practices & Use Cases

What is telemetry tracing?

Telemetry tracing - often referred to as distributed tracing - is a method for tracking and visualizing the journey of a request or transaction as it flows through the components of a distributed system. It provides end-to-end visibility into system behavior, performance, and dependencies.

Telemetry tracing is made up of traces, spans, context propagation and instrumentation. A trace represents the full lifecycle of a request as it moves through various services/components in a system. Context propagation is what happens as requests travel through different services - the tracing context (like trace ID and span ID) is passed along to maintain linkage between operations. And instrumentation involves inserting code or using libraries (e.g., OpenTelemetry) to collect trace data. 

Telemetry tracing can be used to diagnose bottlenecks across microservices, identify latency sources in request paths, understand system dependencies and failure points, correlate traces with logs and metrics for full observability, and improve performance optimization and incident response

What is OpenTelemetry?

OpenTelemetry (often abbreviated as OTel) is an open-source observability framework designed to collect, generate, and export telemetry data (metrics, logs, and traces) from applications and infrastructure. It provides vendor-neutral, standardized instrumentation so developers and operators can understand system behavior and performance across distributed systems.

OTel has a number of key components including:

  1. APIs
    Provides a language-specific interface for creating telemetry data (traces, metrics, logs).
  2. SDKs
    Offers the implementation for the API, including sampling, batching, and exporting.
  3. Instrumentation Libraries
    Prebuilt or custom libraries that auto-instrument common frameworks (HTTP, gRPC, database clients).
  4. Collectors
    The OpenTelemetry Collector is a vendor-agnostic agent/service that receives, processes, and exports telemetry data to backends like Jaeger, Prometheus, Mezmo, etc.
  5. Exporters
    Translate telemetry data into formats compatible with external observability platforms (OTLP, Zipkin, or Prometheus formats).

OTel primarily handles three types of data including traces (which track request paths across services), metrics (which measure system behavior), and logs (which capture structured or unstructured application/system events). OTel works in three steps: first, code is instrumented using OTel libraries or auto-instrumentation; then, traces, metrics, and logs are collected at runtime;  and finally, collected data is sent to an observability backend via the OpenTelemetry Collector or exporters.

Organizations report a wide variety of benefits from adoption of OTel including a unified standard for all telemetry types, a vendor-neutral, open-source governance, wide ecosystem support, a reduction in vendor lock-in, and support for  both manual and automatic instrumentation.

Traces: Definitions

TracerProvider

A TracerProvider in OpenTelemetry is the central component responsible for creating and managing tracers, which in turn generate and record spans (units of work in a trace). It acts as the entry point to tracing within an application.

In the OpenTelemetry tracing pipeline, the TracerProvider is the top-level object that configures tracing. The tracer is created by the TracerProvider; and is used in code to start spans. And the span represents an individual operation or step in a trace.

A tracer provider creates tracers, controls span processors and exporters, manages configuration and routes spans to the right backends or collectors.

Tracer

In OpenTelemetry, a Tracer is the component used to create spans, which are the individual units of work in a distributed trace. It's the primary interface developers use in application code to generate and record tracing data.

The Tracer starts and ends spans, connects spans into traces, attaches attributes, events, and status to spans, and helps instrument code to observe distributed workflows. Tracers are critical because they enable distributed tracing, provide context for telemetry data, improve debugging, and support root cause analysis.

Tracer vs. Tracer Provider

Tracer Provider Tracer
Singleton or app-level object Specific to a module or component
Configures tracing pipeline Starts and manages spans
Manages exporters/processors Used in business logic code

Trace Exporters

Trace exporters in OpenTelemetry are components responsible for sending collected trace data (spans) to an external backend system or observability platform for storage, visualization, and analysis.

They act as the final step in the telemetry pipeline, after spans are created and processed, exporters send them out to observability tools or custom destinations.

A trace exporter converts span data into the correct format for the destination, transmits spans to external systems via protocols, and works with Span Processors to handle batch delivery, retries, and error handling.

Teams find trace exporters useful for a number of reasons. They decouple instrumentation from backends, support multiple observability platforms, enable centralized tracing and analysis, optimize performance through batching and asynchronous export, and facilitate vendor flexibility and observability portability.

Context Propagation

Context propagation in OpenTelemetry is the mechanism that allows telemetry data - especially trace context - to be passed across service boundaries and asynchronous operations, so that spans can be linked into a complete distributed trace.

In a distributed system, a request may travel through many services. Each service creates a span, but without context propagation, those spans would appear as unrelated traces. With context propagation, all spans can be connected into one coherent trace, showing the full lifecycle of a request.

Four items are typically propagated: the trace ID, the span ID, sampling decisions and sometimes baggage.

The process of context propagation kicks off with a tracer starting a span and attaching its context to the current execution thread. Then the context is injected into headers before making a remote call. In the final step, the receiving service extracts the context from incoming headers and uses it to create a child span linked to the original trace.

Context propagation has a number of benefits. It enables end-to-end trace visibility across distributed systems, maintains parent-child relationships between spans, supports both synchronous and asynchronous operations, and facilitates root cause analysis and latency breakdown.

Spans: Definitions

Span Context

In OpenTelemetry, a span context is a lightweight, immutable object that carries the identity and metadata of a span, allowing it to be linked to other spans and propagated across services or threads. It is crucial for enabling distributed tracing and context propagation.

A SpanContext includes the following key fields:

Field Description
Trace ID Globally unique identifier for the entire trace
Span ID Unique identifier for the current span
Trace Flags Indicates sampling decision (e.g., sampled or not)
Trace State (optional) Vendor-specific trace metadata (e.g., priority, tenant info)
Is Remote Indicates whether the context was extracted from a remote service

A span context links spans together in a trace (parent-child relationships), carries tracing information across service boundaries, and enables context propagation so all spans stay part of the same trace. Span context is important because it maintains trace continuity across services and threads, enables correlation of spans into a coherent trace, powers trace exporters and visualization tools, and allows tools to apply sampling and filtering decisions.

Span Attributes

Span attributes in OpenTelemetry are key-value pairs attached to a span to provide additional context and metadata about the operation it represents. They help describe what happened, where it happened, and how, making trace data more meaningful, searchable, and actionable.

Span attributes describe details like the HTTP request method, database query, user ID, cloud region, or hostname. This metadata helps filter and search traces, group spans by common tags, and diagnose performance issues and trace root causes. Span attributes add context to each span, while enabling fine-grained filtering in observability tools. Span attributes power dashboards, alerts, and analysis so teams can improve troubleshooting and root cause identification.

Span Events

Span events in OpenTelemetry are timestamped annotations added to a span to represent notable moments or intermediate steps during the span's execution. They help enrich span data with in-line context about what happened within the span’s lifetime, without creating separate spans.

A Span event consists of a name (usually a short label for the event), a timestamp, and sometimes attributes. Teams use Span events to mark significant occurrences like errors or exceptions, retries or fallbacks, time of external calls, or state transitions. Span events provide granular detail without needing additional spans and help with debugging, performance profiling, and understanding behavior.

Span Events vs. Spans

Feature Span Event Span
Represents A moment in time A duration of work
Overhead Minimal Higher (creates new span context)
Hierarchy Lives inside a span Can be parent/child spans
Use case Checkpoints, logs, minor events Full operations or service calls

When to use Span Events vs. Span Attributes

Choosing between span events and span attributes depends on what you're capturing and when it happens during the execution of a span.

Here's a detailed comparison to help you decide:

Use Span Attributes When:

Use Case Description
Static context Information known at the start or throughout the entire span (e.g., HTTP method, DB type)
Describing the operation Metadata that explains the span's purpose or environment (e.g., user ID, resource name, cloud region)
Span-wide properties Attributes that apply to the full duration of the span

Use Span Events When:

Use Case Description
Dynamic or time-specific events Something that happens at a specific time during the span (e.g., a retry, exception, cache miss)
Milestones in execution Intermediate stages or checkpoints inside the operation (e.g., request sent, response received)
Debugging or tracing internal steps When you want to see what happened inside the span over time

Overall, use attributes for describing the span and use events for what happened during the span.

Span Links

Span links in OpenTelemetry are references to other spans that are related to the current span but are not its direct parent. They allow you to connect spans across traces or branches that are logically related but do not follow the traditional parent-child hierarchy.

A span link contains a reference to a SpanContext,optional attributes describing the relationship, and no timing or causal relationship like parent-child spans. Span links are useful when a span has multiple parents or depends on multiple inputs or it is important to preserve context across asynchronous or concurrent operations. Also, span links are important when you’re sampling traces but want to maintain relationships with unsampled spans.

Span Status

In OpenTelemetry, a span status indicates the outcome of the operation represented by a span—whether it was successful, failed, or encountered an error. It is a semantic signal used to describe how the operation ended, which is essential for troubleshooting, alerting, and trace analysis.

A span’s status consists of two parts:

Field Description
Status Code A standardized result indicator (UNSET, OK, or ERROR)
Description (optional) A human-readable explanation (e.g., "Timeout connecting to DB")

Use Span status for error tracking, alerting, performance debugging and filtering.

Span kind

In OpenTelemetry, SpanKind specifies the role a span plays in a distributed system interaction—such as whether it represents a client request, a server response, or an internal operation. It helps observability tools interpret the meaning 

By setting the correct SpanKind, you provide semantic meaning about how the span participates in a system’s architecture. This is critical for trace correlation across services, accurate dependency mapping, and meaningful visualization in observability tools.

Client

The client makes the outbound remote call. A client is typically a Parent Span Kind, while a server is a Child Span Kind, and the use case example would be “service A calls service B.”

Server

The server is handling an inbound request and is usually a Child Span Kind. 

Internal

An internal Span Kind is used for local operations or default operations.

Producer

A Producer Span Kind is sending a message to a queue in an outbound direction. It may have a “parent” relationship with a “child” Consumer Span Kind.

Consumer

A Consumer Span Kind is receiving or processing an inbound message.

Best practices for OpenTelemetry Tracing

Data consistency

Data consistency is a critical best practice in OpenTelemetry tracing because it ensures that trace data collected from distributed systems is reliable, coherent, and useful for performance monitoring, debugging, and root cause analysis.

Inconsistent data - such as mismatched trace IDs, incorrect span kinds, or irregular attribute naming - can break trace continuity, mislead analysis, or make it impossible to correlate telemetry across services.

Data consistency refers to the uniformity and correctness of the tracing data across spans, services, and systems. It covers trace structure consistency, attribute and naming conventions, context propagation integrity, time and clock synchronization, and span semantic correctness.

To ensure data consistency in OTel tracing, experts recommend seven best practices.

1. Maintain consistent trace context across services

2. Follow semantic conventions for attributes

3. Use span kinds accurately

4. Synchronize timestamps

5. Apply consistent naming for spans and services

6. Ensure sampling decisions are honored across services

7. Keep span status accurate

Attribution Selection

Attribution selection in OpenTelemetry tracing refers to the intentional and consistent choice of attributes (i.e., key-value metadata) that are attached to spans to describe the who, what, where, and why of an operation.

Get the most out of OTel tracing by following these six best practices for attribution selection:

1. Use semantic conventions

2. Avoid high-cardinality attributes

3. Include business-relevant metadata

4. Limit attribute volume per span

5. Tag spans with environment context

6. Ensure consistency across services

Naming Conventions

In OpenTelemetry tracing, naming conventions are a critical best practice that ensure your telemetry data is consistent, interpretable, and useful across all teams, services, and tools.

Well-defined naming conventions apply to span names, attribute keys, service names, and instrumentation libraries.

By following these five naming best practices, you make trace data easier to search, visualize, analyze, and correlate across distributed systems.

1. Use clear, consistent span names

2. Follow standardized attribute keys

3. Standardize service names

4. Use consistent naming for custom attributes

5. Include versioning where relevant

Context Propagation

Context propagation is a foundational best practice in OpenTelemetry tracing that ensures trace data remains coherent and connected as requests flow through distributed systems across services, threads, processes, and network boundaries.

Without consistent context propagation, your traces become fragmented, making it impossible to accurately reconstruct the end-to-end journey of a request.

Experts suggest following six steps to get the most out of context propagation.

1. Always inject and extract trace context

2. Use standard propagation formats

3. Propagate context across async and threaded work

4. Handle context in messaging systems

5. Use a global propagator across your application

6. Respect and continue incoming trace context

Resource Management

Batching and compression

The goal of batching and compression is to shrink outbound traffic and reduce CPU context switching. This is critical because network egress costs and per‑span export overhead can dwarf application work in large estates.

Best-practice checklist

Setting Guideline Practical Tip
Batch size 50–512 spans Tune until P99 flush latency < acceptable SLO
Export interval 1–5 s Keep below half the smallest scrape/aggregation window you care about
Queue limit 2× batch size Drop oldest spans if queue is full to avoid OOM
Compression Enable gzip when RTT > 10 ms or bandwidth metered Use OTLP_EXPORTER_COMPRESSION=gzip env var
Graceful shutdown Call tracerProvider.shutdown() on SIGTERM Ensures final flush in Kubernetes pre‑stop hook

Sampling

With sampling, teams want to control how many spans are actually stored or exported. This process removes noise, keeps retention affordable, and avoids UI overload.

Choosing a policy

Scenario Sampling Recommendation
Dev / CI AlwaysOn for full visibility
Low‑QPS prod service TraceIdRatioBased(1.0) (i.e., off) may be fine
High‑volume APIs TraceIdRatioBased(0.01–0.10) + tail sampling on errors/long‑latency
Compliance / audits ParentBased(TraceIdRatioBased(...)) so child spans obey the decision made at the edge gateway

Automatic and manual instrumentation

Teams need to balance coverage, accuracy, and engineering effort, which is why it’s important to have a strategy around automatic and manual instrumentation. Over‑instrumenting hurts performance; under‑instrumenting leaves blind spots.

Best‑practice blend

Layer Approach
Framework / I/O Auto‑instrument to avoid missing common calls
Business logic Manually wrap key user journeys, error paths, and high‑latency loops
Background workers / batch jobs Manual spans with SpanKind.INTERNAL (or CONSUMER/PRODUCER)—auto agents often miss these
Performance hotspots Opt out (otel.instrumentation.<lib>.enabled=false) or use SpanSuppression to skip

To sum up: Batch first, then compress to slash network chatter without losing data. Sample deterministically and consistently - edge or tail, but make the policy explicit and version‑controlled. Mix auto and manual instrumentation: auto for breadth, manual for depth. Provide shared libraries and  CI checks so every service follows the same rules, otherwise “resource management” becomes “resource chaos.”

Security and Configuration

In OpenTelemetry tracing, Security and Configuration best practices are essential to ensure that your observability stack is safe, compliant, and performant. Since tracing data often includes sensitive information (user IDs, API keys, request headers, etc.), poor security or misconfiguration can lead to:

  • Data leaks
  • Regulatory violations (e.g., GDPR, HIPAA)
  • Attack surface expansion

Secure Configuration

The key goals of security configuration are to prevent unauthorized access, protect data in transit and at rest, and ensure trace context can't be spoofed. 

Suggested Best Practices:

Area Practice Why
Transport Security Use TLS for all telemetry pipelines (e.g., OTLP gRPC/HTTP) Prevents man-in-the-middle attacks
Authentication Secure communication between apps and the OpenTelemetry Collector with mTLS or API tokens Prevents unauthorized ingestion/export
Environment isolation Don’t share trace pipelines across prod/dev/test Limits blast radius and access controls
Secrets management Store exporter credentials (e.g., Mezmo API keys) in secure vaults, not code Avoids accidental exposure
Rate limiting Set max batch size and export intervals Prevents abuse and unintended data flooding

Minimizing components

The key goals of minimizing components are to reduce the attack surface, simplify auditing and maintenance, and improve traceability and compliance. Every extra component adds operational overhead and potential vulnerabilities. 

Suggested Best Practices:

Component Best Practice
Auto-instrumentation agents Use only what's necessary (disable unused libraries with otel.instrumentation.<lib>.enabled=false)
Collectors Deploy the minimum number of components: prefer agent + central collector setup over many standalone agents
Exporters Export only to necessary backends. Avoid duplicating data unless explicitly needed
Third-party libraries Validate and audit custom exporters or processors for telemetry data leaks

Data Scrubbing

The key goals of data scrubbing are to avoid sending sensitive or PII data, maintain compliance, and reduce telemetry noise. Even a single exposed token or ID in a span can result in a serious breach or compliance violation.

Suggested Best Practices:

Area Recommendation
Span attributes Never include raw PII (e.g., emails, passwords, tokens) in span attributes
Request/response payloads Avoid attaching full HTTP bodies or database responses
Baggage & context Scrub user-defined baggage keys that may leak secrets
Scrubbing processors Use Otel Collector processors like attributes to filter or redact keys before export
Custom logic Implement middleware in your app to exclude or hash sensitive values before adding them to spans

Collector Security

The key goals of collector security are to harden the OpenTelemetry Collector as a trusted system component and protect against data injection, misrouting, or unauthorized use.

Suggested Best Practices:

Area Best Practice
Run as non-root Deploy collector containers as non-root users
Limit permissions Use Kubernetes RBAC or IAM policies to restrict what the collector can access
Isolate network Run collectors in private subnets or separate namespaces (e.g., observability)
TLS everywhere Use TLS on both internal and external endpoints (OTLP, Prometheus, etc.)
Logging & auditing Enable audit logs and monitor collector activity
Upgrade regularly Keep the collector updated to patch CVEs and stay aligned with Otel spec changes

Error Handling

Error handling is a crucial best practice in OpenTelemetry tracing that ensures application errors are captured, classified, and traceable throughout a distributed system. Properly instrumented errors make it easier to:

  • Identify failure points
  • Diagnose root causes
  • Improve system reliability
  • Trigger alerts and observability workflows

Experts suggest following these six best practices:

1. Set span status explicitly on error

2. Capture and record exceptions as events

3. Use semantic attributes to enrich error context

4. Handle errors in both client and server spans

5. Don’t swallow or misclassify errors

6. Link logs to traces for full context

End Spans 

Ending spans properly is a fundamental best practice in OpenTelemetry tracing that ensures spans accurately represent the lifecycle of operations and reflect correct timing, relationships, and resource usage. If you don’t explicitly end spans (or end them incorrectly), you risk broken traces, inaccurate metrics, and misleading observability data.

Suggested best practices:

1. Always end spans explicitly

2. End spans at the right time

3. Use context managers or try/finally blocks

4. Avoid ending a span multiple times

5. Ensure asynchronous work ends the original span

Choosing the right backend for stage and analysis 

Choosing the right backend for staging and analysis is a strategic best practice in OpenTelemetry tracing. Your backend determines how traces are stored, queried, visualized, and acted upon, and the right choice depends on your use case, team maturity, cost constraints, and compliance requirements.

Select a backend that balances observability depth, scalability, and operational fit for both staging (test) and production (analysis) environments.

Suggested best practices: 

1. Define your environment-specific goals

2. Understand backend types

3. Evaluate key criteria

4. Match tooling to team maturity

5. Use separate backends for stage vs. prod (optional)

A great example of a backend for telemetry tracing would be Mezmo. Mezmo (formerly LogDNA) leverages OpenTelemetry to enhance its observability platform. By integrating OpenTelemetry collectors and exporters, Mezmo enables users to ingest logs, metrics, and traces from across their infrastructure with minimal setup. This unified view of telemetry data empowers DevOps and SRE teams to diagnose issues faster, optimize performance, and ensure reliability.

OpenTelemetry collects and standardizes observability data when used together, while Mezmo ingestion, enrichment, and routes that data to optimize performance, cost, and insights.

Together, that leads to:

  • An end-to-end observability pipeline: From source to destination with flexibility and control.
  • Better incident response: Faster troubleshooting using structured and enriched logs and traces.
  • Optimized telemetry costs: Collect broadly, route selectively, and store strategically.
  • Enhanced developer workflows: Faster debugging and visibility without reinventing tooling.

Example on how to set up OpenTelemetry tracing

Here's a simple example to help you set up OpenTelemetry tracing in an application. We'll walk through the steps using Python, but the principles apply to any language supported by OpenTelemetry.

Step 1: Install Required Packages

bash
CopyEdit
pip install opentelemetry-api
pip install opentelemetry-sdk
pip install opentelemetry-exporter-otlp
pip install opentelemetry-instrumentation
pip install opentelemetry-instrumentation-requests

Step 2: Initialize OpenTelemetry Tracer

python
CopyEdit
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

# Step 1: Set the global tracer provider
trace.set_tracer_provider(TracerProvider())

# Step 2: Create an OTLP exporter (can point to collector or backend)
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces", insecure=True)

# Step 3: Configure batch processor
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Step 4: Get a tracer
tracer = trace.get_tracer("my-service-name")

Step 3: Create and End a Span

python
CopyEdit
from opentelemetry.trace import Status, StatusCode

with tracer.start_as_current_span("process-order") as span:
    try:
        # Simulate work
        result = "Order processed"
        span.set_attribute("order.id", "1234")
        span.set_status(Status(StatusCode.OK))
    except Exception as e:
        span.record_exception(e)
        span.set_status(Status(StatusCode.ERROR, str(e)))

Step 4: Auto-Instrument Common Libraries (Optional)

python
CopyEdit
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()

This automatically traces outbound HTTP calls made with requests.

Step 5: Run an OpenTelemetry Collector (Optional)

Use the OpenTelemetry Collector if you want to buffer, transform, or export to multiple backends:
Sample Collector Config:
yaml
CopyEdit
receivers:
  otlp:
    protocols:
      http:
      grpc:

exporters:
  logging:
    loglevel: debug
  otlphttp:
    endpoint: https://api.your-backend.com
    compression: gzip

processors:
  batch:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging, otlphttp]

Run the collector with this config to act as an intermediary.
Output Example in Logs (with logging exporter)
json
CopyEdit
{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "name": "process-order",
  "status": {
    "code": "OK"
  },
  "attributes": {
    "order.id": "1234"
  }
}

Summing it up

Telemetry tracing is a complex but critical component of observability. Industry-approved best practices can make the process easier, as can the right choice of observability tool. Our best advice? Take it step-by-step!

It’s time to let data charge