Configuring and Leveraging Cluster Observability

This tutorial explains how to configure, access, and query telemetry data (metrics, logs, and traces) from your exalsius cluster. You'll learn how to use the built-in Grafana dashboards, query data programmatically via REST APIs, and extend observability to collect custom metrics and traces from your applications.

When observability is enabled on your cluster, exalsius automatically configures secure access to:

Metrics: Prometheus-compatible metrics stored in VictoriaMetrics
Logs: Application and system logs stored in VictoriaLogs
Traces: Distributed tracing data stored in VictoriaTraces

All telemetry data is automatically collected, annotated with cluster-specific metadata, and persisted in a secure, cluster-scoped manner.

Before You Begin

Before continuing, ensure you have:

Installed and configured the exalsius CLI
Logged in with your exalsius account
At least one active cluster (Working with Clusters)
A cluster with observability enabled (see Step 1 below)

Note

You can verify your setup using:

exls --version
exls clusters list

Step 1 — Enable Observability on Your Cluster

Observability must be enabled when creating a cluster. If you haven't created a cluster yet, or if you need to create a new cluster with observability, use the --enable-telemetry flag during cluster creation:

exls clusters deploy --name <CLUSTER-NAME> --enable-telemetry

Tip

If you're unsure whether observability is enabled on an existing cluster, you can check for Grafana dashboard access (see Step 2 below).

Once observability is enabled, exalsius automatically deploys and configures the observability stack, which includes:

OpenTelemetry Collectors: Automatically discover and scrape metrics, logs, and traces from your cluster
VictoriaMetrics: Stores time-series metrics data
VictoriaLogs: Stores log data
VictoriaTraces: Stores distributed tracing data
Grafana: Provides visualization and querying capabilities

The setup process typically takes a few minutes. Once complete, you can start accessing your telemetry data.

Step 2 — Access the Grafana Dashboard

The easiest way to explore your cluster's telemetry data is through the Grafana web interface. Grafana provides pre-configured dashboards for common Kubernetes metrics, system performance, and application logs.

Getting Your Dashboard URL

To obtain a cluster-scoped login link to Grafana:

exls clusters get-dashboard-url <CLUSTER-ID>

This command returns a unique URL that provides:

Passwordless authentication: No credentials needed—the link includes authentication tokens
Cluster-scoped access: You can only view data from the specified cluster
Read-only mode: All data is read-only for security

Using Grafana Dashboards

Once you open the dashboard URL, you'll have access to several pre-configured dashboards:

Kubernetes Overview: Cluster-level metrics including node status, pod counts, and resource usage
Node Exporter: Detailed system metrics from each node (CPU, memory, disk, network)
Kubelet Metrics: Container and pod-level resource consumption
Application Logs: Searchable logs from all pods in your cluster
Distributed Traces: View and analyze request traces across services using the integrated trace explorer

You can explore these dashboards, create custom queries using PromQL (for metrics), LogsQL (for logs), or trace queries (for traces), and even create your own custom dashboards. All queries are automatically scoped to your cluster's data.

Tip

The URL remains valid as long as your cluster exists, but you'll need to regenerate it if your authentication expires.

Step 3 — Query Data via REST API

While Grafana is excellent for interactive exploration, you may need programmatic access to telemetry data for automation, integration with external systems, or custom analysis. The observability stack exposes REST APIs compatible with Prometheus (for metrics), VictoriaLogs (for logs), and VictoriaTraces (for traces).

Retrieving Authentication Credentials

To authenticate against the REST APIs, you'll need credentials stored in a Kubernetes secret. Let's retrieve them:

# First, get the kubeconfig for your cluster
exls clusters import-kubeconfig <CLUSTER-ID> --kubeconfig-path kube.conf

# Set the KUBECONFIG environment variable
export KUBECONFIG=kube.conf

# Retrieve the username from the secret
USERNAME=$(kubectl get secret storage-vmuser-credentials -n kof -o jsonpath='{.data.username}' | base64 -d)

# Retrieve the password from the secret
PASSWORD=$(kubectl get secret storage-vmuser-credentials -n kof -o jsonpath='{.data.password}' | base64 -d)

# Verify the credentials were retrieved
echo "Username: $USERNAME"
echo "Password: $PASSWORD"

The credentials are stored in the storage-vmuser-credentials secret in the kof namespace. These credentials are automatically created by exalsius and are unique to your cluster.

Preparing Basic Authentication

Next, we'll encode these credentials for HTTP Basic Authentication:

# Create the Basic Auth header value
BASIC_AUTH=$(echo -n "$USERNAME:$PASSWORD" | base64)
echo "Basic Auth Header: Basic $BASIC_AUTH"

This encoded string will be used in the Authorization header of your API requests.

Querying Metrics

Metrics are stored in VictoriaMetrics and can be queried using PromQL (Prometheus Query Language). Here's an example query to check if all targets are up:

curl \
    -H "Authorization: Basic $BASIC_AUTH" \
    -H "Content-Type: application/x-www-form-urlencoded" \
    "https://vmauth-de1.exalsius.ai/vm/select/0/prometheus/api/v1/query" \
    -d 'query=up'

This query returns a JSON response with the current value of the up metric, which indicates whether monitoring targets are reachable.

Querying Logs

Logs are stored in VictoriaLogs and can be queried using LogsQL. Here's an example to search for error logs:

curl \
    -H "Authorization: Basic $BASIC_AUTH" \
    -H "Content-Type: application/x-www-form-urlencoded" \
    "https://vmauth-de1.exalsius.ai/vls/select/logsql/query" \
    -d 'query=error | limit 10'

This query searches for log entries containing "error" and limits the results to 10 entries.

Querying Traces

Traces are stored in VictoriaTraces and can be queried via the Jaeger HTTP API. Here's an example to search for traces from a specific service:

curl \
    -H "Authorization: Basic $BASIC_AUTH" \
    "https://vmauth-de1.exalsius.ai/vt/select/0/jaeger/api/traces?service=my-service&limit=10"

You can filter traces by multiple criteria. Here's an example querying traces with errors, filtered by operation and duration:

curl \
    -H "Authorization: Basic $BASIC_AUTH" \
    "https://vmauth-de1.exalsius.ai/vt/select/0/jaeger/api/traces?service=my-service&operation=my-operation&tags=%7B%22error%22%3A%22true%22%7D&minDuration=1ms&maxDuration=10ms&limit=20"

The tags parameter uses JSON format (URL-encoded in the example above). You can filter by span attributes, resource attributes (with resource_attr: prefix), or instrumentation scope attributes (with scope_attr: prefix).

You can also retrieve a specific trace by its trace ID:

curl \
    -H "Authorization: Basic $BASIC_AUTH" \
    "https://vmauth-de1.exalsius.ai/vt/select/0/jaeger/api/traces/<TRACE-ID>"

Note

The most convenient way to explore traces is through the Grafana trace explorer, which provides an interactive interface for trace visualization and analysis. The REST API is useful for programmatic access and integration with external tools. VictoriaTraces also supports LogsQL queries for advanced trace filtering.

Advanced Querying

For more complex queries, refer to the official documentation:

PromQL: Query language for metrics (CPU usage, memory, custom metrics, etc.)
VictoriaMetrics API: Additional query endpoints and features
VictoriaLogs LogsQL: Query language for log data
VictoriaTraces API: Jaeger HTTP API and LogsQL for querying distributed traces

Note

The API endpoints (vmauth-de1.exalsius.ai) may vary depending on your cluster's region. Check your cluster configuration or contact support if you encounter connection issues.

Step 4 — Collect Custom Metrics

While exalsius automatically collects standard Kubernetes and system metrics, you may need to monitor custom application metrics, third-party services, or specialized workloads. The observability stack supports collecting metrics from custom exporters and application endpoints.

Understanding Metric Collection

The OpenTelemetry collectors deployed by exalsius automatically discover and scrape metrics from:

Kubernetes system components (kubelet, kube-proxy, etc.)
Node exporters (CPU, memory, disk, network metrics)
Pods and Services configured via PodMonitor and ServiceMonitor custom resources

To add your own metrics, you need to:

Expose metrics in Prometheus format (typically at a /metrics endpoint)
Create a PodMonitor or ServiceMonitor resource to configure scraping

Using PodMonitor and ServiceMonitor

PodMonitor and ServiceMonitor custom resources provide fine-grained control over metric collection, allowing you to configure:

Custom scraping intervals
Relabeling rules for metric transformation
TLS/authentication configuration
Complex label selectors
Multiple endpoints per resource

Using PodMonitor

PodMonitor is used to scrape metrics directly from Pods based on label selectors. This is ideal when you want to monitor multiple pods with a single configuration:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: my-custom-exporter
  namespace: default
spec:
  selector:
    matchLabels:
      app: my-exporter
  podMetricsEndpoints:
  - port: metrics                    # Must match a named port in the Pod
    path: /metrics                   # Metrics endpoint path
    interval: 30s                    # Scraping interval
    scheme: http                      # http or https

Key points:

The selector.matchLabels must match labels on your Pods
The port must be a named port in your Pod specification (not a number)
The namespace should match where your Pods are deployed

Using ServiceMonitor

ServiceMonitor is used to scrape metrics from Services, which is particularly useful for monitoring services that may have multiple backend pods or when you want to scrape through a Service rather than individual Pods:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-service-monitor
  namespace: default
spec:
  selector:
    matchLabels:
      app: my-service
  endpoints:
  - port: metrics                    # Must match a named port in the Service
    path: /metrics
    interval: 30s
    scheme: https                    # Can use https
    tlsConfig:
      insecureSkipVerify: false      # TLS configuration

Key points:

The selector.matchLabels must match labels on your Service
The port must be a named port in your Service specification
You can configure TLS settings for secure endpoints
Multiple endpoints can be defined for different ports or paths

Best Practices

When collecting custom metrics:

Use named ports: Always use named ports in your Pod/Service definitions when using PodMonitor/ServiceMonitor
Follow Prometheus format: Ensure your metrics endpoint follows the Prometheus exposition format
Use appropriate intervals: Balance between data freshness and resource usage (30s-1m is typical)
Label your metrics: Use meaningful labels in your metrics for better querying and filtering
Namespace considerations: Create monitor resources in the same namespace as your workloads, or ensure proper RBAC permissions

Once your custom metrics are being collected, they'll appear in Grafana alongside standard metrics and can be queried via the REST API using PromQL.

Step 5 — Collect Custom Traces

While exalsius automatically collects standard Kubernetes and system telemetry, you may need to instrument your applications to emit distributed traces. Distributed tracing helps you understand request flows across services, identify performance bottlenecks, and debug issues in microservices architectures.

To collect traces from your applications, you need to configure OpenTelemetry SDKs in your application code. The OpenTelemetry collectors deployed by exalsius automatically receive traces from instrumented applications via OTLP (OpenTelemetry Protocol). Configure your OpenTelemetry SDK to export traces to the collector's OTLP endpoint, which is typically accessible at the otel-collector service in the observability or kof namespace.

For detailed instructions on instrumenting your application, refer to the OpenTelemetry documentation for your specific programming language. Once your application is instrumented and sending traces, they'll appear in Grafana's trace explorer and can be queried via the REST API using TraceQL.

Understanding the Architecture

To help you make the most of observability, here's what happens under the hood when observability is enabled on your cluster.

Component Overview

exalsius deploys a comprehensive observability stack based primarily on OpenTelemetry, an industry-standard observability framework. The stack includes:

OpenTelemetry Collectors: Deployed as DaemonSets and Deployments, these collectors:
- Automatically discover Kubernetes resources (Pods, Services, etc.)
- Scrape metrics from annotated resources and system components
- Collect logs from pods and system components
- Receive distributed traces from instrumented applications via OTLP
- Annotate all telemetry data with cluster-specific metadata (cluster ID, cluster name, region, etc.)
Telemetry Storage: Collected data is securely transmitted to a dedicated monitoring cluster that runs:
- VictoriaMetrics: High-performance time-series database for metrics
- VictoriaLogs: Efficient log storage and indexing system
- VictoriaTraces: Distributed tracing backend for storing and querying trace data
- Grafana: Visualization and querying platform with integrated trace explorer
Security and Access Control:
- Outbound authentication: Credentials are automatically created and stored in your cluster's kof namespace, allowing collectors to authenticate when sending data
- Inbound authentication: The monitoring cluster only accepts data from authorized clusters
- Query-time filtering: When you query data (via Grafana or API), results are automatically filtered to show only data from your cluster
- Read-only access: All user-facing access is read-only for security

Data Flow

flowchart TB
    subgraph YourCluster["Your Cluster"]
        Apps["Applications<br/>(Pods)"]
        System["System Components<br/>(kubelet, etc.)"]
        Collectors["OpenTelemetry<br/>Collectors"]
    end

    subgraph MonitoringCluster["Monitoring Cluster"]
        VM["VictoriaMetrics<br/>(Metrics Storage)"]
        VL["VictoriaLogs<br/>(Log Storage)"]
        VT["VictoriaTraces<br/>(Trace Storage)"]
        Grafana["Grafana<br/>(Visualization)"]
    end

    Users["Your Queries<br/>(Grafana / REST API)"]

    Apps -->|"metrics, logs, traces"| Collectors
    System -->|"metrics, logs"| Collectors
    Collectors -->|"authenticated<br/>transmission"| VM
    Collectors -->|"authenticated<br/>transmission"| VL
    Collectors -->|"authenticated<br/>transmission"| VT
    Users -->|"queries"| Grafana
    Grafana -->|"read"| VM
    Grafana -->|"read"| VL
    Grafana -->|"read"| VT

    style YourCluster fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    style MonitoringCluster fill:#fff3e0,stroke:#e65100,stroke-width:2px
    style Collectors fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style Grafana fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px

Key Benefits

This architecture provides several advantages:

Centralized storage: All telemetry data is stored in a dedicated, optimized monitoring cluster
Automatic discovery: No manual configuration needed for standard Kubernetes metrics
Cluster isolation: Data is automatically tagged and filtered by cluster
Scalability: The monitoring infrastructure scales independently from your workloads
Security: Multi-layer authentication ensures only authorized access
Standards-based: Uses OpenTelemetry and Prometheus standards for compatibility

Understanding this architecture helps you make informed decisions about what to monitor, how to structure your custom metrics, and how to optimize your observability setup.

Next Steps

Now that you understand how to use observability on your cluster, you can:

Explore the pre-configured Grafana dashboards to understand your cluster's behavior
Create custom dashboards for metrics specific to your workloads
Integrate REST API queries into your automation and monitoring scripts
Add custom metrics from your applications using annotations or PodMonitor/ServiceMonitor
Instrument your applications to emit distributed traces and analyze them in Grafana
Set up alerts in Grafana based on your metrics, logs, and traces

For more information, refer to the Prometheus documentation, VictoriaMetrics documentation, VictoriaTraces documentation, OpenTelemetry documentation, and Grafana documentation.