Skip to main content

Observability

Ilum ships an opt-in observability layer that produces a single distributed trace for every pipeline run, covering the Airflow task that triggered it, the Ilum service request that received it, and the Spark driver and executors that ran it. Each job detail view gains a Pipeline Trace tab and a Cost tab built from that trace stream.

Where the Lineage feature answers what data flowed where, the Pipeline Trace answers what ran when, on which pod, with what latency, and which log lines correlate. The Cost tab answers where the compute and storage cost went on this run, broken down per table.

What it shows

Pipeline Trace tab

The Trace tab renders a waterfall with one row per operation in the run, grouped by component (Airflow, Ilum service, Spark driver, Spark executors, Spark SQL queries, stages, table commits, and so on). It makes the structure of a run visible at a glance:

  • Spark SQL queries, jobs, and stages, with per-stage metrics such as shuffle bytes, spill, GC time, and peak memory attached to each stage row.
  • Iceberg and Delta table commits, MERGE operations, and table maintenance (OPTIMIZE, VACUUM, compaction), called out as distinct rows so "data was written" is separated from "table layout was rewritten".
  • Streaming micro-batches, each rendered as its own row.

Hover any span to highlight the correlated log lines in the Loki panel below it. Click Open in Grafana to load the same trace in Grafana's Tempo explore view, or Download to export the full trace as JSON (useful for support tickets).

A runtime dependency graph sits at the top of the tab. It shows what a single run actually called: the Airflow task that triggered it, the Spark driver and executors, the metastore it queried, the object-storage buckets it touched, and the tables it committed. Click any node or edge to inspect its calls, latency, and errors.

When a Spark job was triggered by an Airflow DAG, the job detail header shows a Triggered by · chip linking back to the task in the Airflow UI.

Cost tab

The Cost tab attributes a run's executor time, storage request counts, shuffle, spill, GC, and commit bytes across the datasets the job actually touched. It renders:

  • A treemap where each tile is a dataset, sized by its share of the run's I/O and coloured by role (read, write, or both). A callout names any single table that carries more than half the run's bytes.
  • A per-table cost breakdown listing each dataset with its role, bytes share, executor seconds, and a modeled cost figure.

Costs that cannot be tied to a specific dataset (driver bootstrap calls, queries with no table target) are reported as explicit "unattributed" rows rather than hidden, so the totals always reconcile.

Modeled cost reflects a configurable pricing profile (provider, region, instance type), not real cloud invoices. The provider and region can be changed in the UI without a redeploy. For details on which cost dimensions are tracked and how to add your own, see Cost metrics configuration.

A top-level Cost route aggregates across jobs over a selected window, with a heatmap of the most expensive tables and a daily cost time series stacked by namespace.

Enabling observability

Observability is off by default. An operator enables it with a single Helm flag:

global:
observability:
otel:
enabled: true

With this set, Ilum deploys the trace backend, wires tracing into the Spark driver and executors, and surfaces the Trace and Cost tabs in the UI. To enable the cross-job cost rollup (the warehouse fallback and the top-level Cost route), also enable the cost aggregator:

global:
costAggregator:
enabled: true
note

Only jobs launched after observability is enabled are traced. When the feature is disabled, the Trace and Cost tabs are hidden.

Per-job tracing profile

The job submit form exposes a Tracing profile selector that overrides the cluster default for a single run:

ProfileEffect
clusterDefaultInherit whatever the operator configured. The common case.
offDisable tracing for this run; near-zero overhead. Useful for latency-sensitive or smoke-test jobs.
structuralEmit the structural waterfall (queries, jobs, stages) only, without per-task detail.
detailedEnable per-task spans for this run, for investigating a single slow stage or skewed tasks.

The profile applies to direct API, Livy, and Airflow submissions alike. When observability is disabled cluster-wide, the selector is disabled and the run falls back to clusterDefault.

Sampling

By default every trace is recorded. On busy production clusters, the sampling ratio can be lowered (for example to record one run in ten):

global:
observability:
otel:
sampling:
ratio: 0.1

Sampling is parent-based: when a run is recorded, every downstream span on that run is recorded too, so traces are always complete. A fraction of runs simply have no trace at all and render the empty-state copy. For job classes where per-table cost is critical, keep the ratio at 1.0, since per-table attribution needs per-stage trace data.

Listener defaults

The Spark listener decides how much span detail each run emits. Its cluster-wide policy is configured in the Cost → Settings view under Listener defaults, and is applied at job submit time unless a per-job tracing profile overrides it for a single run. The trace search window the UI queries Tempo over (job.otel.searchWindowSeconds, helm-configured) is shown read-only at the top of the same section.

Each setting narrows what is traced. A blank value is not "off" — it means the listener applies its built-in default, listed below:

SettingEffectDefault
granularityHow deep the trace goes: disabled (no spans), job, stage, task, or all.stage
Slow stage thresholdOnly stages slower than this many milliseconds emit spans; blank emits all stages.off
Slow task thresholdOnly tasks slower than this many milliseconds emit spans; blank emits all tasks.off
Stage name filterRegular expression; only stages whose name matches emit spans; blank matches all.none
Retry-only tracingEmit task detail only for retried or speculative tasks.off
planRedactionHow much of the SQL plan is attached to query spans: full, paths-only, or minimal.paths-only
taskParentModeWhether task spans parent to a virtual stage span or the native one.virtual

The granularity is the primary control: disabled switches emission off entirely, while the thresholds and filters only narrow which stages or tasks emit once granularity is stage, task, or all. Deeper granularity produces a richer Pipeline Trace waterfall and more precise per-table cost, at the cost of more spans per run.

Rate card

The rate card is the pricing model that converts a run's measured resource usage (vCPU-hours, memory GB-hours) into a modeled cost figure. It is configured in the Cost → Settings view under Rate card, and applies in the browser — no redeploy is needed to change rates, provider, or region.

FieldMeaning
ConfiguredMaster switch. While off, modeled cost is hidden and cost cells show a "pricing not configured" notice instead of a currency value.
Provider / Region / CurrencyLabels carried onto the modeled figures.
vCPU rateCurrency charged per vCPU-hour.
Memory rateCurrency charged per GB-hour.
Pricing model + MultipliersThe selected pricing model picks one multiplier (for example spot = 0.35), which scales the base rates.

Modeled cost reflects this rate card on the per-job Cost tab, the per-table breakdown, and the cost-flow treemap — all computed in the browser. The top-level Cost dashboard serves pre-aggregated warehouse data and is not affected by rate-card edits. Because it is off by default and derived from operator-entered rates, the modeled figure is an estimate for relative comparison, not a billing amount; the underlying resource dimensions (executor seconds, bytes, I/O) are exact regardless of the rate card.

Multi-engine coverage

The trace covers the Spark batch happy path (queries, jobs, stages), Iceberg and Delta commits, MERGE, table maintenance and schema evolution, optional per-task detail, and structured-streaming micro-batches. Jobs run through Kyuubi or Livy still produce the full Spark engine waterfall on every query; what they do not yet get is a per-session parent grouping multiple queries together, which is blocked on upstream support.

Custom Spark images

If a workload uses a custom Spark image, ensure the OpenTelemetry Java agent is present at /opt/spark/jars/opentelemetry-javaagent.jar. The bundled Ilum Spark image ships it at that path; for a custom image, either rebuild on top of the Ilum Spark base image or copy the agent jar to the equivalent path. The Spark engine spans require no custom-image change, as the listener jar is added at job-submit time.