Apache Spark

Apache Spark is the default execution engine for distributed data processing in Ilum. It runs on Kubernetes (with native CRD-based pod orchestration) or Apache Hadoop Yarn, and is exposed through batch jobs, interactive sessions, in-app SQL notebooks, and the Apache Kyuubi SQL gateway.

Ilum bundles Apache Spark 4.x by default, with Spark 3.x available for legacy workloads.

When to use Spark

Spark is the right engine for:

Large-scale ETL and data transformation pipelines.
Machine learning workloads using Spark ML or MLlib.
Complex joins and aggregations across large datasets.
Streaming workloads with Spark Structured Streaming.
Workloads that benefit from horizontal scaling across many executors.

For interactive analytics on medium-to-large data, consider Trino. For small-data and local execution, consider DuckDB. For low-latency stream processing, consider Apache Flink.

Execution model

Spark runs as a driver and a configurable number of executors:

Driver pod: One per job. Coordinates execution, holds the Spark session, and tracks task state.
Executor pods: Provisioned dynamically based on workload. Run individual tasks in parallel and hold cached data.

Ilum manages the full pod lifecycle, including image selection, resource limits, dynamic allocation, and cleanup on completion.

Workload types

Spark powers four kinds of workloads in Ilum:

Jobs: One-shot batch executions.
Services: Long-running interactive Spark sessions that execute code on demand without per-call initialization overhead.
Schedules: Cron-driven recurring jobs.
Requests: Ad-hoc submissions through the REST API or UI.

All four are managed through the Workloads section of the Ilum UI.

Supported catalogs

Spark connects to all four Ilum catalogs:

Hive Metastore (default)
Project Nessie (Iceberg with Git-style branching)
Unity Catalog (Databricks-compatible governance)
DuckLake (DuckDB-native, primarily used by DuckDB)

Supported table formats

Spark reads and writes:

Delta Lake: ACID transactions, time travel, schema evolution.
Apache Iceberg: Partition evolution, hidden partitioning.
Apache Hudi: Record-level upserts, incremental processing.
Parquet, ORC, CSV, JSON, Avro: Standard file formats.

The Ilum Tables abstraction lets you read and write Delta, Iceberg, and Hudi using the same Spark API.

Configuration

Spark configuration is managed through Helm values and per-cluster settings:

ilum-core:
  spark:
    enabled: true
  cluster:
    defaults:
      spark.dynamicAllocation.enabled: "true"
      spark.dynamicAllocation.minExecutors: "1"
      spark.dynamicAllocation.maxExecutors: "20"
      spark.dynamicAllocation.executorIdleTimeout: "60s"

Per-cluster overrides are configured in the Workloads > Clusters UI and apply to all Spark jobs targeting that cluster.

Spark Connect

Spark Connect provides a client-server architecture for remote Spark execution. Ilum deploys Spark Connect servers as standard jobs and includes a Kubernetes-aware proxy that allows Spark Connect endpoints to be reached across cluster boundaries.

Refer to Spark Connect for details.

Submitting a Spark job

For a step-by-step walkthrough, refer to Run a simple Spark job.

When to use Spark​

Execution model​

Workload types​

Supported catalogs​

Supported table formats​

Configuration​

Spark Connect​

Submitting a Spark job​

Related pages​