Apache Spark
Apache Spark is the default execution engine for distributed data processing in Ilum. It runs on Kubernetes (with native CRD-based pod orchestration) or Apache Hadoop Yarn, and is exposed through batch jobs, interactive sessions, in-app SQL notebooks, and the Apache Kyuubi SQL gateway.
Ilum bundles Apache Spark 4.x by default, with Spark 3.x available for legacy workloads.
When to use Spark
Spark is the right engine for:
- Large-scale ETL and data transformation pipelines.
- Machine learning workloads using Spark ML or MLlib.
- Complex joins and aggregations across large datasets.
- Streaming workloads with Spark Structured Streaming.
- Workloads that benefit from horizontal scaling across many executors.
For interactive analytics on medium-to-large data, consider Trino . For small-data and local execution, consider DuckDB . For low-latency stream processing, consider Apache Flink.
Execution model
Spark runs as a driver and a configurable number of executors:
- Driver pod: One per job. Coordinates execution, holds the Spark session, and tracks task state.
- Executor pods: Provisioned dynamically based on workload. Run individual tasks in parallel and hold cached data.
Ilum manages the full pod lifecycle, including image selection, resource limits, dynamic allocation, and cleanup on completion.
Workload types
Spark powers four kinds of workloads in Ilum:
- Trabajos : One-shot batch executions.
- Servicios : Long-running interactive Spark sessions that execute code on demand without per-call initialization overhead.
- Horarios : Cron-driven recurring jobs.
- Requests: Ad-hoc submissions through the REST API or UI.
All four are managed through the Cargas section of the Ilum UI.
Supported catalogs
Spark connects to all four Ilum catalogs:
- Hive Metastore (default)
- Proyecto Nessie (Iceberg with Git-style branching)
- Catálogo de Unity (Databricks-compatible governance)
- DuckLake (DuckDB-native, primarily used by DuckDB)
Supported table formats
Spark reads and writes:
- Lago Delta : ACID transactions, time travel, schema evolution.
- Apache Iceberg : Partition evolution, hidden partitioning.
- Apache Hudi : Record-level upserts, incremental processing.
- Parquet, ORC, CSV (en inglés) , JSON, Avro: Standard file formats.
El Mesas Ilum abstraction lets you read and write Delta, Iceberg, and Hudi using the same Spark API.
Configuración
Spark configuration is managed through Helm values and per-cluster settings:
Núcleo de ilum :
chispa :
Habilitado : verdadero
clúster :
defaults:
spark.dynamicAllocation.enabled: "Cierto"
spark.dynamicAllocation.minExecutors: "1"
spark.dynamicAllocation.maxExecutors: "20"
spark.dynamicAllocation.executorIdleTimeout: "60s"
Per-cluster overrides are configured in the Workloads > Clusters UI and apply to all Spark jobs targeting that cluster.
Conexión de chispas
Spark Connect provides a client-server architecture for remote Spark execution. Ilum deploys Spark Connect servers as standard jobs and includes a Kubernetes-aware proxy that allows Spark Connect endpoints to be reached across cluster boundaries.
Refiérase a Conexión de chispas for details.
Submitting a Spark job
For a step-by-step walkthrough, refer to Run a simple Spark job.