Apache Spark is a distributed data processing framework designed for speed and scalability. It works internally through a combination of cluster computing, in-memory processing, and DAG execution. Below is a breakdown of how Spark works internally:
1. Architecture Components
1.1. Spark Driver
- The Spark Driver is the main orchestrator of a Spark job.
- It creates the SparkContext and converts user-defined code into execution plans.
- Responsible for:
- Breaking jobs into stages.
- Scheduling and distributing tasks across worker nodes.
- Monitoring execution.
1.2. Cluster Manager
- Manages and allocates resources for Spark applications.
- Supported cluster managers:
- Standalone (built-in).
- YARN (Hadoop cluster).
- Mesos.
- Kubernetes.
1.3. Worker Nodes (Executors)
- Worker nodes execute the assigned tasks.
- Each worker runs one or more executors, which:
- Process data in partitions.
- Store intermediate results in memory.
- Communicate with the driver.
2. Execution Flow
Step 1: Spark Application Submission
- The user submits a Spark job (Python, Scala, Java).
- The Spark Driver initializes and requests resources from the cluster manager.
Step 2: Logical Plan Creation
- Parsing: The Spark job (e.g., a PySpark script) is converted into a Logical Plan.
- Optimization:
- Spark optimizes the logical plan using Catalyst Optimizer (e.g., predicate pushdown, column pruning).
- Execution Plan:
- The logical plan is converted into a Directed Acyclic Graph (DAG).
Step 3: DAG Scheduling
- The DAG Scheduler:
- Breaks the execution plan into stages (Map and Reduce phases).
- Groups transformations that can run together.
- Creates tasks for execution.
Step 4: Task Execution on Executors
- The Task Scheduler assigns tasks to available worker nodes.
- Each worker executes tasks in parallel on data partitions.
- Data shuffling happens if necessary (e.g., during
groupBy()
orjoin()
operations).
Step 5: In-Memory Computation & Caching
- Spark processes data in-memory (RAM), reducing disk I/O.
- Users can persist intermediate results using RDD caching (
persist()
andcache()
).
Step 6: Collecting Results
- The results are gathered back to the driver using
collect()
(for small results) or written to storage (HDFS, S3, databases).
3. Key Optimizations in Spark
3.1. Catalyst Optimizer
- Spark’s query optimizer applies:
- Predicate Pushdown: Filters applied early to reduce data scans.
- Column Pruning: Loads only required columns.
- Join Optimization: Optimized join strategies like broadcast join.
3.2. Tungsten Execution Engine
- Low-level memory management to improve performance.
- Uses binary processing, avoiding JVM overhead.
3.3. Lazy Evaluation
- Transformations (e.g.,
map()
,filter()
,groupBy()
) are not executed immediately. - Execution starts only when an action (
count()
,show()
,collect()
) is triggered.
3.4. Shuffle Optimization
- Spark minimizes data shuffling by:
- Using broadcast joins for small datasets.
- Caching results to avoid recomputation.
4. Spark Execution Modes
- Local Mode (Single Machine)
- Runs on a single node.
- Good for testing.
- Cluster Mode (Distributed)
- Runs on multiple nodes.
- Uses YARN, Mesos, or Kubernetes.
5. Data Abstractions in Spark
- RDD (Resilient Distributed Dataset) – Low-level API.
- DataFrame – Optimized table-like abstraction (faster due to Catalyst Optimizer).
- Dataset – Typed version of DataFrame (Scala/Java only).
DAG Execution in Spark
The Directed Acyclic Graph (DAG) is central to how Spark plans and executes tasks. Understanding how Spark builds and executes a DAG helps clarify how tasks are parallelized and optimized.
DAG Breakdown
- Logical Plan:
- The user-defined transformations (like
map
,filter
, etc.) are parsed into a logical execution plan. - It defines what needs to be done, not how.
- The user-defined transformations (like
- DAG Scheduler:
- The DAG Scheduler divides the job into stages.
- Stages are defined by wide transformations (e.g.,
groupBy
,join
, etc.) that require data shuffling (moving data between nodes). - A stage consists of tasks that can be executed in parallel.
- Each stage contains tasks for every partition of the data.
- Task Scheduler:
- The Task Scheduler splits the job further into tasks, one per partition of data.
- These tasks are distributed to the available executors in the cluster.
- If there’s a failure (e.g., a task or node fails), Spark can re-run the failed tasks (thanks to RDD lineage).
Fault Tolerance in DAG
- If any task fails, Spark re-runs only the failed task rather than the entire job.
- Spark uses RDD lineage to keep track of how each partition was computed.
Example of a DAG
Imagine you have a job like:
df.filter(lambda x: x > 100).groupBy("category").count()
- Stage 1: Filter operation (single operation on a partition).
- Stage 2: GroupBy operation (a shuffle operation between partitions).
The DAG scheduler ensures the operations are executed efficiently across multiple nodes.
Catalyst Optimizer
The Catalyst Optimizer is Spark’s query optimization engine and a key part of Spark SQL. It’s responsible for making decisions about how to execute a query efficiently.
How Catalyst Works
- Logical Optimization:
- It performs optimizations like predicate pushdown (moving filters closer to the data), constant folding (evaluating expressions like
1 + 1
before runtime), and column pruning (removing unused columns).
- It performs optimizations like predicate pushdown (moving filters closer to the data), constant folding (evaluating expressions like
- Physical Planning:
- Catalyst generates different physical plans for how to execute the query.
- The optimizer then selects the best plan based on statistics and heuristics.
- Cost-Based Optimization (CBO):
- In addition to logical optimizations, Catalyst supports cost-based optimizations where Spark estimates the cost of different plans and chooses the most efficient one.
- CBO is based on statistics about data, like the number of rows, column distributions, etc.
Examples of Catalyst Optimizations
- Predicate Pushdown: Instead of reading all data and then filtering, Spark will push the
filter()
operation down to the data source (e.g., HDFS, Parquet) to avoid reading unnecessary data. - Join Reordering: For multiple joins, Catalyst may reorder the joins to minimize the amount of data shuffled between nodes.
Tungsten Execution Engine
The Tungsten Execution Engine is a low-level optimization in Spark that focuses on improving memory management, code generation, and CPU efficiency. It provides major performance improvements, particularly for DataFrame and Dataset operations.
Key Features of Tungsten
- Memory Management:
- Tungsten manages memory manually rather than relying on the JVM’s garbage collection. This minimizes overhead.
- It uses off-heap memory for storage, which means Spark can avoid Java’s memory management issues.
- Code Generation:
- Spark uses whole-stage code generation, which compiles entire stages of the query plan into Java bytecode, making it faster than interpreting each operation one-by-one.
- For example, operations like
map
orfilter
are converted into specialized Java code, resulting in faster execution.
- Binary Processing:
- Data is processed using binary formats rather than standard JVM objects (e.g., primitive values like integers and floats are processed in raw memory, reducing object overhead).
- Cache Optimization:
- Tungsten optimizes the in-memory cache, ensuring that data is stored and accessed as efficiently as possible in the memory.
Tungsten Example
Let’s say you perform a groupBy
operation on a DataFrame. The transformation is optimized by Tungsten to:
- Use off-heap memory for intermediate results.
- Generate specialized code for each operation (like
map
,filter
). - Avoid unnecessary JVM object allocations.
Putting it All Together
When you run a Spark job:
- The driver sends the job to the DAG Scheduler.
- The job is broken into stages and tasks.
- Catalyst Optimizer transforms your query into an optimized physical plan.
- Tungsten further optimizes memory usage, execution speed, and CPU usage.
- The tasks are then executed across the cluster on worker nodes.
- Fault tolerance ensures that if a task fails, it is re-executed based on lineage information.
Spark’s DAG Scheduler, Catalyst Optimizer, and Tungsten Execution Engine work together to achieve high performance, scalability, and fault tolerance.
Key Benefits of These Components
- Optimized Execution: Through logical, physical, and cost-based optimizations.
- Fault Tolerance: Task-level failure recovery using RDD lineage.
- In-Memory Performance: Through in-memory caching and optimized memory management (Tungsten).
- Parallel Execution: Efficient task scheduling and parallel execution across a cluster.
Conclusion
- Spark is a distributed, fault-tolerant framework that executes jobs in parallel across a cluster.
- It breaks jobs into DAGs, optimizes execution, and runs tasks efficiently using in-memory computation.
- Optimizations like Catalyst Optimizer and Tungsten Engine ensure high performance.