Concepts

Understanding the Executor Node in Apache Spark How to allocate driver memory and executor memory in Spark In-Memory Processing in Apache Spark: An Overview for SEO Optimization Why foreach() is called an action display the contents of a DataFrame in Spark

Transformations Cheat Sheet

By Darshini December 27, 2024

Explore a detailed comparison of PySpark transformations with a comprehensive table highlighting key points for RDD and DataFrame operations. Learn the differences, use cases, and examples for efficient big data processing.

Key Attribute	map()	filter()	flatMap()	groupByKey()	reduceByKey()
Definition	Applies a function to each element, returning a new dataset with transformed elements.	Filters elements or rows based on a condition	Maps each input element to multiple outputs and flattens the results.	Groups elements of an RDD by their key.	Aggregates values for each key using a function.
One-to-Many Mapping	No	No	No	No	No
Input	Single RDD or DataFrame column.	Single RDD or DataFrame column.	Single RDD or DataFrame column.	RDD of key-value pairs.	RDD of key-value pairs.
Output	Transformed RDD or modified DataFrame column.	Filtered RDD or subset DataFrame.	Flattened RDD or expanded DataFrame rows.	RDD of key and grouped values (as iterable).	RDD of key and aggregated value.
Use Cases	Element-wise transformations (e.g., scaling, formatting).	Conditional filtering (e.g., remove unwanted data).	Splitting text, expanding lists into rows.	Grouping data by a key (e.g., categorizing).	Aggregating metrics like sum, count, or average for each key.
Examples (RDD)	rdd.map(lambda x: x * 2)	rdd.filter(lambda x: x > 0)	rdd.flatMap(lambda x: x.split(” “))	rdd.groupByKey()	rdd.reduceByKey(lambda x, y: x + y)
Examples (DataFrame)	df.withColumn(“new_col”, df[“col”] * 2)	df.filter(df[“col”] > 0)	df.select(explode(split(df[“col”], ” “)).alias(“words”))	df.groupBy(“key”).agg(collect_list(“value”))	df.groupBy(“key”).agg(sum(“value”))
Performance	Efficient; processes each element independently.	Efficient; skips non-matching rows/elements.	Slightly less efficient; requires flattening of multiple outputs.	Can be less efficient for large datasets due to shuffling.	Optimized for aggregations; avoids full shuffles by combining locally.
Common Use Cases	– Scaling values. – Formatting strings. – Converting data types.	– Removing invalid rows. – Filtering based on range or condition	– Splitting text. – Expanding hierarchical data	– Creating categories. – Grouping data for subsequent transformations.	– Summing sales per region. – Counting occurrences per category.
Lazy Evaluation	Yes	Yes	Yes	Yes	Yes
Key Difference	One-to-one mapping.	Selects a subset of elements.	Can produce one-to-many mapping; results are flattened.	Groups values with a key into a single collection.	Combines and aggregates values for each key.

By Darshini

Understanding the Executor Node in Apache Spark

How to allocate driver memory and executor memory in Spark

In-Memory Processing in Apache Spark: An Overview for SEO Optimization

Why foreach() is called an action