Transformations Cheat Sheet

Explore a detailed comparison of PySpark transformations with a comprehensive table highlighting key points for RDD and DataFrame operations. Learn the differences, use cases, and examples for efficient big data processing.

Key Attributemap()filter()flatMap()groupByKey()reduceByKey()
DefinitionApplies a function to each element, returning a new dataset with transformed elements.Filters elements or rows based on a condition
Maps each input element to multiple outputs and flattens the results.
Groups elements of an RDD by their key.Aggregates values for each key using a function.
One-to-Many MappingNoNoNoNoNo
InputSingle RDD or DataFrame column.Single RDD or DataFrame column.Single RDD or DataFrame column.RDD of key-value pairs.RDD of key-value pairs.
OutputTransformed RDD or modified DataFrame column.Filtered RDD or subset DataFrame.Flattened RDD or expanded DataFrame rows.RDD of key and grouped values (as iterable).RDD of key and aggregated value.
Use CasesElement-wise transformations (e.g., scaling, formatting).Conditional filtering (e.g., remove unwanted data).Splitting text, expanding lists into rows.Grouping data by a key (e.g., categorizing).Aggregating metrics like sum, count, or average for each key.
Examples (RDD)rdd.map(lambda x: x * 2)rdd.filter(lambda x: x > 0)rdd.flatMap(lambda x: x.split(” “))rdd.groupByKey()rdd.reduceByKey(lambda x, y: x + y)
Examples (DataFrame)df.withColumn(“new_col”, df[“col”] * 2)df.filter(df[“col”] > 0)df.select(explode(split(df[“col”], ” “)).alias(“words”))df.groupBy(“key”).agg(collect_list(“value”))df.groupBy(“key”).agg(sum(“value”))
PerformanceEfficient; processes each element independently.Efficient; skips non-matching rows/elements.Slightly less efficient; requires flattening of multiple outputs.Can be less efficient for large datasets due to shuffling.Optimized for aggregations; avoids full shuffles by combining locally.
Common Use Cases– Scaling values.
– Formatting strings.
– Converting data types.
– Removing invalid rows.
– Filtering based on range or condition
– Splitting text.
– Expanding hierarchical data
– Creating categories.
– Grouping data for subsequent transformations.
– Summing sales per region.
– Counting occurrences per category.
Lazy EvaluationYesYesYesYesYes
Key DifferenceOne-to-one mapping.Selects a subset of elements.Can produce one-to-many mapping; results are flattened.Groups values with a key into a single collection.Combines and aggregates values for each key.