Explore a detailed comparison of PySpark transformations with a comprehensive table highlighting key points for RDD and DataFrame operations. Learn the differences, use cases, and examples for efficient big data processing.
Key Attribute | map() | filter() | flatMap() | groupByKey() | reduceByKey() |
Definition | Applies a function to each element, returning a new dataset with transformed elements. | Filters elements or rows based on a condition | Maps each input element to multiple outputs and flattens the results. | Groups elements of an RDD by their key. | Aggregates values for each key using a function. |
One-to-Many Mapping | No | No | No | No | No |
Input | Single RDD or DataFrame column. | Single RDD or DataFrame column. | Single RDD or DataFrame column. | RDD of key-value pairs. | RDD of key-value pairs. |
Output | Transformed RDD or modified DataFrame column. | Filtered RDD or subset DataFrame. | Flattened RDD or expanded DataFrame rows. | RDD of key and grouped values (as iterable). | RDD of key and aggregated value. |
Use Cases | Element-wise transformations (e.g., scaling, formatting). | Conditional filtering (e.g., remove unwanted data). | Splitting text, expanding lists into rows. | Grouping data by a key (e.g., categorizing). | Aggregating metrics like sum, count, or average for each key. |
Examples (RDD) | rdd.map(lambda x: x * 2) | rdd.filter(lambda x: x > 0) | rdd.flatMap(lambda x: x.split(” “)) | rdd.groupByKey() | rdd.reduceByKey(lambda x, y: x + y) |
Examples (DataFrame) | df.withColumn(“new_col”, df[“col”] * 2) | df.filter(df[“col”] > 0) | df.select(explode(split(df[“col”], ” “)).alias(“words”)) | df.groupBy(“key”).agg(collect_list(“value”)) | df.groupBy(“key”).agg(sum(“value”)) |
Performance | Efficient; processes each element independently. | Efficient; skips non-matching rows/elements. | Slightly less efficient; requires flattening of multiple outputs. | Can be less efficient for large datasets due to shuffling. | Optimized for aggregations; avoids full shuffles by combining locally. |
Common Use Cases | – Scaling values. – Formatting strings. – Converting data types. | – Removing invalid rows. – Filtering based on range or condition | – Splitting text. – Expanding hierarchical data | – Creating categories. – Grouping data for subsequent transformations. | – Summing sales per region. – Counting occurrences per category. |
Lazy Evaluation | Yes | Yes | Yes | Yes | Yes |
Key Difference | One-to-one mapping. | Selects a subset of elements. | Can produce one-to-many mapping; results are flattened. | Groups values with a key into a single collection. | Combines and aggregates values for each key. |