In PySpark, the when function is a crucial part of the pyspark.sql.functions module that helps implement conditional logic in a DataFrame transformation pipeline. Here’s an in-depth explanation of its usage, alternatives, internal behavior, and how to optimize blog content for SEO when writing about it.


1. Using when in PySpark

The when function works like a SQL CASE statement. It allows you to implement conditional logic within a column transformation.

Syntax:

from pyspark.sql.functions import when, col

df = df.withColumn("new_column", when(condition, value).otherwise(default_value))

Example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col

# Initialize SparkSession
spark = SparkSession.builder.appName("WhenExample").getOrCreate()

# Sample DataFrame
data = [(1, "A"), (2, "B"), (3, None)]
df = spark.createDataFrame(data, ["id", "category"])

# Apply 'when' and 'otherwise'
df = df.withColumn(
    "category_updated", when(col("category").isNull(), "Unknown").otherwise(col("category"))
)
df.show()

Output:

+---+--------+----------------+
| id|category|category_updated|
+---+--------+----------------+
|  1|       A|               A|
|  2|       B|               B|
|  3|    null|         Unknown|
+---+--------+----------------+

2. Alternative Approaches

a. CASE statement in SQL Expression

You can achieve similar logic by using PySpark’s SQL capabilities:

df.createOrReplaceTempView("temp_table")
result_df = spark.sql("""
    SELECT id, 
           category, 
           CASE 
               WHEN category IS NULL THEN 'Unknown' 
               ELSE category 
           END AS category_updated 
    FROM temp_table
""")
result_df.show()

b. UDF (User Defined Function)

For complex logic, you can use a UDF:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def update_category(category):
    return "Unknown" if category is None else category

update_category_udf = udf(update_category, StringType())
df = df.withColumn("category_updated", update_category_udf(col("category")))
df.show()

c. Using fillna for Simple Null Replacements

If your use case is only to handle NULL values, you can use fillna:

df = df.fillna({"category": "Unknown"})

3. Internal Behavior

  • Execution Plan: PySpark translates the when function into logical plans for Spark SQL. These plans are optimized by Catalyst Optimizer, minimizing execution overhead.
  • Performance: When compared to UDFs, when is generally faster because it leverages Spark’s native SQL execution and avoids Python serialization overhead.

To implement multiple levels of conditions using when in PySpark, you can chain when conditions by nesting them. This allows for multiple conditional checks within a column transformation.

Here’s an updated example demonstrating how to use nested when statements:

Example with Multiple Levels of when:

from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col

# Initialize SparkSession
spark = SparkSession.builder.appName("WhenNestedExample").getOrCreate()

# Sample DataFrame
data = [(1, "A"), (2, "B"), (3, None), (4, "C"), (5, "Unknown")]
df = spark.createDataFrame(data, ["id", "category"])

# Apply nested 'when' and 'otherwise'
df = df.withColumn(
    "category_updated",
    when(col("category").isNull(), "Unknown")  # First condition: if category is NULL
    .when(col("category") == "A", "Category A")  # Second condition: if category is 'A'
    .when(col("category") == "B", "Category B")  # Third condition: if category is 'B'
    .otherwise("Other Category")  # Default value for all other cases
)

df.show()

Output:

+---+--------+----------------+
| id|category|category_updated|
+---+--------+----------------+
|  1|       A|      Category A|
|  2|       B|      Category B|
|  3|    null|         Unknown|
|  4|       C|  Other Category|
|  5|  Unknown|  Other Category|
+---+--------+----------------+

Key Explanation:

  1. when Conditions:
    • Each when specifies a condition and its associated value.
    • Conditions are evaluated in the order they appear.
    • Once a condition is satisfied, subsequent conditions are ignored for that row.
  2. Default Case (otherwise):
    • The otherwise clause acts as a fallback value when none of the specified when conditions are met.
  3. Chaining Conditions:
    • You can chain multiple when conditions to handle complex logic.

Optimization for Complex Conditions

For readability and performance:

Use separate variables for conditions:

is_null = col("category").isNull()
is_a = col("category") == "A"
is_b = col("category") == "B"

df = df.withColumn(
"category_updated",
when(is_null, "Unknown")
.when(is_a, "Category A")
.when(is_b, "Category B")
.otherwise("Other Category")
)

If conditions are highly complex or repetitive, consider a User Defined Function (UDF) or SQL case expressions.


The when function in PySpark is versatile and can be adapted to handle various scenarios for conditional logic in DataFrames. Below are multiple examples and explanations to help students understand and prepare for all possible use cases.

1. Basic when Usage with otherwise

from pyspark.sql.functions import when, col

df = df.withColumn(
    "new_column", when(col("column") > 10, "Greater").otherwise("Smaller or Equal")
)
  • Explanation:
    • when evaluates the condition (col("column") > 10).
    • If true, it assigns “Greater”; otherwise, “Smaller or Equal.”

2. Using Multiple when Conditions

You can chain multiple conditions with when.

df = df.withColumn(
    "new_column",
    when(col("column") > 10, "Greater")
    .when(col("column") == 10, "Equal")
    .otherwise("Smaller")
)
  • Explanation:
    • Each when condition is checked in order.
    • If none match, otherwise provides the fallback.

3. Using when Without otherwise

If otherwise is not used, rows that do not match any condition will contain null.

df = df.withColumn("new_column", when(col("column") > 10, "Greater"))
  • Explanation:
    • Rows where col("column") <= 10 will have null in new_column.
    • Best Practice: Use otherwise to avoid null values unless explicitly required.

4. Handling Null Values

You can explicitly check for nulls using .isNull() or .isNotNull().

df = df.withColumn(
    "new_column",
    when(col("column").isNull(), "Missing Value")
    .otherwise("Present Value")
)
  • Explanation:
    • Handles null values explicitly.
    • Useful for data cleaning and validation.

5. Using Complex Conditions

Combine multiple conditions using logical operators like & (AND) or | (OR).

df = df.withColumn(
    "new_column",
    when((col("column1") > 10) & (col("column2") < 5), "Condition 1")
    .when((col("column1") == 0) | (col("column2").isNull()), "Condition 2")
    .otherwise("Other")
)
  • Explanation:
    • Combines multiple conditions for greater flexibility.

6. Nested when Statements

You can nest when inside another when for deeply nested logic.

df = df.withColumn(
    "new_column",
    when(col("column1") > 10, 
         when(col("column2") < 5, "Nested Condition True")
         .otherwise("Nested Condition False"))
    .otherwise("Outer Condition False")
)
  • Explanation:
    • Use nested when for hierarchical logic.

7. Without otherwise When Only Partial Updates Are Needed

Sometimes, you may only want to update specific rows and leave others unchanged.

df = df.withColumn(
    "new_column",
    when(col("column") > 10, "Updated Value")
)
  • Explanation:
    • Rows not matching the condition will retain their original value as null.

8. Using lit for Constant Values

If your conditions evaluate to constant values, use lit.

from pyspark.sql.functions import lit

df = df.withColumn(
    "new_column",
    when(col("column") > 10, lit("Greater"))
    .otherwise(lit("Smaller"))
)
  • Explanation:
    • lit is used to create a column of constant values.

9. Applying Conditional Logic to Multiple Columns

Transform multiple columns with a single when clause.

df = df.withColumn(
    "new_column",
    when((col("column1") > 10) & (col("column2") < 20), "Match")
    .otherwise("No Match")
)
  • Explanation:
    • Applies conditions involving multiple columns.

10. Combining with Other Functions

Combine when with other PySpark SQL functions like concat, substr, etc.

from pyspark.sql.functions import concat

df = df.withColumn(
    "new_column",
    when(col("column1") > 10, concat(col("column2"), lit("_High")))
    .otherwise(concat(col("column2"), lit("_Low")))
)
  • Explanation:
    • Use when with other transformations for dynamic values.

Key Points

  1. Always Plan Conditions: Prioritize conditions because when is evaluated sequentially.
  2. Use otherwise When Appropriate: Avoid null results unless they are intended.
  3. Understand Null Behavior: Handle null values explicitly to avoid unintended consequences.
  4. Optimize for Readability: Use clear variable names and avoid deeply nested when statements.
  5. Performance Consideration: Native when is optimized by Spark, so avoid using Python UDFs unless necessary.
10 thoughts on “when”
  1. Hmm it looks like your site ate my first comment (it was super long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I too am an aspiring blog blogger but I’m still new to everything. Do you have any tips and hints for first-time blog writers? I’d genuinely appreciate it.

  2. Good article and right to the point. I don’t know if this is in fact the best place to ask but do you people have any thoughts on where to hire some professional writers? Thx 🙂

  3. I have been exploring for a little for any high quality articles or weblog posts on this kind of space . Exploring in Yahoo I finally stumbled upon this website. Reading this info So i?¦m satisfied to express that I’ve a very just right uncanny feeling I came upon exactly what I needed. I such a lot unquestionably will make sure to don?¦t overlook this website and provides it a glance regularly.

  4. I¦ll right away clutch your rss as I can not to find your e-mail subscription hyperlink or newsletter service. Do you’ve any? Please permit me recognize in order that I may just subscribe. Thanks.

  5. Thank you, I’ve recently been searching for information approximately this topic for a while and yours is the best I have found out so far. However, what about the conclusion? Are you positive about the source?

Leave a Reply

Your email address will not be published. Required fields are marked *