In PySpark, the when function is a crucial part of the pyspark.sql.functions module that helps implement conditional logic in a DataFrame transformation pipeline. Here’s an in-depth explanation of its usage, alternatives, internal behavior, and how to optimize blog content for SEO when writing about it.
1. Using when in PySpark
The when function works like a SQL CASE statement. It allows you to implement conditional logic within a column transformation.
Syntax:
from pyspark.sql.functions import when, col
df = df.withColumn("new_column", when(condition, value).otherwise(default_value))
Example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
# Initialize SparkSession
spark = SparkSession.builder.appName("WhenExample").getOrCreate()
# Sample DataFrame
data = [(1, "A"), (2, "B"), (3, None)]
df = spark.createDataFrame(data, ["id", "category"])
# Apply 'when' and 'otherwise'
df = df.withColumn(
"category_updated", when(col("category").isNull(), "Unknown").otherwise(col("category"))
)
df.show()
Output:
+---+--------+----------------+
| id|category|category_updated|
+---+--------+----------------+
| 1| A| A|
| 2| B| B|
| 3| null| Unknown|
+---+--------+----------------+
2. Alternative Approaches
a. CASE statement in SQL Expression
You can achieve similar logic by using PySpark’s SQL capabilities:
df.createOrReplaceTempView("temp_table")
result_df = spark.sql("""
SELECT id,
category,
CASE
WHEN category IS NULL THEN 'Unknown'
ELSE category
END AS category_updated
FROM temp_table
""")
result_df.show()
b. UDF (User Defined Function)
For complex logic, you can use a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def update_category(category):
return "Unknown" if category is None else category
update_category_udf = udf(update_category, StringType())
df = df.withColumn("category_updated", update_category_udf(col("category")))
df.show()
c. Using fillna for Simple Null Replacements
If your use case is only to handle NULL values, you can use fillna:
df = df.fillna({"category": "Unknown"})
3. Internal Behavior
- Execution Plan: PySpark translates the
whenfunction into logical plans for Spark SQL. These plans are optimized by Catalyst Optimizer, minimizing execution overhead. - Performance: When compared to UDFs,
whenis generally faster because it leverages Spark’s native SQL execution and avoids Python serialization overhead.
To implement multiple levels of conditions using when in PySpark, you can chain when conditions by nesting them. This allows for multiple conditional checks within a column transformation.
Here’s an updated example demonstrating how to use nested when statements:
Example with Multiple Levels of when:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
# Initialize SparkSession
spark = SparkSession.builder.appName("WhenNestedExample").getOrCreate()
# Sample DataFrame
data = [(1, "A"), (2, "B"), (3, None), (4, "C"), (5, "Unknown")]
df = spark.createDataFrame(data, ["id", "category"])
# Apply nested 'when' and 'otherwise'
df = df.withColumn(
"category_updated",
when(col("category").isNull(), "Unknown") # First condition: if category is NULL
.when(col("category") == "A", "Category A") # Second condition: if category is 'A'
.when(col("category") == "B", "Category B") # Third condition: if category is 'B'
.otherwise("Other Category") # Default value for all other cases
)
df.show()
Output:
+---+--------+----------------+
| id|category|category_updated|
+---+--------+----------------+
| 1| A| Category A|
| 2| B| Category B|
| 3| null| Unknown|
| 4| C| Other Category|
| 5| Unknown| Other Category|
+---+--------+----------------+
Key Explanation:
whenConditions:- Each
whenspecifies a condition and its associated value. - Conditions are evaluated in the order they appear.
- Once a condition is satisfied, subsequent conditions are ignored for that row.
- Each
- Default Case (
otherwise):- The
otherwiseclause acts as a fallback value when none of the specifiedwhenconditions are met.
- The
- Chaining Conditions:
- You can chain multiple
whenconditions to handle complex logic.
- You can chain multiple
Optimization for Complex Conditions
For readability and performance:
Use separate variables for conditions:
is_null = col("category").isNull()
is_a = col("category") == "A"
is_b = col("category") == "B"
df = df.withColumn(
"category_updated",
when(is_null, "Unknown")
.when(is_a, "Category A")
.when(is_b, "Category B")
.otherwise("Other Category")
)
If conditions are highly complex or repetitive, consider a User Defined Function (UDF) or SQL case expressions.
The when function in PySpark is versatile and can be adapted to handle various scenarios for conditional logic in DataFrames. Below are multiple examples and explanations to help students understand and prepare for all possible use cases.
1. Basic when Usage with otherwise
from pyspark.sql.functions import when, col
df = df.withColumn(
"new_column", when(col("column") > 10, "Greater").otherwise("Smaller or Equal")
)
- Explanation:
whenevaluates the condition (col("column") > 10).- If true, it assigns “Greater”; otherwise, “Smaller or Equal.”
2. Using Multiple when Conditions
You can chain multiple conditions with when.
df = df.withColumn(
"new_column",
when(col("column") > 10, "Greater")
.when(col("column") == 10, "Equal")
.otherwise("Smaller")
)
- Explanation:
- Each
whencondition is checked in order. - If none match,
otherwiseprovides the fallback.
- Each
3. Using when Without otherwise
If otherwise is not used, rows that do not match any condition will contain null.
df = df.withColumn("new_column", when(col("column") > 10, "Greater"))
- Explanation:
- Rows where
col("column") <= 10will havenullinnew_column. - Best Practice: Use
otherwiseto avoidnullvalues unless explicitly required.
- Rows where
4. Handling Null Values
You can explicitly check for nulls using .isNull() or .isNotNull().
df = df.withColumn(
"new_column",
when(col("column").isNull(), "Missing Value")
.otherwise("Present Value")
)
- Explanation:
- Handles
nullvalues explicitly. - Useful for data cleaning and validation.
- Handles
5. Using Complex Conditions
Combine multiple conditions using logical operators like & (AND) or | (OR).
df = df.withColumn(
"new_column",
when((col("column1") > 10) & (col("column2") < 5), "Condition 1")
.when((col("column1") == 0) | (col("column2").isNull()), "Condition 2")
.otherwise("Other")
)
- Explanation:
- Combines multiple conditions for greater flexibility.
6. Nested when Statements
You can nest when inside another when for deeply nested logic.
df = df.withColumn(
"new_column",
when(col("column1") > 10,
when(col("column2") < 5, "Nested Condition True")
.otherwise("Nested Condition False"))
.otherwise("Outer Condition False")
)
- Explanation:
- Use nested
whenfor hierarchical logic.
- Use nested
7. Without otherwise When Only Partial Updates Are Needed
Sometimes, you may only want to update specific rows and leave others unchanged.
df = df.withColumn(
"new_column",
when(col("column") > 10, "Updated Value")
)
- Explanation:
- Rows not matching the condition will retain their original value as
null.
- Rows not matching the condition will retain their original value as
8. Using lit for Constant Values
If your conditions evaluate to constant values, use lit.
from pyspark.sql.functions import lit
df = df.withColumn(
"new_column",
when(col("column") > 10, lit("Greater"))
.otherwise(lit("Smaller"))
)
- Explanation:
litis used to create a column of constant values.
9. Applying Conditional Logic to Multiple Columns
Transform multiple columns with a single when clause.
df = df.withColumn(
"new_column",
when((col("column1") > 10) & (col("column2") < 20), "Match")
.otherwise("No Match")
)
- Explanation:
- Applies conditions involving multiple columns.
10. Combining with Other Functions
Combine when with other PySpark SQL functions like concat, substr, etc.
from pyspark.sql.functions import concat
df = df.withColumn(
"new_column",
when(col("column1") > 10, concat(col("column2"), lit("_High")))
.otherwise(concat(col("column2"), lit("_Low")))
)
- Explanation:
- Use
whenwith other transformations for dynamic values.
- Use
Key Points
- Always Plan Conditions: Prioritize conditions because
whenis evaluated sequentially. - Use
otherwiseWhen Appropriate: Avoidnullresults unless they are intended. - Understand Null Behavior: Handle
nullvalues explicitly to avoid unintended consequences. - Optimize for Readability: Use clear variable names and avoid deeply nested
whenstatements. - Performance Consideration: Native
whenis optimized by Spark, so avoid using Python UDFs unless necessary.
Hmm it looks like your site ate my first comment (it was super long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I too am an aspiring blog blogger but I’m still new to everything. Do you have any tips and hints for first-time blog writers? I’d genuinely appreciate it.
Good article and right to the point. I don’t know if this is in fact the best place to ask but do you people have any thoughts on where to hire some professional writers? Thx 🙂
I got what you mean , thanks for putting up.Woh I am happy to find this website through google.
I have been exploring for a little for any high quality articles or weblog posts on this kind of space . Exploring in Yahoo I finally stumbled upon this website. Reading this info So i?¦m satisfied to express that I’ve a very just right uncanny feeling I came upon exactly what I needed. I such a lot unquestionably will make sure to don?¦t overlook this website and provides it a glance regularly.
My brother suggested I might like this website. He was totally right. This post truly made my day. You can not imagine just how much time I had spent for this info! Thanks!
I¦ll right away clutch your rss as I can not to find your e-mail subscription hyperlink or newsletter service. Do you’ve any? Please permit me recognize in order that I may just subscribe. Thanks.
Perfect piece of work you have done, this internet site is really cool with wonderful info .
Very interesting information!Perfect just what I was looking for!
Great post, I think website owners should learn a lot from this web blog its real user friendly.
Thank you, I’ve recently been searching for information approximately this topic for a while and yours is the best I have found out so far. However, what about the conclusion? Are you positive about the source?