In PySpark, the when
function is a crucial part of the pyspark.sql.functions
module that helps implement conditional logic in a DataFrame transformation pipeline. Here’s an in-depth explanation of its usage, alternatives, internal behavior, and how to optimize blog content for SEO when writing about it.
1. Using when
in PySpark
The when
function works like a SQL CASE
statement. It allows you to implement conditional logic within a column transformation.
Syntax:
from pyspark.sql.functions import when, col
df = df.withColumn("new_column", when(condition, value).otherwise(default_value))
Example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
# Initialize SparkSession
spark = SparkSession.builder.appName("WhenExample").getOrCreate()
# Sample DataFrame
data = [(1, "A"), (2, "B"), (3, None)]
df = spark.createDataFrame(data, ["id", "category"])
# Apply 'when' and 'otherwise'
df = df.withColumn(
"category_updated", when(col("category").isNull(), "Unknown").otherwise(col("category"))
)
df.show()
Output:
+---+--------+----------------+
| id|category|category_updated|
+---+--------+----------------+
| 1| A| A|
| 2| B| B|
| 3| null| Unknown|
+---+--------+----------------+
2. Alternative Approaches
a. CASE
statement in SQL Expression
You can achieve similar logic by using PySpark’s SQL capabilities:
df.createOrReplaceTempView("temp_table")
result_df = spark.sql("""
SELECT id,
category,
CASE
WHEN category IS NULL THEN 'Unknown'
ELSE category
END AS category_updated
FROM temp_table
""")
result_df.show()
b. UDF (User Defined Function)
For complex logic, you can use a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def update_category(category):
return "Unknown" if category is None else category
update_category_udf = udf(update_category, StringType())
df = df.withColumn("category_updated", update_category_udf(col("category")))
df.show()
c. Using fillna
for Simple Null Replacements
If your use case is only to handle NULL
values, you can use fillna
:
df = df.fillna({"category": "Unknown"})
3. Internal Behavior
- Execution Plan: PySpark translates the
when
function into logical plans for Spark SQL. These plans are optimized by Catalyst Optimizer, minimizing execution overhead. - Performance: When compared to UDFs,
when
is generally faster because it leverages Spark’s native SQL execution and avoids Python serialization overhead.
To implement multiple levels of conditions using when
in PySpark, you can chain when
conditions by nesting them. This allows for multiple conditional checks within a column transformation.
Here’s an updated example demonstrating how to use nested when
statements:
Example with Multiple Levels of when
:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col
# Initialize SparkSession
spark = SparkSession.builder.appName("WhenNestedExample").getOrCreate()
# Sample DataFrame
data = [(1, "A"), (2, "B"), (3, None), (4, "C"), (5, "Unknown")]
df = spark.createDataFrame(data, ["id", "category"])
# Apply nested 'when' and 'otherwise'
df = df.withColumn(
"category_updated",
when(col("category").isNull(), "Unknown") # First condition: if category is NULL
.when(col("category") == "A", "Category A") # Second condition: if category is 'A'
.when(col("category") == "B", "Category B") # Third condition: if category is 'B'
.otherwise("Other Category") # Default value for all other cases
)
df.show()
Output:
+---+--------+----------------+
| id|category|category_updated|
+---+--------+----------------+
| 1| A| Category A|
| 2| B| Category B|
| 3| null| Unknown|
| 4| C| Other Category|
| 5| Unknown| Other Category|
+---+--------+----------------+
Key Explanation:
when
Conditions:- Each
when
specifies a condition and its associated value. - Conditions are evaluated in the order they appear.
- Once a condition is satisfied, subsequent conditions are ignored for that row.
- Each
- Default Case (
otherwise
):- The
otherwise
clause acts as a fallback value when none of the specifiedwhen
conditions are met.
- The
- Chaining Conditions:
- You can chain multiple
when
conditions to handle complex logic.
- You can chain multiple
Optimization for Complex Conditions
For readability and performance:
Use separate variables for conditions:
is_null = col("category").isNull()
is_a = col("category") == "A"
is_b = col("category") == "B"
df = df.withColumn(
"category_updated",
when(is_null, "Unknown")
.when(is_a, "Category A")
.when(is_b, "Category B")
.otherwise("Other Category")
)
If conditions are highly complex or repetitive, consider a User Defined Function (UDF) or SQL case expressions.
The when
function in PySpark is versatile and can be adapted to handle various scenarios for conditional logic in DataFrames. Below are multiple examples and explanations to help students understand and prepare for all possible use cases.
1. Basic when
Usage with otherwise
from pyspark.sql.functions import when, col
df = df.withColumn(
"new_column", when(col("column") > 10, "Greater").otherwise("Smaller or Equal")
)
- Explanation:
when
evaluates the condition (col("column") > 10
).- If true, it assigns “Greater”; otherwise, “Smaller or Equal.”
2. Using Multiple when
Conditions
You can chain multiple conditions with when
.
df = df.withColumn(
"new_column",
when(col("column") > 10, "Greater")
.when(col("column") == 10, "Equal")
.otherwise("Smaller")
)
- Explanation:
- Each
when
condition is checked in order. - If none match,
otherwise
provides the fallback.
- Each
3. Using when
Without otherwise
If otherwise
is not used, rows that do not match any condition will contain null
.
df = df.withColumn("new_column", when(col("column") > 10, "Greater"))
- Explanation:
- Rows where
col("column") <= 10
will havenull
innew_column
. - Best Practice: Use
otherwise
to avoidnull
values unless explicitly required.
- Rows where
4. Handling Null Values
You can explicitly check for nulls using .isNull()
or .isNotNull()
.
df = df.withColumn(
"new_column",
when(col("column").isNull(), "Missing Value")
.otherwise("Present Value")
)
- Explanation:
- Handles
null
values explicitly. - Useful for data cleaning and validation.
- Handles
5. Using Complex Conditions
Combine multiple conditions using logical operators like &
(AND) or |
(OR).
df = df.withColumn(
"new_column",
when((col("column1") > 10) & (col("column2") < 5), "Condition 1")
.when((col("column1") == 0) | (col("column2").isNull()), "Condition 2")
.otherwise("Other")
)
- Explanation:
- Combines multiple conditions for greater flexibility.
6. Nested when
Statements
You can nest when
inside another when
for deeply nested logic.
df = df.withColumn(
"new_column",
when(col("column1") > 10,
when(col("column2") < 5, "Nested Condition True")
.otherwise("Nested Condition False"))
.otherwise("Outer Condition False")
)
- Explanation:
- Use nested
when
for hierarchical logic.
- Use nested
7. Without otherwise
When Only Partial Updates Are Needed
Sometimes, you may only want to update specific rows and leave others unchanged.
df = df.withColumn(
"new_column",
when(col("column") > 10, "Updated Value")
)
- Explanation:
- Rows not matching the condition will retain their original value as
null
.
- Rows not matching the condition will retain their original value as
8. Using lit
for Constant Values
If your conditions evaluate to constant values, use lit
.
from pyspark.sql.functions import lit
df = df.withColumn(
"new_column",
when(col("column") > 10, lit("Greater"))
.otherwise(lit("Smaller"))
)
- Explanation:
lit
is used to create a column of constant values.
9. Applying Conditional Logic to Multiple Columns
Transform multiple columns with a single when
clause.
df = df.withColumn(
"new_column",
when((col("column1") > 10) & (col("column2") < 20), "Match")
.otherwise("No Match")
)
- Explanation:
- Applies conditions involving multiple columns.
10. Combining with Other Functions
Combine when
with other PySpark SQL functions like concat
, substr
, etc.
from pyspark.sql.functions import concat
df = df.withColumn(
"new_column",
when(col("column1") > 10, concat(col("column2"), lit("_High")))
.otherwise(concat(col("column2"), lit("_Low")))
)
- Explanation:
- Use
when
with other transformations for dynamic values.
- Use
Key Points
- Always Plan Conditions: Prioritize conditions because
when
is evaluated sequentially. - Use
otherwise
When Appropriate: Avoidnull
results unless they are intended. - Understand Null Behavior: Handle
null
values explicitly to avoid unintended consequences. - Optimize for Readability: Use clear variable names and avoid deeply nested
when
statements. - Performance Consideration: Native
when
is optimized by Spark, so avoid using Python UDFs unless necessary.
Hmm it looks like your site ate my first comment (it was super long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I too am an aspiring blog blogger but I’m still new to everything. Do you have any tips and hints for first-time blog writers? I’d genuinely appreciate it.
Good article and right to the point. I don’t know if this is in fact the best place to ask but do you people have any thoughts on where to hire some professional writers? Thx 🙂
I got what you mean , thanks for putting up.Woh I am happy to find this website through google.
I have been exploring for a little for any high quality articles or weblog posts on this kind of space . Exploring in Yahoo I finally stumbled upon this website. Reading this info So i?¦m satisfied to express that I’ve a very just right uncanny feeling I came upon exactly what I needed. I such a lot unquestionably will make sure to don?¦t overlook this website and provides it a glance regularly.
My brother suggested I might like this website. He was totally right. This post truly made my day. You can not imagine just how much time I had spent for this info! Thanks!
I¦ll right away clutch your rss as I can not to find your e-mail subscription hyperlink or newsletter service. Do you’ve any? Please permit me recognize in order that I may just subscribe. Thanks.
Perfect piece of work you have done, this internet site is really cool with wonderful info .
Very interesting information!Perfect just what I was looking for!
Great post, I think website owners should learn a lot from this web blog its real user friendly.
Thank you, I’ve recently been searching for information approximately this topic for a while and yours is the best I have found out so far. However, what about the conclusion? Are you positive about the source?