Pyspark Joins

PySpark, the Python API for Apache Spark, provides multiple join types to combine DataFrames based on specific conditions. These join types are crucial for merging datasets, performing lookups, or combining results from multiple sources, big data processing and analytics. Here’s an overview of the various join types in PySpark with examples.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize SparkSession
spark = SparkSession.builder.appName("JoinExample").getOrCreate()

# Sample DataFrames
data1 = [(1, "Alice"), (2, "Bob"), (3, "Cathy")]
data2 = [(1, "HR"), (2, "Finance"), (4, "Marketing")]

df1 = spark.createDataFrame(data1, ["id", "name"])
df2 = spark.createDataFrame(data2, ["id", "department"])

# Inner Join
inner_join = df1.join(df2, on="id", how="inner")
inner_join.show()
Output:

+---+-----+----------+
| id| name|department|
+---+-----+----------+
|  1|Alice|       HR |
|  2|  Bob|   Finance|
+---+-----+----------+

In PySpark, you can use the join() method to join two DataFrames with multiple conditions by combining them using logical operators within the on parameter. Below is an example of how to perform a join with multiple conditions

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize SparkSession
spark = SparkSession.builder.appName("JoinExample").getOrCreate()

# Create DataFrames
data1 = [(1, "Alice", 25), (2, "Bob", 30), (3, "Charlie", 35)]
data2 = [(1, "USA"), (2, "UK"), (4, "Canada")]

df1 = spark.createDataFrame(data1, ["id", "name", "age"])
df2 = spark.createDataFrame(data2, ["id", "country"])

# Join with multiple conditions
joined_df = df1.join(
    df2,
    (df1["id"] == df2["id"]) & (df1["age"] > 20),  # Multiple conditions
    "inner"  # Join type (e.g., "inner", "left", "right", "outer")
)

# Show the result
joined_df.show()

Output

+---+-----+---+---+-------+
| id| name|age| id|country|
+---+-----+---+---+-------+
|  1|Alice| 25|  1|    USA|
|  2|  Bob| 30|  2|     UK|
+---+-----+---+---+-------+

Key Points:

Conditions: Use logical operators like & (AND), | (OR), and ~ (NOT) to combine multiple conditions.
Column References: Use col() or direct column names like df["col_name"] to refer to columns.
Join Types

Types of Joins in PySpark

PySpark supports multiple types of joins, similar to SQL:

Inner Join: Returns rows where the keys exist in both DataFrames.
Left Outer Join: Returns all rows from the left DataFrame, and matching rows from the right. Fills non-matching rows with null.
Right Outer Join: Returns all rows from the right DataFrame, and matching rows from the left. Fills non-matching rows with null.
Full Outer Join: Returns all rows from both DataFrames, filling non-matching rows with null.
Cross Join: Returns the Cartesian product of both DataFrames.
Semi Join: Returns rows from the left DataFrame where there are matching rows in the right DataFrame.
Anti Join: Returns rows from the left DataFrame that have no match in the right DataFrame.

1. Inner Join

Definition: Returns rows with matching keys in both DataFrames.
Use Case: When you only need records that exist in both DataFrames.
Syntax:
- df1.join(df2, df1["key"] == df2["key"], "inner")
Result: Rows where key exists in both df1 and df2.

2. Left Join (Left Outer Join)

Definition: Returns all rows from the left DataFrame and matched rows from the right DataFrame. Unmatched rows in the right DataFrame are returned as NULL.
Use Case: When you want all records from the left DataFrame, even if there’s no match in the right.
Syntax:
- df1.join(df2, df1["key"] == df2["key"], "left")
Result: All rows from df1 with matching rows from df2.

3. Right Join (Right Outer Join)

Definition: Returns all rows from the right DataFrame and matched rows from the left DataFrame. Unmatched rows in the left DataFrame are returned as NULL.
Use Case: When you want all records from the right DataFrame, even if there’s no match in the left.
Syntax:
- df1.join(df2, df1["key"] == df2["key"], "right")
Result: All rows from df2 with matching rows from df1.

4. Full Join (Full Outer Join)

Definition: Returns all rows from both DataFrames. Unmatched rows are filled with NULL in the corresponding columns.
Use Case: When you want to include all records from both DataFrames, regardless of matches.
Syntax:
- df1.join(df2, df1["key"] == df2["key"], "outer")
Result: Union of df1 and df2.

5. Left Semi Join

Definition: Returns only rows from the left DataFrame that have matches in the right DataFrame.
Use Case: When you want rows from the left DataFrame that are present in the right DataFrame.
Syntax:
- df1.join(df2, df1["key"] == df2["key"], "left_semi")
Result: Rows from df1 that match with df2.

6. Left Anti Join

Definition: Returns only rows from the left DataFrame that do not have matches in the right DataFrame.
Use Case: When you want rows from the left DataFrame that do not exist in the right DataFrame.
Syntax:
- df1.join(df2, df1[“key”] == df2[“key”], “left_anti”)
Result: Rows from df1 that do not match with df2.

7. Cross Join

Definition: Returns the Cartesian product of two DataFrames, combining each row from the first DataFrame with every row from the second.
Use Case: When you need all possible combinations of rows.
Syntax:
- df1.crossJoin(df2)
Result: All combinations of rows from df1 and df2.

Join Optimization in PySpark

Broadcast Join: Use broadcast(df) for small DataFrames to avoid shuffles.
Partitioning: Use .repartition() or .coalesce() to optimize shuffling.
Filter Early: Apply .filter() before joins to reduce data size.

is there right_semi and right_anti join in pyspark?

No, PySpark does not directly support right_semi or right_anti joins.

However, you can achieve the same functionality by swapping the DataFrames and using left_semi or left_anti joins.

How to Simulate `right_semi` in PySpark

A right_semi join would return all rows from the right DataFrame where a match exists in the left DataFrame. You can achieve this by swapping the DataFrames and using a left_semi join.

Example:

# Right semi join equivalent
result = df2.join(df1, df2["key"] == df1["key"], "left_semi")

This ensures only the rows from df2 (right DataFrame) with matches in df1 (left DataFrame) are included.

How to Simulate `right_anti` in PySpark

A right_anti join would return all rows from the right DataFrame where no match exists in the left DataFrame. You can achieve this by swapping the DataFrames and using a left_anti join.

Example:

# Right anti join equivalent
result = df2.join(df1, df2["key"] == df1["key"], "left_anti")

This ensures only the rows from df2 (right DataFrame) without matches in df1 (left DataFrame) are included.

What Happens Internally?

When a join operation is performed in PySpark:

Logical Plan Generation:
- PySpark generates a logical execution plan by analyzing the join and identifying the most efficient way to execute it.
Optimization by Catalyst Optimizer:
- PySpark’s Catalyst Optimizer optimizes the logical plan by:
  - Selecting the appropriate join type (e.g., broadcast join if applicable).
  - Reordering operations for efficiency.
  - Pruning unnecessary columns or filters.
Execution Plan:
- The logical plan is converted into a physical execution plan, which determines how the data will be read and processed.
Shuffle or Broadcast:
- For larger DataFrames, PySpark may shuffle data across partitions to align keys.
- If one DataFrame is small enough, Spark may use a broadcast join to replicate it to all worker nodes for faster processing.

When to Use Each Type of Join

1. Inner Join

Use: When you only want matching rows between two DataFrames.
Avoid: If you need unmatched rows from either DataFrame.

2. Left/Right Outer Join

Use: When you want all rows from one DataFrame, regardless of whether there’s a match in the other.
Avoid: If the unmatched rows are irrelevant or increase data unnecessarily.

3. Full Outer Join

Use: When you need all rows from both DataFrames.
Avoid: If the unmatched data from both DataFrames is unnecessary, as this increases memory usage.

4. Cross Join

Use: Rarely, only when you explicitly need the Cartesian product.
Avoid: In most cases, as it can be computationally expensive.

5. Semi Join

Use: When you need rows from the left DataFrame that have matches in the right.
Avoid: If you need columns from the right DataFrame.

6. Anti Join

Use: When you need rows from the left DataFrame that don’t have matches in the right.
Avoid: If you need to retain unmatched data from the right DataFrame.

Optimizations for Joins

Broadcast Joins:
- Use broadcast joins for small DataFrames (typically <10MB).
- Manually enable broadcasting:
- from pyspark.sql.functions import broadcast
- df1.join(broadcast(df2), on="id", how="inner")
- PySpark will replicate the smaller DataFrame to all worker nodes, avoiding shuffling.
Partitioning:
- Ensure DataFrames are partitioned on the join key to minimize shuffling.
- Example:
- df1 = df1.repartition(“id”)
- df2 = df2.repartition(“id”)
Column Pruning:
- Select only the necessary columns before the join to reduce data transfer and processing.
Filter Early:
- Apply filters before the join to reduce the size of DataFrames.
- Example:
- df1_filtered = df1.filter(col("id") > 10)
- df2_filtered = df2.filter(col("department") == "HR")
- result = df1_filtered.join(df2_filtered, on="id", how="inner")
Use Appropriate Join Types:
- Avoid using full outer join if you don’t need unmatched rows from both DataFrames.
- Use semi join or anti join for existence checks instead of inner joins.
Avoid Skewed Data:
- If your join key has skewed distribution, it can lead to load imbalance.
- Solutions:
  - Use salting by adding a random key to distribute skewed rows.
  - Repartition using a composite key.
Use Spark SQL for Complex Joins:
- For readability and maintainability, complex joins can be expressed in Spark SQL.

df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")

result = spark.sql("""
    SELECT df1.id, df1.name, df2.department
    FROM df1
    JOIN df2
    ON df1.id = df2.id
""")

When Not to Use Joins

When Broadcast Join Fails:
- Avoid using broadcast joins for large DataFrames as it may cause memory errors.
For Simple Key-Value Lookups:
- Consider using map or flatMap transformations instead of joins.
With Highly Skewed Data:
- Joins can cause load imbalance when the data distribution is skewed. Mitigate using salting or repartitioning.
Unnecessary Joins:
- Avoid joins if data can be pre-aggregated or filtered to achieve the same result.

Summary

right_semi: Use left_semi with swapped DataFrames.
right_anti: Use left_anti with swapped DataFrames.

Conclusion

Understanding the different join types in PySpark—inner, outer, left, right, semi, anti, and cross—allows you to handle diverse big data use cases effectively. With proper optimization, these joins can perform efficiently even on large datasets.

Concepts

Key Points:

Types of Joins in PySpark

1. Inner Join

2. Left Join (Left Outer Join)

3. Right Join (Right Outer Join)

4. Full Join (Full Outer Join)

5. Left Semi Join

6. Left Anti Join

7. Cross Join

Join Optimization in PySpark

is there right_semi and right_anti join in pyspark?

How to Simulate `right_semi` in PySpark

Example:

How to Simulate `right_anti` in PySpark

Example:

What Happens Internally?

When to Use Each Type of Join

1. Inner Join

2. Left/Right Outer Join

3. Full Outer Join

4. Cross Join

5. Semi Join

6. Anti Join

Optimizations for Joins

When Not to Use Joins

Summary

Conclusion

By Darshini

Try to check similar content

PySpark DataFrame Transformations

PySpark transformations

Understanding the Executor Node in Apache Spark

How to allocate driver memory and executor memory in Spark

Pyspark Joins

Key Points:

Types of Joins in PySpark

1. Inner Join

2. Left Join (Left Outer Join)

3. Right Join (Right Outer Join)

4. Full Join (Full Outer Join)

5. Left Semi Join

6. Left Anti Join

7. Cross Join

Join Optimization in PySpark

is there right_semi and right_anti join in pyspark?

How to Simulate right_semi in PySpark

Example:

How to Simulate right_anti in PySpark

Example:

What Happens Internally?

When to Use Each Type of Join

1. Inner Join

2. Left/Right Outer Join

3. Full Outer Join

4. Cross Join

5. Semi Join

6. Anti Join

Optimizations for Joins

When Not to Use Joins

Summary

Conclusion

By Darshini

Related Post

PySpark transformations

Understanding the Executor Node in Apache Spark

How to allocate driver memory and executor memory in Spark

Try to check similar content

PySpark DataFrame Transformations

PySpark transformations

Understanding the Executor Node in Apache Spark

How to allocate driver memory and executor memory in Spark

How to Simulate `right_semi` in PySpark

How to Simulate `right_anti` in PySpark