PySpark vs SQL

PySpark vs SQL

PySpark vs SQL: Complete Cheat Sheet for Data Operations

Compare PySpark and SQL commands for common DML operations, Group By, Window Functions, and Filters. Learn how to manipulate and analyze data effectively using PySpark and SQL

PySpark vs SQL Commands: DML, Group By, Window Functions & Filters

Operation	PySpark	SQL
Select Columns	`df.select("col1", "col2")`	`SELECT col1, col2 FROM table_name;`
Filter Rows	`df.filter(df["col1"] > 100)`	`SELECT * FROM table_name WHERE col1 > 100;`
Add a Column	`df = df.withColumn("new_col", df["col1"] * 2)`	`ALTER TABLE table_name ADD new_col INT; UPDATE table_name SET new_col = col1 * 2;`
Drop a Column	`df = df.drop("col1")`	`ALTER TABLE table_name DROP COLUMN col1;`
Rename a Column	`df = df.withColumnRenamed("old_name", "new_name")`	`ALTER TABLE table_name RENAME COLUMN old_name TO new_name;`
Update Values	`df = df.withColumn("col2", when(df["col1"] > 100, 10).otherwise(df["col2"]))`	`UPDATE table_name SET col2 = 10 WHERE col1 > 100;`
Delete Rows	`df = df.filter(df["col1"] <= 100)`	`DELETE FROM table_name WHERE col1 > 100;`
Insert Data	`new_data = spark.createDataFrame([(1, 3), (2, 4)], ["col1", "col2"])` `df = df.union(new_data)`	`INSERT INTO table_name (col1, col2) VALUES (1, 3), (2, 4);`
Group By	`df.groupBy("col1").agg({"col2": "sum"})`	`SELECT col1, SUM(col2) FROM table_name GROUP BY col1;`
Filter After Grouping	`df.groupBy("col1").agg({"col2": "sum"}).filter("sum(col2) > 100")`	`SELECT col1, SUM(col2) FROM table_name GROUP BY col1 HAVING SUM(col2) > 100;`
Window Functions (Row Number)	`from pyspark.sql.window import Window` `df.withColumn("row_num", row_number().over(Window.partitionBy("col1").orderBy("col2")))`	`SELECT col1, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col2) AS row_num FROM table_name;`
Window Functions (Rank)	`df.withColumn("rank", rank().over(Window.partitionBy("col1").orderBy("col2")))`	`SELECT col1, RANK() OVER (PARTITION BY col1 ORDER BY col2 DESC) AS rank FROM table_name;`
Window Functions (NTILE)	`df.withColumn("quartile", ntile(4).over(Window.partitionBy("col1").orderBy("col2")))`	`SELECT col1, NTILE(4) OVER (PARTITION BY col1 ORDER BY col2 DESC) AS quartile FROM table_name;`
Aggregate Functions	`df.groupBy("col1").agg({"col2": "mean"})`	`SELECT col1, AVG(col2) FROM table_name GROUP BY col1;`
Sorting	`df.orderBy("col1", ascending=False)`	`SELECT * FROM table_name ORDER BY col1 DESC;`
Distinct Values	`df.select("col1").distinct()`	`SELECT DISTINCT col1 FROM table_name;`
Count	`df.count()`	`SELECT COUNT(col1) FROM table_name;`
Join (Inner)	`df1.join(df2, "col1", "inner")`	`SELECT * FROM table1 INNER JOIN table2 ON table1.col1 = table2.col1;`
Join (Left)	`df1.join(df2, "col1", "left")`	`SELECT * FROM table1 LEFT JOIN table2 ON table1.col1 = table2.col1;`
Join (Right)	`df1.join(df2, "col1", "right")`	`SELECT * FROM table1 RIGHT JOIN table2 ON table1.col1 = table2.col1;`
Join (Full Outer)	`df1.join(df2, "col1", "outer")`	`SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.col1 = table2.col1;`
Concatenate DataFrames	`df1.union(df2)`	`SELECT * FROM table1 UNION ALL SELECT * FROM table2;`
Union	`df1.union(df2)`	`SELECT * FROM table1 UNION ALL SELECT * FROM table2;`

Key Differences Between PySpark and SQL

DML Commands:
- PySpark utilizes methods like df.select(), df.filter(), df.withColumn(), and df.drop().
- SQL uses SELECT, INSERT, UPDATE, DELETE, and ALTER commands for data manipulation.
Grouping and Aggregations:
- In PySpark, use groupBy() and agg() for grouping, with filters applied post-grouping.
- In SQL, use GROUP BY with HAVING to filter aggregated data.
Window Functions:
- PySpark uses the Window module to apply functions like row_number(), rank(), and ntile().
- SQL uses OVER(), along with PARTITION BY and ORDER BY to perform window functions.
Joins:
- Both PySpark and SQL support different types of joins (Inner, Left, Right, Full Outer), but PySpark uses .join() while SQL uses the JOIN keyword.
Concatenation/Union:
- PySpark uses union() to concatenate DataFrames, while SQL uses UNION or UNION ALL to combine tables.

SQL

Concepts

PySpark vs SQL Commands: DML, Group By, Window Functions & Filters

Key Differences Between PySpark and SQL

By Darshini

Try to check similar content

Understanding the Executor Node in Apache Spark

How to allocate driver memory and executor memory in Spark

In-Memory Processing in Apache Spark: An Overview for SEO Optimization

Why foreach() is called an action

PySpark vs SQL

PySpark vs SQL Commands: DML, Group By, Window Functions & Filters

Key Differences Between PySpark and SQL

By Darshini

Related Post

Pyspark cheat sheet

SQL Cheat Sheet

Transformations Cheat Sheet

Try to check similar content

Understanding the Executor Node in Apache Spark

How to allocate driver memory and executor memory in Spark

In-Memory Processing in Apache Spark: An Overview for SEO Optimization

Why foreach() is called an action