
PySpark vs SQL
PySpark vs SQL: Complete Cheat Sheet for Data Operations
Compare PySpark and SQL commands for common DML operations, Group By, Window Functions, and Filters. Learn how to manipulate and analyze data effectively using PySpark and SQL
PySpark vs SQL Commands: DML, Group By, Window Functions & Filters
Operation | PySpark | SQL |
---|---|---|
Select Columns | df.select("col1", "col2") | SELECT col1, col2 FROM table_name; |
Filter Rows | df.filter(df["col1"] > 100) | SELECT * FROM table_name WHERE col1 > 100; |
Add a Column | df = df.withColumn("new_col", df["col1"] * 2) | ALTER TABLE table_name ADD new_col INT; UPDATE table_name SET new_col = col1 * 2; |
Drop a Column | df = df.drop("col1") | ALTER TABLE table_name DROP COLUMN col1; |
Rename a Column | df = df.withColumnRenamed("old_name", "new_name") | ALTER TABLE table_name RENAME COLUMN old_name TO new_name; |
Update Values | df = df.withColumn("col2", when(df["col1"] > 100, 10).otherwise(df["col2"])) | UPDATE table_name SET col2 = 10 WHERE col1 > 100; |
Delete Rows | df = df.filter(df["col1"] <= 100) | DELETE FROM table_name WHERE col1 > 100; |
Insert Data | new_data = spark.createDataFrame([(1, 3), (2, 4)], ["col1", "col2"]) df = df.union(new_data) | INSERT INTO table_name (col1, col2) VALUES (1, 3), (2, 4); |
Group By | df.groupBy("col1").agg({"col2": "sum"}) | SELECT col1, SUM(col2) FROM table_name GROUP BY col1; |
Filter After Grouping | df.groupBy("col1").agg({"col2": "sum"}).filter("sum(col2) > 100") | SELECT col1, SUM(col2) FROM table_name GROUP BY col1 HAVING SUM(col2) > 100; |
Window Functions (Row Number) | from pyspark.sql.window import Window df.withColumn("row_num", row_number().over(Window.partitionBy("col1").orderBy("col2"))) | SELECT col1, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col2) AS row_num FROM table_name; |
Window Functions (Rank) | df.withColumn("rank", rank().over(Window.partitionBy("col1").orderBy("col2"))) | SELECT col1, RANK() OVER (PARTITION BY col1 ORDER BY col2 DESC) AS rank FROM table_name; |
Window Functions (NTILE) | df.withColumn("quartile", ntile(4).over(Window.partitionBy("col1").orderBy("col2"))) | SELECT col1, NTILE(4) OVER (PARTITION BY col1 ORDER BY col2 DESC) AS quartile FROM table_name; |
Aggregate Functions | df.groupBy("col1").agg({"col2": "mean"}) | SELECT col1, AVG(col2) FROM table_name GROUP BY col1; |
Sorting | df.orderBy("col1", ascending=False) | SELECT * FROM table_name ORDER BY col1 DESC; |
Distinct Values | df.select("col1").distinct() | SELECT DISTINCT col1 FROM table_name; |
Count | df.count() | SELECT COUNT(col1) FROM table_name; |
Join (Inner) | df1.join(df2, "col1", "inner") | SELECT * FROM table1 INNER JOIN table2 ON table1.col1 = table2.col1; |
Join (Left) | df1.join(df2, "col1", "left") | SELECT * FROM table1 LEFT JOIN table2 ON table1.col1 = table2.col1; |
Join (Right) | df1.join(df2, "col1", "right") | SELECT * FROM table1 RIGHT JOIN table2 ON table1.col1 = table2.col1; |
Join (Full Outer) | df1.join(df2, "col1", "outer") | SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.col1 = table2.col1; |
Concatenate DataFrames | df1.union(df2) | SELECT * FROM table1 UNION ALL SELECT * FROM table2; |
Union | df1.union(df2) | SELECT * FROM table1 UNION ALL SELECT * FROM table2; |
Key Differences Between PySpark and SQL
- DML Commands:
- PySpark utilizes methods like
df.select()
,df.filter()
,df.withColumn()
, anddf.drop()
. - SQL uses
SELECT
,INSERT
,UPDATE
,DELETE
, andALTER
commands for data manipulation.
- PySpark utilizes methods like
- Grouping and Aggregations:
- In PySpark, use
groupBy()
andagg()
for grouping, with filters applied post-grouping. - In SQL, use
GROUP BY
withHAVING
to filter aggregated data.
- In PySpark, use
- Window Functions:
- PySpark uses the
Window
module to apply functions likerow_number()
,rank()
, andntile()
. - SQL uses
OVER()
, along withPARTITION BY
andORDER BY
to perform window functions.
- PySpark uses the
- Joins:
- Both PySpark and SQL support different types of joins (Inner, Left, Right, Full Outer), but PySpark uses
.join()
while SQL uses theJOIN
keyword.
- Both PySpark and SQL support different types of joins (Inner, Left, Right, Full Outer), but PySpark uses
- Concatenation/Union:
- PySpark uses
union()
to concatenate DataFrames, while SQL usesUNION
orUNION ALL
to combine tables.
- PySpark uses