PySpark vs SQL

PySpark vs SQL: Complete Cheat Sheet for Data Operations

Compare PySpark and SQL commands for common DML operations, Group By, Window Functions, and Filters. Learn how to manipulate and analyze data effectively using PySpark and SQL

PySpark vs SQL Commands: DML, Group By, Window Functions & Filters

OperationPySparkSQL
Select Columnsdf.select("col1", "col2")SELECT col1, col2 FROM table_name;
Filter Rowsdf.filter(df["col1"] > 100)SELECT * FROM table_name WHERE col1 > 100;
Add a Columndf = df.withColumn("new_col", df["col1"] * 2)ALTER TABLE table_name ADD new_col INT; UPDATE table_name SET new_col = col1 * 2;
Drop a Columndf = df.drop("col1")ALTER TABLE table_name DROP COLUMN col1;
Rename a Columndf = df.withColumnRenamed("old_name", "new_name")ALTER TABLE table_name RENAME COLUMN old_name TO new_name;
Update Valuesdf = df.withColumn("col2", when(df["col1"] > 100, 10).otherwise(df["col2"]))UPDATE table_name SET col2 = 10 WHERE col1 > 100;
Delete Rowsdf = df.filter(df["col1"] <= 100)DELETE FROM table_name WHERE col1 > 100;
Insert Datanew_data = spark.createDataFrame([(1, 3), (2, 4)], ["col1", "col2"])
df = df.union(new_data)
INSERT INTO table_name (col1, col2) VALUES (1, 3), (2, 4);
Group Bydf.groupBy("col1").agg({"col2": "sum"})SELECT col1, SUM(col2) FROM table_name GROUP BY col1;
Filter After Groupingdf.groupBy("col1").agg({"col2": "sum"}).filter("sum(col2) > 100")SELECT col1, SUM(col2) FROM table_name GROUP BY col1 HAVING SUM(col2) > 100;
Window Functions (Row Number)from pyspark.sql.window import Window
df.withColumn("row_num", row_number().over(Window.partitionBy("col1").orderBy("col2")))
SELECT col1, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col2) AS row_num FROM table_name;
Window Functions (Rank)df.withColumn("rank", rank().over(Window.partitionBy("col1").orderBy("col2")))SELECT col1, RANK() OVER (PARTITION BY col1 ORDER BY col2 DESC) AS rank FROM table_name;
Window Functions (NTILE)df.withColumn("quartile", ntile(4).over(Window.partitionBy("col1").orderBy("col2")))SELECT col1, NTILE(4) OVER (PARTITION BY col1 ORDER BY col2 DESC) AS quartile FROM table_name;
Aggregate Functionsdf.groupBy("col1").agg({"col2": "mean"})SELECT col1, AVG(col2) FROM table_name GROUP BY col1;
Sortingdf.orderBy("col1", ascending=False)SELECT * FROM table_name ORDER BY col1 DESC;
Distinct Valuesdf.select("col1").distinct()SELECT DISTINCT col1 FROM table_name;
Countdf.count()SELECT COUNT(col1) FROM table_name;
Join (Inner)df1.join(df2, "col1", "inner")SELECT * FROM table1 INNER JOIN table2 ON table1.col1 = table2.col1;
Join (Left)df1.join(df2, "col1", "left")SELECT * FROM table1 LEFT JOIN table2 ON table1.col1 = table2.col1;
Join (Right)df1.join(df2, "col1", "right")SELECT * FROM table1 RIGHT JOIN table2 ON table1.col1 = table2.col1;
Join (Full Outer)df1.join(df2, "col1", "outer")SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.col1 = table2.col1;
Concatenate DataFramesdf1.union(df2)SELECT * FROM table1 UNION ALL SELECT * FROM table2;
Uniondf1.union(df2)SELECT * FROM table1 UNION ALL SELECT * FROM table2;

Key Differences Between PySpark and SQL

  1. DML Commands:
    • PySpark utilizes methods like df.select(), df.filter(), df.withColumn(), and df.drop().
    • SQL uses SELECT, INSERT, UPDATE, DELETE, and ALTER commands for data manipulation.
  2. Grouping and Aggregations:
    • In PySpark, use groupBy() and agg() for grouping, with filters applied post-grouping.
    • In SQL, use GROUP BY with HAVING to filter aggregated data.
  3. Window Functions:
    • PySpark uses the Window module to apply functions like row_number(), rank(), and ntile().
    • SQL uses OVER(), along with PARTITION BY and ORDER BY to perform window functions.
  4. Joins:
    • Both PySpark and SQL support different types of joins (Inner, Left, Right, Full Outer), but PySpark uses .join() while SQL uses the JOIN keyword.
  5. Concatenation/Union:
    • PySpark uses union() to concatenate DataFrames, while SQL uses UNION or UNION ALL to combine tables.

SQL