SparkContext and SparkSession in Apache Spark

Both SparkContext and SparkSession are foundational elements in Spark, but they serve different purposes and evolved in different versions of Spark. Let’s break down each independently before discussing their differences.


1. SparkContext

What is SparkContext?

SparkContext is the entry point for a Spark application. It is responsible for connecting to the cluster manager (like YARN or Mesos) and coordinating the resources required for executing jobs. It represents the connection to the Spark cluster and is the heart of any Spark application.

Key Functions of SparkContext:

  • Communicates with the cluster manager to allocate resources.
  • Distributes data across the cluster.
  • Creates RDDs (Resilient Distributed Datasets) and manages their transformations and actions.
  • Manages job scheduling and execution.

How to Create a SparkContext?

Before Spark 2.0, SparkContext was the primary entry point for any Spark application. It was usually created directly or through a SparkConf object.

Code Example:

from pyspark import SparkContext, SparkConf

# Create a configuration object
conf = SparkConf().setAppName("ExampleApp").setMaster("local")

# Create a SparkContext
sc = SparkContext(conf=conf)

# Perform operations using the SparkContext
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
result = rdd.reduce(lambda x, y: x + y)
print("Sum of elements:", result)  # Output: Sum of elements: 15

sc.stop()

2. SparkSession

What is SparkSession?

Introduced in Spark 2.0, SparkSession is a unified entry point for Spark applications. It encapsulates all the functionality of SparkContext, SQLContext, and HiveContext. It simplifies the Spark programming model by providing a single entry point for all APIs.

Key Features of SparkSession:

  • Provides access to Spark SQL, DataFrame, and Dataset APIs.
  • Encapsulates SparkContext internally, so users don’t need to explicitly create it.
  • Supports Hive integration (if enabled) and provides easy access to catalog functions.
  • Facilitates configuration management for Spark applications.

How to Create a SparkSession?

You create a SparkSession using its builder method. If one already exists, it will return the existing session.

Code Example:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("ExampleApp") \
    .master("local") \
    .getOrCreate()

# Perform operations using SparkSession
data = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
df = spark.createDataFrame(data, ["id", "name"])

# Perform DataFrame operations
df.show()
# Output:
# +---+-------+
# | id|   name|
# +---+-------+
# |  1|  Alice|
# |  2|    Bob|
# |  3|Charlie|
# +---+-------+

spark.stop()

1. SparkSession.builder

  • Purpose: The builder is a method of the SparkSession class that initializes the building of a new SparkSession.
  • Functionality: It sets up the configuration for the Spark application. Think of it as the starting point where you specify settings like the application name, master URL, and other configurations.

2. .appName("ExampleApp")

  • Purpose: Sets the name of the Spark application.
  • Usage: The application name helps identify your Spark job in the Spark UI or cluster logs, making it easier to monitor and debug.
  • Example:
    • If you’re running multiple Spark jobs, having descriptive application names helps differentiate them.

3. .master("local")

  • Purpose: Specifies the master URL for the cluster manager. It determines where the Spark application will run.
  • Options:
    • "local": Runs the Spark application locally on a single machine. The number of threads can be specified, e.g., "local[4]" for 4 threads.
    • "local[*]": Utilizes all available CPU cores on the machine.
    • Cluster URL (e.g., "yarn", "mesos", or a specific Spark cluster URL like "spark://hostname:port"): Runs the application on a distributed cluster.
  • In this case: "local" indicates the application will run on a single thread locally, suitable for testing or small-scale jobs.

4. .getOrCreate()

  • Purpose:
    • If a SparkSession already exists in the application, it returns the existing session.
    • If no session exists, it creates a new SparkSession based on the configuration specified in the builder.
  • Why this is useful:
    • Ensures that there is always a single SparkSession per application, preventing the creation of multiple sessions which can lead to resource conflicts.

Combined Example Breakdown:

pythonCopy codespark = SparkSession.builder \
    .appName("ExampleApp") \
    .master("local") \
    .getOrCreate()
  1. Step-by-Step Workflow:
    • The builder initializes the process to create a SparkSession.
    • .appName("ExampleApp") sets the application name to “ExampleApp”.
    • .master("local") specifies that the application will run locally on a single thread.
    • .getOrCreate() ensures the SparkSession is created or retrieves an existing session.
  2. When the Snippet Runs:
    • Spark initializes the application and establishes the necessary environment.
    • If running locally, Spark sets up resources on your machine.
    • If running in a cluster, Spark communicates with the cluster manager to allocate resources.

Differences Between SparkContext and SparkSession

FeatureSparkContextSparkSession
Introduced InSpark 1.xSpark 2.0
PurposeEntry point for low-level RDD APIsUnified entry point for all APIs (RDD, DataFrame, Dataset, SQL)
Ease of UseRequires explicit creation and managementSimplifies usage by encapsulating SparkContext internally
APIs SupportedLimited to RDD APIsSupports RDD, DataFrame, Dataset, and Spark SQL APIs
Hive IntegrationRequires separate HiveContextDirectly integrated if Hive support is enabled
ConfigurationConfigured through SparkConfConfigured through builder methods

Interview Tips and Key Points

  1. Historical Context:
    • Mention that SparkContext was the entry point in Spark 1.x, but it was replaced by SparkSession in 2.0 for simplicity and unification.
  2. API Coverage:
    • Highlight that SparkSession integrates RDDs, DataFrames, and Datasets into a single unified API.
  3. Example-Driven Explanation:
    • Be prepared to write examples showcasing SparkContext for RDD-based operations and SparkSession for DataFrame-based operations.
  4. Practical Insight:
    • Explain that SparkSession is the preferred way in modern Spark applications due to its simplicity and advanced feature set.
  5. Why use .getOrCreate()?
    • It is useful in shared environments (like notebooks) to avoid creating multiple SparkSession instances.
  6. What happens if .master() is omitted?
    • Spark uses the default master setting (usually "local[*]" for local mode if no other cluster manager is configured).
  7. What’s the role of .appName() in a cluster?

By understanding both concepts thoroughly and being able to illustrate their differences with examples, you’ll be well-prepared for an interview question on this topic!

12 thoughts on “SparkContext and SparkSession in Apache Spark”
  1. Good site! I really love how it is simple on my eyes and the data are well written. I am wondering how I might be notified whenever a new post has been made. I have subscribed to your RSS which must do the trick! Have a great day!

  2. Hiya, I’m really glad I’ve found this info. Today bloggers publish only about gossips and web and this is actually irritating. A good site with exciting content, that’s what I need. Thanks for keeping this site, I will be visiting it. Do you do newsletters? Can’t find it.

  3. Oh my goodness! an amazing article dude. Thanks Nevertheless I’m experiencing challenge with ur rss . Don’t know why Unable to subscribe to it. Is there anybody getting equivalent rss problem? Anyone who is aware of kindly respond. Thnkx

  4. Hey! I’m at work browsing your blog from my new iphone! Just wanted to say I love reading through your blog and look forward to all your posts! Keep up the great work!

  5. It is perfect time to make a few plans for the longer term and it is time to be happy. I have read this publish and if I could I wish to recommend you few interesting issues or advice. Perhaps you can write subsequent articles regarding this article. I desire to read more issues approximately it!

  6. I together with my buddies ended up analyzing the great things from the blog while at once came up with an awful feeling I never thanked the web site owner for those tips. The young boys were certainly glad to see all of them and have now sincerely been making the most of those things. Appreciation for indeed being simply thoughtful as well as for making a choice on such tremendous tips most people are really eager to know about. My very own honest regret for not expressing gratitude to sooner.

Leave a Reply

Your email address will not be published. Required fields are marked *