Pyspark
PySpark is an interface on top of Apache Spark in Python, by which Python developers can use Apache Spark through Python APIs to build Spark applications. In big data environments, PySpark is most commonly used because it is simple and yet powerful when it comes to dealing with large-scale data processing. The major thing that PySpark taps into is the distributed powerful computing features of Spark, making it a perfect choice for huge datasets processing. Furthermore, PySpark has great compatibility with various tools and big data frameworks, making it suitable for many big data ecosystems
PySpark is built upon some of the fundamental concepts, as listed below:
- RDD (Resilient Distributed Dataset): The core data structure of PySpark is an immutable distributed collection of objects that allows parallel processing.
- DataFrame: A collection of data organized into named columns; it resembles the DataFrames in R or Python, though this is more optimized for better performance.
- Spark SQL: A module for working with structured data which also allows querying of data through SQL and the DataFrame API.
- Transformations and Actions: Transformations create a new RDD based on another RDD, while actions perform computation against an RDD and return it back to the driver program.
- Lazy Evaluation: All the transformations that will be executed in PySpark are evaluated lazily, which means computation won’t take place until an action is invoked.
- Cluster Manager: PySpark is run on a cluster managed by YARN, Mesos, or Spark’s standalone cluster manager, with responsibilities for resource allocation and job scheduling.