Apache Spark
- written in Scala
- in-memory computations (MapReduce was performing batch processing only and lacked a real-time processing feature)
- Apart from real-time and batch processing, Spark also supports interactive queries and iterative algorithms
- PySpark allows to work with RDDs in Python. (the library called Py4j)
Classes in pyspark
SparkContext:
- the entry point to spark functionality.
- a SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster
- uses Py4j to launch a JVM and creates a JavaSparkContext
RDD:
- resilient distributed dataset
- elements that run and operate on multiple nodes to do parallel processing.
- immutable (cannot be changed)
- operations on RDD:
- Transformation: to create a new RDD (filter / groupby / map)
- Action: to insturct Spark to perform computation and send the result back to the driver
DataFrame:
-
dataset: distribured collection of data / new interface added in Spark 1.6 that provides the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine
-
dataframe: a dataset organised into named columns(以命名列方式组织的分布式数据集). It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
-
RDD是分布式的Java对象的集合。DataFrame是分布式的Row对象的集合。Dataset是DataFrame的一个特例,主要区别是Dataset每一个record存储的是一个强类型值而不是一个Row(DataFrame等价于Dataset[Row])
-
can be created using various functions in SparkSession (e.g. spark.read.parquet("...") ), once created, it can be manipulated using the various domain-specific-language functions.
- sql() / .show() / .filter() / .groupby() / .toPandas()
SparkConf:
- Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
- SparkConf() is used to create a SparkConf object , which will load values from spark.* Java system properties as well.
- pyspark.conf
Modules in pyspark
sql:
- class
pyspark.sql.
SparkSession
(sparkContext, jsparkSession=None): The entry point to programming Spark with the Dataset and DataFrame API.
streaming:
- class
pyspark.streaming.
StreamingContext
(sparkContext, batchDuration=None, jssc=None): Main entry point for Spark Streaming functionality. A StreamingContext represents the connection to a Spark cluster, and can be used to create DStream various input sources. It can be from an existing SparkContext.
ml:
pyspark中提供了两个机器学习库:mllib和ml;mllib的操作是基于RDD的,而ml则是基于DataFrame
- feature module:
- VectorAssembler: merges multiple columns into a vector column
- StandardScaler
- clustering module
- classification module