Spark的pySpark

quick刀斩乱麻

已于 2022-02-17 14:20:35 修改

阅读量197

点赞数

分类专栏：计算与存储引擎 python 文章标签： spark python

于 2020-09-27 19:16:43 首次发布

本文链接：https://blog.csdn.net/qq_34276652/article/details/108410392

版权

15 篇文章 0 订阅

订阅专栏

10 篇文章 0 订阅

订阅专栏

written in Scala
in-memory computations (MapReduce was performing batch processing only and lacked a real-time processing feature)
Apart from real-time and batch processing, Spark also supports interactive queries and iterative algorithms
PySpark allows to work with RDDs in Python. (the library called Py4j)

the entry point to spark functionality.
a SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster
uses Py4j to launch a JVM and creates a JavaSparkContext

resilient distributed dataset
elements that run and operate on multiple nodes to do parallel processing.
immutable (cannot be changed)
operations on RDD:
- Transformation: to create a new RDD (filter / groupby / map)
- Action: to insturct Spark to perform computation and send the result back to the driver

dataset: distribured collection of data / new interface added in Spark 1.6 that provides the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine
dataframe: a dataset organised into named columns(以命名列方式组织的分布式数据集). It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
RDD是分布式的Java对象的集合。DataFrame是分布式的Row对象的集合。Dataset是DataFrame的一个特例，主要区别是Dataset每一个record存储的是一个强类型值而不是一个Row(DataFrame等价于Dataset[Row])
can be created using various functions in SparkSession (e.g. spark.read.parquet("...") ), once created, it can be manipulated using the various domain-specific-language functions.
sql() / .show() / .filter() / .groupby() / .toPandas()

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
SparkConf() is used to create a SparkConf object , which will load values from spark.* Java system properties as well.
pyspark.conf

class pyspark.sql.SparkSession(sparkContext, jsparkSession=None): The entry point to programming Spark with the Dataset and DataFrame API.

class pyspark.streaming.StreamingContext(sparkContext, batchDuration=None, jssc=None): Main entry point for Spark Streaming functionality. A StreamingContext represents the connection to a Spark cluster, and can be used to create DStream various input sources. It can be from an existing SparkContext.