Spark的pySpark

Apache Spark

  • written in Scala  
  • in-memory computations (MapReduce was performing batch processing only and lacked a real-time processing feature) 
  • Apart from real-time and batch processing, Spark also supports interactive queries and iterative algorithms
  • PySpark allows to work with RDDs in Python.  (the library called Py4j)

Classes in pyspark

SparkContext:

  • the entry point to spark functionality.
  • a SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster
  • uses Py4j to launch a JVM and creates a JavaSparkContext

RDD:

  • resilient distributed dataset
  • elements that run and operate on multiple nodes to do parallel processing.
  • immutable (cannot be changed)
  • operations on RDD:
    • Transformation: to create a new RDD (filter / groupby / map)
    • Action: to insturct Spark to perform computation and send the result back to the driver

DataFrame:

  • dataset: distribured collection of data / new interface added in Spark 1.6 that provides the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine

  • dataframe: a dataset organised into named columns(以命名列方式组织的分布式数据集). It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. 

  • RDD是分布式的Java对象的集合。DataFrame是分布式的Row对象的集合。Dataset是DataFrame的一个特例,主要区别是Dataset每一个record存储的是一个强类型值而不是一个Row(DataFrame等价于Dataset[Row])

  • can be created using various functions in SparkSession (e.g. spark.read.parquet("...") ), once created, it can be manipulated using the various domain-specific-language functions.

  • sql() / .show() / .filter() / .groupby()  / .toPandas() 

SparkConf:

  • Configuration for a Spark application. Used to set various Spark parameters as key-value pairs. 
  • SparkConf() is used to create a SparkConf object , which will load values from spark.* Java system properties as well.
  • pyspark.conf

Modules in pyspark

sql:

  • class pyspark.sql.SparkSession(sparkContextjsparkSession=None): The entry point to programming Spark with the Dataset and DataFrame API.

streaming:

  • class pyspark.streaming.StreamingContext(sparkContextbatchDuration=Nonejssc=None): Main entry point for Spark Streaming functionality. A StreamingContext represents the connection to a Spark cluster, and can be used to create DStream various input sources. It can be from an existing SparkContext.

ml:

pyspark中提供了两个机器学习库:mllib和ml;mllib的操作是基于RDD的,而ml则是基于DataFrame

  • feature module:
    • VectorAssembler: merges multiple columns into a vector column
    • StandardScaler
  • clustering module
  • classification module
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值