Pyspark入门

Pyspark的RDD是其基本抽象,是不可变的分布式数据集,支持map、filter等操作。RDD具有分区、依赖关系等特性,用于实现弹性。SparkContext是主入口点,连接到Spark集群。RDD操作包括转换(lazy计算)和行动(触发计算)。常见操作如map、filter、reduceByKey等,可用于词频统计、平均年龄计算等场景。生产中常使用YARN或Standalone模式运行。
摘要由CSDN通过智能技术生成

RDD(Resilent Distributed Dataset)

Resilent对丢失节点,丢失数据集的修复

Distributed分布式,运行在不同节点上

Dataset数据集

RDD是spark最基本的抽象,是不可变的(每个RDD生成后就不再改变,所有的操作都是生成一个新的RDD,因为是并行化计算,如果在原有基础上进行修改,那么不得不浪费时间在同步通信上)

可以进行分割的,

分割后的数据集可以并行运算的。

支持的操作有map,filter,persist。

PairRDDFunctions

DoubleRDDFunctions

SequenceFIleRDDFunctions

RDD特性(面试必备):

  • A list of partitions 数据可分成split1,split2
  • Function for each splits 对被分割的splits每一个都进行运算,programmer不需要知道是如何划分的
  • A list of other dependencies: rdd1 -》 rdd2 -》 rdd3 -》rdd4 RDD会保存一份依赖关系,当数据丢失的时候可以追踪溯源到上一个节点进行重新计算,也是弹性的基础保障
  • 可选 Partitioner for Key-value RDD 
  • 可选 a list of prefferd locations:移动数据不如移动计算,为什么会有多个location,因为数据也是多份拷贝的

一个partition(split)是一个task

<
About This Book, Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0Develop and deploy efficient, scalable real-time Spark solutionsTake your understanding of using Spark with Python to the next level with this jump start guide, Who This Book Is For, If you are a Python developer who wants to learn about the Apache Spark 2.0 ecosystem, this book is for you. A firm understanding of Python is expected to get the best out of the book. Familiarity with Spark would be useful, but is not mandatory., What You Will Learn, Learn about Apache Spark and the Spark 2.0 architectureBuild and interact with Spark DataFrames using Spark SQLLearn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectivelyRead, transform, and understand data and use it to train machine learning modelsBuild machine learning models with MLlib and MLLearn how to submit your applications programmatically using spark-submitDeploy locally built applications to a cluster, In Detail, Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark., You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command., By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used t
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值