Spark[1]：基本概念与python接口使用

最新推荐文章于 2024-08-01 10:54:58 发布

little_miya

最新推荐文章于 2024-08-01 10:54:58 发布

阅读量1.2k

点赞数

分类专栏： bigdata 文章标签： spark python big data

本文链接：https://blog.csdn.net/allenhsu6/article/details/122382486

版权

bigdata 专栏收录该内容

4 篇文章

订阅专栏

本文详细介绍了Spark中的RDD（弹性分布式数据集）概念、partition操作，以及RDD的Transformations和Actions。通过实例演示了如何创建RDD、使用map和filter等transformations，以及执行reduce等actions。同时涵盖了SparkContext、SparkSession的基础使用，以及DataFrame的创建、查询和数据处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、RDD

RDD 是 Resilient Distributes Datasets 的缩写。

RDD 基于cluster中node个数进行partition

为了更好了解RDD进行partition的操作，笔者单独列出一篇博文，举例说明。Partition相关操作

RDD可以基于所有支持Hadoop的文件系统来构建，比如：

HDFS， Hbase，Cassandra， Amazon S3
在这里插入图片描述

1. RDD Transformations

从当前RDD创建一个新的RDD
懒加载：the results are only computed when evaluated by actions

比如map()就是一个transformation，从一个RDD根据对应函数生成另外一个RDD

在这里插入图片描述

2. RDD Actions

Actions return a value to driver program after running a computation.

比如reduce()就是一个action，用于aggregates all RDD elements

在这里插入图片描述

3. DAG

DAG的全称：Directed Acyclic Graph

Spark依赖DAGS确保fault tolerance，当一个节点坏掉，Spark复制DAG重新回复node
在这里插入图片描述

在这里插入图片描述

二、基础操作

创建SparkContext与SparkSession
创建RDD
Dataframes 和 SparkSQL的使用

预备工作

# Installing required packages
!pip install pyspark
!pip install findspark

import findspark
findspark.init()

# PySpark is the Spark API for Python. In this lab, we use PySpark to initialize the spark context. 
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

SparkContext是spark app的入口，包含一系列的function，比如创建RDD的parallelize()

SparkSession是SparkSQL和DataFrame操作的必须品

创建SparkContext和SparkSession的实例：

# Creating a spark context class
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
    .builder \
    .appName("Python Spark DataFrames basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

创建RDD然后使用transformations

# create an RDD which has integres from 1-30
data = range(1, 30)
xrangeRDD = sc.parallelize(data, 4)

# transformations
subRDD = xrangeRDD.map(lambda x: x-1)
filteredRDD = subRDD.filter(lambda x : x<10)

创建DataFrame并使用多种方法查询数据，最后关闭。

# Read the dataset into a spark dataframe using the `read.json()` function
df = spark.read.json("***.json").cache()
# Print the dataframe as well as the data schema
df.show()
df.printSchema()
# Register the DataFrame as a SQL temporary view
df.createTempView("people")

# Select and show basic data columns
df.select("name").show()
df.select(df["name"]).show()
spark.sql("SELECT name FROM people").show()


# Perform basic filtering
df.filter(df["age"] > 21).show()
spark.sql("SELECT age, name FROM people WHERE age > 21").show()

# 在单独col上的操作，新创造一列old，数值为age的3倍
df.withColumn('old', df['age']*3).show()

# Perfom basic aggregation of data
df.groupBy("age").count().show()
spark.sql("SELECT age, COUNT(age) as count FROM people GROUP BY age").show()


# close the SparkSession
spark.stop()