spark常用命令&操作(简单例子)——pyspark

快速入门pyspark教程

https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/

 

进行大量计算前,采用 变量名.cache() 加快运行速度。

不适用jupyter时,运行spark.stop() 关闭spark

1.代码运行时,通过Spark UI 网页查看当前进行程度

 

2.读写parquet文件为DataFrame

 

Pyspark将Parquet文件读入DataFrame

1

parDF=spark.read.parquet("/tmp/output/people.parquet")

Pyspark将DataFrame写入Parquet文件格式

1

df.write.parquet("/tmp/output/people.parquet")

追加或覆盖现有的Parquet文件

1

2

df.write.mode('append').parquet("/tmp/output/people.parquet")

df.write.mode('overwrite').parquet("/tmp/output/people.parquet")

3.对DataFrame的各种操作 pyspark.sql基础

 

初始一个DataFrame 即df
 

1

2

3

4

5

6

7

8

9

10

11

12

import pyspark

from pyspark.sql import SparkSession

 

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = [("James","Smith","USA","CA"),

    ("Michael","Rose","USA","NY"),

    ("Robert","Williams","USA","CA"),

    ("Maria","Jones","USA","FL")

  ]

columns = ["firstname","lastname","country","state"]

df = spark.createDataFrame(data = data, schema = columns)

df.show(truncate=False)

 

sql基本功能

1.PySpark – show()————————————把当前Dataframe在juipyter中显示

 

2.PySpark – select()————————————选择若干列

1

2

3

4

5

6

7

8

9

10

11

12

df.select("firstname","lastname").show()

df.select(df.firstname,df.lastname).show()

df.select(df["firstname"],df["lastname"]).show()

 

+---------+--------+

|firstname|lastname|

+---------+--------+

|    James|   Smith|

|  Michael|    Rose|

|   Robert|Williams|

|    Maria|   Jones|

+---------+--------+

 

3.PySpark – withColumn()—————————用于更改值,转换现有列的数据类型,创建新列等等

更改数据类型

1

2

#将列的数据类型从String更改Integer为salary

df.withColumn("salary",col("salary").cast("Integer")).show()

更新现有列的值,将“ salary”的值乘以100,并将其值更新回“ salary”列

1

df.withColumn("salary",col("salary")*100).show()

从现有列创建一个列,此代码段通过将“工资”列乘以值-1来创建新列“ CopiedColumn”

1

df.withColumn("CopiedColumn",col("salary")* -1).show()

添加一个新列

1

2

3

4

df.withColumn("Country", lit("USA")).show()

df.withColumn("Country", lit("USA")) \

  .withColumn("anotherColumn",lit("anotherValue")) \

  .show()

 

4.PySpark – withColumnRenamed()————重命名列名

1

2

3

4

5

df.withColumnRenamed("gender","sex") \

  .show(truncate=False)

 

df2 = df.withColumnRenamed("dob","DateOfBirth") \

    .withColumnRenamed("salary","salary_amount")

 

5.PySpark – where() & filter()

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

# Using equals condition  等于

df.filter(df.state == "OH").show(truncate=False)

 

+----------------------+------------------+-----+------+

|name                  |languages         |state|gender|

+----------------------+------------------+-----+------+

|[James, , Smith]      |[Java, Scala, C++]|OH   |M     |

|[Julia, , Williams]   |[CSharp, VB]      |OH   |F     |

|[Mike, Mary, Williams]|[Python, VB]      |OH   |M     |

+----------------------+------------------+-----+------+

 

# not equals condition 不等于

df.filter(df.state != "OH") \

    .show(truncate=False)

df.filter(~(df.state == "OH")) \

    .show(truncate=False)

 

//Filter multiple condition 多个等于

df.filter( (df.state  == "OH") & (df.gender  == "M") ) \

    .show(truncate=False)

+----------------------+------------------+-----+------+

|name                  |languages         |state|gender|

+----------------------+------------------+-----+------+

|[James, , Smith]      |[Java, Scala, C++]|OH   |M     |

|[Mike, Mary, Williams]|[Python, VB]      |OH   |M     |

+----------------------+------------------+-----+------+

 

# Using startswith  开头

df.filter(df.state.startswith("N")).show()

+--------------------+------------------+-----+------+

|                name|         languages|state|gender|

+--------------------+------------------+-----+------+

|      [Anna, Rose, ]|[Spark, Java, C++]|   NY|     F|

|[Maria, Anne, Jones]|      [CSharp, VB]|   NY|     M|

|  [Jen, Mary, Brown]|      [CSharp, VB]|   NY|     M|

+--------------------+------------------+-----+------+

 

#using endswith  结尾

df.filter(df.state.endswith("H")).show()

 

#contains  包含

df.filter(df.state.contains("H")).show()

 

6.PySpark – drop() & dropDuplicates() & distinct()

 

7.PySpark – orderBy() and sort()

1

2

3

#默认情况下,它按升序排序

df.sort("department","state").show(truncate=False)

df.orderBy("department","state").show(truncate=False)

 

8.PySpark – groupBy()

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

+-------------+----------+-----+------+---+-----+

|employee_name|department|state|salary|age|bonus|

+-------------+----------+-----+------+---+-----+

|        James|     Sales|   NY| 9000034|10000|

|      Michael|     Sales|   NY| 8600056|20000|

|       Robert|     Sales|   CA| 8100030|23000|

|        Maria|   Finance|   CA| 9000024|23000|

|        Raman|   Finance|   CA| 9900040|24000|

|        Scott|   Finance|   NY| 8300036|19000|

|          Jen|   Finance|   NY| 7900053|15000|

|         Jeff| Marketing|   CA| 8000025|18000|

|        Kumar| Marketing|   NY| 9100050|21000|

+-------------+----------+-----+------+---+-----+

 

df.groupBy("department").sum("salary").show(truncate=False)

+----------+-----------+

|department|sum(salary)|

+----------+-----------+

|Sales     |257000     |

|Finance   |351000     |

|Marketing |171000     |

+----------+-----------+

9.PySpark – join()

1

2

3

empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"inner").show()

 

empDF.join(deptDF,"dept_id","inner").show()

10.PySpark – union()

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

+-------------+----------+-----+------+---+-----+

|employee_name|department|state|salary|age|bonus|

+-------------+----------+-----+------+---+-----+

|James        |Sales     |NY   |90000 |34 |10000|

|Michael      |Sales     |NY   |86000 |56 |20000|

|Robert       |Sales     |CA   |81000 |30 |23000|

|Maria        |Finance   |CA   |90000 |24 |23000|

+-------------+----------+-----+------+---+-----+

 

+-------------+----------+-----+------+---+-----+

|employee_name|department|state|salary|age|bonus|

+-------------+----------+-----+------+---+-----+

|James        |Sales     |NY   |90000 |34 |10000|

|Maria        |Finance   |CA   |90000 |24 |23000|

|Jen          |Finance   |NY   |79000 |53 |15000|

|Jeff         |Marketing |CA   |80000 |25 |18000|

|Kumar        |Marketing |NY   |91000 |50 |21000|

+-------------+----------+-----+------+---+-----+

合并无重复

 

disDF = df.union(df2).distinct()

disDF.show(truncate=False)

注意:union函数并不是按照列名和并得,而是按照位置合并的。即DataFrame的列名可以不相同,但对应位置的列将合并在一起。

因此在合并列明在不同位置的两个表时,用unionByName()

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值