Apache Spark DataFrames入门指南:操作DataFrame

  本系列文章翻译自:《scala data analysis cookbook》第二章:Getting Started with Apache  Spark DataFrames。原书是基于 Spark 1.4.1编写的,我这里使用的是Spark 1.6.0,丢弃了一些已经标记为遗弃的函数。并且修正了其中的错误。
  一、从csv文件创建DataFrame
    如何做?
    如何工作的
    附录
  二、操作DataFrame
    打印DataFrame里面的模式
    对DataFrame里面的数据进行采样
    查询DataFrame里面的列
    根据条件过滤数据
    对DataFrame里面的数据进行排序
    对列进行重命名
    将DataFrame看作是关系型数据表
    对两个DataFrame进行Join操作
    将DataFrame保存成文件
三、从Scala case class中创建DataFrame
    如何做?
    如何工作的
    附录

二、操作DataFrame

  在前面的文章中,我们介绍了如何创建DataFrame。本文将介绍如何操作DataFrame里面的数据和打印出DataFrame里面数据的模式

打印DataFrame里面的模式

  在创建完DataFrame之后,我们一般都会查看里面数据的模式,我们可以通过printSchema函数来查看。它会打印出列的名称和类型:

students.printSchema
root
  |-- id : string (nullable = true )
  |-- studentName : string (nullable = true )
  |-- phone : string (nullable = true )
  |-- email : string (nullable = true )

如果采用的是load方式参见DataFrame的,students.printSchema的输出则如下:

root
  |-- id|studentName|phone|email : string (nullable = true )

对DataFrame里面的数据进行采样

  打印完模式之后,我们要做的第二件事就是看看加载进DataFrame里面的数据是否正确。从新创建的DataFrame里面采样数据的方法有很多种。我们来对其进行介绍。

  最简单的就是使用show方法,show方法有四个版本:
  (1)、第一个需要我们指定采样的行数def show(numRows: Int);
  (2)、第二种不需要我们指定任何参数,这种情况下,show函数默认会加载出20行的数据def show();
  (3)、第三种需要指定一个boolean值,这个值说明是否需要对超过20个字符的列进行截取def show(truncate: Boolean);
  (4)、最后一种需要指定采样的行和是否需要对列进行截断def show(numRows: Int, truncate: Boolean)。实际上,前三个函数都是调用这个函数实现的。

  Show函数和其他函数不同的地方在于其不仅会显示需要打印的行,而且还会打印出头信息,并且会直接在默认的输出流打出(console)。来看看怎么使用吧:

students.show()  //打印出20行
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
1 |      Burke| 1 - 300 - 746 - 8446 |ullamcorper.velit...|
2 |      Kamal| 1 - 668 - 571 - 5046 |pede.Suspendisse @ ...|
3 |       Olga| 1 - 956 - 311 - 1686 |Aenean.eget.metus...|
4 |      Belle| 1 - 246 - 894 - 6340 |vitae.aliquet.nec...|
5 |     Trevor| 1 - 300 - 527 - 4967 |dapibus.id @ acturp...|
6 |     Laurel| 1 - 691 - 379 - 9921 |adipiscing @ consec...|
7 |       Sara| 1 - 608 - 140 - 1995 |Donec.nibh @ enimEt...|
8 |     Kaseem| 1 - 881 - 586 - 2689 |cursus.et.magna @ e...|
9 |        Lev| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|
| 10 |       Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...|
| 11 |        Emi| 1 - 467 - 270 - 1337 |        est @ nunc.com|
| 12 |      Caleb| 1 - 683 - 212 - 0896 |Suspendisse @ Quisq...|
| 13 |   Florence| 1 - 603 - 575 - 2444 |sit.amet.dapibus @ ...|
| 14 |      Anika| 1 - 856 - 828 - 7883 |euismod @ ligulaeli...|
| 15 |      Tarik| 1 - 398 - 171 - 2268 |turpis @ felisorci.com|
| 16 |      Amena| 1 - 878 - 250 - 3129 |lorem.luctus.ut @ s...|
| 17 |    Blossom| 1 - 154 - 406 - 9596 |Nunc.commodo.auct...|
| 18 |        Guy| 1 - 869 - 521 - 3230 |senectus.et.netus...|
| 19 |    Malachi| 1 - 608 - 637 - 2772 |Proin.mi.Aliquam @ ...|
| 20 |     Edward| 1 - 711 - 710 - 6552 |lectus @ aliquetlib...|
+---+-----------+--------------+--------------------+
only showing top 20 rows
students.show( 15 )
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
1 |      Burke| 1 - 300 - 746 - 8446 |ullamcorper.velit...|
2 |      Kamal| 1 - 668 - 571 - 5046 |pede.Suspendisse @ ...|
3 |       Olga| 1 - 956 - 311 - 1686 |Aenean.eget.metus...|
4 |      Belle| 1 - 246 - 894 - 6340 |vitae.aliquet.nec...|
5 |     Trevor| 1 - 300 - 527 - 4967 |dapibus.id @ acturp...|
6 |     Laurel| 1 - 691 - 379 - 9921 |adipiscing @ consec...|
7 |       Sara| 1 - 608 - 140 - 1995 |Donec.nibh @ enimEt...|
8 |     Kaseem| 1 - 881 - 586 - 2689 |cursus.et.magna @ e...|
9 |        Lev| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|
| 10 |       Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...|
| 11 |        Emi| 1 - 467 - 270 - 1337 |        est @ nunc.com|
| 12 |      Caleb| 1 - 683 - 212 - 0896 |Suspendisse @ Quisq...|
| 13 |   Florence| 1 - 603 - 575 - 2444 |sit.amet.dapibus @ ...|
| 14 |      Anika| 1 - 856 - 828 - 7883 |euismod @ ligulaeli...|
| 15 |      Tarik| 1 - 398 - 171 - 2268 |turpis @ felisorci.com|
+---+-----------+--------------+--------------------+
only showing top 15 rows
 
students.show( true )
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
1 |      Burke| 1 - 300 - 746 - 8446 |ullamcorper.velit...|
2 |      Kamal| 1 - 668 - 571 - 5046 |pede.Suspendisse @ ...|
3 |       Olga| 1 - 956 - 311 - 1686 |Aenean.eget.metus...|
4 |      Belle| 1 - 246 - 894 - 6340 |vitae.aliquet.nec...|
5 |     Trevor| 1 - 300 - 527 - 4967 |dapibus.id @ acturp...|
6 |     Laurel| 1 - 691 - 379 - 9921 |adipiscing @ consec...|
7 |       Sara| 1 - 608 - 140 - 1995 |Donec.nibh @ enimEt...|
8 |     Kaseem| 1 - 881 - 586 - 2689 |cursus.et.magna @ e...|
9 |        Lev| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|
| 10 |       Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...|
| 11 |        Emi| 1 - 467 - 270 - 1337 |        est @ nunc.com|
| 12 |      Caleb| 1 - 683 - 212 - 0896 |Suspendisse @ Quisq...|
| 13 |   Florence| 1 - 603 - 575 - 2444 |sit.amet.dapibus @ ...|
| 14 |      Anika| 1 - 856 - 828 - 7883 |euismod @ ligulaeli...|
| 15 |      Tarik| 1 - 398 - 171 - 2268 |turpis @ felisorci.com|
| 16 |      Amena| 1 - 878 - 250 - 3129 |lorem.luctus.ut @ s...|
| 17 |    Blossom| 1 - 154 - 406 - 9596 |Nunc.commodo.auct...|
| 18 |        Guy| 1 - 869 - 521 - 3230 |senectus.et.netus...|
| 19 |    Malachi| 1 - 608 - 637 - 2772 |Proin.mi.Aliquam @ ...|
| 20 |     Edward| 1 - 711 - 710 - 6552 |lectus @ aliquetlib...|
+---+-----------+--------------+--------------------+
only showing top 20 rows
 
students.show( false )
+---+-----------+--------------+-----------------------------------------+
|id |studentName|phone         |email                                    |
+---+-----------+--------------+-----------------------------------------+
| 1  |Burke      | 1 - 300 - 746 - 8446 |ullamcorper.velit.in @ ametnullaDonec.co.uk|
| 2  |Kamal      | 1 - 668 - 571 - 5046 |pede.Suspendisse @ interdumenim.edu        |
| 3  |Olga       | 1 - 956 - 311 - 1686 |Aenean.eget.metus @ dictumcursusNunc.edu   |
| 4  |Belle      | 1 - 246 - 894 - 6340 |vitae.aliquet.nec @ neque.co.uk            |
| 5  |Trevor     | 1 - 300 - 527 - 4967 |dapibus.id @ acturpisegestas.net           |
| 6  |Laurel     | 1 - 691 - 379 - 9921 |adipiscing @ consectetueripsum.edu         |
| 7  |Sara       | 1 - 608 - 140 - 1995 |Donec.nibh @ enimEtiamimperdiet.edu        |
| 8  |Kaseem     | 1 - 881 - 586 - 2689 |cursus.et.magna @ euismod.org              |
| 9  |Lev        | 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsumdolor.com              |
| 10 |Maya       | 1 - 271 - 683 - 2698 |accumsan.convallis @ ornarelectusjusto.edu |
| 11 |Emi        | 1 - 467 - 270 - 1337 |est @ nunc.com                             |
| 12 |Caleb      | 1 - 683 - 212 - 0896 |Suspendisse @ Quisque.edu                  |
| 13 |Florence   | 1 - 603 - 575 - 2444 |sit.amet.dapibus @ lacusAliquamrutrum.ca   |
| 14 |Anika      | 1 - 856 - 828 - 7883 |euismod @ ligulaelit.co.uk                 |
| 15 |Tarik      | 1 - 398 - 171 - 2268 |turpis @ felisorci.com                     |
| 16 |Amena      | 1 - 878 - 250 - 3129 |lorem.luctus.ut @ scelerisque.com          |
| 17 |Blossom    | 1 - 154 - 406 - 9596 |Nunc.commodo.auctor @ eratSed.co.uk        |
| 18 |Guy        | 1 - 869 - 521 - 3230 |senectus.et.netus @ lectusrutrum.com       |
| 19 |Malachi    | 1 - 608 - 637 - 2772 |Proin.mi.Aliquam @ estarcu.net             |
| 20 |Edward     | 1 - 711 - 710 - 6552 |lectus @ aliquetlibero.co.uk               |
+---+-----------+--------------+-----------------------------------------+
only showing top 20 rows
 
students.show( 10 , false )
 
+---+-----------+--------------+-----------------------------------------+
|id |studentName|phone         |email                                    |
+---+-----------+--------------+-----------------------------------------+
| 1  |Burke      | 1 - 300 - 746 - 8446 |ullamcorper.velit.in @ ametnullaDonec.co.uk|
| 2  |Kamal      | 1 - 668 - 571 - 5046 |pede.Suspendisse @ interdumenim.edu        |
| 3  |Olga       | 1 - 956 - 311 - 1686 |Aenean.eget.metus @ dictumcursusNunc.edu   |
| 4  |Belle      | 1 - 246 - 894 - 6340 |vitae.aliquet.nec @ neque.co.uk            |
| 5  |Trevor     | 1 - 300 - 527 - 4967 |dapibus.id @ acturpisegestas.net           |
| 6  |Laurel     | 1 - 691 - 379 - 9921 |adipiscing @ consectetueripsum.edu         |
| 7  |Sara       | 1 - 608 - 140 - 1995 |Donec.nibh @ enimEtiamimperdiet.edu        |
| 8  |Kaseem     | 1 - 881 - 586 - 2689 |cursus.et.magna @ euismod.org              |
| 9  |Lev        | 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsumdolor.com              |
| 10 |Maya       | 1 - 271 - 683 - 2698 |accumsan.convallis @ ornarelectusjusto.edu |
+---+-----------+--------------+-----------------------------------------+
only showing top 10 rows

  我们还可以使用head(n: Int)方法来采样数据,这个函数也需要输入一个参数标明需要采样的行数,而且这个函数返回的是Row数组,我们需要遍历打印。当然,我们也可以使用head()函数直接打印,这个函数只是返回数据的一行,类型也是Row。

students.head( 5 ).foreach(println)
[ 1 ,Burke, 1 - 300 - 746 - 8446 ,ullamcorper.velit.in @ ametnullaDonec.co.uk]
[ 2 ,Kamal, 1 - 668 - 571 - 5046 ,pede.Suspendisse @ interdumenim.edu]
[ 3 ,Olga, 1 - 956 - 311 - 1686 ,Aenean.eget.metus @ dictumcursusNunc.edu]
[ 4 ,Belle, 1 - 246 - 894 - 6340 ,vitae.aliquet.nec @ neque.co.uk]
[ 5 ,Trevor, 1 - 300 - 527 - 4967 ,dapibus.id @ acturpisegestas.net]
println(students.head())
[ 1 ,Burke, 1 - 300 - 746 - 8446 ,ullamcorper.velit.in @ ametnullaDonec.co.uk]

除了show、head函数。我们还可以使用first和take函数,他们分别调用head()和head(n)

println(students.first())
[ 1 ,Burke, 1 - 300 - 746 - 8446 ,ullamcorper.velit.in @ ametnullaDonec.co.uk]
students.take( 5 ).foreach(println)
[ 1 ,Burke, 1 - 300 - 746 - 8446 ,ullamcorper.velit.in @ ametnullaDonec.co.uk]
[ 2 ,Kamal, 1 - 668 - 571 - 5046 ,pede.Suspendisse @ interdumenim.edu]
[ 3 ,Olga, 1 - 956 - 311 - 1686 ,Aenean.eget.metus @ dictumcursusNunc.edu]
[ 4 ,Belle, 1 - 246 - 894 - 6340 ,vitae.aliquet.nec @ neque.co.uk]
[ 5 ,Trevor, 1 - 300 - 527 - 4967 ,dapibus.id @ acturpisegestas.net]

查询DataFrame里面的列

  正如你所看到的,所有的DataFrame里面的列都是有名称的。Select函数可以帮助我们从DataFrame中选择需要的列,并且返回一个全新的DataFrame,下面我将此进行介绍。

  1、只选择一列。假如我们只想从DataFrame中选择email这列,因为DataFrame是不可变的(immutable),所以这个操作会返回一个新的DataFrame:

val emailDataFrame : DataFrame = students.select( "email" )

现在我们有一个名叫emailDataFrame全新的DataFrame,而且其中只包含了email这列,让我们使用show来看看是否是这样的:

emailDataFrame.show( 3 )
+--------------------+
|               email|
+--------------------+
|ullamcorper.velit...|
|pede.Suspendisse @ ...|
|Aenean.eget.metus...|
+--------------------+
only showing top 3 rows

  2、选择多列。其实select函数支持选择多列。

val studentEmailDF = students.select( "studentName" , "email" )
studentEmailDF.show( 5 )
+-----------+--------------------+
|studentName|               email|
+-----------+--------------------+
|      Burke|ullamcorper.velit...|
|      Kamal|pede.Suspendisse @ ...|
|       Olga|Aenean.eget.metus...|
|      Belle|vitae.aliquet.nec...|
|     Trevor|dapibus.id @ acturp...|
+-----------+--------------------+
only showing top 5 rows

  需要主要的是,我们select列的时候,需要保证select的列是有效的,换句话说,就是必须保证select的列是printSchema打印出来的。如果列的名称是无效的,将会出现org.apache.spark.sql.AnalysisException异常,如下:

val studentEmailDF = students.select( "studentName" , "iteblog" )
studentEmailDF.show( 5 )
 
Exception in thread "main" org.apache.spark.sql.AnalysisException : cannot resolve 'iteblog' given input columns id, studentName, phone, email;

根据条件过滤数据

  现在我们已经知道如何在DataFrame中选择需要的列,让我们来看看如何根据条件来过滤DataFrame里面的数据。对应基于Row的数据,我们可以将DataFrame看作是普通的Scala集合,然后我们根据需要的条件进行相关的过滤,为了展示清楚,我在语句没后面都用show函数展示过滤的结果。

students.filter( "id > 5" ).show( 7 )
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
6 |     Laurel| 1 - 691 - 379 - 9921 |adipiscing @ consec...|
7 |       Sara| 1 - 608 - 140 - 1995 |Donec.nibh @ enimEt...|
8 |     Kaseem| 1 - 881 - 586 - 2689 |cursus.et.magna @ e...|
9 |        Lev| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|
| 10 |       Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...|
| 11 |        Emi| 1 - 467 - 270 - 1337 |        est @ nunc.com|
| 12 |      Caleb| 1 - 683 - 212 - 0896 |Suspendisse @ Quisq...|
| 13 |   Florence| 1 - 603 - 575 - 2444 |sit.amet.dapibus @ ...|
| 14 |      Anika| 1 - 856 - 828 - 7883 |euismod @ ligulaeli...|
| 15 |      Tarik| 1 - 398 - 171 - 2268 |turpis @ felisorci.com|
+---+-----------+--------------+--------------------+
only showing top 10 rows
 
students.filter( "studentName =''" ).show( 7 )
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 21 |           | 1 - 598 - 439 - 7549 |consectetuer.adip...|
| 32 |           | 1 - 184 - 895 - 9602 |accumsan.laoreet @ ...|
| 45 |           | 1 - 245 - 752 - 0481 |Suspendisse.eleif...|
| 83 |           | 1 - 858 - 810 - 2204 |sociis.natoque @ eu...|
| 94 |           | 1 - 443 - 410 - 7878 |Praesent.eu.nulla...|
+---+-----------+--------------+--------------------+

  注意看第一个过滤语句,虽然id被解析成String了,但是程序依然正确地做出了比较。我们也可以对多个条件进行过滤:

students.filter( "studentName ='' OR studentName = 'NULL'" ).show( 7 )
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 21 |           | 1 - 598 - 439 - 7549 |consectetuer.adip...|
| 32 |           | 1 - 184 - 895 - 9602 |accumsan.laoreet @ ...|
| 33 |       NULL| 1 - 105 - 503 - 0141 |Donec @ Inmipede.co.uk|
| 45 |           | 1 - 245 - 752 - 0481 |Suspendisse.eleif...|
| 83 |           | 1 - 858 - 810 - 2204 |sociis.natoque @ eu...|
| 94 |           | 1 - 443 - 410 - 7878 |Praesent.eu.nulla...|
+---+-----------+--------------+--------------------+

我们还可以采用类SQL的语法对数据进行过滤:

students.filter( "SUBSTR(studentName,0,1) ='M'" ).show( 7 )
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 10 |       Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...|
| 19 |    Malachi| 1 - 608 - 637 - 2772 |Proin.mi.Aliquam @ ...|
| 24 |    Marsden| 1 - 477 - 629 - 7528 |Donec.dignissim.m...|
| 37 |      Maggy| 1 - 910 - 887 - 6777 |facilisi.Sed.nequ...|
| 61 |     Maxine| 1 - 422 - 863 - 3041 |aliquet.molestie....|
| 77 |      Maggy| 1 - 613 - 147 - 4380 | pellentesque @ mi.net|
| 97 |    Maxwell| 1 - 607 - 205 - 1273 |metus.In @ musAenea...|
+---+-----------+--------------+--------------------+
only showing top 7 rows

对DataFrame里面的数据进行排序

使用sort函数我们可以对DataFrame中指定的列进行排序:

students.sort(students( "studentName" ).desc).show( 7 )
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 50 |      Yasir| 1 - 282 - 511 - 4445 |eget.odio.Aliquam...|
| 52 |       Xena| 1 - 527 - 990 - 8606 |in.faucibus.orci @ ...|
| 86 |     Xandra| 1 - 677 - 708 - 5691 |libero @ arcuVestib...|
| 43 |     Wynter| 1 - 440 - 544 - 1851 |amet.risus.Donec @ ...|
| 31 |    Wallace| 1 - 144 - 220 - 8159 | lorem.lorem @ non.net|
| 66 |      Vance| 1 - 268 - 680 - 0857 |pellentesque @ netu...|
| 41 |     Tyrone| 1 - 907 - 383 - 5293 |non.bibendum.sed @ ...|
+---+-----------+--------------+--------------------+
only showing top 7 rows

也可以对多列进行排序:

students.sort( "studentName" , "id" ).show( 10 )
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 21 |           | 1 - 598 - 439 - 7549 |consectetuer.adip...|
| 32 |           | 1 - 184 - 895 - 9602 |accumsan.laoreet @ ...|
| 45 |           | 1 - 245 - 752 - 0481 |Suspendisse.eleif...|
| 83 |           | 1 - 858 - 810 - 2204 |sociis.natoque @ eu...|
| 94 |           | 1 - 443 - 410 - 7878 |Praesent.eu.nulla...|
| 91 |       Abel| 1 - 530 - 527 - 7467 |    urna @ veliteu.edu|
| 69 |       Aiko| 1 - 682 - 230 - 7013 |turpis.vitae.puru...|
| 47 |       Alma| 1 - 747 - 382 - 6775 |    nec.enim @ non.org|
| 26 |      Amela| 1 - 526 - 909 - 2605 | in @ vitaesodales.edu|
| 16 |      Amena| 1 - 878 - 250 - 3129 |lorem.luctus.ut @ s...|
+---+-----------+--------------+--------------------+
only showing top 10 rows

从上面的结果我们可以看出,默认是按照升序进行排序的。我们也可以将上面的语句写成下面的:

students.sort(students( "studentName" ).asc, students( "id" ).asc).show( 10 )

这两个语句运行的效果是一致的。

对列进行重命名

  如果我们对DataFrame中默认的列名不感兴趣,我们可以在select的时候利用as对其进行重命名,下面的列子将studentName重命名为name,而email这列名字不变:

students.select(students( "studentName" ).as( "name" ), students( "email" )).show( 10 )
+--------+--------------------+
|    name|               email|
+--------+--------------------+
|   Burke|ullamcorper.velit...|
|   Kamal|pede.Suspendisse @ ...|
|    Olga|Aenean.eget.metus...|
|   Belle|vitae.aliquet.nec...|
|  Trevor|dapibus.id @ acturp...|
|  Laurel|adipiscing @ consec...|
|    Sara|Donec.nibh @ enimEt...|
|  Kaseem|cursus.et.magna @ e...|
|     Lev|Vivamus.nisi @ ipsu...|
|    Maya|accumsan.convalli...|
+--------+--------------------+
only showing top 10 rows

将DataFrame看作是关系型数据表

  DataFrame的一个强大之处就是我们可以将它看作是一个关系型数据表,然后在其上运行SQL查询语句,只要我们进行下面两步即可实现:
  (1)、将DataFrame注册成一张名为students的表:

students.registerTempTable( "students" )

  (2)、然后我们在其上用标准的SQL进行查询:

sqlContext.sql( "select * from students where studentName!='' order by email desc" ).show( 7 )
 
+---+-----------+--------------+--------------------+
| id|studentName|         phone|               email|
+---+-----------+--------------+--------------------+
| 87 |      Selma| 1 - 601 - 330 - 4409 |vulputate.velit @ p...|
| 96 |   Channing| 1 - 984 - 118 - 7533 |viverra.Donec.tem...|
4 |      Belle| 1 - 246 - 894 - 6340 |vitae.aliquet.nec...|
| 78 |       Finn| 1 - 213 - 781 - 6969 |vestibulum.massa @ ...|
| 53 |     Kasper| 1 - 155 - 575 - 9346 |velit.eget @ pedeCu...|
| 63 |      Dylan| 1 - 417 - 943 - 8961 |vehicula.aliquet @ ...|
| 35 |     Cadman| 1 - 443 - 642 - 5919 |ut.lacus @ adipisci...|
+---+-----------+--------------+--------------------+
only showing top 7 rows

对两个DataFrame进行Join操作

  前面我们已经知道如何将DataFrame注册成一张表,现在我们来看看如何使用普通的SQL对两个DataFrame进行Join操作。

  1、内联:内联是默认的Join操作,它仅仅返回两个DataFrame都匹配到的结果,来看看下面的例子:

val students 1 = sqlContext.csvFile(filePath = "E:\\StudentPrep1.csv" , useHeader = true , delimiter = '|' )
val students 2 = sqlContext.csvFile(filePath = "E:\\StudentPrep2.csv" , useHeader = true , delimiter = '|' )
val studentsJoin = students 1 .join(students 2 , students 1 ( "id" ) === students 2 ( "id" ))
studentsJoin.show(studentsJoin.count.toInt)
 
+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+
| id|studentName|         phone|               email| id|       studentName|         phone|               email|
+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+
1 |      Burke| 1 - 300 - 746 - 8446 |ullamcorper.velit...|  1 |BurkeDifferentName| 1 - 300 - 746 - 8446 |ullamcorper.velit...|
2 |      Kamal| 1 - 668 - 571 - 5046 |pede.Suspendisse @ ...|  2 |KamalDifferentName| 1 - 668 - 571 - 5046 |pede.Suspendisse @ ...|
3 |       Olga| 1 - 956 - 311 - 1686 |Aenean.eget.metus...|  3 |              Olga| 1 - 956 - 311 - 1686 |Aenean.eget.metus...|
4 |      Belle| 1 - 246 - 894 - 6340 |vitae.aliquet.nec...|  4 |BelleDifferentName| 1 - 246 - 894 - 6340 |vitae.aliquet.nec...|
5 |     Trevor| 1 - 300 - 527 - 4967 |dapibus.id @ acturp...|  5 |            Trevor| 1 - 300 - 527 - 4967 |dapibusDifferentE...|
6 |     Laurel| 1 - 691 - 379 - 9921 |adipiscing @ consec...|  6 |LaurelInvalidPhone|     000000000 |adipiscing @ consec...|
7 |       Sara| 1 - 608 - 140 - 1995 |Donec.nibh @ enimEt...|  7 |              Sara| 1 - 608 - 140 - 1995 |Donec.nibh @ enimEt...|
8 |     Kaseem| 1 - 881 - 586 - 2689 |cursus.et.magna @ e...|  8 |            Kaseem| 1 - 881 - 586 - 2689 |cursus.et.magna @ e...|
9 |        Lev| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|  9 |               Lev| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|
| 10 |       Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...| 10 |              Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...|
+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+

  2、右外联:在内连接的基础上,还包含右表中所有不符合条件的数据行,并在其中的左表列填写NULL ,来看看下面的实例:

val studentsRightOuterJoin = students 1 .join(students 2 , students 1 ( "id" ) === students 2 ( "id" ), "right_outer" )
studentsRightOuterJoin.show(studentsRightOuterJoin.count.toInt)
+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+
|  id|studentName|         phone|               email| id|         studentName|         phone|               email|
+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+
|   1 |      Burke| 1 - 300 - 746 - 8446 |ullamcorper.velit...|  1 |  BurkeDifferentName| 1 - 300 - 746 - 8446 |ullamcorper.velit...|
|   2 |      Kamal| 1 - 668 - 571 - 5046 |pede.Suspendisse @ ...|  2 |  KamalDifferentName| 1 - 668 - 571 - 5046 |pede.Suspendisse @ ...|
|   3 |       Olga| 1 - 956 - 311 - 1686 |Aenean.eget.metus...|  3 |                Olga| 1 - 956 - 311 - 1686 |Aenean.eget.metus...|
|   4 |      Belle| 1 - 246 - 894 - 6340 |vitae.aliquet.nec...|  4 |  BelleDifferentName| 1 - 246 - 894 - 6340 |vitae.aliquet.nec...|
|   5 |     Trevor| 1 - 300 - 527 - 4967 |dapibus.id @ acturp...|  5 |              Trevor| 1 - 300 - 527 - 4967 |dapibusDifferentE...|
|   6 |     Laurel| 1 - 691 - 379 - 9921 |adipiscing @ consec...|  6 |  LaurelInvalidPhone|     000000000 |adipiscing @ consec...|
|   7 |       Sara| 1 - 608 - 140 - 1995 |Donec.nibh @ enimEt...|  7 |                Sara| 1 - 608 - 140 - 1995 |Donec.nibh @ enimEt...|
|   8 |     Kaseem| 1 - 881 - 586 - 2689 |cursus.et.magna @ e...|  8 |              Kaseem| 1 - 881 - 586 - 2689 |cursus.et.magna @ e...|
|   9 |        Lev| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|  9 |                 Lev| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|
10 |       Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...| 10 |                Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...|
| null |       null |          null |                null | 999 |LevUniqueToSecondRDD| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|
+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+

  3、左外联:在内连接的基础上,还包含左表中所有不符合条件的数据行,并在其中的右表列填写NULL ,同样我们来看看下面的实例:

val studentsLeftOuterJoin = students 1 .join(students 2 , students 1 ( "id" ) === students 2 ( "id" ), "left_outer" )
studentsLeftOuterJoin.show(studentsLeftOuterJoin.count.toInt)
+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+
| id|studentName|         phone|               email|  id|       studentName|         phone|               email|
+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+
1 |      Burke| 1 - 300 - 746 - 8446 |ullamcorper.velit...|   1 |BurkeDifferentName| 1 - 300 - 746 - 8446 |ullamcorper.velit...|
2 |      Kamal| 1 - 668 - 571 - 5046 |pede.Suspendisse @ ...|   2 |KamalDifferentName| 1 - 668 - 571 - 5046 |pede.Suspendisse @ ...|
3 |       Olga| 1 - 956 - 311 - 1686 |Aenean.eget.metus...|   3 |              Olga| 1 - 956 - 311 - 1686 |Aenean.eget.metus...|
4 |      Belle| 1 - 246 - 894 - 6340 |vitae.aliquet.nec...|   4 |BelleDifferentName| 1 - 246 - 894 - 6340 |vitae.aliquet.nec...|
5 |     Trevor| 1 - 300 - 527 - 4967 |dapibus.id @ acturp...|   5 |            Trevor| 1 - 300 - 527 - 4967 |dapibusDifferentE...|
6 |     Laurel| 1 - 691 - 379 - 9921 |adipiscing @ consec...|   6 |LaurelInvalidPhone|     000000000 |adipiscing @ consec...|
7 |       Sara| 1 - 608 - 140 - 1995 |Donec.nibh @ enimEt...|   7 |              Sara| 1 - 608 - 140 - 1995 |Donec.nibh @ enimEt...|
8 |     Kaseem| 1 - 881 - 586 - 2689 |cursus.et.magna @ e...|   8 |            Kaseem| 1 - 881 - 586 - 2689 |cursus.et.magna @ e...|
9 |        Lev| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|   9 |               Lev| 1 - 916 - 367 - 5608 |Vivamus.nisi @ ipsu...|
| 10 |       Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...|  10 |              Maya| 1 - 271 - 683 - 2698 |accumsan.convalli...|
| 11 |    iteblog|        999999 | iteblog @ iteblog.com| null |              null |          null |                null |
+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+

将DataFrame保存成文件

  下面我来介绍如何将DataFrame保存到一个文件里面。前面我们加载csv文件用到了load函数,与之对于的用于保存文件可以使用save函数。具体操作包括以下两步:

  1、首先创建一个map对象,用于存储一些save函数需要用到的一些属性。这里我将制定保存文件的存放路径和csv的头信息。

val saveOptions = Map( "header" -> "true" , "path" -> "iteblog.csv" )

  为了基于学习的态度,我们从DataFrame里面选择出studentName和email两列,并且将studentName的列名重定义为name。

val copyOfStudents = students.select(students( "studentName" ).as( "name" ), students( "email" ))

  2、下面我们调用save函数保存上面的DataFrame数据到iteblog.csv文件夹中

copyOfStudents.write.format( "com.databricks.spark.csv" ).mode(SaveMode.Overwrite).options(saveOptions).save()

  mode函数可以接收的参数有Overwrite、Append、Ignore和ErrorIfExists。从名字就可以很好的理解,Overwrite代表覆盖目录下之前存在的数据;Append代表给指定目录下追加数据;Ignore代表如果目录下已经有文件,那就什么都不执行;ErrorIfExists代表如果保存目录下存在文件,那么抛出相应的异常。

  需要注意的是,上述path参数指定的是保存文件夹,并不是最后的保存文件名。


 转载自过往记忆(http://www.iteblog.com/)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值