一、从csv文件创建DataFrame
如何做?
如何工作的
附录
二、操作DataFrame
打印DataFrame里面的模式
对DataFrame里面的数据进行采样
查询DataFrame里面的列
根据条件过滤数据
对DataFrame里面的数据进行排序
对列进行重命名
将DataFrame看作是关系型数据表
对两个DataFrame进行Join操作
将DataFrame保存成文件
三、从Scala case class中创建DataFrame
如何做?
如何工作的
附录
文章目录 [hide]
二、操作DataFrame
在前面的文章中,我们介绍了如何创建DataFrame。本文将介绍如何操作DataFrame里面的数据和打印出DataFrame里面数据的模式
打印DataFrame里面的模式
在创建完DataFrame之后,我们一般都会查看里面数据的模式,我们可以通过printSchema
函数来查看。它会打印出列的名称和类型:
students.printSchema
root
|-- id
:
string (nullable
=
true
)
|-- studentName
:
string (nullable
=
true
)
|-- phone
:
string (nullable
=
true
)
|-- email
:
string (nullable
=
true
)
|
如果采用的是load方式参见DataFrame的,students.printSchema
的输出则如下:
root
|-- id|studentName|phone|email
:
string (nullable
=
true
)
|
对DataFrame里面的数据进行采样
打印完模式之后,我们要做的第二件事就是看看加载进DataFrame里面的数据是否正确。从新创建的DataFrame里面采样数据的方法有很多种。我们来对其进行介绍。
最简单的就是使用show方法,show方法有四个版本:
(1)、第一个需要我们指定采样的行数def show(numRows: Int);
(2)、第二种不需要我们指定任何参数,这种情况下,show函数默认会加载出20行的数据def show();
(3)、第三种需要指定一个boolean值,这个值说明是否需要对超过20个字符的列进行截取def show(truncate: Boolean);
(4)、最后一种需要指定采样的行和是否需要对列进行截断def show(numRows: Int, truncate: Boolean)
。实际上,前三个函数都是调用这个函数实现的。
Show函数和其他函数不同的地方在于其不仅会显示需要打印的行,而且还会打印出头信息,并且会直接在默认的输出流打出(console)。来看看怎么使用吧:
students.show()
//打印出20行
+---+-----------+--------------+--------------------+
| id|studentName| phone| email|
+---+-----------+--------------+--------------------+
|
1
| Burke|
1
-
300
-
746
-
8446
|ullamcorper.velit...|
|
2
| Kamal|
1
-
668
-
571
-
5046
|pede.Suspendisse
@
...|
|
3
| Olga|
1
-
956
-
311
-
1686
|Aenean.eget.metus...|
|
4
| Belle|
1
-
246
-
894
-
6340
|vitae.aliquet.nec...|
|
5
| Trevor|
1
-
300
-
527
-
4967
|dapibus.id
@
acturp...|
|
6
| Laurel|
1
-
691
-
379
-
9921
|adipiscing
@
consec...|
|
7
| Sara|
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEt...|
|
8
| Kaseem|
1
-
881
-
586
-
2689
|cursus.et.magna
@
e...|
|
9
| Lev|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
|
11
| Emi|
1
-
467
-
270
-
1337
| est
@
nunc.com|
|
12
| Caleb|
1
-
683
-
212
-
0896
|Suspendisse
@
Quisq...|
|
13
| Florence|
1
-
603
-
575
-
2444
|sit.amet.dapibus
@
...|
|
14
| Anika|
1
-
856
-
828
-
7883
|euismod
@
ligulaeli...|
|
15
| Tarik|
1
-
398
-
171
-
2268
|turpis
@
felisorci.com|
|
16
| Amena|
1
-
878
-
250
-
3129
|lorem.luctus.ut
@
s...|
|
17
| Blossom|
1
-
154
-
406
-
9596
|Nunc.commodo.auct...|
|
18
| Guy|
1
-
869
-
521
-
3230
|senectus.et.netus...|
|
19
| Malachi|
1
-
608
-
637
-
2772
|Proin.mi.Aliquam
@
...|
|
20
| Edward|
1
-
711
-
710
-
6552
|lectus
@
aliquetlib...|
+---+-----------+--------------+--------------------+
only showing top
20
rows
students.show(
15
)
+---+-----------+--------------+--------------------+
| id|studentName| phone| email|
+---+-----------+--------------+--------------------+
|
1
| Burke|
1
-
300
-
746
-
8446
|ullamcorper.velit...|
|
2
| Kamal|
1
-
668
-
571
-
5046
|pede.Suspendisse
@
...|
|
3
| Olga|
1
-
956
-
311
-
1686
|Aenean.eget.metus...|
|
4
| Belle|
1
-
246
-
894
-
6340
|vitae.aliquet.nec...|
|
5
| Trevor|
1
-
300
-
527
-
4967
|dapibus.id
@
acturp...|
|
6
| Laurel|
1
-
691
-
379
-
9921
|adipiscing
@
consec...|
|
7
| Sara|
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEt...|
|
8
| Kaseem|
1
-
881
-
586
-
2689
|cursus.et.magna
@
e...|
|
9
| Lev|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
|
11
| Emi|
1
-
467
-
270
-
1337
| est
@
nunc.com|
|
12
| Caleb|
1
-
683
-
212
-
0896
|Suspendisse
@
Quisq...|
|
13
| Florence|
1
-
603
-
575
-
2444
|sit.amet.dapibus
@
...|
|
14
| Anika|
1
-
856
-
828
-
7883
|euismod
@
ligulaeli...|
|
15
| Tarik|
1
-
398
-
171
-
2268
|turpis
@
felisorci.com|
+---+-----------+--------------+--------------------+
only showing top
15
rows
students.show(
true
)
+---+-----------+--------------+--------------------+
| id|studentName| phone| email|
+---+-----------+--------------+--------------------+
|
1
| Burke|
1
-
300
-
746
-
8446
|ullamcorper.velit...|
|
2
| Kamal|
1
-
668
-
571
-
5046
|pede.Suspendisse
@
...|
|
3
| Olga|
1
-
956
-
311
-
1686
|Aenean.eget.metus...|
|
4
| Belle|
1
-
246
-
894
-
6340
|vitae.aliquet.nec...|
|
5
| Trevor|
1
-
300
-
527
-
4967
|dapibus.id
@
acturp...|
|
6
| Laurel|
1
-
691
-
379
-
9921
|adipiscing
@
consec...|
|
7
| Sara|
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEt...|
|
8
| Kaseem|
1
-
881
-
586
-
2689
|cursus.et.magna
@
e...|
|
9
| Lev|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
|
11
| Emi|
1
-
467
-
270
-
1337
| est
@
nunc.com|
|
12
| Caleb|
1
-
683
-
212
-
0896
|Suspendisse
@
Quisq...|
|
13
| Florence|
1
-
603
-
575
-
2444
|sit.amet.dapibus
@
...|
|
14
| Anika|
1
-
856
-
828
-
7883
|euismod
@
ligulaeli...|
|
15
| Tarik|
1
-
398
-
171
-
2268
|turpis
@
felisorci.com|
|
16
| Amena|
1
-
878
-
250
-
3129
|lorem.luctus.ut
@
s...|
|
17
| Blossom|
1
-
154
-
406
-
9596
|Nunc.commodo.auct...|
|
18
| Guy|
1
-
869
-
521
-
3230
|senectus.et.netus...|
|
19
| Malachi|
1
-
608
-
637
-
2772
|Proin.mi.Aliquam
@
...|
|
20
| Edward|
1
-
711
-
710
-
6552
|lectus
@
aliquetlib...|
+---+-----------+--------------+--------------------+
only showing top
20
rows
students.show(
false
)
+---+-----------+--------------+-----------------------------------------+
|id |studentName|phone |email |
+---+-----------+--------------+-----------------------------------------+
|
1
|Burke |
1
-
300
-
746
-
8446
|ullamcorper.velit.in
@
ametnullaDonec.co.uk|
|
2
|Kamal |
1
-
668
-
571
-
5046
|pede.Suspendisse
@
interdumenim.edu |
|
3
|Olga |
1
-
956
-
311
-
1686
|Aenean.eget.metus
@
dictumcursusNunc.edu |
|
4
|Belle |
1
-
246
-
894
-
6340
|vitae.aliquet.nec
@
neque.co.uk |
|
5
|Trevor |
1
-
300
-
527
-
4967
|dapibus.id
@
acturpisegestas.net |
|
6
|Laurel |
1
-
691
-
379
-
9921
|adipiscing
@
consectetueripsum.edu |
|
7
|Sara |
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEtiamimperdiet.edu |
|
8
|Kaseem |
1
-
881
-
586
-
2689
|cursus.et.magna
@
euismod.org |
|
9
|Lev |
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsumdolor.com |
|
10
|Maya |
1
-
271
-
683
-
2698
|accumsan.convallis
@
ornarelectusjusto.edu |
|
11
|Emi |
1
-
467
-
270
-
1337
|est
@
nunc.com |
|
12
|Caleb |
1
-
683
-
212
-
0896
|Suspendisse
@
Quisque.edu |
|
13
|Florence |
1
-
603
-
575
-
2444
|sit.amet.dapibus
@
lacusAliquamrutrum.ca |
|
14
|Anika |
1
-
856
-
828
-
7883
|euismod
@
ligulaelit.co.uk |
|
15
|Tarik |
1
-
398
-
171
-
2268
|turpis
@
felisorci.com |
|
16
|Amena |
1
-
878
-
250
-
3129
|lorem.luctus.ut
@
scelerisque.com |
|
17
|Blossom |
1
-
154
-
406
-
9596
|Nunc.commodo.auctor
@
eratSed.co.uk |
|
18
|Guy |
1
-
869
-
521
-
3230
|senectus.et.netus
@
lectusrutrum.com |
|
19
|Malachi |
1
-
608
-
637
-
2772
|Proin.mi.Aliquam
@
estarcu.net |
|
20
|Edward |
1
-
711
-
710
-
6552
|lectus
@
aliquetlibero.co.uk |
+---+-----------+--------------+-----------------------------------------+
only showing top
20
rows
students.show(
10
,
false
)
+---+-----------+--------------+-----------------------------------------+
|id |studentName|phone |email |
+---+-----------+--------------+-----------------------------------------+
|
1
|Burke |
1
-
300
-
746
-
8446
|ullamcorper.velit.in
@
ametnullaDonec.co.uk|
|
2
|Kamal |
1
-
668
-
571
-
5046
|pede.Suspendisse
@
interdumenim.edu |
|
3
|Olga |
1
-
956
-
311
-
1686
|Aenean.eget.metus
@
dictumcursusNunc.edu |
|
4
|Belle |
1
-
246
-
894
-
6340
|vitae.aliquet.nec
@
neque.co.uk |
|
5
|Trevor |
1
-
300
-
527
-
4967
|dapibus.id
@
acturpisegestas.net |
|
6
|Laurel |
1
-
691
-
379
-
9921
|adipiscing
@
consectetueripsum.edu |
|
7
|Sara |
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEtiamimperdiet.edu |
|
8
|Kaseem |
1
-
881
-
586
-
2689
|cursus.et.magna
@
euismod.org |
|
9
|Lev |
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsumdolor.com |
|
10
|Maya |
1
-
271
-
683
-
2698
|accumsan.convallis
@
ornarelectusjusto.edu |
+---+-----------+--------------+-----------------------------------------+
only showing top
10
rows
|
我们还可以使用head(n: Int)方法来采样数据,这个函数也需要输入一个参数标明需要采样的行数,而且这个函数返回的是Row数组,我们需要遍历打印。当然,我们也可以使用head()函数直接打印,这个函数只是返回数据的一行,类型也是Row。
students.head(
5
).foreach(println)
[
1
,Burke,
1
-
300
-
746
-
8446
,ullamcorper.velit.in
@
ametnullaDonec.co.uk]
[
2
,Kamal,
1
-
668
-
571
-
5046
,pede.Suspendisse
@
interdumenim.edu]
[
3
,Olga,
1
-
956
-
311
-
1686
,Aenean.eget.metus
@
dictumcursusNunc.edu]
[
4
,Belle,
1
-
246
-
894
-
6340
,vitae.aliquet.nec
@
neque.co.uk]
[
5
,Trevor,
1
-
300
-
527
-
4967
,dapibus.id
@
acturpisegestas.net]
println(students.head())
[
1
,Burke,
1
-
300
-
746
-
8446
,ullamcorper.velit.in
@
ametnullaDonec.co.uk]
|
除了show、head函数。我们还可以使用first和take函数,他们分别调用head()和head(n)
println(students.first())
[
1
,Burke,
1
-
300
-
746
-
8446
,ullamcorper.velit.in
@
ametnullaDonec.co.uk]
students.take(
5
).foreach(println)
[
1
,Burke,
1
-
300
-
746
-
8446
,ullamcorper.velit.in
@
ametnullaDonec.co.uk]
[
2
,Kamal,
1
-
668
-
571
-
5046
,pede.Suspendisse
@
interdumenim.edu]
[
3
,Olga,
1
-
956
-
311
-
1686
,Aenean.eget.metus
@
dictumcursusNunc.edu]
[
4
,Belle,
1
-
246
-
894
-
6340
,vitae.aliquet.nec
@
neque.co.uk]
[
5
,Trevor,
1
-
300
-
527
-
4967
,dapibus.id
@
acturpisegestas.net]
|
查询DataFrame里面的列
正如你所看到的,所有的DataFrame里面的列都是有名称的。Select函数可以帮助我们从DataFrame中选择需要的列,并且返回一个全新的DataFrame,下面我将此进行介绍。
1、只选择一列。假如我们只想从DataFrame中选择email这列,因为DataFrame是不可变的(immutable),所以这个操作会返回一个新的DataFrame:
val
emailDataFrame
:
DataFrame
=
students.select(
"email"
)
|
现在我们有一个名叫emailDataFrame全新的DataFrame,而且其中只包含了email这列,让我们使用show来看看是否是这样的:
emailDataFrame.show(
3
)
+--------------------+
| email|
+--------------------+
|ullamcorper.velit...|
|pede.Suspendisse
@
...|
|Aenean.eget.metus...|
+--------------------+
only showing top
3
rows
|
2、选择多列。其实select函数支持选择多列。
val
studentEmailDF
=
students.select(
"studentName"
,
"email"
)
studentEmailDF.show(
5
)
+-----------+--------------------+
|studentName| email|
+-----------+--------------------+
| Burke|ullamcorper.velit...|
| Kamal|pede.Suspendisse
@
...|
| Olga|Aenean.eget.metus...|
| Belle|vitae.aliquet.nec...|
| Trevor|dapibus.id
@
acturp...|
+-----------+--------------------+
only showing top
5
rows
|
需要主要的是,我们select列的时候,需要保证select的列是有效的,换句话说,就是必须保证select的列是printSchema
打印出来的。如果列的名称是无效的,将会出现org.apache.spark.sql.AnalysisException
异常,如下:
val
studentEmailDF
=
students.select(
"studentName"
,
"iteblog"
)
studentEmailDF.show(
5
)
Exception in thread
"main"
org.apache.spark.sql.AnalysisException
:
cannot resolve
'iteblog'
given input columns id, studentName, phone, email;
|
根据条件过滤数据
现在我们已经知道如何在DataFrame中选择需要的列,让我们来看看如何根据条件来过滤DataFrame里面的数据。对应基于Row的数据,我们可以将DataFrame看作是普通的Scala集合,然后我们根据需要的条件进行相关的过滤,为了展示清楚,我在语句没后面都用show函数展示过滤的结果。
students.filter(
"id > 5"
).show(
7
)
+---+-----------+--------------+--------------------+
| id|studentName| phone| email|
+---+-----------+--------------+--------------------+
|
6
| Laurel|
1
-
691
-
379
-
9921
|adipiscing
@
consec...|
|
7
| Sara|
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEt...|
|
8
| Kaseem|
1
-
881
-
586
-
2689
|cursus.et.magna
@
e...|
|
9
| Lev|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
|
11
| Emi|
1
-
467
-
270
-
1337
| est
@
nunc.com|
|
12
| Caleb|
1
-
683
-
212
-
0896
|Suspendisse
@
Quisq...|
|
13
| Florence|
1
-
603
-
575
-
2444
|sit.amet.dapibus
@
...|
|
14
| Anika|
1
-
856
-
828
-
7883
|euismod
@
ligulaeli...|
|
15
| Tarik|
1
-
398
-
171
-
2268
|turpis
@
felisorci.com|
+---+-----------+--------------+--------------------+
only showing top
10
rows
students.filter(
"studentName =''"
).show(
7
)
+---+-----------+--------------+--------------------+
| id|studentName| phone| email|
+---+-----------+--------------+--------------------+
|
21
| |
1
-
598
-
439
-
7549
|consectetuer.adip...|
|
32
| |
1
-
184
-
895
-
9602
|accumsan.laoreet
@
...|
|
45
| |
1
-
245
-
752
-
0481
|Suspendisse.eleif...|
|
83
| |
1
-
858
-
810
-
2204
|sociis.natoque
@
eu...|
|
94
| |
1
-
443
-
410
-
7878
|Praesent.eu.nulla...|
+---+-----------+--------------+--------------------+
|
注意看第一个过滤语句,虽然id被解析成String了,但是程序依然正确地做出了比较。我们也可以对多个条件进行过滤:
students.filter(
"studentName ='' OR studentName = 'NULL'"
).show(
7
)
+---+-----------+--------------+--------------------+
| id|studentName| phone| email|
+---+-----------+--------------+--------------------+
|
21
| |
1
-
598
-
439
-
7549
|consectetuer.adip...|
|
32
| |
1
-
184
-
895
-
9602
|accumsan.laoreet
@
...|
|
33
| NULL|
1
-
105
-
503
-
0141
|Donec
@
Inmipede.co.uk|
|
45
| |
1
-
245
-
752
-
0481
|Suspendisse.eleif...|
|
83
| |
1
-
858
-
810
-
2204
|sociis.natoque
@
eu...|
|
94
| |
1
-
443
-
410
-
7878
|Praesent.eu.nulla...|
+---+-----------+--------------+--------------------+
|
我们还可以采用类SQL的语法对数据进行过滤:
students.filter(
"SUBSTR(studentName,0,1) ='M'"
).show(
7
)
+---+-----------+--------------+--------------------+
| id|studentName| phone| email|
+---+-----------+--------------+--------------------+
|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
|
19
| Malachi|
1
-
608
-
637
-
2772
|Proin.mi.Aliquam
@
...|
|
24
| Marsden|
1
-
477
-
629
-
7528
|Donec.dignissim.m...|
|
37
| Maggy|
1
-
910
-
887
-
6777
|facilisi.Sed.nequ...|
|
61
| Maxine|
1
-
422
-
863
-
3041
|aliquet.molestie....|
|
77
| Maggy|
1
-
613
-
147
-
4380
| pellentesque
@
mi.net|
|
97
| Maxwell|
1
-
607
-
205
-
1273
|metus.In
@
musAenea...|
+---+-----------+--------------+--------------------+
only showing top
7
rows
|
对DataFrame里面的数据进行排序
使用sort函数我们可以对DataFrame中指定的列进行排序:
students.sort(students(
"studentName"
).desc).show(
7
)
+---+-----------+--------------+--------------------+
| id|studentName| phone| email|
+---+-----------+--------------+--------------------+
|
50
| Yasir|
1
-
282
-
511
-
4445
|eget.odio.Aliquam...|
|
52
| Xena|
1
-
527
-
990
-
8606
|in.faucibus.orci
@
...|
|
86
| Xandra|
1
-
677
-
708
-
5691
|libero
@
arcuVestib...|
|
43
| Wynter|
1
-
440
-
544
-
1851
|amet.risus.Donec
@
...|
|
31
| Wallace|
1
-
144
-
220
-
8159
| lorem.lorem
@
non.net|
|
66
| Vance|
1
-
268
-
680
-
0857
|pellentesque
@
netu...|
|
41
| Tyrone|
1
-
907
-
383
-
5293
|non.bibendum.sed
@
...|
+---+-----------+--------------+--------------------+
only showing top
7
rows
|
也可以对多列进行排序:
students.sort(
"studentName"
,
"id"
).show(
10
)
+---+-----------+--------------+--------------------+
| id|studentName| phone| email|
+---+-----------+--------------+--------------------+
|
21
| |
1
-
598
-
439
-
7549
|consectetuer.adip...|
|
32
| |
1
-
184
-
895
-
9602
|accumsan.laoreet
@
...|
|
45
| |
1
-
245
-
752
-
0481
|Suspendisse.eleif...|
|
83
| |
1
-
858
-
810
-
2204
|sociis.natoque
@
eu...|
|
94
| |
1
-
443
-
410
-
7878
|Praesent.eu.nulla...|
|
91
| Abel|
1
-
530
-
527
-
7467
| urna
@
veliteu.edu|
|
69
| Aiko|
1
-
682
-
230
-
7013
|turpis.vitae.puru...|
|
47
| Alma|
1
-
747
-
382
-
6775
| nec.enim
@
non.org|
|
26
| Amela|
1
-
526
-
909
-
2605
| in
@
vitaesodales.edu|
|
16
| Amena|
1
-
878
-
250
-
3129
|lorem.luctus.ut
@
s...|
+---+-----------+--------------+--------------------+
only showing top
10
rows
|
从上面的结果我们可以看出,默认是按照升序进行排序的。我们也可以将上面的语句写成下面的:
students.sort(students(
"studentName"
).asc, students(
"id"
).asc).show(
10
)
|
这两个语句运行的效果是一致的。
对列进行重命名
如果我们对DataFrame中默认的列名不感兴趣,我们可以在select的时候利用as对其进行重命名,下面的列子将studentName
重命名为name,而email这列名字不变:
students.select(students(
"studentName"
).as(
"name"
), students(
"email"
)).show(
10
)
+--------+--------------------+
| name| email|
+--------+--------------------+
| Burke|ullamcorper.velit...|
| Kamal|pede.Suspendisse
@
...|
| Olga|Aenean.eget.metus...|
| Belle|vitae.aliquet.nec...|
| Trevor|dapibus.id
@
acturp...|
| Laurel|adipiscing
@
consec...|
| Sara|Donec.nibh
@
enimEt...|
| Kaseem|cursus.et.magna
@
e...|
| Lev|Vivamus.nisi
@
ipsu...|
| Maya|accumsan.convalli...|
+--------+--------------------+
only showing top
10
rows
|
将DataFrame看作是关系型数据表
DataFrame的一个强大之处就是我们可以将它看作是一个关系型数据表,然后在其上运行SQL查询语句,只要我们进行下面两步即可实现:
(1)、将DataFrame注册成一张名为students的表:
students.registerTempTable(
"students"
)
|
(2)、然后我们在其上用标准的SQL进行查询:
sqlContext.sql(
"select * from students where studentName!='' order by email desc"
).show(
7
)
+---+-----------+--------------+--------------------+
| id|studentName| phone| email|
+---+-----------+--------------+--------------------+
|
87
| Selma|
1
-
601
-
330
-
4409
|vulputate.velit
@
p...|
|
96
| Channing|
1
-
984
-
118
-
7533
|viverra.Donec.tem...|
|
4
| Belle|
1
-
246
-
894
-
6340
|vitae.aliquet.nec...|
|
78
| Finn|
1
-
213
-
781
-
6969
|vestibulum.massa
@
...|
|
53
| Kasper|
1
-
155
-
575
-
9346
|velit.eget
@
pedeCu...|
|
63
| Dylan|
1
-
417
-
943
-
8961
|vehicula.aliquet
@
...|
|
35
| Cadman|
1
-
443
-
642
-
5919
|ut.lacus
@
adipisci...|
+---+-----------+--------------+--------------------+
only showing top
7
rows
|
对两个DataFrame进行Join操作
前面我们已经知道如何将DataFrame注册成一张表,现在我们来看看如何使用普通的SQL对两个DataFrame进行Join操作。
1、内联:内联是默认的Join操作,它仅仅返回两个DataFrame都匹配到的结果,来看看下面的例子:
val
students
1
=
sqlContext.csvFile(filePath
=
"E:\\StudentPrep1.csv"
, useHeader
=
true
, delimiter
=
'|'
)
val
students
2
=
sqlContext.csvFile(filePath
=
"E:\\StudentPrep2.csv"
, useHeader
=
true
, delimiter
=
'|'
)
val
studentsJoin
=
students
1
.join(students
2
, students
1
(
"id"
)
===
students
2
(
"id"
))
studentsJoin.show(studentsJoin.count.toInt)
+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+
| id|studentName| phone| email| id| studentName| phone| email|
+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+
|
1
| Burke|
1
-
300
-
746
-
8446
|ullamcorper.velit...|
1
|BurkeDifferentName|
1
-
300
-
746
-
8446
|ullamcorper.velit...|
|
2
| Kamal|
1
-
668
-
571
-
5046
|pede.Suspendisse
@
...|
2
|KamalDifferentName|
1
-
668
-
571
-
5046
|pede.Suspendisse
@
...|
|
3
| Olga|
1
-
956
-
311
-
1686
|Aenean.eget.metus...|
3
| Olga|
1
-
956
-
311
-
1686
|Aenean.eget.metus...|
|
4
| Belle|
1
-
246
-
894
-
6340
|vitae.aliquet.nec...|
4
|BelleDifferentName|
1
-
246
-
894
-
6340
|vitae.aliquet.nec...|
|
5
| Trevor|
1
-
300
-
527
-
4967
|dapibus.id
@
acturp...|
5
| Trevor|
1
-
300
-
527
-
4967
|dapibusDifferentE...|
|
6
| Laurel|
1
-
691
-
379
-
9921
|adipiscing
@
consec...|
6
|LaurelInvalidPhone|
000000000
|adipiscing
@
consec...|
|
7
| Sara|
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEt...|
7
| Sara|
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEt...|
|
8
| Kaseem|
1
-
881
-
586
-
2689
|cursus.et.magna
@
e...|
8
| Kaseem|
1
-
881
-
586
-
2689
|cursus.et.magna
@
e...|
|
9
| Lev|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
9
| Lev|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
+---+-----------+--------------+--------------------+---+------------------+--------------+--------------------+
|
2、右外联:在内连接的基础上,还包含右表中所有不符合条件的数据行,并在其中的左表列填写NULL ,来看看下面的实例:
val
studentsRightOuterJoin
=
students
1
.join(students
2
, students
1
(
"id"
)
===
students
2
(
"id"
),
"right_outer"
)
studentsRightOuterJoin.show(studentsRightOuterJoin.count.toInt)
+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+
| id|studentName| phone| email| id| studentName| phone| email|
+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+
|
1
| Burke|
1
-
300
-
746
-
8446
|ullamcorper.velit...|
1
| BurkeDifferentName|
1
-
300
-
746
-
8446
|ullamcorper.velit...|
|
2
| Kamal|
1
-
668
-
571
-
5046
|pede.Suspendisse
@
...|
2
| KamalDifferentName|
1
-
668
-
571
-
5046
|pede.Suspendisse
@
...|
|
3
| Olga|
1
-
956
-
311
-
1686
|Aenean.eget.metus...|
3
| Olga|
1
-
956
-
311
-
1686
|Aenean.eget.metus...|
|
4
| Belle|
1
-
246
-
894
-
6340
|vitae.aliquet.nec...|
4
| BelleDifferentName|
1
-
246
-
894
-
6340
|vitae.aliquet.nec...|
|
5
| Trevor|
1
-
300
-
527
-
4967
|dapibus.id
@
acturp...|
5
| Trevor|
1
-
300
-
527
-
4967
|dapibusDifferentE...|
|
6
| Laurel|
1
-
691
-
379
-
9921
|adipiscing
@
consec...|
6
| LaurelInvalidPhone|
000000000
|adipiscing
@
consec...|
|
7
| Sara|
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEt...|
7
| Sara|
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEt...|
|
8
| Kaseem|
1
-
881
-
586
-
2689
|cursus.et.magna
@
e...|
8
| Kaseem|
1
-
881
-
586
-
2689
|cursus.et.magna
@
e...|
|
9
| Lev|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
9
| Lev|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
|
null
|
null
|
null
|
null
|
999
|LevUniqueToSecondRDD|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
+----+-----------+--------------+--------------------+---+--------------------+--------------+--------------------+
|
3、左外联:在内连接的基础上,还包含左表中所有不符合条件的数据行,并在其中的右表列填写NULL ,同样我们来看看下面的实例:
val
studentsLeftOuterJoin
=
students
1
.join(students
2
, students
1
(
"id"
)
===
students
2
(
"id"
),
"left_outer"
)
studentsLeftOuterJoin.show(studentsLeftOuterJoin.count.toInt)
+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+
| id|studentName| phone| email| id| studentName| phone| email|
+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+
|
1
| Burke|
1
-
300
-
746
-
8446
|ullamcorper.velit...|
1
|BurkeDifferentName|
1
-
300
-
746
-
8446
|ullamcorper.velit...|
|
2
| Kamal|
1
-
668
-
571
-
5046
|pede.Suspendisse
@
...|
2
|KamalDifferentName|
1
-
668
-
571
-
5046
|pede.Suspendisse
@
...|
|
3
| Olga|
1
-
956
-
311
-
1686
|Aenean.eget.metus...|
3
| Olga|
1
-
956
-
311
-
1686
|Aenean.eget.metus...|
|
4
| Belle|
1
-
246
-
894
-
6340
|vitae.aliquet.nec...|
4
|BelleDifferentName|
1
-
246
-
894
-
6340
|vitae.aliquet.nec...|
|
5
| Trevor|
1
-
300
-
527
-
4967
|dapibus.id
@
acturp...|
5
| Trevor|
1
-
300
-
527
-
4967
|dapibusDifferentE...|
|
6
| Laurel|
1
-
691
-
379
-
9921
|adipiscing
@
consec...|
6
|LaurelInvalidPhone|
000000000
|adipiscing
@
consec...|
|
7
| Sara|
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEt...|
7
| Sara|
1
-
608
-
140
-
1995
|Donec.nibh
@
enimEt...|
|
8
| Kaseem|
1
-
881
-
586
-
2689
|cursus.et.magna
@
e...|
8
| Kaseem|
1
-
881
-
586
-
2689
|cursus.et.magna
@
e...|
|
9
| Lev|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
9
| Lev|
1
-
916
-
367
-
5608
|Vivamus.nisi
@
ipsu...|
|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
10
| Maya|
1
-
271
-
683
-
2698
|accumsan.convalli...|
|
11
| iteblog|
999999
| iteblog
@
iteblog.com|
null
|
null
|
null
|
null
|
+---+-----------+--------------+--------------------+----+------------------+--------------+--------------------+
|
将DataFrame保存成文件
下面我来介绍如何将DataFrame保存到一个文件里面。前面我们加载csv文件用到了load函数,与之对于的用于保存文件可以使用save函数。具体操作包括以下两步:
1、首先创建一个map对象,用于存储一些save函数需要用到的一些属性。这里我将制定保存文件的存放路径和csv的头信息。
val
saveOptions
=
Map(
"header"
->
"true"
,
"path"
->
"iteblog.csv"
)
|
为了基于学习的态度,我们从DataFrame里面选择出studentName和email两列,并且将studentName的列名重定义为name。
val
copyOfStudents
=
students.select(students(
"studentName"
).as(
"name"
), students(
"email"
))
|
2、下面我们调用save函数保存上面的DataFrame数据到iteblog.csv文件夹中
copyOfStudents.write.format(
"com.databricks.spark.csv"
).mode(SaveMode.Overwrite).options(saveOptions).save()
|
mode函数可以接收的参数有Overwrite、Append、Ignore和ErrorIfExists。从名字就可以很好的理解,Overwrite代表覆盖目录下之前存在的数据;Append代表给指定目录下追加数据;Ignore代表如果目录下已经有文件,那就什么都不执行;ErrorIfExists代表如果保存目录下存在文件,那么抛出相应的异常。
需要注意的是,上述path参数指定的是保存文件夹,并不是最后的保存文件名。
转载自过往记忆(http://www.iteblog.com/)