利用.filter对dataframe的数据进行筛选
筛选比较符有“==”、"!="、">"、"<"、"<="、">="、"like"、"rlike"
数据长这样
scala> df.show(10)
+--------+------------------+------+
| R1| G2|labels|
+--------+------------------+------+
|148.6041|4.1254973506233155| 1.0|
|163.6788|2.8350005837741903| 1.0|
|153.9485|1.8033965176854478| 1.0|
|150.3755|1.5140336026654098| 1.0|
| 150.738|1.6580451019197278| 1.0|
|150.1358| 1.28157676321007| 1.0|
|150.0713|1.2962300876001915| 1.0|
| 157.623|1.5737972391639274| 1.0|
|157.7101| 1.490367458045163| 1.0|
|169.3828| 1.968593152482249| 1.0|
+--------+------------------+------+
only showing top 10 rows
单条件筛选
1、比较运算符“==”、"!="、">"、"<"、"<="、">=",用法都一样
scala> df.filter("labels >2 ").show(5)
+--------+------------------+------+
| R1| G2|labels|
+--------+------------------+------+
|130.2428| 2.743570053780293| 3.0|
|141.3739|1.7569390541507126| 3.0|
|140.3577| 1.97759550970364| 3.0|
|141.1218|2.3682219300563876| 3.0|
|148.0428|1.5806853070741185| 3.0|
+--------+------------------+------+
2、"like":当指定列的值与判断语句完成相等时才返回,%代表任意n个字符,_代表任意一个字符。
scala> df.filter("labels like '2' ").show(5)
+---+---+------+
| R1| G2|labels|
+---+---+------+
+---+---+------+
scala> df.filter("labels like '2%' ").show(5)
+--------+------------------+------+
| R1| G2|labels|
+--------+------------------+------+
|126.2613| 2.516413249051117| 2.0|
|145.6122|1.6008582573107464| 2.0|
|126.3282| 2.209409006951859| 2.0|
| 139.539|1.7367981316203676| 2.0|
|120.1344| 4.356126691224671| 2.0|
+--------+------------------+------+
only showing top 5 rows
scala> df.filter("labels like '2__' ").show(5)
+--------+------------------+------+
| R1| G2|labels|
+--------+------------------+------+
|126.2613| 2.516413249051117| 2.0|
|145.6122|1.6008582573107464| 2.0|
|126.3282| 2.209409006951859| 2.0|
| 139.539|1.7367981316203676| 2.0|
|120.1344| 4.356126691224671| 2.0|
+--------+------------------+------+
only showing top 5 rows
3、rlike:当指定列的值包含判断语句时即可返回
scala> df.filter("labels rlike '2' ").show(5)
+--------+------------------+------+
| R1| G2|labels|
+--------+------------------+------+
|126.2613| 2.516413249051117| 2.0|
|145.6122|1.6008582573107464| 2.0|
|126.3282| 2.209409006951859| 2.0|
| 139.539|1.7367981316203676| 2.0|
|120.1344| 4.356126691224671| 2.0|
+--------+------------------+------+
only showing top 5 rows
多条件筛选
判断语句中间加and或or
scala> df.filter("labels >2 and R1>140 or G2>2").show
+--------+------------------+------+
| R1| G2|labels|
+--------+------------------+------+
|148.6041|4.1254973506233155| 1.0|
|163.6788|2.8350005837741903| 1.0|
|147.2315|3.7958537787960167| 1.0|
|163.1148|2.2304563748255646| 1.0|
|142.3022|2.1617213048864556| 1.0|
|158.2378| 2.761671776297828| 1.0|
|156.4203| 2.764175318245932| 1.0|
|126.2613| 2.516413249051117| 2.0|
|126.3282| 2.209409006951859| 2.0|
|120.1344| 4.356126691224671| 2.0|