spark匹配html字段,Spark常用内置函数

最新推荐文章于 2022-04-24 17:07:41 发布

weixin_39916758

最新推荐文章于 2022-04-24 17:07:41 发布

阅读量218

点赞数

文章标签： spark匹配html字段

8种机械键盘轴体对比

本人程序员，要买一个写代码的键盘，请问红轴和茶轴怎么选？

explode 数据拆分函数Creates a new row for each element in the given array or map column.

遍历列中的值，生成新的一行1

2val dfDates = datas.select($"namespace", $"time", explode(datas("nodes"))).toDF("namespace", "time", "node")

dfDates.show(false) 1990AB5A141A4D62803166F3853A5570

image 1DBC75F00B664E92A2B99EFF58C0911B

image

filter过滤函数

将datefream按照传入的过滤条件，进行判断

.cast为指定列的数据类型

filter

filter对RDD中的数据项进行计算

($”id”.isin(idList: _*))：列中的值是否在数组中(val idList = points.split(“,”).sorted)

filterByRange

对RDD中的元素进行过滤

.isin函数：遍历判断列中值是否在数组中 AEC7B7B126EF4CAF80F58B6DD7854002

image1

2

3

4

5

6

7

8

9

10

11

12

13

14

15val splitDataInfo = dfDates

.select(

$"namespace".cast(StringType),

$"time".cast(LongType),

$"node.id".cast(StringType),

$"node.v".cast(StringType).substr(0, 1).as("type"),

$"node.v".cast(

StringType

).substr(3, 10).as("v"),

dfDates("node.t").as("idTime"),

//from_unixtime(dfDates("node.t") / 1000).as("node_time").cast(TimestampType),

$"node.s"

)

.filter($"id".isin(idList: _*))

.filter($"namespace" === (namespace))

数据合并函数 concat_ws

使用给定的分隔符将多个输入字符串列连接成一个字符串1df.select(concat_ws(",",$"name",$"age",$"phone").cast(StringType).as("value"))

struct函数

将多列合并一个数组保存为一列1df.select(struct($"name",$"age",$"phone").as("value")).show(false)//将String转数组

foldLeft左累积器1

2

3

4

5

6

7

8

9

10//times(List('a', 'b', 'a')) --> List(('a', 2), ('b', 1))

def times(chars: List[Char]): List[(Char, Int)] = {

def incr(pairs: List[(Char, Int)], C: Char): List[(Char, Int)] =

pairs match {

case Nil => List((C, 1))

case (C, n) :: ps => (C, n+1) :: ps

case p :: ps => p :: incr(ps, C)

}

chars.foldLeft(List[(Char,Int)]())(incr)

}

printSchema scheam输出函数 620FF78D569B416A9F219D17BDDB658F

image

.groupBy使用指定的列对数据集进行分组，以便我们可以对它们进行聚合。1

2

3def groupBy(cols: Column*): RelationalGroupedDataset = {

RelationalGroupedDataset(toDF(), cols.map(_.expr), RelationalGroupedDataset.GroupByType)

}

.pivot行转列聚合，将对应列的值与列表匹配1

2val idList = points.split(",").sorted

val typeTable = splitDataInfo.groupBy($"namespace").pivot("id", idList).agg(collect_list($"type")(0)).filter($"namespace" === (namespace)) E281056E0C414C378B2F6A4ED0E84429

image

.agg()聚合生成datafream1.agg(collect_list($"type")(0))

collect_list()参数去重 12CCF483135742F999F097114AF424B7

image

Spark正则匹配1

2

3

4

5

6

7val sparkpath = output + "_sparkData"

data.coalesce(1).write.mode("Overwrite").option("header", true).parquet(sparkpath)

val pattern = "hdfs://(.*?):(\d{4})".r//正则表达式

//正则匹配

val uri = (pattern findFirstIn sparkpath).get//正则化匹配

val path = sparkpath.stripPrefix(uri) 截取

val fs = HdfsUtils.creatFileSystem(uri)

createStatement创建sql连接1

2

3

4

5

6

7val conn =JdbcUtil.getConnectionPool

val sql = "UPDATE data_table SET num_of_rows = "+numberofRow+" " +

",num_of_columns= "+numberofcolumns+" " +

",schema_info= '"+schema_info+"'" +

" WHERE id = '"+tableId+"'"

val stmt = conn.createStatement

stmt.executeUpdate(sql)

时间转换函数1from_unixtime(dfDates("time") / 1000, "yyyyMMdd").as("day"),

字段截取

1.调用expor函数1expr("substring(v, 3)").cast(StringType).as("v")

2.使用udf函数1

2

3

4

5

6

7val substringByIndex = udf {

(str: String, index: Int) => {

str.substring(index)

}

}

//调用udf函数创建列

substringByIndex($"node.v",lit(2)).cast(StringType).as("v")

weixin_39916758

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark匹配html字段,Spark常用内置函数

8种机械键盘轴体对比本人程序员，要买一个写代码的键盘，请问红轴和茶轴怎么选？explode 数据拆分函数Creates a new row for each element in the given array or map column.遍历列中的值，生成新的一行12val dfDates = datas.select($"namespace", $"time", explode(datas(...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。