最近在数据处理时,用到了正则匹配,在数据 Column 类型处理时用到的是 regexp_extract,其中具体方法,如下
/**
Extract a specific group matched by a Java regex, from the specified string column.
If the regex did not match, or the specified group did not match, an empty string is
returned.
Since: 1.5.0
*/
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column = withExpr {
RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
}
不过,直接使用scala的正则是Regex类
scala.util.matching.Regex
记录下测试方法:
package com.qihoo.icebase.apollo.test
import scala.util.matching.Regex
object TestRegex {
/**
*/
def main(args: Array[String]): Unit = {
val logInfo = "requestURI:/c?app=2&p=3&did=14 test(Datetime) 0042334&industry=42Ztest(DatetimeCCCD)"
val regSameTokenProc: Regex = """test\(([\w:.><\-\s\\/]*)\)""".r
println("findFirstIn:------" + regSameTokenProc.findFirstIn(logInfo).getOrElse(""))
println("findFirstMatchIn.get.group:------" + regSameTokenProc.findFirstMatchIn(logInfo).getOrElse(null))
val matchResult: Regex.Match = regSameTokenProc.findFirstMatchIn(logInfo).getOrElse(null)
if (matchResult != null) {
println("match", matchResult.group(1))
} else {
println("match null")
}
println("\nfindAllIn:")
regSameTokenProc.findAllIn(logInfo).toList.foreach(println(_))
println("\nfindAllMatchIn:")
regSameTokenProc.findAllMatchIn(logInfo).foreach(item => println(item.group(1)))
println("\n")
val date = """(\d\d\d\d)-(\d\d)-(\d\d)""".r
"2015-05-23" match {
case date(year, month, day) => println(year, month, day)
}
"2014-05-23" match {
case date(year, month, _*) => println("The year of the date is " + year)
}
"2014-05-23" match {
case date(_*) => println("It is a date")
}
}
}
测试结果数据:
findFirstIn:------test(Datetime)
findFirstMatchIn.get.group:------test(Datetime)
(match,Datetime)
findAllIn:
test(Datetime)
test(DatetimeCCCD)
findAllMatchIn:
Datetime
DatetimeCCCD
(2015,05,23)
The year of the date is 2014
It is a date