split函数解析

最新推荐文章于 2024-07-31 17:45:07 发布

逃出你的肖生克

最新推荐文章于 2024-07-31 17:45:07 发布

阅读量1.2k

点赞数

分类专栏： Scala随笔文章标签： scala

本文链接：https://blog.csdn.net/do_yourself_go_on/article/details/74931446

版权

Scala随笔专栏收录该内容

3 篇文章 0 订阅

订阅专栏

今天在使用Spark做数据分析时候遇到一个问题，解析文件得到的字段数目总是跟预设的有出入，经过反复排查，发现是scala中split函数使用出现错误导致的，通过查看Java API文档中的split函数解释，才真正的理解split函数的使用，下面分享一下自己的认识。

官方API文档解释

**1.String[]    split(String regex)**
  Splits this string around matches of the given regular expression.
**2.String[]    split(String regex, int limit)**
  Splits this string around matches of the given regular expression.

从上面可以看出，split方法有两种，主要区别在于第二个参数的设置。

第一种方法split(String regex, int limit)

官方解释：

Splits this string around matches of the given regular expression.
//根据给定的正则表达式来分解这个String
The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.
//将这个字符串按照给定的正则表达式分割成一个个子串，并以数组的形式将子串返回，子串在数组中的位置与子串之前在字符串中出现的先后顺序一致；如果正则表达式与字符串没有匹配，则返回整个字符串作为一个数组元素，即数组中只有字符串一个元素。

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
//limit参数控制着正则表达式应用的次数从而影响着结果数组的长度。如果limit n是大于0的话，那么这个正则表达式最多会被使用n-1次，并且返回数组的长度不会超过n,并且这个数组的最后一个元素将会包括字符串剩余的未被分割的部分。如果limit n是小于0的话，这个正则表达式匹配过程会被使用尽可能多的次数且结果数组长度没有限制。如果limit n等于0的话，这个正则表达式匹配过程也会进行尽可能多的次数，结果数组长度也没有限制，不过结果数组最后的空字符串会被忽略，不会作为一个数组元素，这也是与limit n小于0的区别。

The string "boo:and:foo", for example, yields the following results with these parameters:

Regex   Limit   Result
:   2   { "boo", "and:foo" }  //限制了正则匹配次数 这里使用1次
:   5   { "boo", "and", "foo" }
:   -2  { "boo", "and", "foo" }
o   5   { "b", "", ":and:f", "", "" }
o   -2  { "b", "", ":and:f", "", "" } //尾部空字符串元素不会被忽略
o   0   { "b", "", ":and:f" }    //尾部空字符串元素被忽略

第二种方法split(String regex)

这里第二种方法是第一种方法的特例，即第一种方法中limit n=0的情况，我们平常用的较多的也是这种，不过这种方法会忽略后面的空字符串子串，这一点在数据解析的时候可能会造成错误，应格外注意。

逃出你的肖生克

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
split函数解析

今天在使用Spark做数据分析时候遇到一个问题，解析文件得到的字段数目总是跟预设的有出入，经过反复排查，发现是scala中split函数使用出现错误导致的，通过查看Java API文档中的split函数解释，才真正的理解split函数的使用，下面分享一下自己的认识。官方API文档解释**1.String[] split(String regex)** Splits this string
复制链接

扫一扫

专栏目录