Java 字符串分词

最新推荐文章于 2023-05-12 14:49:08 发布

a1000005aa

最新推荐文章于 2023-05-12 14:49:08 发布

阅读量1.8k

点赞数

分类专栏： JDK 文章标签： java 数据库

本文链接：https://blog.csdn.net/a1000005aa/article/details/84693946

版权

JDK 专栏收录该内容

42 篇文章 0 订阅

订阅专栏

在Java的世界里有个类型叫String，中文名就字符串。

很多时候我们需要使用它来存储，出了基本的8个类型外，还有Date和String这两个特殊的“基本”类型.

对于字符串，我们接触的多，处理的多，却很少去总结，比如我们经常把用户信息存储为

USER

id name pass email

1 "abc" "bb" "a@aa.com"

AUTH

id add modify delete
1 true true true

USERAUTH

id userid authid
1 1 1

其实我们还可以这样存储:

SYSTEM
id user auth
1 "abc","bb","a@aa.com" true,true,true

问题就来了，我们会想这样存储读取和保存不是很麻烦，还有每一个的字段定义都没有。

其实我想说，要啥字段定义，你Java中定义的对象就是字段定义，你可以随意添加和修改，而不需要修改数据库定义。

问题又来了，那我数据库管理员可不知道你Java怎么定义的，现在要为用户A添加add操作怎么办，好吧再添加一张数据定义表.

DEFINE
id column defnie
1 user name pass email
2 auth add modify delete

然后，你的java对象根据这个定义生成对象，那么还有一个问题，怎么搜索，怎么删除....

好吧，每一个定义都有其用处，比如这个数据定义适合做计算结果的存储，这样就不需要每个字段的关联，导致效率等的下降，但是也会多出维护成本，比如用户数据的变更等.

我这人呢喜欢瞎扯，回归正题，如何做分词:

Example1, 以逗号分隔:

"a,b,c,d,e"



/**
	 * Splits the provided text into an array, separator specified, preserving
	 * all tokens, including empty tokens created by adjacent separators.
	 *
	 * CSVUtil.split(null, *, true) = null
	 * CSVUtil.split("", *, , true) = []
	 * CSVUtil.split("a.b.c", '.', true) = ["a", "b", "c"]
	 * CSVUtil.split("a...c", '.', true) = ["a", "", "", "c"]
	 * CSVUtil.split("a...c", '.', false) = ["a", "c"]
	 *
	 * @param str
	 *            the string to parse
	 * @param separatorChar
	 *            the seperator char
	 * @param preserveAllTokens
	 *            if true, adjacent separators are treated as empty token
	 *            separators
	 * @return the splitted string
	 */
	public static String[] split(String str, char separatorChar, boolean preserveAllTokens) {
		if (str == null) {
			return null;
		}
		int len = str.length();
		if (len == 0) {
			return new String[0];
		}
		List<String> list = new ArrayList<String>();
		int i = 0, start = 0;
		boolean match = false;
		boolean lastMatch = false;
		while (i < len) {
			if (str.charAt(i) == separatorChar) {
				if (match || preserveAllTokens) {
					list.add(str.substring(start, i));
					match = false;
					lastMatch = true;
				}
				start = ++i;
				continue;
			}
			lastMatch = false;
			match = true;
			i++;
		}
		if (match || preserveAllTokens && lastMatch) {
			list.add(str.substring(start, i));
		}
		return list.toArray(new String[list.size()]);
	}

思路: 循环每一个字符，当字符为分隔符时：

1. any char matched?
2. preserve(保持) token?

当我们输保持token为false, 那么如果分割为空字符串，忽略.


while (i < len) {
			if (str.charAt(i) == separatorChar) {
				if (match) {
					list.add(str.substring(start, i));lastMatch = true;
					match = false;
				}
				start = ++i;
				continue;
			}
lastMatch = false;
			match = true;
			i++;
		}

默认两个标识都为false, 当遇到非分隔符时，设置match为true, 遇到分隔符设为false,那么当match为true时，非空，当match为false时，连续两个分隔符.


if (match || preserveAllTokens && lastMatch) {
			list.add(str.substring(start, i));
		}

这回处理，当最后一个逗号之后的内容,如果都之后有内容，那么lastMatch为false,match为true.
当逗号之后没内容，那么lastMatch为true,match为false,无论有没内容都加.

当遇到分隔符时，start=++i,表明取下一坐标开始，而下个坐标为分隔符时，那么i=i,也就是没有取到字符.
这里如果preserveAllTokens为true,那么无论match是否为true,都加.
如果preserveAllTokens为false,那么只有match为true,不是以都好结尾的才加.

可能有人会觉得这个分词太简单了，比如可以用以下代码:


String[]  cc = "a,b,c,,d,e,f".split(",");

JDK Source code String.split方法就能解决，多简单啊，但是这个方法不会处理最后一个逗号的值，所以自己定制的字符串分割还是很有意义的。

当然JDK Source Code String.split还有一个扩展的方法


String[]  cc = "a,b,c,,d,e,f".split(",",7);


 * <p> The <tt>limit</tt> parameter controls the number of times the
     * pattern is applied and therefore affects the length of the resulting
     * array.  If the limit <i>n</i> is greater than zero then the pattern
     * will be applied at most <i>n</i> - 1 times, the array's
     * length will be no greater than <i>n</i>, and the array's last entry
     * will contain all input beyond the last matched delimiter.  If <i>n</i>
     * is non-positive then the pattern will be applied as many times as
     * possible and the array can have any length.  If <i>n</i> is zero then
     * the pattern will be applied as many times as possible, the array can
     * have any length, and trailing empty strings will be discarded.

limit: 0, will be applied as many times as possible.
limit: 1~n, will be applied as 1~n times.
limint: >n, the array's last entry will contain all input beyond the last matched delimiter.

简单，这个方法会处理最后一个逗号的值，API的扩展方法。

Example:


		String[]  dd = "a,,,,,,d,e,".split(","); 
		System.out.println(Arrays.toString(dd)+":"+dd.length);

		String[]  ee = "a,,,,,,d,e,".split(",",8); 
		System.out.println(Arrays.toString(ee)+":"+ee.length);


		String[]  ff = "a,,,,,,d,e,".split(",",10); 
		System.out.println(Arrays.toString(ff)+":"+ff.length);

Output:


[a, , , , , , d, e]:8
[a, , , , , , d, e,]:8
[a, , , , , , d, e, ]:9

但是如果使用我们自己的split方法:

.
String[]  aa = CSVUtil.split("a,,,,,,d,e,", ',', false);
		System.out.println(Arrays.toString(aa)+":"+aa.length);

		String[]  bb = CSVUtil.split("a,,,,,,d,e,", ',', true);
		System.out.println(Arrays.toString(bb)+":"+bb.length);

Output:


[a, d, e]:3
[a, , , , , , d, e, ]:9

你喜欢那个API呢，如果是我，我会选下面这个，因为字符循环，自定义的优势等等。

具体看看JDK的实现的:


String[] hh = null;
		String input = "abcddaeaaa";
		int limit = 100;
		int index = 0;
		boolean matchLimited = limit > 0;
		ArrayList<String> matchList = new ArrayList<String>();
		Pattern pattern = Pattern.compile("A", 2);
		Matcher m = pattern.matcher(input); 
		 while(m.find()) {
	            if (!matchLimited || matchList.size() < limit - 1) {
	                String match = input.subSequence(index, m.start()).toString();
	                matchList.add(match);
	                index = m.end();
	            } else if (matchList.size() == limit - 1) { // last one
	                String match = input.subSequence(index,
	                                                 input.length()).toString();
	                matchList.add(match);
	                index = m.end();
	            }
	    }

		 // If no match was found, return this
	        if (index == 0)
	        	hh= new String[] {input.toString()};

	        // Add remaining segment
	        if (!matchLimited || matchList.size() < limit)
	            matchList.add(input.subSequence(index, input.length()).toString());

	        // Construct result
	        int resultSize = matchList.size();
	        if (limit == 0){
	            while (resultSize > 0 && matchList.get(resultSize-1).equals("")){
	                resultSize--;
	            }
	        }
	        String[] result = new String[resultSize];
	        hh= matchList.subList(0, resultSize).toArray(result);
	        //if limit is 0, then String.split at this step is delete the last "".
	        // so we should be change limit to  s.length.

this is section of Pattern.split. CASE_INSENSITIVE to match.
default String.split(","), will be ignore the last ",", cause of the construct result, this will be clean the last one of the Array,if the value is "" will be remove.

当然想要添加最后一个也简单，limit = input.length:


String input1 = "a,,,,,,d,e,";
String[]  ii = input1.split(",",input1.length());  
System.out.println(Arrays.toString(ii)+":"+ii.length);

Output:


[a, , , , , , d, e, ]:9

as this, it is the same as the result of
CSVUtil.split("a,,,,,,d,e,", ',', true);

那么你现在会选哪个做字符串分割呢， JDK代码是经过千锤百炼的，所以使用JDK代码更简单，而且更稳定，效率也更高，真的吗？

有人说功能越简单越好， JDK的字符串功能太强大了， JDK的regex可以查找，匹配，分割等，而且刚才的limit也不好设置吧，而且每一次进行字符串分割都需要构建Partern/Matcher...效率肯定没有直接进行字符遍历来的直接。