Java编程思想(十三) —— 字符串之正则表达式

最新推荐文章于 2022-08-28 22:07:10 发布

iaiti

最新推荐文章于 2022-08-28 22:07:10 发布

阅读量6.9k

点赞数

分类专栏： Java Java编程思想文章标签： java编程思想 String Scanner 字符串正则表达式

本文链接：https://blog.csdn.net/iaiti/article/details/39211883

版权

Java 同时被 2 个专栏收录

44 篇文章 3 订阅

订阅专栏

Java编程思想

19 篇文章 87 订阅

订阅专栏

上篇讲到String的基本用法及StringBuilder和String的比较。继续。

给大家感受一下RednaxelaFX的厉害，他大学看的书。

嗯，这就是去硅谷的水平，所以，还是继续看书吧。

1)格式化输出

确实，说到C的printf，是不能用重载的+操作符的。

printf("%d %f", x , y);

%d这些为格式修饰符，%d表示整数，x插入到%d的位置，%f表示浮点数，y查到%f的位置。

Java也模仿了C：

public class TestString {
    public static void main(String[] args) {
        int x = 1;
        float y = 1.223f;
        System.out.printf("%d %f",x,y);
        System.out.println();
        System.out.format("%d %f",x,y);
    }
}

可以用Formatter在控制台完美的控制间隔，不用你自己去数几个空格了。

public class TestString {
    public static void main(String[] args) {
        Formatter fm = new Formatter(System.out);
        fm.format("%-5s %5s %10s ", "Name","Age","School");
    }
}

%数字+s这样的表达像c，位置可以移动。

 System.out.println(String.format("%h", 17));
 fm.format("%h", 17);

16进制的格式化输出。

2)正则表达式（regex ：regular expression）

字符串处理，文件批处理中经常使用到，很好用，也是容易忘。这个点结合网上的一些知识点来写。

-？一个可能带有负号的数字不包括数字。

\d 表示一位数字，注意其他语言的\\是在正则表达式中是一个反斜杠，而在java中是正要插入正则表达式的\。

举一反三，那么\d在java中就是\\d了，真正想插入一条反斜杠就要\\\。

String的匹配 利用String的match方法

public class TestString {
    public static void main(String[] args) {
        System.out.println("-3444".matches("-?\\d+"));
        System.out.println("-3".matches("-?\\d"));
        System.out.println("-3".matches("(-|\\+)?\\d"));
    }
}

result：都是ture

(-|\\+)? 这个比较复杂，|是或的意思，\\+，由于加号有特殊含义，那么要\\转义，所以就是有加号或者负号的其中一个，或者都没有。

split方法：

经常使用的时候是根据空格切割。

String s = Arrays.toString("sdfsdf sf sdf".split(" "));

其实还可以在split参数中输入正则表达式进行切割：

String s = Arrays.toString("sdfsdf sf sdf".split("\\W+"));
String s2 = Arrays.toString("sdfsdf sf sdf".split("n\\W+"));

\w是非单词字符，\w为单词字符，n\\W+ 字母n后跟着一个或多个非中文字符。

参考：

http://blog.csdn.net/kdnuggets/article/details/2526588

和JDK的Pattern类：

Construct	Matches

Characters
x	The character x
`\\`	The backslash character
`\0`n	The character with octal value `0`n (0 `<=` n `<=` 7)
`\0`nn	The character with octal value `0`nn (0 `<=` n `<=` 7)
`\0`mnn	The character with octal value `0`mnn (0 `<=` m `<=` 3, 0 `<=` n `<=` 7)
`\x`hh	The character with hexadecimal value `0x`hh
`\u`hhhh	The character with hexadecimal value `0x`hhhh
`\x`{h...h}	The character with hexadecimal value `0x`h...h (`Character.MIN_CODE_POINT` <= `0x`h...h <= `Character.MAX_CODE_POINT`)
`\t`	The tab character (`'\u0009'`)
`\n`	The newline (line feed) character (`'\u000A'`)
`\r`	The carriage-return character (`'\u000D'`)
`\f`	The form-feed character (`'\u000C'`)
`\a`	The alert (bell) character (`'\u0007'`)
`\e`	The escape character (`'\u001B'`)
`\c`x	The control character corresponding to x

Character classes
`[abc]`	`a`, `b`, or `c` (simple class)
`[^abc]`	Any character except `a`, `b`, or `c` (negation)
`[a-zA-Z]`	`a` through `z` or `A` through `Z`, inclusive (range)
`[a-d[m-p]]`	`a` through `d`, or `m` through `p`:`[a-dm-p]` (union)
`[a-z&&[def]]`	`d`, `e`, or `f` (intersection)
`[a-z&&[^bc]]`	`a` through `z`, except for `b` and `c`: `[ad-z]` (subtraction)
`[a-z&&[^m-p]]`	`a` through `z`, and not `m` through `p`: `[a-lq-z]`(subtraction)

Predefined character classes
`.`	Any character (may or may not match line terminators)
`\d`	A digit: `[0-9]`
`\D`	A non-digit: `[^0-9]`
`\s`	A whitespace character: `[ \t\n\x0B\f\r]`
`\S`	A non-whitespace character: `[^\s]`
`\w`	A word character: `[a-zA-Z_0-9]`
`\W`	A non-word character: `[^\w]`

POSIX character classes (US-ASCII only)
`\p{Lower}`	A lower-case alphabetic character: `[a-z]`
`\p{Upper}`	An upper-case alphabetic character:`[A-Z]`
`\p{ASCII}`	All ASCII:`[\x00-\x7F]`
`\p{Alpha}`	An alphabetic character:`[\p{Lower}\p{Upper}]`
`\p{Digit}`	A decimal digit: `[0-9]`
`\p{Alnum}`	An alphanumeric character:`[\p{Alpha}\p{Digit}]`
`\p{Punct}`	Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{\|}~
`\p{Graph}`	A visible character: `[\p{Alnum}\p{Punct}]`
`\p{Print}`	A printable character: `[\p{Graph}\x20]`
`\p{Blank}`	A space or a tab: `[ \t]`
`\p{Cntrl}`	A control character: `[\x00-\x1F\x7F]`
`\p{XDigit}`	A hexadecimal digit: `[0-9a-fA-F]`
`\p{Space}`	A whitespace character: `[ \t\n\x0B\f\r]`

java.lang.Character classes (simple java character type)
`\p{javaLowerCase}`	Equivalent to java.lang.Character.isLowerCase()
`\p{javaUpperCase}`	Equivalent to java.lang.Character.isUpperCase()
`\p{javaWhitespace}`	Equivalent to java.lang.Character.isWhitespace()
`\p{javaMirrored}`	Equivalent to java.lang.Character.isMirrored()

Classes for Unicode scripts, blocks, categories and binary properties
`\p{IsLatin}`	A Latin script character (script)
`\p{InGreek}`	A character in the Greek block (block)
`\p{Lu}`	An uppercase letter (category)
`\p{IsAlphabetic}`	An alphabetic character (binary property)
`\p{Sc}`	A currency symbol
`\P{InGreek}`	Any character except one in the Greek block (negation)
`[\p{L}&&[^\p{Lu}]]`	Any letter except an uppercase letter (subtraction)

Boundary matchers
`^`	The beginning of a line
`$`	The end of a line
`\b`	A word boundary
`\B`	A non-word boundary
`\A`	The beginning of the input
`\G`	The end of the previous match
`\Z`	The end of the input but for the final terminator, if any
`\z`	The end of the input
量词：吸收文本的方式
Greedy quantifiers 贪婪型
X`?`	X, once or not at all
X`*`	X, zero or more times
X`+`	X, one or more times
X`{`n`}`	X, exactly n times
X`{`n`,}`	X, at least n times
X`{`n`,`m`}`	X, at least n but not more than m times

Reluctant quantifiers
X`??`	X, once or not at all
X`*?`	X, zero or more times
X`+?`	X, one or more times
X`{`n`}?`	X, exactly n times
X`{`n`,}?`	X, at least n times
X`{`n`,`m`}?`	X, at least n but not more than m times

Possessive quantifiers
X`?+`	X, once or not at all
X`*+`	X, zero or more times
X`++`	X, one or more times
X`{`n`}+`	X, exactly n times
X`{`n`,}+`	X, at least n times
X`{`n`,`m`}+`	X, at least n but not more than m times

Logical operators
XY	X followed by Y
X`\|`Y	Either X or Y
`(`X`)`	X, as a capturing group

Back references
`\`n	Whatever the n^th capturing group matched
`\`k<name>	Whatever the named-capturing group "name" matched

Quotation
`\`	Nothing, but quotes the following character
`\Q`	Nothing, but quotes all characters until `\E`
`\E`	Nothing, but ends quoting started by `\Q`

Special constructs (named-capturing and non-capturing)
`(?<name>`X`)`	X, as a named-capturing group
`(?:`X`)`	X, as a non-capturing group
`(?idmsuxU-idmsuxU)`	Nothing, but turns match flags i d m s u x U on - off
`(?idmsux-idmsux:`X`)`	X, as a non-capturing group with the given flagsi d m s u x on - off
`(?=`X`)`	X, via zero-width positive lookahead
`(?!`X`)`	X, via zero-width negative lookahead
`(?<=`X`)`	X, via zero-width positive lookbehind
`(?<!`X`)`	X, via zero-width negative lookbehind
`(?>`X`)`	X, as an independent, non-capturing group

3)Pattern和Matcher

public class TestString {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("\\W+");
        Matcher m = p.matcher("qw");
        System.out.println(m.matches());
    }
}

Pattern.compile，静态方法，Compiles the given regular expression into a pattern。将一个正则表达式编译进Pattern中。

p.mathcer，Creates a matcher that will match the given input against this pattern。创建一个matcher将输入和Pattern匹配。

m.matches，Attempts to match the entire region against the pattern。
boolean，返回匹配结果。

这样就可以传入正则表达式，然后对字符串进行匹配。

1、find和group

public class TestString {
    public static void main(String[] args) {
        String s = "You're kidding me!";
        Pattern p = Pattern.compile("\\w+");
        Matcher m = p.matcher(s);
        while(m.find()){
            System.out.printf(m.group()+" ");
        }

        int i = 0;
        while(m.find(i)){
            System.out.printf(m.group()+" ");
            i++;
        }
    }
}

result: You re kidding me 
You ou u re re e kidding kidding idding dding ding ing ng g me me e

find可以遍历字符串，寻找正则表达式的匹配，group是 Returns the input subsequence matched by the previous match。这样返回的便是第一个匹配多个单词字符，所以便是You。find传入参数后，可以调整开始搜索的位置，刚开始为0，那么匹配的是You，i+1之后，匹配到的是ou。

2、end和start

while(m.find()){
            System.out.printf(m.group()+" Start:"+m.start()+" End:"+m.end());
}

You Start:0 End:3re Start:4 End:6kidding Start:7 End:14me Start:15 End:17

匹配起始位置的索引，匹配结束位置的索引。

3、split
其实书上讲属性的东西是最简单的，因为文档有，这种文档有的就是自己动手查动手敲代码。Pattern还有两个

`String[]`	`split(CharSequence input)` Splits the given input sequence around matches of this pattern.
`String[]`	`split(CharSequence input, int limit)` Splits the given input sequence around matches of this pattern.

String string = "kjj~~lkjl~~lkjlJ~~lkj~~";
System.out.println(Arrays.toString(Pattern.compile("~~").split(string)));
System.out.println(Arrays.toString(Pattern.compile("~~").split(string,2)));

result:
[kjj, lkjl, lkjlJ, lkj]
[kjj, lkjl~~lkjlJ~~lkj~~]

（哈哈，作者竟然在书中直接讽刺Sun里面的java设计者，把Pattern的标记设计得难懂。）

4)替换操作

`String`	`replaceAll(String regex,String replacement)` Replaces each substring of this string that matches the given regular expression with the given replacement.
`String`	`replaceFirst(String regex,String replacement)` Replaces the first substring of this string that matches the given regular expression with the given replacement.

replaceFirst替换的是第一个匹配的内容。replaceAll是全部替换。

接下来还有比这两者好用的处理方法，加入你要找出abcd字母并且替换成大写字母，如果用上面两种写法的话就要处理多次。

String string = "asdfb  sdfoiwer  sdfcdf wer sd d sdf  cxvxzcv s ef bob b   b ";
StringBuffer s2 = new StringBuffer();
Pattern pa = Pattern.compile("[abcd]");
Matcher mc = pa.matcher(string);
System.out.println();
while(mc.find()){
    mc.appendReplacement(s2,mc.group().toUpperCase());
}
mc.appendTail(s2);
System.out.println(s2);

result：AsDfB  sDfoiwer  sDfCDf wer sD D sDf  CxvxzCv s ef BoB B   B 



mc.find();
mc.appendReplacement(s2,mc.group().toUpperCase());
mc.appendTail(s2);
System.out.println(s2);

result ：Asdfb  sdfoiwer  sdfcdf wer sd d sdf  cxvxzcv s ef bob b   b

替换时也能操作字符串，while的时候能够全部替换，如果不用while，只进行一次find操作，那么s2打印出来的只有A，要达到replaceFirst的效果，要用appendTail方法，加尾巴，就是把剩余没替换的补上。这样才会打印完整。

reset方法：

Matcher mc = pa.matcher(string);

每次mc只能match一个字符串，可以用reset方法重新match其他字符串：

mc.reset(String newString);

5)扫描输入
c的输入很简单，有时java经常写Syso（Eclipse的System.out.println的快捷输入，很早之前一位前辈告诉我的，一直受用）。一直输出，却忘了输入怎么写。
可读流对象：

public class TestScanner {
    public static void main(String[] args) {
        BufferedReader br = new BufferedReader(new StringReader("sdfsdf\nsdfsdf\nsdfsdf"));
        try {
            System.out.println(br.readLine());
            System.out.println(br.readLine());
            System.out.println(br.readLine());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

用Scanner也可以。

Scanner s = new Scanner(System.in);
System.out.println(s.nextLine());

可以在控制台输入后输出。

Scanner定界符：

public class TestScanner {
    public static void main(String[] args) {
        Scanner s = new Scanner("12, 323, 34, 34, 5");
        s.useDelimiter("\\s*,\\s*");
        while (s.hasNextInt()) {
            System.out.println(s.nextInt());
        }
    }
}

昨天看到这里的时候卡住了，本来Scanner根据空白字符对输入进行分词:

Scanner s = new Scanner("12  323 34 34 5");
while (s.hasNextInt()) {
    System.out.println(s.nextInt());
}

这样可以打印每一个数字。
昨天想了好久的就是定界符这个东西，为什么我用\\d+,\\d+不行，今天再来看想通了，其实定界是作为分隔符来看，\\s是空格，而*是零次或多次，这样说就是以逗号前后无空白或者一个或多个空白，将Scanner里面的内容分隔开。
\\d+,\\d+，以逗号前后有数字作为分隔符，肯定不匹配。为了验证，把s改为W一试，也是可以的。

以前没有Scanner和正则表达式的时候，Java使用的是StringTokenizer，现在基本废弃不用了，当然，IDE还没有提示Deprecated.

String内容就到这里了，输入输出，格式化输出，正则表达式，用好的话，在批处理方面甚是强大，有空补充一下String不变性和内存分配的内容。