Java编程思想(十三) —— 字符串之正则表达式

上篇讲到String的基本用法及StringBuilder和String的比较。继续。


给大家感受一下RednaxelaFX的厉害,他大学看的书。

嗯,这就是去硅谷的水平,所以,还是继续看书吧。


1)格式化输出

确实,说到C的printf,是不能用重载的+操作符的。

printf("%d %f", x , y);
%d这些为格式修饰符,%d表示整数,x插入到%d的位置,%f表示浮点数,y查到%f的位置

Java也模仿了C:

public class TestString {
    public static void main(String[] args) {
        int x = 1;
        float y = 1.223f;
        System.out.printf("%d %f",x,y);
        System.out.println();
        System.out.format("%d %f",x,y);
    }
}


可以用Formatter在控制台完美的控制间隔,不用你自己去数几个空格了。

public class TestString {
    public static void main(String[] args) {
        Formatter fm = new Formatter(System.out);
        fm.format("%-5s %5s %10s ", "Name","Age","School");
    }
}
%数字+s这样的表达像c,位置可以移动。


 System.out.println(String.format("%h", 17));
 fm.format("%h", 17);
16进制的格式化输出。


2)正则表达式(regex :regular expression)

字符串处理,文件批处理中经常使用到,很好用,也是容易忘。这个点结合网上的一些知识点来写。

-? 一个可能带有负号的数字不包括数字。

\d   表示一位数字,注意其他语言的\\是在正则表达式中是一个反斜杠,而在java中是正要插入正则表达式的\。


举一反三,那么\d在java中就是\\d了,真正想插入一条反斜杠就要\\\。

String的匹配 利用String的match方法

public class TestString {
    public static void main(String[] args) {
        System.out.println("-3444".matches("-?\\d+"));
        System.out.println("-3".matches("-?\\d"));
        System.out.println("-3".matches("(-|\\+)?\\d"));
    }
}

result:都是ture

(-|\\+)? 这个比较复杂,|是或的意思,\\+,由于加号有特殊含义,那么要\\转义,所以就是有加号或者负号的其中一个,或者都没有。


split方法:

经常使用的时候是根据空格切割。

String s = Arrays.toString("sdfsdf sf sdf".split(" "));
其实还可以在split参数中输入正则表达式进行切割:

String s = Arrays.toString("sdfsdf sf sdf".split("\\W+"));
String s2 = Arrays.toString("sdfsdf sf sdf".split("n\\W+"));
\w是非单词字符,\w为单词字符,n\\W+ 字母n后跟着一个或多个非中文字符。


参考:

http://blog.csdn.net/kdnuggets/article/details/2526588

和JDK的Pattern类:


ConstructMatches
 
Characters
xThe character x
\\The backslash character
\0nThe character with octal value 0n (0 <= n <= 7)
\0nnThe character with octal value 0nn (0 <= n <= 7)
\0mnnThe character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\xhhThe character with hexadecimal value 0xhh
\uhhhhThe character with hexadecimal value 0xhhhh
\x{h...h}The character with hexadecimal value 0xh...h (Character.MIN_CODE_POINT  <= 0xh...h <= Character.MAX_CODE_POINT)
\tThe tab character ('\u0009')
\nThe newline (line feed) character ('\u000A')
\rThe carriage-return character ('\u000D')
\fThe form-feed character ('\u000C')
\aThe alert (bell) character ('\u0007')
\eThe escape character ('\u001B')
\cxThe control character corresponding to x
 
Character classes
[abc]
a, b, or c (simple class)
[^abc]Any character except a, b, or c (negation)
[a-zA-Z]a through z or A through Z, inclusive (range)
[a-d[m-p]]a through d, or m through p:[a-dm-p] (union)
[a-z&&[def]]d, e, or f (intersection)
[a-z&&[^bc]]a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]]a through z, and not m through p: [a-lq-z](subtraction)
 
Predefined character classes
.Any character (may or may not match line terminators)
\dA digit: [0-9]
\DA non-digit: [^0-9]
\sA whitespace character: [ \t\n\x0B\f\r]
\SA non-whitespace character: [^\s]
\wA word character: [a-zA-Z_0-9]
\WA non-word character: [^\w]
 
POSIX character classes (US-ASCII only)
\p{Lower}A lower-case alphabetic character: [a-z]
\p{Upper}An upper-case alphabetic character:[A-Z]
\p{ASCII}All ASCII:[\x00-\x7F]
\p{Alpha}An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit}A decimal digit: [0-9]
\p{Alnum}An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct}Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}A visible character: [\p{Alnum}\p{Punct}]
\p{Print}A printable character: [\p{Graph}\x20]
\p{Blank}A space or a tab: [ \t]
\p{Cntrl}A control character: [\x00-\x1F\x7F]
\p{XDigit}A hexadecimal digit: [0-9a-fA-F]
\p{Space}A whitespace character: [ \t\n\x0B\f\r]
 
java.lang.Character classes (simple java character type)
\p{javaLowerCase}Equivalent to java.lang.Character.isLowerCase()
\p{javaUpperCase}Equivalent to java.lang.Character.isUpperCase()
\p{javaWhitespace}Equivalent to java.lang.Character.isWhitespace()
\p{javaMirrored}Equivalent to java.lang.Character.isMirrored()
 
Classes for Unicode scripts, blocks, categories and binary properties
\p{IsLatin}A Latin script character (script)
\p{InGreek}A character in the Greek block (block)
\p{Lu}An uppercase letter (category)
\p{IsAlphabetic}An alphabetic character (binary property)
\p{Sc}A currency symbol
\P{InGreek}Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)
 
Boundary matchers
^The beginning of a line
$The end of a line
\bA word boundary
\BA non-word boundary
\AThe beginning of the input
\GThe end of the previous match
\ZThe end of the input but for the final terminator, if any
\zThe end of the input


量词:吸收文本的方式

Greedy quantifiers            贪婪型
X?X, once or not at all
X*X, zero or more times
X+X, one or more times
X{n}X, exactly n times
X{n,}X, at least n times
X{n,m}X, at least n but not more than m times
 
Reluctant quantifiers
X??X, once or not at all
X*?X, zero or more times
X+?X, one or more times
X{n}?X, exactly n times
X{n,}?X, at least n times
X{n,m}?X, at least n but not more than m times
 
Possessive quantifiers
X?+X, once or not at all
X*+X, zero or more times
X++X, one or more times
X{n}+X, exactly n times
X{n,}+X, at least n times
X{n,m}+X, at least n but not more than m times
 
Logical operators
XYX followed by Y
X|YEither X or Y
(X)X, as a capturing group
 
Back references
\nWhatever the nth capturing group matched
\k<name>Whatever the named-capturing group "name" matched
 
Quotation
\Nothing, but quotes the following character
\QNothing, but quotes all characters until \E
\ENothing, but ends quoting started by \Q
 
Special constructs (named-capturing and non-capturing)
(?<name>X)X, as a named-capturing group
(?:X)X, as a non-capturing group
(?idmsuxU-idmsuxU) Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X)  X, as a non-capturing group with the given flagsidmsu x on - off
(?=X)X, via zero-width positive lookahead
(?!X)X, via zero-width negative lookahead
(?<=X)X, via zero-width positive lookbehind
(?<!X)X, via zero-width negative lookbehind
(?>X)X, as an independent, non-capturing group


3)Pattern和Matcher

public class TestString {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("\\W+");
        Matcher m = p.matcher("qw");
        System.out.println(m.matches());
    }
}
Pattern.compile,静态方法,Compiles the given regular expression into a pattern。将一个正则表达式编译进Pattern中。

p.mathcer,Creates a matcher that will match the given input against this pattern。创建一个matcher将输入和Pattern匹配。



m.matches,Attempts to match the entire region against the pattern。
boolean,返回匹配结果。

这样就可以传入正则表达式,然后对字符串进行匹配。

1、find和group
public class TestString {
    public static void main(String[] args) {
        String s = "You're kidding me!";
        Pattern p = Pattern.compile("\\w+");
        Matcher m = p.matcher(s);
        while(m.find()){
            System.out.printf(m.group()+" ");
        }

        int i = 0;
        while(m.find(i)){
            System.out.printf(m.group()+" ");
            i++;
        }
    }
}

result: You re kidding me 
You ou u re re e kidding kidding idding dding ding ing ng g me me e 


find可以遍历字符串,寻找正则表达式的匹配,group是 Returns the input subsequence matched by the previous match。这样返回的便是第一个匹配多个单词字符 ,所以便是You。find传入参数后,可以调整开始搜索的位置,刚开始为0,那么匹配的是You,i+1之后,匹配到的是ou。

2、end和start
while(m.find()){
            System.out.printf(m.group()+" Start:"+m.start()+" End:"+m.end());
}

You Start:0 End:3re Start:4 End:6kidding Start:7 End:14me Start:15 End:17
匹配起始位置的索引,匹配结束位置的索引。

3、split
其实书上讲属性的东西是最简单的,因为文档有,这种文档有的就是自己动手查动手敲代码。Pattern还有两个

String[]split(CharSequence input)
Splits the given input sequence around matches of this pattern.
String[]split(CharSequence input, int limit)
Splits the given input sequence around matches of this pattern.
String string = "kjj~~lkjl~~lkjlJ~~lkj~~";
System.out.println(Arrays.toString(Pattern.compile("~~").split(string)));
System.out.println(Arrays.toString(Pattern.compile("~~").split(string,2)));

result:
[kjj, lkjl, lkjlJ, lkj]
[kjj, lkjl~~lkjlJ~~lkj~~]
(哈哈,作者竟然在书中直接讽刺Sun里面的java设计者,把Pattern的标记设计得难懂。)

4)替换操作

StringreplaceAll(String regex,String replacement)
Replaces each substring of this string that matches the given regular expression with the given replacement.
StringreplaceFirst(String regex,String replacement)
Replaces the first substring of this string that matches the given regular expression with the given replacement.
replaceFirst替换的是第一个匹配的内容。replaceAll是全部替换。

接下来还有比这两者好用的处理方法,加入你要找出abcd字母并且替换成大写字母,如果用上面两种写法的话就要处理多次。

String string = "asdfb  sdfoiwer  sdfcdf wer sd d sdf  cxvxzcv s ef bob b   b ";
StringBuffer s2 = new StringBuffer();
Pattern pa = Pattern.compile("[abcd]");
Matcher mc = pa.matcher(string);
System.out.println();
while(mc.find()){
    mc.appendReplacement(s2,mc.group().toUpperCase());
}
mc.appendTail(s2);
System.out.println(s2);

result:AsDfB  sDfoiwer  sDfCDf wer sD D sDf  CxvxzCv s ef BoB B   B 



mc.find();
mc.appendReplacement(s2,mc.group().toUpperCase());
mc.appendTail(s2);
System.out.println(s2);

result :Asdfb  sdfoiwer  sdfcdf wer sd d sdf  cxvxzcv s ef bob b   b 
替换时也能操作字符串,while的时候能够全部替换,如果不用while,只进行一次find操作,那么s2打印出来的只有A,要达到replaceFirst的效果,要用appendTail方法,加尾巴,就是把剩余没替换的补上。这样才会打印完整。

reset方法:
Matcher mc = pa.matcher(string);
每次mc只能match一个字符串,可以用reset方法重新match其他字符串:
mc.reset(String newString);

5)扫描输入
c的输入很简单,有时java经常写Syso(Eclipse的System.out.println的快捷输入,很早之前一位前辈告诉我的,一直受用)。一直输出,却忘了输入怎么写。
可读流对象:

public class TestScanner {
    public static void main(String[] args) {
        BufferedReader br = new BufferedReader(new StringReader("sdfsdf\nsdfsdf\nsdfsdf"));
        try {
            System.out.println(br.readLine());
            System.out.println(br.readLine());
            System.out.println(br.readLine());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

用Scanner也可以。
Scanner s = new Scanner(System.in);
System.out.println(s.nextLine());
可以在控制台输入后输出。

Scanner定界符:
public class TestScanner {
    public static void main(String[] args) {
        Scanner s = new Scanner("12, 323, 34, 34, 5");
        s.useDelimiter("\\s*,\\s*");
        while (s.hasNextInt()) {
            System.out.println(s.nextInt());
        }
    }
}
昨天看到这里的时候卡住了,本来Scanner根据空白字符对输入进行分词:
Scanner s = new Scanner("12  323 34 34 5");
while (s.hasNextInt()) {
    System.out.println(s.nextInt());
}
这样可以打印每一个数字。
昨天想了好久的就是定界符这个东西,为什么我用\\d+,\\d+不行,今天再来看想通了,其实定界是作为分隔符来看,\\s是空格,而*是零次或多次,这样说就是以逗号前后无空白或者一个或多个空白,将Scanner里面的内容分隔开。
\\d+,\\d+,以逗号前后有数字作为分隔符,肯定不匹配。为了验证,把s改为W一试,也是可以的。

以前没有Scanner和正则表达式的时候,Java使用的是StringTokenizer,现在基本废弃不用了,当然,IDE还没有提示Deprecated.

String内容就到这里了,输入输出,格式化输出,正则表达式,用好的话,在批处理方面甚是强大,有空补充一下String不变性和内存分配的内容。


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

iaiti

赏顿早餐钱~

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值