Java编程思想笔记——字符串

字符串操作是计算机程序设计中最常见的行为。

不可变String

String对象是不可变的。查看JDK文档就会发现,String类中每一个看起来会修改String值的方法,实际上都是创建了一个全新的String对象,以包含修改后的字符串的内容,而最初的String对象则丝毫未动:

public class Immotable {
    public static String upcase(String s) {
        return s.toUpperCase();
    }

    public static void main(String[] args) {
        String q = "howdy";
        System.out.println(q);
        String qq = upcase(q);
        System.out.println(qq);
        System.out.println(q);
    }
}/*
howdy
HOWDY
howdy
*/

当把q传递给upcase方法时,实际传递的是引用的一个拷贝。其实,每当把String对象做为方法的参数时,都会复制一份引用,而该引用所指的对象其实一直待在单一的物理位置,从未动过。
回到upcase的定义,传入其中的引用有了名字s,只有upcase运行的时候,局部引用s才会存在。一旦upcase运行结束,s就消失了。当然了,upcase的返回值,其实只是最终结果的引用。这足以说明,upcase返回的引用已经指向了一个新的对象,而原本的q则还在原地。

重载“+”与StringBuilder

不可变型为String带来了一定的效率问题,为String重载的“+”就是个例子:

public class Concatenation {
    public static void main(String[] args) {
        String mango = "mango";
        String s = "abc" + mango + "def" + 47;
        System.out.println(s);
    }
}/*
abcmangodef47
*/

String可能有一个append方法,它会生成一个新的String对象,以包含abc和mango连接后的字符串,以此类推。这样会产生一大堆需要垃圾回收的中间对象。
想查看代码是如何工作的,可以用JDK自带的工具javap来反编译以上代码,使用命令进入到Concatenation.java所在目录:

javac Concatenation.java
javap -c Concatenation
public class thinking3.Concatenation {
  public thinking3.Concatenation();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: return

  public static void main(java.lang.String[]);
    Code:
       0: ldc           #2                  // String mango
       2: astore_1
       3: new           #3                  // class java/lang/StringBuilder
       6: dup
       7: invokespecial #4                  // Method java/lang/StringBuilder."<init>":()V
      10: ldc           #5                  // String abc
      12: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      15: aload_1
      16: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      19: ldc           #7                  // String def
      21: invokevirtual #6                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      24: bipush        47
      26: invokevirtual #8                  // Method java/lang/StringBuilder.append:(I)Ljava/lang/StringBuilder;
      29: invokevirtual #9                  // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
      32: astore_2
      33: getstatic     #10                 // Field java/lang/System.out:Ljava/io/PrintStream;
      36: aload_2
      37: invokevirtual #11                 // Method java/io/PrintStream.println:(Ljava/lang/String;)V
      40: return
}

编译器自动引入了java.lang.StringBuilder类。虽然源代码中我们并没有使用StringBuilder类,但是编译器却自动使用了,因为它更高效。
编译器创建StringBuiler对象,用以构造器最终的String,并为每个字符串用一次StringBuiler的append方法,总计四次。最后调用toString生成结果。
现在也许可以随意使用String对象,反正编译器会自动优化性能。但我们要了解优化到了什么程度:

public class WhitherStringBuilder {
    public String implicit(String[] fields) {
        String result = "";
        for(int i = 0; i < fields.length; i++) {
            result += fields[i];
        }
        return result;
    }
    public String explicit(String[] fields) {
        StringBuilder result = new StringBuilder();
        for(int i = 0; i < fields.length; i++) {
            result.append(fields[i]);
        }
        return result.toString();
    }
}

现在运行javap -c WhitherStringBuilder ,可以看到两个方法的字节码:

public class thinking6.WhitherStringBuilder {
  public thinking6.WhitherStringBuilder();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: return

  public java.lang.String implicit(java.lang.String[]);
    Code:
       0: ldc           #2                  // String
       2: astore_2
       3: iconst_0
       4: istore_3
       5: iload_3
       6: aload_1
       7: arraylength
       8: if_icmpge     38
      11: new           #3                  // class java/lang/StringBuilder
      14: dup
      15: invokespecial #4                  // Method java/lang/StringBuilder."<init>":()V
      18: aload_2
      19: invokevirtual #5                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      22: aload_1
      23: iload_3
      24: aaload
      25: invokevirtual #5                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      28: invokevirtual #6                  // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
      31: astore_2
      32: iinc          3, 1
      35: goto          5
      38: aload_2
      39: areturn

  public java.lang.String explicit(java.lang.String[]);
    Code:
       0: new           #3                  // class java/lang/StringBuilder
       3: dup
       4: invokespecial #4                  // Method java/lang/StringBuilder."<init>":()V
       7: astore_2
       8: iconst_0
       9: istore_3
      10: iload_3
      11: aload_1
      12: arraylength
      13: if_icmpge     30
      16: aload_2
      17: aload_1
      18: iload_3
      19: aaload
      20: invokevirtual #5                  // Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
      23: pop
      24: iinc          3, 1
      27: goto          10
      30: aload_2
      31: invokevirtual #6                  // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
      34: areturn
}

implicit中StringBuilder是在循环体之内构造的,这就意味着每经过一次循环,就会创建一个新的StringBuilder对象。
explicit只生成了一个StringBuilder对象。显式地创建StringBuilder还允许预先为其指定大小。如果已经知道最终字符串的长度,预先指定大小可以避免多次重新分配缓冲。
所以如果要在toString方法中使用循环,那么最好自己创建StringBuilder对象,用它来构造最终结果:

public class UsingStringBuilder {
    public static Random rand = new Random(47);
    @Override
    public String toString(){
        StringBuilder result = new StringBuilder("[");
        for(int i = 0; i < 25; i++) {
            result.append(rand.nextInt(100));
            result.append(", ");
        }
        result.delete(result.length() - 2, result.length());
        result.append("]");
        return result.toString();
    }

    public static void main(String[] args) {
        UsingStringBuilder usb = new UsingStringBuilder();
        System.out.println(usb);
    }
}/*
[58, 55, 93, 61, 61, 29, 68, 0, 22, 7, 88, 28, 51, 89, 9, 78, 98, 61, 20, 58, 16, 40, 11, 22, 4]
*/

结果是由append一点点拼接出来,如果想走捷径,例如append(a + “:” + c),那编译器就会掉入陷阱,从而为你另外创建一个StringBuilder对象处理括号内的字符串操作。
StringBuilder是Java SE5引入的,在这之前Java用的是StringBuffer,后者是线程安全的,因此开销也大些,所以在Java SE5/6中,字符串操作应该还会更快一点。

无意识的递归

Java的每个类都继承自Object,标准容器类也不例外,因此容器类都有toString方法,并覆盖了该方法,使得它生成的String结果能够表达容器自身,以及容器所包含的对象。例如ArrayList.toString(),它会遍历ArrayList中包含的所有对象,调用每个元素上的toString方法:
Coffee和CoffeeGenerator包含在后续的泛型章节中。

public class ArrayListDisplay {
    public static void main(String[] args) {
        ArrayList<Coffee> coffees = new ArrayList<Coffee>();
        for(Coffee c : new CoffeeGenerator(10)) {
            coffees.add(c);
        }
        System.out.println(coffees);
    }
}/*
[Americano 0, Latte 1, Americano 2, Mocha 3, Mocha 4, Breve 5, Americano 6, Latte 7, Cappuccino 8, Cappuccino 9]
*/

如果希望toString方法打印内存地址,或许会使用this:

public class InfiniteRecursion {
    @Override
    public String toString() {
        return "InfiniteRecursion address: " + this + "\n";
    }

    public static void main(String[] args) {
        List<InfiniteRecursion> v = new ArrayList<InfiniteRecursion>();
        for(int i = 0; i < 10; i++) {
            v.add(new InfiniteRecursion());
        }
        System.out.println(v);
    }
}

当创建一个InfiniteRecursion 并打印或存入ArrayList并打印,都会出现异常。其实当”InfiniteRecursion address: ” + this运行时,这里发生自动类型转换,有InfiniteRecursion 转换成String类型。因为编译器看到一个String后跟着一个“+”,而在后面的对象不是String,于是编译器试着将this转换成一个String,通过调用this的toString方法转换,于是发生了递归调用。所以不该使用this,应该调用super.toString方法。

String上的操作

这里写图片描述

可以看出,当需要改变字符串内容时,String类的方法都会返回一个新的String对象。同时,如果内容没有发生变化,String方法只是返回指向原对象的引用而已。这可以节约存储空间以及避免额外的开销。

格式化输出

prinrf()

printf并不使用重载的“+”操作符来连接引号内的字符串或字符串变量,而是使用特殊占位符来表示数据将来的位置,而且它还将插入格式化字符串的参数以逗号分开,排成一行。
例如:
printf(“Row 1: [%d %f]\n”, x, y);
这一行代码在运行的时候,首先将x的值插入到%d的位置,然后将y的值插入到%f的位置,这些占位符被称作格式修饰符。他们不仅说明了插入位置,同时还说明插入类型。%d表示整数,%f表示浮点数

System.out.format()

format方法可用于PrintStream或PrintWriter对象,其中也包括System.out对象。format模仿自c的printf,如果比较怀旧可以使用printf:

public class SimpleFormat {
    public static void main(String[] args) {
        int x = 5;
        double y = 5.332542;
        System.out.println("Row 1: [" + x + " " + y + "]");
        System.out.format("Row 1: [%d %f]\n", x, y);
        System.out.printf("Row 1: [%d %f]\n", x, y);
    }
}/*
Row 1: [5 5.332542]
Row 1: [5 5.332542]
Row 1: [5 5.332542]
*/
Formatter类

在Java中,所有新的格式化功能都有java.util.Formatter类处理。可以将Formatter看作一个翻译器,它将格式化字符串与数据翻译成需要的结果:

public class Turtle {
    private String name;
    private Formatter f;
    public Turtle(String name, Formatter f) {
        this.name = name;
        this.f = f;
    }
    public void move(int x, int y) {
        f.format("%s The Turtle is at (%d,%d)\n", name, x, y);
    }

    public static void main(String[] args) {
        PrintStream outAlias = System.out;
        Turtle tommy = new Turtle("Tommy",new Formatter(System.out));
        Turtle terry = new Turtle("Terry",new Formatter(outAlias));
        tommy.move(0,0);
        terry.move(4,8);
        tommy.move(3,4);
        terry.move(2,5);
        tommy.move(3,3);
        terry.move(3,3);
    }
}/*
Tommy The Turtle is at (0,0)
Terry The Turtle is at (4,8)
Tommy The Turtle is at (3,4)
Terry The Turtle is at (2,5)
Tommy The Turtle is at (3,3)
Terry The Turtle is at (3,3)
*/

Formatter的构造器经过重载可以接受多种输出目的地(System.out,System.out的一个别名),最常用的是PrintStream、OutputStream和File。

格式化说明符
%[argument_index$][flags][width][.precision]conversion

width控制一个域的最小尺寸,Formatter对象通过在必要时添加空格,来确保一个域至少达到某个长度。在默认情况下,数据是右对齐,不过可以通过使用“-”标志来改变对齐方向。
与width相对的是precision,来指明最大尺寸。width可用于各种类型的数据转换,且行为方式都一样。precision则不然,不是所有类型的数据都可以使用precision,而且应用于不同类型的数据转换时,precision意义也不同。
precision应用于String时,表示打印String时输出字符的最大数量。而在将precision应用于浮点数时,他表示小数部分要显示出来的位数,过多舍入,太少补0。precision无法应用于整数(会触发异常),因为整数没有小数部分。

public class Receipt {
    private double total = 0;
    private Formatter f = new Formatter(System.out);
    public void printTitle() {
        f.format("%-15s %5s %10s\n", "Item", "Qty", "Price");
        f.format("%-15s %5s %10s\n", "----", "---", "-----");
    }
    public void print(String name, int qty, double price) {
        f.format("%-15.15s %5d %10.2f\n", name, qty, price);
        total += price;
    }
    public void printTotal() {
        f.format("%-15s %5s %10.2f\n", "Tax", "", total * 0.06);
        f.format("%-15s %5s %10s\n", "", "", "-----");
        f.format("%-15s %5s %10.2f\n", "Total", "", total * 1.06);
    }

    public static void main(String[] args) {
        Receipt receipt = new Receipt();
        receipt.printTitle();
        receipt.print("Jack's Magic Beans", 4, 4.25);
        receipt.print("Princess Peas", 3, 5.1);
        receipt.print("ThreeBears Porridge", 1, 14.29);
        receipt.printTotal();
    }
}/*
Item              Qty      Price
----              ---      -----
Jack's Magic Be     4       4.25
Princess Peas       3       5.10
ThreeBears Porr     1      14.29
Tax                         1.42
                           -----
Total                      25.06
*/
Formatter转换

这里写图片描述

public class Conversion {
    public static void main(String[] args) {
        Formatter f = new Formatter(System.out);
        char u = 'a';
        System.out.println("u = 'a'");
        f.format("s: %s\n", u);
        //f.format("d: %d\n", u); //d != java.lang.Character
        f.format("c: %c\n", u);
        f.format("b: %b\n", u);
        //f.format("f: %f\n", u);
        //f.format("e: %e\n", u);
        //f.format("x: %x\n", u);
        f.format("h: %h\n", u);

        int v = 121;
        System.out.println("v = 121");
        f.format("d: %d\n", v);
        f.format("c: %c\n", v);
        f.format("b: %b\n", v);
        f.format("s: %s\n", v);
        //f.format("f: %f\n", v);
        //f.format("e: %e\n", v);
        f.format("x: %x\n", v);
        f.format("h: %h\n", v);

        BigInteger w = new BigInteger("50000000000000");
        System.out.println("w = new BigInteger(\"50000000000000\")");
        f.format("d: %d\n", w);
        //f.format("c: %c\n", w);
        f.format("b: %b\n", w);
        f.format("s: %s\n", w);
        //f.format("f: %f\n", w);
        //f.format("e: %e\n", w);
        f.format("x: %x\n", w);
        f.format("h: %h\n", w);

        double x = 179.543;
        System.out.println("x = 179.543");
        //f.format("d: %d\n", x);
        //f.format("c: %c\n", x);
        f.format("b: %b\n", x);
        f.format("s: %s\n", x);
        f.format("f: %f\n", x);
        f.format("e: %e\n", x);
        //f.format("x: %x\n", x);
        f.format("h: %h\n", x);

        Conversion y = new Conversion();
        System.out.println("y = new Conversion()");
        //f.format("d: %d\n", y);
        //f.format("c: %c\n", y);
        f.format("b: %b\n", y);
        f.format("s: %s\n", y);
        //f.format("f: %f\n", y);
        //f.format("e: %e\n", y);
        //f.format("x: %x\n", y);
        f.format("h: %h\n", y);

        boolean z = false;
        System.out.println("z = false");
        //f.format("d: %d\n", z);
        //f.format("c: %c\n", z);
        f.format("b: %b\n", z);
        f.format("s: %s\n", z);
        //f.format("f: %f\n", z);
        //f.format("e: %e\n", z);
        //f.format("x: %x\n", z);
        f.format("h: %h\n", z);
    }
}/*
u = 'a'
s: a
c: a
b: true
h: 61
v = 121
d: 121
c: y
b: true
s: 121
x: 79
h: 79
w = new BigInteger("50000000000000")
d: 50000000000000
b: true
s: 50000000000000
x: 2d79883d2000
h: 8842a1a7
x = 179.543
b: true
s: 179.543
f: 179.543000
e: 1.795430e+02
h: 1ef462c
y = new Conversion()
b: true
s: thinking6.Conversion@1d44bcfa
h: 1d44bcfa
z = false
b: false
s: false
h: 4d5
*/

b的转换对于各种类型都是合法的,但其行为却不一定一致。对于boolean和Boolean对象,其结果是对应的true或false。但对其他类型的参数,只要该参数不为null就永远是true,即使是数字0转换结果依然为true。

String.format()

String.format是一个static方法,它接受与Formatter.format方法一样的参数,但返回一个String对象。

public class DatabaseException extends Exception {
    public DatabaseException(int transactionID, int queryID, String message) {
        super(String.format("(t%d, q%d) %s", transactionID, queryID, message));
    }

    public static void main(String[] args) {
        try {
            throw new DatabaseException(3, 7, "Write failed");
        } catch (Exception e) {
            System.out.println(e);
        }
    }
}/*
thinking6.DatabaseException: (t3, q7) Write failed
*/

一个十六进制转储(dump)工具::

public class Hex {
    public static String format(byte[] data) {
        StringBuilder result = new StringBuilder();
        int n = 0;
        for(byte b : data) {
            if(n % 16 == 0) {
                result.append(String.format("%05X: ", n));
            }
            result.append(String.format("%02X", b));
            n++;
            if(n % 16 == 0){
                result.append("\n");
            }
        }
        result.append("\n");
        return result.toString();
    }

    public static void main(String[] args) throws Exception {
        if(args.length == 0) {
            System.out.println(format(BinaryFile.read("E:\\JAVA\\IdeaProjects\\untitled\\src\\thinking6\\WhitherStringBuilder.class")));
        } else {
            System.out.println(format(BinaryFile.read(new File(args[0]))));
        }
    }
}/*
00000: CAFEBABE0000003400210A0008001508
00010: 00160700170A000300150A000300180A
00020: 0003001907001A07001B0100063C696E
00030: 69743E010003282956010004436F6465
00040: 01000F4C696E654E756D626572546162
00050: 6C65010008696D706C69636974010027
*/

read方法会将整个文件以byte数组形式返回。

正则表达式

基础

正则表达式就是以某种方式来描述字符串。
描述一个负数:-?,\d表示一位数字。在java中\的意思是要插入一个正则表达式的反斜线,所以其后的字符具有特殊意义。想表达一位数字\d,插入普通的反斜线\\。不过换行和制表符之类的只需要单反斜线\n\t。

public class IntegerMatch {
    public static void main(String[] args) {
        //可能有一个负号后面跟着一位或多位数字
        System.out.println("-1234".matches("-?\\d+"));
        System.out.println("5678".matches("-?\\d+"));
        System.out.println("+911".matches("-?\\d+"));
        //可能以一个加号或减号开头
        System.out.println("+911".matches("(-|\\+)?\\d+"));
    }
}/*
true
true
false
true

*/

split()方法:

public class Splitting {
    public static String knights = "Then, when you have found the shrubbery, you must cut down the mightiest tree in the forest... with... a herring!";
    public static void split(String regex) {
        System.out.println(Arrays.toString(knights.split(regex)));
    }

    public static void main(String[] args) {
        split(" ");
        split("\\W+");
        split("n\\W+");
    }
}/*
[Then,, when, you, have, found, the, shrubbery,, you, must, cut, down, the, mightiest, tree, in, the, forest..., with..., a, herring!]
[Then, when, you, have, found, the, shrubbery, you, must, cut, down, the, mightiest, tree, in, the, forest, with, a, herring]
[The, whe, you have found the shrubbery, you must cut dow, the mightiest tree i, the forest... with... a herring!]
*/

\W表示非单词字符(\w则表示单词字符),第二个例子表示将标点符号删除,第三个表示字母n后面跟着一个或多个非单词字符。
split有一个重载版本,允许限制字符串分割次数。

public class Replacing {
    static String s = Splitting.knights;

    public static void main(String[] args) {
        System.out.println(s.replaceFirst("f\\w+", "located"));
        System.out.println(s.replaceAll("shrubbery|tree|herring","banana"));
    }
}/*
Then, when you have located the shrubbery, you must cut down the mightiest tree in the forest... with... a herring!
Then, when you have found the banana, you must cut down the mightiest banana in the forest... with... a banana!
*/
创建正则表达式

java.util.regex包中的Pattern类。
这里写图片描述

public class Rudolph {
    public static void main(String[] args) {
        for(String pattern : new String[]{ "Rudolph", "[rR]udolph", "[rR][aeiou][a-z]ol.*", "R.*"}) {
            System.out.println("Rudolph".matches(pattern));
        }
    }
}/*
true
true
true
true
*/
量词

量词描述了一个模式吸收输入文本的方式:
贪婪型:量词总是贪恋的,除非有其他的选项被设置。贪婪表达式会为所有可能的模式发现尽可能多的匹配
勉强式:用问号指定,匹配满足模式所需的最少字符数
占有式:当正则表达式被应用于字符串时,她会产生相当多的状态,以便在匹配失败时可以回溯。他们常常用于防止正则表达式失控,因此可以使正则表达式执行起来更有效。
这里写图片描述
表达式X通常必须要用圆括号括起来,以便它能按照我们的期望去执行。
abc+实际上表示的是:匹配ab,后面跟随1个或多个c。要表明匹配1个或多个完整的abc字符串,必须:(abc)+

CharSequence

接口CharSequence从CharBuffer、String、StringBuffer、StringBuilder类之中抽象出字符序列的一般化定义:

interface CharSequence {
    charAt(int i);
    length();
    subSequence(int start, int end);
    toString();
}

因此,这些类都实现了该接口。多数正则表达式操作都接受CharSequence类型的参数。

Pattern和Matcher

比起功能有限的String类,我们更愿意使用功能强大的正则表达式对象。只需导入java.util.regex包,然后用static Pattern.compile()方法来编译正则表达式即可。它会根据你的String类型的正则表达式生成一个Pattern对象。

//需要添加Program arguments为abcabcabcdefabc "abc+" "(abc)+" "(abc){2,}"
public class TestRegularExpression {
    public static void main(String[] args) {
        if(args.length < 2) {
            System.out.println("Usage:\njava TestRegularExpression characterSequence regularExpression");
            System.exit(0);
        }
        System.out.println("Input: \"" + args[0] + "\"");
        for(String arg : args) {
            System.out.println("Regular expression: \"" + arg + "\"");
            Pattern p = Pattern.compile(arg);
            Matcher m = p.matcher(args[0]);
            while (m.find()) {
                System.out.println("Match \"" + m.group() + "\" at positions " + m.start() + "-" + (m.end() - 1));
            }
        }
    }
}/*
Input: "abcabcabcdefabc"
Regular expression: "abcabcabcdefabc"
Match "abcabcabcdefabc" at positions 0-14
Regular expression: "abc+"
Match "abc" at positions 0-2
Match "abc" at positions 3-5
Match "abc" at positions 6-8
Match "abc" at positions 12-14
Regular expression: "(abc)+"
Match "abcabcabc" at positions 0-8
Match "abc" at positions 12-14
Regular expression: "(abc){2,}"
Match "abcabcabc" at positions 0-8
*/

Pattern类提供了static方法:static boolean matches(String regex, CharSequence input),通过调用Pattern.matcher()方法,并传入一个字符串参数,得到一个Matcher对象:
boolean matches()
boolean lookingAt()
boolean find()
boolean find(int start)
其中matches方法用来判断整个输入字符串是否匹配正则表达式模式,而lookingAt则用来判断该字符串的(不必是整个字符串)始部分是否能够匹配模式。

find()

Matcher.find方法用来在CharSequence中查找多个匹配:

public class Finding {
    public static void main(String[] args) {
        Matcher m = Pattern.compile("\\w+").matcher("Evening is full of the linnet's wings");
        while (m.find()) {
            System.out.print(m.group() + " ");
        }
        System.out.println();
        int i = 0;
        while (m.find(i)) {
            System.out.print(m.group() + " ");
            i++;
        }
    }
}/*
Evening is full of the linnet s wings
Evening vening ening ning ing ng g is is s full full ull ll l of of f the the he e linnet linnet innet nnet net et t s s wings wings ings ngs gs s 
*/

模式\w+将字符串划分为单词。find像迭代器一样遍历字符串。而第二个find的整数表示字符串中字符的位置,并以其作为搜索起点。

组(Groups)

组是用括号划分正则表达式,可以根据组的编号来引用某个组。组号0表示整个表达式,1表示被第一对括号括起来的组:
A(B(C))D,组0表示ABCD,组1表示BC,组3表示C。
Matcher对象的public int groupCount返回匹配器的模式中分组数目(不包含0组)。public String group返回前一次匹配操作的第0组。public String group(int i)返回前一次匹配操作期间指定的组号,如果匹配成功,但是指定的组没有匹配输出字符串的任何部分,则将会返回null。public int start(int group)返回在前一次匹配操作中寻找到的组的起始索引。public int end(int group)返回在前一次匹配操作中寻找到的组的最后一个字符索引加一的值。

public class Groups {
    static public final String POEM =
            "Twas brillig, and the slithy toves\n" +
                    "Did gyre and gimble in the wabe.\n" +
                    "All mimsy were the borogoves,\n" +
                    "And the mome raths outgrabe.\n\n" +
                    "Beware the Jabberwock, my son,\n" +
                    "The jaws that bite, the claws that catch.\n" +
                    "Beware the Jubjub bird, and shun\n" +
                    "The frumious Bandersnatch.";
    public static void main(String[] args) {
        Matcher m =
                Pattern.compile("(?m)(\\S+)\\s+((\\S+)\\s+(\\S+))$")
                        .matcher(POEM);
        while(m.find()) {
            for(int j = 0; j <= m.groupCount(); j++) {
                System.out.print("[" + m.group(j) + "]");
            }
            System.out.println();
        }
    }
}/*
[the slithy toves][the][slithy toves][slithy][toves]
[in the wabe.][in][the wabe.][the][wabe.]
[were the borogoves,][were][the borogoves,][the][borogoves,]
[mome raths outgrabe.][mome][raths outgrabe.][raths][outgrabe.]
[Jabberwock, my son,][Jabberwock,][my son,][my][son,]
[claws that catch.][claws][that catch.][that][catch.]
[bird, and shun][bird,][and shun][and][shun]
[The frumious Bandersnatch.][The][frumious Bandersnatch.][frumious][Bandersnatch.]
*/

这个正则表达式目的是捕获每行的最后3个词,每行最后以 结 束 。 正 常 情 况 下 是 将 “ ”与整个输出序列的末端相匹配,所以我们一定要显式的让正则表达式注意序列中的换行符。

start()与end()

匹配失败之后或先于一个正在进行匹配操作调用将会产生IllegalStateException:

public class StartEnd {
    public static String input =
            "As long as there is injustice, whenever a\n" +
                    "Targathian baby cries out, wherever a distress\n" +
                    "signal sounds among the stars ... We'll be there.\n" +
                    "This fine ship, and this fine crew ...\n" +
                    "Never give up! Never surrender!";
    private static class Display {
        private boolean regexPrinted = false;
        private String regex;
        Display(String regex) {
            this.regex = regex;
        }
        void display(String message) {
            if(!regexPrinted) {
                System.out.println(regex);
                regexPrinted = true;
            }
            System.out.println(message);
        }
    }
    static void examine(String s, String regex) {
        Display d = new Display(regex);
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(s);
        while (m.find()) {
            d.display("find() '" + m.group() + "' start = " + m.start() + " end = " + m.end());
        }
        if(m.lookingAt()) {
            d.display("lookingAt() start = " + m.start() + " end = " + m.end());
        }
        if(m.matches()) {
            d.display("matches() start = " + m.start() + " end = " + m.end());
        }
    }

    public static void main(String[] args) {
        for (String in : input.split("\n")) {
            System.out.println("input : " + in);
            for (String regex : new String[]{"\\w*ere\\w*",
                "\\w*ever", "T\\w+", "Never.*?!"
            }) {
                examine(in, regex);
            }
        }
    }
}/*
input : As long as there is injustice, whenever a
\w*ere\w*
find() 'there' start = 11 end = 16
\w*ever
find() 'whenever' start = 31 end = 39
input : Targathian baby cries out, wherever a distress
\w*ere\w*
find() 'wherever' start = 27 end = 35
\w*ever
find() 'wherever' start = 27 end = 35
T\w+
find() 'Targathian' start = 0 end = 10
lookingAt() start = 0 end = 10
input : signal sounds among the stars ... We'll be there.
\w*ere\w*
find() 'there' start = 43 end = 48
input : This fine ship, and this fine crew ...
T\w+
find() 'This' start = 0 end = 4
lookingAt() start = 0 end = 4
input : Never give up! Never surrender!
\w*ever
find() 'Never' start = 0 end = 5
find() 'Never' start = 15 end = 20
lookingAt() start = 0 end = 5
Never.*?!
find() 'Never give up!' start = 0 end = 14
find() 'Never surrender!' start = 15 end = 31
lookingAt() start = 0 end = 14
matches() start = 0 end = 31
*/

find可以在输出的任意位置定义正则表达式,而lookingAt和matches只有在正则表达式与输出的最开始处就开始匹配时才会成功。matches只有在整个输出都匹配正则表达式时才会成功,而lookingAt只要输出的第一部分匹配就会成功。

Pattern标记

Pattern类的compile还有另一个版本,接受一个标记参数:
Pattern Pattern.compile(String regex, int flag)
这里写图片描述
在这些标记中,Pattern.CASE_INSENSITIVE、Pattern.MULTILINE以及Pattern.COMMENTS特别有用。

public class ReFlags {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("^java", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
        Matcher m = p.matcher(
                "java has regex\nJava has regex\n" +
                "JAVA has pretty good regular expressions\n" +
                "Regular expressions are in Java");
        while (m.find()) {
            System.out.println(m.group());
        }
    }
}/*
java
Java
JAVA
*/

匹配所有以java、Java、JAVA等开头的行,并且设置了多行标记的状态下,对每一个航都进行匹配。

split

split方法将输入字符串断开成字符串对象数组,断开边界有下列正则表达式确定:
String[] split(CharSeuence input)
String[] split(CharSeuence input, int limit)

public class SplitDemo {
    public static void main(String[] args) {
        String input = "This!!unusual use!! of exclamation!!points";
        System.out.println(Arrays.toString(Pattern.compile("!!").split(input)));
        System.out.println(Arrays.toString(Pattern.compile("!!").split(input,3)));
    }
}/*
[This, unusual use,  of exclamation, points]
[This, unusual use,  of exclamation!!points]
*/

第二种形式的split方法可以限制将输入分割成字符串的数量。

替换操作

replaceFirst(String rep)以参数字符串rep替换掉第一个匹配成功的部分
replaceAll(String rep)以参数字符串rep替换所有匹配成功的部分
appendReplacement(StringBuffer sbuf,String rep)执行渐进式替换,它允许调用其他方法来生成或处理rep,使能够以编程的方式将目标分割成组
appendTail(StringBuffer sbuf)在执行一次或多次appendReplacement之后,调用此方法可以将输入字符串剩余部分复制到sbuf。

/*! Here's a block of text to use as input to
    the regular expression matcher. Note that we'll
    first extract the block of text by looking for
    the special delimiters, then process the
    exteckted block. !*/
public class TheReplacements {
    public static void main(String[] args) {
        String s = TextFile.read("E:\\IDEAFile\\untitled\\src\\thinking3\\TheReplacements.java");
        Matcher mInput = Pattern.compile("/\\*!(.*)!\\*/",Pattern.DOTALL).matcher(s);
        if(mInput.find()) {
            s = mInput.group(1);
        }
        s = s.replaceAll(" {2,}", " ");
        s = s.replaceAll("(?m)^ +", "");
        System.out.println(s);
        s = s.replaceFirst("[aeiou]", "(VOWEL1)");
        StringBuffer sbuf = new StringBuffer();
        Pattern p = Pattern.compile("[aeiou]");
        Matcher m = p.matcher(s);
        while (m.find()) {
            m.appendReplacement(sbuf, m.group().toUpperCase());
        }
        m.appendTail(sbuf);
        System.out.println(sbuf);
    }
}/*
Here's a block of text to use as input to
the regular expression matcher. Note that we'll
first extract the block of text by looking for
the special delimiters, then process the
exteckted block. 
H(VOWEL1)rE's A blOck Of tExt tO UsE As InpUt tO
thE rEgUlAr ExprEssIOn mAtchEr. NOtE thAt wE'll
fIrst ExtrAct thE blOck Of tExt by lOOkIng fOr
thE spEcIAl dElImItErs, thEn prOcEss thE
ExtEcktEd blOck. 
*/

minput用以匹配/!和!/之间的所有汉字。接下来将存在两个或两个以上空格的地方缩减为一个空格,并删除每行开头部分的所有空格。因为两个替换操作都只用了一次replaceAll所以与其编译为Pattern,不如直接使用String的replaceAll方法,而且开销更小一些。
appendReplacement允许在执行替换过程中,操作用来替换的字符串。例子中先构造sbuf用来保存最终结果,然后用group选择一个组,并对其进行处理,将正则表达式找到的元音字母转换成大写字母。执行完所有替换操作,然后调用appendTail方法。

reset()

reset方法可以将现有的Matcher对象应用于一个新的字符序列:

public class Resetting {
    public static void main(String[] args) {
        Matcher m = Pattern.compile("[frb][aiu][gx]").matcher("fix the fug with bags");
        while (m.find()) {
            System.out.print(m.group() + " ");
        }
        System.out.println();
        m.reset("fix the rig with rags");
        while (m.find()) {
            System.out.print(m.group() + " ");
        }
    }
}/*
fix fug bag 
fix rig rag 
*/

使用不带参数的reset方法,可以将Matcher对象重新设置到当前字符序列的起始位置。

正则表达式与Java I/O

应用正则表达式在一个文件中进行搜索匹配操作。JGrep.java源自于Unix上的grep。它有两个参数:文件名以及要匹配的正则表达式。输出的是有匹配的部分以及匹配部分在行中的位置。

public class JGrep {
    public static void main(String[] args) throws Exception {
        Pattern p = Pattern.compile("\\b[Ssct]\\w+");
        // Iterate through the lines of the input file:
        int index = 0;
        Matcher m = p.matcher("");
        for(String line : new TextFile("E:\\IDEAFile\\untitled\\src\\JGrep.java")) {
            m.reset(line);
            while(m.find()) {
                System.out.println(index++ + ": " +
                        m.group() + ": " + m.start());
            }
        }
    }
}/*
0: class: 7
1: static: 11
2: String: 28
3: throws: 43
4: compile: 28
5: Ssct: 41
6: through: 19
7: the: 27
8: the: 40
9: String: 12
10: src: 64
11: System: 16
12: start: 45
*/

读入JGrep.java文件,搜索以[Ssct]开头的单词。

扫描输入

从文件或标准输入读取数据(读入一行文本,就对其进行分词,然后解析数据):

public class SimpleRead {
    public static BufferedReader input = new BufferedReader(new StringReader("Sir Robin of Camelot\n22 1.61803"));

    public static void main(String[] args) {
        try {
            System.out.println("What is your name?");
            String name = input.readLine();
            System.out.println(name);
            System.out.println("How old are you? What is your favorite double?");
            System.out.println("(input: <age><double>)");
            String numbers = input.readLine();
            System.out.println(numbers);
            String[] numArray = numbers.split(" ");
            int age = Integer.parseInt(numArray[0]);
            double favorite = Double.parseDouble(numArray[1]);
            System.out.format("Hi %s.\n", name);
            System.out.format("In 5 years you will be %d.\n", age + 5);
            System.out.format("My favorite double is %f.", favorite /2);
        } catch (IOException e) {
            System.err.println("I/O exception");
        }
    }
}/*
What is your name?
Sir Robin of Camelot
How old are you? What is your favorite double?
(input: <age><double>)
22 1.61803
Hi Sir Robin of Camelot.
In 5 years you will be 27.
My favorite double is 0.809015.
*/

readLine房一行输入转为String对象。
Java SE5新增了Scanner类,它可以大大减轻扫描输入的工作负担:

public class BetterRead {
    public static void main(String[] args) {
        Scanner stdin = new Scanner(SimpleRead.input);
        System.out.println("What is your name?");
        String name = stdin.nextLine();
        System.out.println(name);
        System.out.println("How old are you? What is your favorite double?");
        System.out.println("(input: <age><double>)");
        String numbers = stdin.nextLine();
        System.out.println(numbers);
        String[] numArray = numbers.split(" ");
        int age = Integer.parseInt(numArray[0]);
        double favorite = Double.parseDouble(numArray[1]);
        System.out.format("Hi %s.\n", name);
        System.out.format("In 5 years you will be %d.\n", age + 5);
        System.out.format("My favorite double is %f.", favorite /2);
    }
}/*
What is your name?
Sir Robin of Camelot
How old are you? What is your favorite double?
(input: <age><double>)
22 1.61803
Hi Sir Robin of Camelot.
In 5 years you will be 27.
My favorite double is 0.809015.
*/

Scanner的构造器可以接受任何类型的输入对象,包括File对象、InputStream、String或者Readable对象。Readable是Java SE5中新加入的接口,表示具有read方法的某种东西。有了Scaaner,所有的输入、分词以及翻译的操作都隐藏在不同类型的next方法中。普通的next方法返回下一个String。所有基本类型除char之外都用对应的next方法,包括BigDecimal和BigInteger。Scanner还有对应的hasNest方法,用以判断下一个输入分词是否所需的类型。
Scanner有一个假设,在输入结束时会抛出IOException,所以Scanner会把IOException吞掉。

Scanner定界符

默认情况下,Scanner根据空白字符对输入进行分词,也可用正则表达式指定定界符:

public class SacnnerDelimiter {
    public static void main(String[] args) {
        Scanner scanner = new Scanner("12, 42, 78, 99, 42");
        scanner.useDelimiter("\\s*,\\s*");
        while (scanner.hasNextInt()) {
            System.out.println(scanner.nextInt());
        }
    }
}/*
12
42
78
99
42
*/

使用useDelimiter来设置定界符,还有一个delimiter方法来返回当前正在做为定界符使用的Pattern对象。

用正则表达式扫描

使用自定义正则表达式进行扫描复杂数据:

public class ThreatAnalyzer {
    static String threatData =
            "58.27.82.161@02/10/2005\n" +
            "204.45.234.40@02/11/2005\n" +
            "58.27.82.161@02/11/2005\n" +
            "58.27.82.161@02/12/2005\n" +
            "58.27.82.161@02/12/2005\n" +
            "[Next log section with different data format]";

    public static void main(String[] args) {
        Scanner scanner = new Scanner(threatData);
        String pattern = "(\\d+[.]\\d+[.]\\d+[.]\\d+)@" +
                "(\\d{2}/\\d{2}/\\d{4})";
        while (scanner.hasNext(pattern)) {
            scanner.next(pattern);
            MatchResult match = scanner.match();
            String ip = match.group(1);
            String date = match.group(2);
            System.out.format("Threat on %s from %s\n", date, ip);
        }
    }
}/*
Threat on 02/10/2005 from 58.27.82.161
Threat on 02/11/2005 from 204.45.234.40
Threat on 02/11/2005 from 58.27.82.161
Threat on 02/12/2005 from 58.27.82.161
Threat on 02/12/2005 from 58.27.82.161
*/

当next方法配合指定的正则表达式使用时,将找到下一个匹配该模式的输入部分,调用match方法就可以获得匹配结果。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值