Top 10 Questions for Java Regular Expression

This post summarizes the top questions asked about Java regular expressions. As they are most frequently asked, you may find that they are also very useful.

1. How to extract numbers from a string?

One common question of using regular expression is to extract all the numbers into an array of integers.

In Java, \d means a range of digits (0-9). Using the predefined classes whenever possible will make your code easier to read and eliminate errors introduced by malformed character classes. Please refer toPredefined character classes for more details. Please note the first backslash \ in \d. If you are using an escaped construct within a string literal, you must precede the backslash with another backslash for the string to compile. That’s why we need to use \\d.

List<Integer> numbers = new LinkedList<Integer>();
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(str); 
while (m.find()) {
  numbers.add(Integer.parseInt(m.group()));
}

2. How to split Java String by newlines?

There are at least three different ways to enter a new line character, dependent on the operating system you are working on.

  1. \r represents CR (Carriage Return), which is used in Unix
  2. \n means LF (Line Feed), used in Mac OS
  3. \r\n means CR + LF, used in Windows

Therefore the most straightforward way to split string by new lines is

String lines[] = String.split("\\r?\\n");

But if you don’t want empty lines, you can use, which is also my favourite way:

String.split("[\\r\\n]+")

A more robust way, which is really system independent, is as follows. But remember, you will still get empty lines if two newline characters are placed side by side.

String.split(System.getProperty("line.separator"));

3. Importance of Pattern.compile()

A regular expression, specified as a string, must first be compiled into an instance of Pattern class.Pattern.compile() method is the only way to create a instance of object. A typical invocation sequence is thus

Pattern p = Pattern.compile("a*b");
Matcher matcher = p.matcher("aaaaab");
assert matcher.matches() == true;

Essentially, Pattern.compile() is used to transform a regular expression into an Finite state machine (seeCompilers: Principles, Techniques, and Tools (2nd Edition)). But all of the states involved in performing a match resides in the matcher. By this way, the Pattern p can be reused. And many matchers can share the same pattern.

Matcher anotherMatcher = p.matcher("aab");
assert anotherMatcher.matches() == true;

Pattern.matches() method is defined as a convenience for when a regular expression is used just once. This method still uses compile() to get the instance of a Pattern implicitly, and matches a string. Therefore,

boolean b = Pattern.matches("a*b", "aaaaab");

is equivalent to the first code above, though for repeated matches it is less efficient since it does not allow the compiled pattern to be reused.

4. How to escape text for regular expression?

In general, regular expression uses “\” to escape constructs, but it is painful to precede the backslash with another backslash for the Java string to compile. There is another way for users to pass string Literals to the Pattern, like “$5″. Instead of writing \\$5 or [$]5, we can type

Pattern.quote("$5");

5. Why does String.split() need pipe delimiter to be escaped?

String.split() splits a string around matches of the given regular expression. Java expression supports special characters that affect the way a pattern is matched, which is called metacharacter| is one metacharacter which is used to match a single regular expression out of several possible regular expressions. For example, A|B means either A or B. Please refer to Alternation with The Vertical Bar or Pipe Symbol for more details. Therefore, to use | as a literature, you need to escape it by adding \ in front of it, like \\|.

6. How can we match anbn with Java regex?

This is the language of all non-empty strings consisting of some number of a‘s followed by an equal number of b‘s, like abaabb, and aaabbb. This language can be show to be context-free grammar S → aSb | ab, and therefore a non-regular language.

However, Java regex implementations can recognize more than just regular languages. That is, they are not “regular” by formal language theory definition. Using lookahead and self-reference matching will achieve it. Here I will give the final regular expression first, then explain it a little bit. For a comprehensive explanation, I would refer you to read How can we match a^n b^n with Java regex.

Pattern p = Pattern.compile("(?x)(?:a(?= a*(\\1?+b)))+\\1");
// true
System.out.println(p.matcher("aaabbb").matches());
// false
System.out.println(p.matcher("aaaabbb").matches());
// false
System.out.println(p.matcher("aaabbbb").matches());
// false
System.out.println(p.matcher("caaabbb").matches());

Instead of explaining the syntax of this complex regular expression, I would rather say a little bit how it works.

  1. In the first iteration, it stops at the first a then looks ahead (after skipping some as by using a*) whether there is a b. This was achieved by using (?:a(?= a*(\\1?+b))). If it matches, \1, the self-reference matching, will matches the very inner parenthesed elements, which is one singleb in the first iteration.
  2. In the second iteration, the expression will stop at the second a, then it looks ahead (again skippingas) to see if there will be b. But this time, \\1+b is actually equivalent to bb, therefore two bs have to be matched. If so, \1 will be changed to bb after the second iteration.
  3. In the nth iteration, the expression stops at the nth a and see if there are n bs ahead.

By this way, the expression can count the number of as and match if the number of bs followed by a is same.

7. How to replace 2 or more spaces with single space in string and delete leading spaces only?

String.replaceAll() replaces each substring that matches the given regular expression with the given replacement. “2 or more spaces” can be expressed by regular expression [ ]+. Therefore, the following code will work. Note that, the solution won’t ultimately remove all leading and trailing whitespaces. If you would like to have them deleted, you can use String.trim() in the pipeline.

String line = "  aa bbbbb   ccc     d  ";
// " aa bbbbb ccc d "
System.out.println(line.replaceAll("[\\s]+", " "));

8. How to determine if a number is a prime with regex?

public static void main(String[] args) {
  // false
  System.out.println(prime(1));
  // true
  System.out.println(prime(2));
  // true
  System.out.println(prime(3));
  // true
  System.out.println(prime(5));
  // false
  System.out.println(prime(8));
  // true
  System.out.println(prime(13));
  // false
  System.out.println(prime(14));
  // false
  System.out.println(prime(15));
}
 
public static boolean prime(int n) {
  return !new String(new char[n]).matches(".?|(..+?)\\1+");
}

The function first generates n number of characters and tries to see if that string matches .?|(..+?)\\1+. If it is prime, the expression will return false and the ! will reverse the result.

The first part .? just tries to make sure 1 is not primer. The magic part is the second part where backreference is used. (..+?)\\1+ first try to matches n length of characters, then repeat it several times by \\1+.

By definition, a prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. That means if a=n*m then a is not a prime. n*m can be further explained “repeat n m times”, and that is exactly what the regular expression does: matches n length of characters by using (..+?), then repeat it m times by using \\1+. Therefore, if the pattern matches, the number is not prime, otherwise it is. Remind that ! will reverse the result.

9. How to split a comma-separated string but ignoring commas in quotes?

You have reached the point where regular expressions break down. It is better and more neat to write a simple splitter, and handles special cases as you wish.

Alternative, you can mimic the operation of finite state machine, by using a switch statement or if-else. Attached is a snippet of code.

public static void main(String[] args) {
  String line = "aaa,bbb,\"c,c\",dd;dd,\"e,e";
  List<String> toks = splitComma(line);
  for (String t : toks) {
    System.out.println("> " + t);
  }
}
 
private static List<String> splitComma(String str) {
  int start = 0;
  List<String> toks = new ArrayList<String>();
  boolean withinQuote = false;
  for (int end = 0; end < str.length(); end++) {
    char c = str.charAt(end);
    switch(c) {
    case ',':
      if (!withinQuote) {
        toks.add(str.substring(start, end));
        start = end + 1;
      }
      break;
    case '\"':
      withinQuote = !withinQuote;
      break;
    }
  }
  if (start < str.length()) {
    toks.add(str.substring(start));
  }
  return toks;
}

10. How to use backreferences in Java Regular Expressions

Backreferences is another useful feature in Java regular expression.

Category:  Java, Regular Expressions, Top 10  

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值