Regular Expressions and the Java Programming Language

Programming Language

Applications frequently require text processing for features like word searches, email validation, or XML document integrity. This often involves pattern matching. Languages like Perl, sed, or awk improves pattern matching with the use of regular expressions, strings of characters that define patterns used to search for matching text. To pattern match using the Java programming language required the use of the StringTokenizer class with many charAt substring methods to read through the characters or tokens to process the text. This often lead to complex or messy code.

Until now.

The Java 2 Platform, Standard Edition (J2SE), version 1.4, contains a new package called java.util.regex, enabling the use of regular expressions. Now functionality includes the use of meta characters, which gives regular expressions versatility.

This article provides an overview of the use of regular expressions, and details how to use regular expressions with the java.util.regex package, using the following common scenarios as examples:

 

  • Simple word replacement
  • Email validation
  • Removal of control characters from a file
  • File searching

 

To compile the code in these examples and to use regular expressions in your applications, you'll need to install J2SE version 1.4.

 

Regular Expressions Constructs

 

A regular expression is a pattern of characters that describes a set of strings. You can use the java.util.regex package to find, display, or modify some or all of the occurrences of a pattern in an input sequence.

The simplest form of a regular expression is a literal string, such as "Java" or "programming." Regular expression matching also allows you to test whether a string fits into a specific syntactic form, such as an email address.

To develop regular expressions, ordinary and special characters are used:

/$^.*
+?['']
/.   

Any other character appearing in a regular expression is ordinary, unless a / precedes it.

Special characters serve a special purpose. For instance, the . matches anything except a new line. A regular expression like s.n matches any three-character string that begins with s and ends with n, including sun and son.

There are many special characters used in regular expressions to find words at the beginning of lines, words that ignore case or are case-specific, and special characters that give a range, such as a-e, meaning any letter from a to e.

Regular expression usage using this new package is Perl-like, so if you are familiar with using regular expressions in Perl, you can use the same expression syntax in the Java programming language. If you're not familiar with regular expressions here are a few to get you started:

ConstructMatches
 
Characters 
xThe character x
//The backslash character
/0nThe character with octal value 0n (0 <= n <= 7)
/0nnThe character with octal value 0nn (0 <= n <= 7)
/0mnnThe character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
/xhhThe character with hexadecimal value 0xhh
/uhhhhThe character with hexadecimal value 0xhhhh
/tThe tab character ('/u0009')
/nThe newline (line feed) character ('/u000A')
/rThe carriage-return character ('/u000D')
/fThe form-feed character ('/u000C')
/aThe alert (bell) character ('/u0007')
/eThe escape character ('/u001B')
/cxThe control character corresponding to x
 
Character Classes
[abc]a, b, or c (simple class)
[^abc]Any character except a, b, or c (negation)
[a-zA-Z]a through z or A through Z, inclusive (range)
[a-z-[bc]]a through z, except for b and c: [ad-z] (subtraction)
[a-z-[m-p]]a through z, except for m through p: [a-lq-z]
[a-z-[^def]]d, e, or f
 
Predefined Character Classes
.Any character (may or may not match line terminators)
/dA digit: [0-9]
/DA non-digit: [^0-9]
/sA whitespace character: [ /t/n/x0B/f/r]
/SA non-whitespace character: [^/s]
/wA word character: [a-zA-Z_0-9]
/WA non-word character: [^/w]

Check the documentation about the Pattern class for more specific details and examples.

Classes and Methods

 

The following classes match character sequences against patterns specified by regular expressions.

Pattern Class

An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similar to that used by Perl.

A regular expression, specified as a string, must first be compiled into an instance of the Pattern class. The resulting pattern is used to create a Matcher object that matches arbitrary character sequences against the regular expression. Many matchers can share the same pattern because it is stateless.

The compile method compiles the given regular expression into a pattern, then the matcher method creates a matcher that will match the given input against this pattern. The pattern method returns the regular expression from which this pattern was compiled.

The split method is a convenience method that splits the given input sequence around matches of this pattern. The following example demonstrates:

/*
 * Uses split to break up a string of input separated by
 * commas and/or whitespace.
 */
import java.util.regex.*;

public class Splitter {
    public static void main(String[] args) throws Exception {
        // Create a pattern to match breaks
        Pattern p = Pattern.compile("[,//s]+");
        // Split input with the pattern
        String[] result = 
                 p.split("one,two, three   four ,  five");
        for (int i=0; i<result.length; i++)
            System.out.println(result[i]);
    }
}

Matcher Class

Instances of the Matcher class are used to match character sequences against a given string sequence pattern. Input is provided to matchers using the CharSequence interface to support matching against characters from a wide variety of input sources.

A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations:

 

  • The matches method attempts to match the entire input sequence against the pattern.
  • The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
  • The find method scans the input sequence looking for the next sequence that matches the pattern.

 

Each of these methods returns a boolean indicating success or failure. More information about a successful match can be obtained by querying the state of the matcher.

This class also defines methods for replacing matched sequences by new strings whose contents can, if desired, be computed from the match result.

The appendReplacement method appends everything up to the next match and the replacement for that match. The appendTail appends the strings at the end, after the last match.

For instance, in the string blahcatblahcatblah, the first appendReplacement appends blahdog. The second appendReplacement appends blahdog, and the appendTail appends blah, resulting in: blahdogblahdogblah. See Simple word replacement for an example.

CharSequence Interface

The CharSequence interface provides uniform, read-only access to many different types of character sequences. You supply the data to be searched from different sources. String, StringBuffer and CharBuffer implement CharSequence, so they are easy sources of data to search through. If you don't care for one of the available sources, you can write your own input source by implementing the CharSequence interface.

 

Example Regex Scenarios

 

The following code samples demonstrate the use of the java.util.regex package for various common scenarios:

Simple Word Replacement

/*
 * This code writes "One dog, two dogs in the yard."
 * to the standard-output stream:
 */
import java.util.regex.*;

public class Replacement {
    public static void main(String[] args) 
                         throws Exception {
        // Create a pattern to match cat
        Pattern p = Pattern.compile("cat");
        // Create a matcher with an input string
        Matcher m = p.matcher("one cat," +
                       " two cats in the yard");
        StringBuffer sb = new StringBuffer();
        boolean result = m.find();
        // Loop through and create a new String 
        // with the replacements
        while(result) {
            m.appendReplacement(sb, "dog");
            result = m.find();
        }
        // Add the last segment of input to 
        // the new String
        m.appendTail(sb);
        System.out.println(sb.toString());
    }
}

Email Validation

The following code is a sample of some characters you can check are in an email address, or should not be in an email address. It is not a complete email validation program that checks for all possible email scenarios, but can be added to as needed.

/*
* Checks for invalid characters
* in email addresses
*/
public class EmailValidation {
   public static void main(String[] args) 
                                 throws Exception {
                                 
      String input = "@sun.com";
      //Checks for email addresses starting with
      //inappropriate symbols like dots or @ signs.
      Pattern p = Pattern.compile("^//.|^//@");
      Matcher m = p.matcher(input);
      if (m.find())
         System.err.println("Email addresses don't start" +
                            " with dots or @ signs.");
      //Checks for email addresses that start with
      //www. and prints a message if it does.
      p = Pattern.compile("^www//.");
      m = p.matcher(input);
      if (m.find()) {
        System.out.println("Email addresses don't start" +
                " with /"www./", only web pages do.");
      }
      p = Pattern.compile("[^A-Za-z0-9//.//@_//-~#]+");
      m = p.matcher(input);
      StringBuffer sb = new StringBuffer();
      boolean result = m.find();
      boolean deletedIllegalChars = false;

      while(result) {
         deletedIllegalChars = true;
         m.appendReplacement(sb, "");
         result = m.find();
      }

      // Add the last segment of input to the new String
      m.appendTail(sb);

      input = sb.toString();

      if (deletedIllegalChars) {
         System.out.println("It contained incorrect characters" +
                           " , such as spaces or commas.");
      }
   }
}

Removing Control Characters from a File

/* This class removes control characters from a named
*  file.
*/
import java.util.regex.*;
import java.io.*;

public class Control {
    public static void main(String[] args) 
                                 throws Exception {
                                 
        //Create a file object with the file name
        //in the argument:
        File fin = new File("fileName1");
        File fout = new File("fileName2");
        //Open and input and output stream
        FileInputStream fis = 
                          new FileInputStream(fin);
        FileOutputStream fos = 
                        new FileOutputStream(fout);

        BufferedReader in = new BufferedReader(
                       new InputStreamReader(fis));
        BufferedWriter out = new BufferedWriter(
                      new OutputStreamWriter(fos));

	// The pattern matches control characters
        Pattern p = Pattern.compile("{cntrl}");
        Matcher m = p.matcher("");
        String aLine = null;
        while((aLine = in.readLine()) != null) {
            m.reset(aLine);
            //Replaces control characters with an empty
            //string.
            String result = m.replaceAll("");
            out.write(result);
            out.newLine();
        }
        in.close();
        out.close();
    }
}

File Searching

/*
 * Prints out the comments found in a .java file.
 */
import java.util.regex.*;
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
import java.nio.channels.*;

public class CharBufferExample {
    public static void main(String[] args) throws Exception {
        // Create a pattern to match comments
        Pattern p = 
            Pattern.compile("//.*___FCKpd___4quot;, Pattern.MULTILINE);
        
        // Get a Channel for the source file
        File f = new File("Replacement.java");
        FileInputStream fis = new FileInputStream(f);
        FileChannel fc = fis.getChannel();
        
        // Get a CharBuffer from the source file
        ByteBuffer bb = 
            fc.map(FileChannel.MAP_RO, 0, (int)fc.size());
        Charset cs = Charset.forName("8859_1");
        CharsetDecoder cd = cs.newDecoder();
        CharBuffer cb = cd.decode(bb);
        
        // Run some matches
        Matcher m = p.matcher(cb);
        while (m.find())
            System.out.println("Found comment: "+m.group());
    }
}

 

Conclusion

 

Pattern matching in the Java programming language is now as flexible as in many other programming languages. Regular expressions can be put to use in applications to ensure data is formatted correctly before being entered into a database, or sent to some other part of an application, and they can be used for a wide variety of administrative tasks. In short, you can use regular expressions anywhere in your Java programming that calls for pattern matching.

For More Information

 

Package java.util.regex

Java Programming Forum

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值