Java Regex Tutorial

http://www.vogella.de/articles/JavaRegularExpressions/article.html

1. Regular Expressions

1.1.  Overview

A regular expression define a search pattern for strings. This pattern may match one or several times or not at all for a given string. The abbreviation for regular expression is "regex".

A simple example for a regular expression is a (literal) string. For example the regex "Hello World" will match exactly the phrase "Hello World". Another example for a regular expression is "." (dot) which matches any single character; it would match for example "a" or "z" or "1".

1.2.  Usage

Regular expressions can be used to search, edit and manipulate text.

Regular expressions are used in several programming languages, e.g. Java but also Perl, Groovy, etc. Unfortunately each language / program supports regex slightly different.

The pattern defined by the regular expression is applied on the string from left to right. Once a source character has been used in a match, it cannot be reused. For example the regex "aba" will match "ababababa" only two times (aba_aba__).

1.3.  Junit

Some of the following examples use JUnit to validate the result. You should be able to adjust them in case if you don't want to use JUnit. To learn about JUnit please see JUnit Tutorial .

2. Regular Expressions

The following is an overview of regular expressions. This chapter is supposed to be a references for the different regex elements.

2.1. Common matching symbols

Table 1. 

Regular ExpressionDescription
.Matches any sign
^regexregex must match at the beginning of the line
regex$Finds regex must match at the end of the line
[abc]Set definition, can match the letter a or b or c
[abc][vz]Set definition, can match a or b or c followed by either v or z
[^abc]When a "^" appears as the first character inside [] when it negates the pattern. This can match any character except a or b or c
[a-d1-7]Ranges, letter between a and d and figures from 1 to 7, will not match d1
X|ZFinds X or Z
XZFinds X directly followed by Z
$Checks if a line end follows

2.2. Metacharacters

The following metacharacters have a pre-defined meaning and make certain common pattern easier to use, e.g. \d instead of [0..9].

Table 2. 

Regular ExpressionDescription
\dAny digit, short for [0-9]
\DA non-digit, short for [^0-9]
\sA whitespace character, short for [ \t\n\x0b\r\f]
\SA non-whitespace character, for short for [^\s]
\wA word character, short for [a-zA-Z_0-9]
\WA non-word character [^\w]
\S+Several non-whitespace characters

2.3. Quantifier

A quantifier defines how often an element can occur. The symbols ?, *, + and {} define the quantity of the regular expressions

Table 3. 

Regular ExpressionDescriptionExamples
*Occurs zero or more times, is short for {0,}X* - Finds no or several letter X, .* - any character sequence
+Occurs one or more times, is short for {1,}X+ - Finds one or several letter X
?Occurs no or one times, ? is short for {0,1}X? -Finds no or exactly one letter X
{X}Occurs X number of times, {} describes the order of the preceding liberal\d{3} - Three digits, .{10} - any character sequence of length 10
{X,Y}.Occurs between X and Y times,\d{1,4}- \d must occur at least once and at a maximum of four
*?? after a qualifier makes it a "reluctant quantifier", it tries to find the smallest match. 

3. Using Regular Expressions with String.matches()

3.1. Overview

Strings in Java have build in support for regular expressions. Stings have three build in methods for regular expressions. These methods do not compile the pattern and are therefore slower then using a pattern and a matcher as described later in this article.

Tip

The backslash is an escape character in Java Strings. e.g. backslash has a predefine meaning in Java. You have to use "\\" to define a single backslash. If you want to define "\w" then you must be using "\\w" in your regex. If you want to use backslash you as a literal you have to type \\\\ as \ is also a escape charactor in regular expressions.

Table 4. 

MethodDescription
s.matches("regex")Evaluates if "regex" matches s. Returns only true if the WHOLE string can be matched
s.split("regex")Creates array with substrings of s divided at occurance of "regex". "regex" is not included in the result.
s.replace("regex"), "replacement"Replaces "regex" with "replacement

Create for the following example the Java project "de.vogella.regex.test".

				
package de.vogella.regex.test;

public class RegexTestStrings {
	public static final String EXAMPLE_TEST = "This is my small example string which I'm going to use for pattern matching.";

	public static void main(String[] args) {
		System.out.println(EXAMPLE_TEST.matches("\\w.*"));
		String[] splitString = (EXAMPLE_TEST.split("\\s+"));
		System.out.println(splitString.length);// Should be 14
		for (String string : splitString) {
			System.out.println(string);
		}
		// Replace all whitespace with tabs
		System.out.println(EXAMPLE_TEST.replaceAll("\\s+", "\t"));
	}
}

			

3.2. Examples

Create for the following example the Java project "de.vogella.regex.string".

The following class gives several examples for the usage of regular expressions with strings. See the comment for the purpose.

				
package de.vogella.regex.string;

public class StringMatcher {
	// Returns true if the string matches exactly "true"
	public boolean isTrue(String s){
		return s.matches("true");
	}
	// Returns true if the string matches exactly "true" or "True"
	public boolean isTrueVersion2(String s){
		return s.matches("[tT]rue");
	}
	
	// Returns true if the string matches exactly "true" or "True"
	// or "yes" or "Yes"
	public boolean isTrueOrYes(String s){
		return s.matches("[tT]rue|[yY]es");
	}
	
	// Returns true if the string contains exactly "true"
	public boolean containsTrue(String s){
		return s.matches(".*true.*");
	}
	

	// Returns true if the string contains of three letters
	public boolean isThreeLetters(String s){
		return s.matches("[a-zA-Z]{3}");
		// Simpler from for
//		return s.matches("[a-Z][a-Z][a-Z]");
	}
	


	// Returns true if the string does not have a number at the beginning
	public boolean isNoNumberAtBeginning(String s){
		return s.matches("^[^\\d].*");
	}
	// Returns true if the string contains a arbitrary number of characters except b
	public boolean isIntersection(String s){
		return s.matches("([\\w&&[^b]])*");
	}
	// Returns true if the string contains a number less then 300
	public boolean isLessThenThreeHundret(String s){
		return s.matches("[^0-9]*[12]?[0-9]{1,2}[^0-9]*");
	}
	
}

			

And a small JUnit Test to validates the examples.

				
package de.vogella.regex.string;

import org.junit.Before;
import org.junit.Test;

import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;

public class StringMatcherTest {
	private StringMatcher m;

	@Before
	public void setup(){
		m = new StringMatcher();
	}

	@Test
	public void testIsTrue() {
		assertTrue(m.isTrue("true"));
		assertFalse(m.isTrue("true2"));
		assertFalse(m.isTrue("True"));
	}

	@Test
	public void testIsTrueVersion2() {
		assertTrue(m.isTrueVersion2("true"));
		assertFalse(m.isTrueVersion2("true2"));
		assertTrue(m.isTrueVersion2("True"));;
	}

	@Test
	public void testIsTrueOrYes() {
		assertTrue(m.isTrueOrYes("true"));
		assertTrue(m.isTrueOrYes("yes"));
		assertTrue(m.isTrueOrYes("Yes"));
		assertFalse(m.isTrueOrYes("no"));
	}

	@Test
	public void testContainsTrue() {
		assertTrue(m.containsTrue("thetruewithin"));
	}

	@Test
	public void testIsThreeLetters() {
		assertTrue(m.isThreeLetters("abc"));
		assertFalse(m.isThreeLetters("abcd"));
	}
	
	@Test
	public void testisNoNumberAtBeginning() {
		assertTrue(m.isNoNumberAtBeginning("abc"));
		assertFalse(m.isNoNumberAtBeginning("1abcd"));
		assertTrue(m.isNoNumberAtBeginning("a1bcd"));
		assertTrue(m.isNoNumberAtBeginning("asdfdsf"));
	}
	
	@Test
	public void testisIntersection() {
		assertTrue(m.isIntersection("1"));
		assertFalse(m.isIntersection("abcksdfkdskfsdfdsf"));
		assertTrue(m.isIntersection("skdskfjsmcnxmvjwque484242"));
	}
	

	@Test
	public void testLessThenThreeHundret() {
		assertTrue(m.isLessThenThreeHundret("288"));
		assertFalse(m.isLessThenThreeHundret("3288"));
		assertFalse(m.isLessThenThreeHundret("328 8"));
		assertTrue(m.isLessThenThreeHundret("1"));
		assertTrue(m.isLessThenThreeHundret("99"));
		assertFalse(m.isLessThenThreeHundret("300"));
	}

}

			

4. Pattern and Matcher

For advanced regular expressions the classes you java.util.regex.Pattern and java.util.regex.Matcher are used.

You first create a Pattern object which defines the regular expression. This pattern object allows create a Matcher object for a given string. This matcher object then allows you to do regex operations on the string.

			
package de.vogella.regex.test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTestPatternMatcher {
	public static final String EXAMPLE_TEST = "This is my small example string which I'm going to use for pattern matching.";

	public static void main(String[] args) {
		Pattern pattern = Pattern.compile("\\w+");
		// In case you would like to ignore case sensitivity you could use this
		// statement
		// Pattern pattern = Pattern.compile("\\s+", Pattern.CASE_INSENSITIVE);
		Matcher matcher = pattern.matcher(EXAMPLE_TEST);
		// Check all occurance
		while (matcher.find()) {
			System.out.print("Start index: " + matcher.start());
			System.out.print(" End index: " + matcher.end() + " ");
			System.out.println(matcher.group());
		}
		// Now create a new pattern and matcher to replace whitespace with tabs
		Pattern replace = Pattern.compile("\\s+");
		Matcher matcher2 = replace.matcher(EXAMPLE_TEST);
		System.out.println(matcher2.replaceAll("\t"));
	}
}

		

5. Java Regex Examples

5.1. Or

Task: Write a regular expression which matches a text line if this text line contains either the word "Joe" or the word "Jim" or both. Create a project "de.vogella.regex.eitheror" and the following class.

				
package de.vogella.regex.eitheror;

import org.junit.Test;

import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;

public class EitherOrCheck {
	@Test
	public void testSimpleTrue() {
		String s = "humbapumpa jim";
		assertTrue(s.matches(".*(jim|joe).*"));
		s = "humbapumpa jom";
		assertFalse(s.matches(".*(jim|joe).*"));
		s = "humbaPumpa joe";
		assertTrue(s.matches(".*(jim|joe).*"));
		s = "humbapumpa joe jim";
		assertTrue(s.matches(".*(jim|joe).*"));
	}
}

			

5.2. Phone number

Task: Write a regular expression which matches any phone number. A phone number in this example consists either out of 7 numbers in a row or out of 3 number a (white)space or a dash and then 4 numbers.

					
package de.vogella.regex.phonenumber;

import org.junit.Test;

import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;


public class CheckPhone {
	
	@Test
	public void testSimpleTrue() {
		String pattern = "\\d\\d\\d([,\\s])?\\d\\d\\d\\d";
		String s= "1233323322";
		assertFalse(s.matches(pattern));
		s = "1233323";
		assertTrue(s.matches(pattern));
		s = "123 3323";
		assertTrue(s.matches(pattern));
	}
}

				

5.3. Check for a certain number range

The following example will check if a text contains a number with 3 digits.

Create the Java project "de.vogella.regex.numbermatch" and the following class.

				
package de.vogella.regex.numbermatch;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.junit.Test;

import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;

public class CheckNumber {

	
	@Test
	public void testSimpleTrue() {
		String s= "1233";
		assertTrue(test(s));
		s= "0";
		assertFalse(test(s));
		s = "29 Kasdkf 2300 Kdsdf";
		assertTrue(test(s));
		s = "99900234";
		assertTrue(test(s));
	}
	

	
	
	public static boolean test (String s){
		Pattern pattern = Pattern.compile("\\d{3}");
		Matcher matcher = pattern.matcher(s);
		if (matcher.find()){
			return true; 
		} 
		return false; 
	}

}

			

5.4. Building a link checker

The following exampleallows to extract all valid links from a webpage. It does not consider links with start with "javascript:" or "mailto:".

Create the Java project "de.vogella.regex.weblinks" and the following class

				
package de.vogella.regex.weblinks;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LinkGetter {
	private Pattern htmltag;
	private Pattern link;
	private final String root;

	public LinkGetter(String root) {
		this.root = root;
		htmltag = Pattern.compile("<a\\b[^>]*href=\"[^>]*>(.*?)</a>");
		link = Pattern.compile("href=\"[^>]*\">");
	}

	public List<String> getLinks(String url) {
		List<String> links = new ArrayList<String>();
		try {
			BufferedReader bufferedReader = new BufferedReader(
					new InputStreamReader(new URL(url).openStream()));
			String s;
			StringBuilder builder = new StringBuilder();
			while ((s = bufferedReader.readLine()) != null) {
				builder.append(s);
			}

			Matcher tagmatch = htmltag.matcher(builder.toString());
			while (tagmatch.find()) {
				Matcher matcher = link.matcher(tagmatch.group());
				matcher.find();
				String link = matcher.group().replaceFirst("href=\"", "")
						.replaceFirst("\">", "");
				if (valid(link)) {
					links.add(makeAbsolute(url, link));
				}
			}
		} catch (MalformedURLException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		return links;
	}

	private boolean valid(String s) {
		if (s.matches("javascript:.*|mailto:.*")) {
			return false;
		}
		return true;
	}

	private String makeAbsolute(String url, String link) {
		if (link.matches("http://.*")) {
			return link;
		}
		if (link.matches("/.*") && url.matches(".*$[^/]")) {
			return url + "/" + link;
		}
		if (link.matches("[^/].*") && url.matches(".*[^/]")) {
			return url + "/" + link;
		}
		if (link.matches("/.*") && url.matches(".*[/]")) {
			return url + link;
		}
		if (link.matches("/.*") && url.matches(".*[^/]")) {
			return url + link;
		}
		throw new RuntimeException("Cannot make the link absolute. Url: " + url
				+ " Link " + link);
	}
}

			

6. Thank you

Please help me to support this article:

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值