字符串处理利器--正则表达式(RegularExpressions)之一

单词:

  1. regular:[ˈreɡjulə]  有规律的, 定期的, 定时的
  2. Expression:[iksˈpreʃən] 表示式,公式
  3. Pattern:[ˈpætən] 型, 样式 花样, 图案
  4. Matcher: [ˈmætʃə] 匹配器;制榫机

一 用途:

  • 字符串的匹配
  • 字符串的查找
  • 字符串的替换

二 Java中的工具类:

  •  java.lang.String
  •  java.util.regex.Pattern   
  •  java.util.regex.Matcher

API:

.util.regex
Class Pattern

java.lang.Object
  java.util.regex.Pattern
All Implemented Interfaces:
Serializable

public final class Pattern
extends Objectimplements Serializable
 

A compiled representation of a regular expression.

A regular expression, specified as a string, must first be compiled into an instance of this class. The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression. All of the state involved in performing a match resides in the matcher, so many matchers can share the same pattern.

A typical invocation sequence is thus

 Pattern p = Pattern.compile("a*b");
 Matcher m = p.matcher("aaaaab");
 boolean b = m.matches();

A matches method is defined by this class as a convenience for when a regular expression is used just once. This method compiles an expression and matches an input sequence against it in a single invocation. The statement

 boolean b = Pattern.matches("a*b", "aaaaab");

is equivalent to the three statements above, though for repeated matches it is less efficient since it does not allow the compiled pattern to be reused.

Instances of this class are immutable and are safe for use by multiple concurrent threads. Instances of the Matcher class are not safe for such use.

Summary of regular-expression constructs

ConstructMatches
 
Characters
xThe character x
\\The backslash character
\0nThe character with octal value 0n (0 <= n <= 7)
\0nnThe character with octal value 0nn (0 <= n <= 7)
\0mnnThe character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\xhhThe character with hexadecimal value 0xhh
\uhhhhThe character with hexadecimal value 0xhhhh
\tThe tab character ('\u0009')
\nThe newline (line feed) character ('\u000A')
\rThe carriage-return character ('\u000D')
\fThe form-feed character ('\u000C')
\aThe alert (bell) character ('\u0007')
\eThe escape character ('\u001B')
\cxThe control character corresponding to x
 
Character classes
[abc]a, b, or c (simple class)
[^abc]Any character except a, b, or c (negation)
[a-zA-Z]a through z or A through Z, inclusive (range)
[a-d[m-p]]a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]]d, e, or f (intersection)
[a-z&&[^bc]]a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]]a through z, and not m through p: [a-lq-z](subtraction)
 
Predefined character classes
.Any character (may or may not match line terminators)
\dA digit: [0-9]
\DA non-digit: [^0-9]
\sA whitespace character: [ \t\n\x0B\f\r]
\SA non-whitespace character: [^\s]
\wA word character: [a-zA-Z_0-9]
\WA non-word character: [^\w]
 
POSIX character classes (US-ASCII only)
\p{Lower}A lower-case alphabetic character: [a-z]
\p{Upper}An upper-case alphabetic character:[A-Z]
\p{ASCII}All ASCII:[\x00-\x7F]
\p{Alpha}An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit}A decimal digit: [0-9]
\p{Alnum}An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct}Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}A visible character: [\p{Alnum}\p{Punct}]
\p{Print}A printable character: [\p{Graph}\x20]
\p{Blank}A space or a tab: [ \t]
\p{Cntrl}A control character: [\x00-\x1F\x7F]
\p{XDigit}A hexadecimal digit: [0-9a-fA-F]
\p{Space}A whitespace character: [ \t\n\x0B\f\r]
 
java.lang.Character classes (simple java character type)
\p{javaLowerCase}Equivalent to java.lang.Character.isLowerCase()
\p{javaUpperCase}Equivalent to java.lang.Character.isUpperCase()
\p{javaWhitespace}Equivalent to java.lang.Character.isWhitespace()
\p{javaMirrored}Equivalent to java.lang.Character.isMirrored()
 
Classes for Unicode blocks and categories
\p{InGreek}A character in the Greek block (simple block)
\p{Lu}An uppercase letter (simple category)
\p{Sc}A currency symbol
\P{InGreek}Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)
 
Boundary matchers
^The beginning of a line
$The end of a line
\bA word boundary
\BA non-word boundary
\AThe beginning of the input
\GThe end of the previous match
\ZThe end of the input but for the final terminator, if any
\zThe end of the input
 
Greedy quantifiers
X?X, once or not at all
X*X, zero or more times
X+X, one or more times
X{n}X, exactly n times
X{n,}X, at least n times
X{n,m}X, at least n but not more than m times
 
Reluctant quantifiers
X??X, once or not at all
X*?X, zero or more times
X+?X, one or more times
X{n}?X, exactly n times
X{n,}?X, at least n times
X{n,m}?X, at least n but not more than m times
 
Possessive quantifiers
X?+X, once or not at all
X*+X, zero or more times
X++X, one or more times
X{n}+X, exactly n times
X{n,}+X, at least n times
X{n,m}+X, at least n but not more than m times
 
Logical operators
XYX followed by Y
X|YEither X or Y
(X)X, as a capturing group
 
Back references
\nWhatever the nthcapturing group matched
 
Quotation
\Nothing, but quotes the following character
\QNothing, but quotes all characters until \E
\ENothing, but ends quoting started by \Q
 
Special constructs (non-capturing)
(?:X)X, as a non-capturing group
(?idmsux-idmsux) Nothing, but turns match flags on - off
(?idmsux-idmsux:X)  X, as a non-capturing group with the given flags on - off
(?=X)X, via zero-width positive lookahead
(?!X)X, via zero-width negative lookahead
(?<=X)X, via zero-width positive lookbehind
(?<!X)X, via zero-width negative lookbehind
(?>X)X, as an independent, non-capturing group

 

例子:

****************************************************
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {

 /**
  * @param args
  */
 public static void main(String[] args) {
  // 简单的正则表达式例子
  //字符串是否是3个字符
 /* p("abc".matches("..."));
  //把字符串中的数字替换"-"
  p("a8729a".replaceAll("\\d", "-"));
  //匹配一个3个字符的字符串,并且每个字符都是a-z,编译是为了执行时会提升效率 
  Pattern p = Pattern.compile("[a-z]{3}");
  //public Matcher matcher(CharSequence input), String 实现了 CharSequence接口
  Matcher m = p.matcher("fgh");
  p(m.matches());
  p("fgh".matches("[a-z]{3}"));*/
  
 /* // 认识 .  *  +  ?
  //.匹配一个任意字符
  p("a".matches("."));
  //匹配字符"aa"
  p("aa".matches("aa"));
  //*匹配字符0个或多个a
  p("aaaa".matches("a*"));
  //+匹配字符 1个或多个
  p("aaaa".matches("a+"));
  //?匹配字符 0个或1个
  p("".matches("a?"));
  //?匹配字符 0个或1个
  p("a".matches("a?"));
  //{n}出现n次,{n,}出现n次以上,{n,m}出现n到m次
  p("214523145234532".matches("\\d{3,100}"));
  p("192.168.0.aaa".matches("\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}"));
  //[]是范围
  p("192".matches("[0-2][0-9][0-9]"));*/
  
  //范围 []
  //表示匹配[]中的一个字符  
 /* p("a".matches("[abc]"));
  //^abc  匹配abc外的一个字符
  p("a".matches("^abc"));
  //匹配
  p("a".matches("[a-zA-Z]"));
  //匹配a-z或 A-Z
  p("A".matches("[a-z]|[A-Z]"));
  //匹配a-z或 A-Z
  p("A".matches("[a-z[A-Z]]"));
  //匹配A-Z中的RFG中的之一
  p("R".matches("[A-Z&&[RFG]]"));*/
  
  /*//认识\s \w \d \
  //4个空白字符
  p("\n\r\t".matches("\\s{4}"));
  //非空白字符
  p(" ".matches("\\S"));
  //\w构成单词的字符
  p("a_8".matches("\\w{3}"));
  
  p("abc888&^%".matches("[a-z]{1,3}\\d+[&^#%]+"));
  //匹配一个\
  p("\\".matches("\\\\"));
  */
  
 /* //POSIX Style POSIX是一种UNIX标准
  p("a".matches("\\p{Lower}"));
  
  //boundary 边界匹配  ^在[]中是取反,[]外表示一行的开头
  
  p("hello sir".matches("^h.*"));
  p("hello sir".matches(".*ir$"));
  // \b是单词边界
  p("hello sir".matches("^h[a-z]{1,3}o\\b.*"));
  p("hellosir".matches("^h[a-z]{1,3}o\\b.*"));*/

  //whilte lines 空白行  (以空白字符开头并且不是换行符 )
  /*p(" \n".matches("^[\\s&&[^\\n]]*\\n$"));
  
  p("aaa 8888c".matches(".*\\d{4}."));
  p("aaa 8888c".matches(".*\\b\\d{4}."));
  p("aaa8888c".matches(".*\\d{4}."));
  p("aaa8888c".matches(".*\\b\\d{4}."));*/
  
  //email \w A word character: [a-zA-Z_0-9]   
  //[\\w[.-]]+  [a-zA-Z_0-9]或 .-  出现一次或多次
 /* p("asdfasdfsasdfasdf@asdfasdf.com".matches("[\\w[.-]]+@[\\w[.-]]+\\.[\\w]+"));
  */
  //matches find lookingAt
  /*
  Pattern p = Pattern.compile("\\d{3,5}");
  String s="123-34345-234-00";
  Matcher m = p.matcher(s);
  //匹配整个字串 "123-34345-234-00"
  p(m.matches());//匹配后引擎截取后"34345-234-00"
  //重新开始匹配
  m.reset();
  //匹配第一只串
  p(m.find());
  p(m.start()+"-"+m.end());
  p(m.find());
  p(m.start()+"-"+m.end());
  p(m.find());
  p(m.start()+"-"+m.end());
  p(m.find());
  //如果找不到会报错
  p(m.start()+"-"+m.end());
  //每次都重头匹配 lookingAt()
  p(m.lookingAt());
  p(m.lookingAt());
  p(m.lookingAt());
  p(m.lookingAt());
  */
  
 /* //replacement 把所含有java(无论大小写的,单数替换为java 偶数替换为JAVA
  Pattern p = Pattern.compile("java",Pattern.CASE_INSENSITIVE);
  Matcher m = p.matcher("java Java JAVa JaVa IloveJAVA you hateJava asdf");
  StringBuffer buf = new StringBuffer();
  int i=0;
  while(m.find()){
   //m.guoup()匹配的子串
   //p(m.group());
   i++;
   if(i%2==0){
    //找到子串放到buf中,并用后面的替换
    m.appendReplacement(buf, "java");
   }else
   {
    m.appendReplacement(buf, "JAVA");
   }
   
  }
  m.appendTail(buf);
  p(buf);*/
  
  //group
  Pattern p = Pattern.compile("(\\d{3,5})([a-z]{2})");
  String s ="123aa-34345bb-234cc-00";
  Matcher m = p.matcher(s);
  while(m.find())
  {
   //整个大组
   p(m.group());
  }
  
  m.reset();
  while(m.find())
  {
   //第1小组
   p(m.group(1));
  }

  m.reset();
  while(m.find())
  {
   //第2小组
   p(m.group(2));
  }  
 }
 
 

 
 
 public static void p (Object o)
 {
  System.out.println(o);
 }
 
 

****************************************************

 

 

抓取文件中Email地址的程序代码

***************************************************************

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class EmailSpider {

 /**
  * @param args
  */
 public static void main(String[] args) {
  try {
   BufferedReader br= new BufferedReader(new FileReader("c:\\email.htm"));
   String line="";
   while((line=br.readLine())!=null)
   {
     parse(line); 
   }
  } catch (FileNotFoundException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }

 }

 private static void parse(String line) {
  // TODO Auto-generated method stub
  Pattern p =Pattern.compile("[\\w[.-]]+@[\\w[.-]]+\\.[\\w]+");
  Matcher m = p.matcher(line);
  while(m.find())
  {
   System.out.println(m.group());
  }
  
//  Matcher m =
 }

}

 


*****************************************************************

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值