java的正则表达式 (基于官方帮助文档做部分说明)

  [b][/b][b]下面这段代码用了log4j,去下个log4j的包引入一下.测试的代码写在代码的 main函数里. [/b]import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.log4j.Logger; publicclass HtmlRegExtraction { static Logger logger = Logger.getLogger(HtmlRegExtraction.class); publicvoid writeHtmltoLocalFile(String localOutPutAdress, String html) { if (localOutPutAdress == null) { localOutPutAdress = "c:\\html.txt"; } FileOutputStream fos = null; try { fos = new FileOutputStream(localOutPutAdress); fos.write(html.getBytes()); System.out.println("html output location:" + localOutPutAdress); } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } /** * @param regString * 正则表达式 * @param html * 输入的需要正则表达式匹配的字条串 * @param index * 取得哪一组的结果 * @return */public ArrayList parseHtml(String regString, String html, int index) { ArrayList resultHtmlArray = new ArrayList(); Pattern p = Pattern.compile(regString); Matcher m = p.matcher(html); // 全文中找到的配数int count = 0; logger.debug("当前 正则表达式可拆分为在:" + m.groupCount() + "组(0组为最大捕获组)"); while (m.find()) { count++; logger.debug("第"+(count)+"条匹配纪录"); for (int i = 0; i 正则表达式的记录 "); return resultHtmlArray; } publicstaticvoid main(String[] args) { HtmlRegExtraction hre = new HtmlRegExtraction(); String ret = "String written in here"; hre.parseHtml("regularation written in here
  [b][/b][b][/b][b][/b][b]写在最开头[/b][b],[/b][b]什么叫正则表达式[/b][b].[/b][b]正则表达式就是一把锁[/b][b].[/b][b]对应的字符串就是钥匙[/b][b],[/b][b]只有当钥匙完全符合这把锁才以打开[/b][b].[/b][b]也就是所谓的匹配[/b][b].
  [/b][b][/b][b]写正则表达式就是做做一把锁[/b][b].[/b][b]去找所有能打开它的钥匙[/b][b].
  [/b][b][/b][b]看这篇文章的前提是你看了 [/b][b]
  [/b]正则表达式30分钟入门教程 GOOGLE一下吧.. [b]
  [/b][b][/b][b]java.util.regex.Pattern
  [/b]A compiled representation of a regular expression.
  A regular expression, specified as a string, must first be compiled into an instance of this class. The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression. All of the state involved in performing a match resides in the matcher, so many matchers can share the same pattern.
  A typical invocation sequence is thus
  Pattern p = Pattern. compile("a*b");
  Matcher m = p. matcher("aaaaab");
  boolean b = m. matches();
  A matches method is defined by this class as a convenience for when a regular expression is used just once. This method compiles an expression and matches an input sequence against it in a single invocation. The statement
  boolean b = Pattern.matches("a*b", "aaaaab");
  is equivalent to the three statements above, though for repeated matches it is less efficient since it does not allow the compiled pattern to be reused.
  Instances of this class are immutable and are safe for use by multiple concurrent threads. Instances of the Matcher class are not safe for such use.
  [b]Summary of regular-expression constructs [/b][b]Construct [/b][b]Matches [/b][b][/b][b]Characters [/b]x The character x \\ The backslash character \0n The character with octal value 0n (0 正则能不用则少用~
  [/b][b][^\\p{ASCII}0-9]+
  [/b][b]正确的写法居然是这样..
  [/b][b][^\\p{ASCII}[0-9]]+
  [/b][b][/b][b][/b]\p{Alpha} An alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9] \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation: One of !"#$%&'()*+,-./:;?@[\]^_`{|}~ \p{Graph} A visible character: [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank} A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F] \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [ \t\n\x0B\f\r] [b][/b][b]java.lang.Character classes (simple [/b][b]java character type[/b][b]) [/b]\p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase() \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase() \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace() \p{javaMirrored} Equivalent to java.lang.Character.isMirrored() [b][/b][b]Classes for Unicode blocks and categories [/b][b]要用上这个里面的东西.首先你得知道什么叫希腊字母.GOOGLE之. [/b]\p{InGreek} A character in the Greek block (simple block) \p{Lu} An uppercase letter (simple category) \p{Sc} A currency symbol[b]我觉得这个还有点用(捕获货币符号) [/b]\P{InGreek} Any character except one in the Greek block (negation) [\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction) [b][/b][b]你认为[\\p{L}]*能捕获字符串”123a我$ ”捕到什么,只能捕获到字母与中文.且捕获到5条符合的记录.
  [/b][b]分别是
  [/b][b][/b][b][/b][b][/b][b]a[/b][b]我
  [/b][b][/b][b][/b][b]不是说5条纪录么!!!怎么就一个”a我”!!..因为其他捕获到的组只是做为标识用了..具体过程如下:
  [/b][b]正则引擎先拿到1,一看:是个数字,匹配了标识(以数字做标识,但不捕获)再找下一个.又是一数字..完全不匹配.这样如果把[\\p{L}]*用标准的正则表达式写应该是这样(?正则表达式.写在(?= )内.只是一个标识.表示我只捕获符合以X正则表达式结尾的字符串.
  例如a.*(?=b)
  传入adcbac
  将会捕获adc c后面跟了一个b,这告诉看到正则表达式看到b就停.但不捕获它.
  以下的参数意义上不同.但原理一样. (?!X) X, via zero-width negative lookahead 只捕获不符合以X正则表达式结尾的字符串. (?正则表达式开头的字符串. (?正则表达式开头的字符串. (?>X) X, as an independent, non-capturing group 这个要跟(?:X) 对比一下
  (?:X)与(?>X)最大的区别在于前者捕获,后者不捕获.
  举例:
  待捕获字符串:abcde
  用(?>a)(bc)将捕获 bc,此时!这个正则只捕获到1组group,只有(bc)!! [b]Backslashes, escapes, and quoting [/b]The backslash character ('\') serves to introduce escaped constructs, as defined in the table above, as well as to quote characters that otherwise would be interpreted as unescaped constructs. Thus the expression \\ matches a single backslash and \{ matches a left brace. It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct. Backslashes within string literals in Java source code are interpreted as required by the Java Language Specification as either Unicode escapes or other character escapes. It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler. The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary. The string literal "\(hello\)" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "\\(hello\\)" must be used. [b]Character Classes [/b]Character classes may appear within other character classes, and may be composed by the union operator (implicit) and the intersection operator (&&). The union operator denotes a class that contains every character that is in at least one of its operand classes. The intersection operator denotes a class that contains every character that is in both of its operand classes. The precedence of character-class operators is as follows, from highest to lowest: [b][/b][b][/b][b][/b][b][/b][b][/b]Note that a different set of metacharacters are in effect inside a character class than outside a character class. For instance, the regular expression . loses its special meaning inside a character class, while the expression - becomes a range forming metacharacter. [b]Line terminators [/b]A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值