java的正则表达式 (基于官方帮助文档做部分说明)

最新推荐文章于 2019-06-22 19:38:53 发布

iteye_11617

最新推荐文章于 2019-06-22 19:38:53 发布

阅读量288

点赞数

分类专栏：技术杂绘文章标签： java

本文链接：https://blog.csdn.net/iteye_11617/article/details/82366248

版权

技术杂绘专栏收录该内容

15 篇文章 0 订阅

订阅专栏

　　[b][/b][b]下面这段代码用了log4j,去下个log4j的包引入一下.测试的代码写在代码的 main函数里. [/b]import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.log4j.Logger; publicclass HtmlRegExtraction { static Logger logger = Logger.getLogger(HtmlRegExtraction.class); publicvoid writeHtmltoLocalFile(String localOutPutAdress, String html) { if (localOutPutAdress == null) { localOutPutAdress = "c:\\html.txt"; } FileOutputStream fos = null; try { fos = new FileOutputStream(localOutPutAdress); fos.write(html.getBytes()); System.out.println("html output location:" + localOutPutAdress); } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } /** * @param regString * 正则表达式 * @param html * 输入的需要正则表达式匹配的字条串 * @param index * 取得哪一组的结果 * @return */public ArrayList parseHtml(String regString, String html, int index) { ArrayList resultHtmlArray = new ArrayList(); Pattern p = Pattern.compile(regString); Matcher m = p.matcher(html); // 全文中找到的配数int count = 0; logger.debug("当前正则表达式可拆分为在:" + m.groupCount() + "组(0组为最大捕获组)"); while (m.find()) { count++; logger.debug("第"+(count)+"条匹配纪录"); for (int i = 0; i 正则表达式的记录 "); return resultHtmlArray; } publicstaticvoid main(String[] args) { HtmlRegExtraction hre = new HtmlRegExtraction(); String ret = "String written in here"; hre.parseHtml("regularation written in here
　　[b][/b][b][/b][b][/b][b]写在最开头[/b][b],[/b][b]什么叫正则表达式[/b][b].[/b][b]正则表达式就是一把锁[/b][b].[/b][b]对应的字符串就是钥匙[/b][b],[/b][b]只有当钥匙完全符合这把锁才以打开[/b][b].[/b][b]也就是所谓的匹配[/b][b].
　　[/b][b][/b][b]写正则表达式就是做做一把锁[/b][b].[/b][b]去找所有能打开它的钥匙[/b][b].
　　[/b][b][/b][b]看这篇文章的前提是你看了 [/b][b]
　　[/b]正则表达式30分钟入门教程 GOOGLE一下吧.. [b]
　　[/b][b][/b][b]java.util.regex.Pattern
　　[/b]A compiled representation of a regular expression.
　　A regular expression, specified as a string, must first be compiled into an instance of this class. The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression. All of the state involved in performing a match resides in the matcher, so many matchers can share the same pattern.
　　A typical invocation sequence is thus
　　Pattern p = Pattern. compile("a*b");
　　Matcher m = p. matcher("aaaaab");
　　boolean b = m. matches();
　　A matches method is defined by this class as a convenience for when a regular expression is used just once. This method compiles an expression and matches an input sequence against it in a single invocation. The statement
　　boolean b = Pattern.matches("a*b", "aaaaab");
　　is equivalent to the three statements above, though for repeated matches it is less efficient since it does not allow the compiled pattern to be reused.
　　Instances of this class are immutable and are safe for use by multiple concurrent threads. Instances of the Matcher class are not safe for such use.
　　[b]Summary of regular-expression constructs [/b][b]Construct [/b][b]Matches [/b][b][/b][b]Characters [/b]x The character x \\ The backslash character \0n The character with octal value 0n (0 正则能不用则少用~
　　[/b][b][^\\p{ASCII}0-9]+
　　[/b][b]正确的写法居然是这样..
　　[/b][b][^\\p{ASCII}[0-9]]+
　　[/b][b][/b][b][/b]\p{Alpha} An alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9] \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation: One of !"#$%&'()*+,-./:;?@[\]^_`{|}~ \p{Graph} A visible character: [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank} A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F] \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [ \t\n\x0B\f\r] [b][/b][b]java.lang.Character classes (simple [/b][b]java character type[/b][b]) [/b]\p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase() \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase() \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace() \p{javaMirrored} Equivalent to java.lang.Character.isMirrored() [b][/b][b]Classes for Unicode blocks and categories [/b][b]要用上这个里面的东西.首先你得知道什么叫希腊字母.GOOGLE之. [/b]\p{InGreek} A character in the Greek block (simple block) \p{Lu} An uppercase letter (simple category) \p{Sc} A currency symbol[b]我觉得这个还有点用(捕获货币符号) [/b]\P{InGreek} Any character except one in the Greek block (negation) [\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction) [b][/b][b]你认为[\\p{L}]*能捕获字符串”123a我$ ”捕到什么,只能捕获到字母与中文.且捕获到5条符合的记录.
　　[/b][b]分别是
　　[/b][b][/b][b][/b][b][/b][b]a[/b][b]我
　　[/b][b][/b][b][/b][b]不是说5条纪录么!!!怎么就一个”a我”!!..因为其他捕获到的组只是做为标识用了..具体过程如下:
　　[/b][b]正则引擎先拿到1,一看:是个数字,匹配了标识(以数字做标识,但不捕获)再找下一个.又是一数字..完全不匹配.这样如果把[\\p{L}]*用标准的正则表达式写应该是这样(?正则表达式.写在(?= )内.只是一个标识.表示我只捕获符合以X正则表达式结尾的字符串.
　　例如a.*(?=b)
　　传入adcbac
　　将会捕获adc c后面跟了一个b,这告诉看到正则表达式看到b就停.但不捕获它.
　　以下的参数意义上不同.但原理一样. (?!X) X, via zero-width negative lookahead 只捕获不符合以X正则表达式结尾的字符串. (?正则表达式开头的字符串. (?正则表达式开头的字符串. (?>X) X, as an independent, non-capturing group 这个要跟(?:X) 对比一下
　　(?:X)与(?>X)最大的区别在于前者捕获,后者不捕获.
　　举例:
　　待捕获字符串:abcde
　　用(?>a)(bc)将捕获 bc,此时!这个正则只捕获到1组group,只有(bc)!! [b]Backslashes, escapes, and quoting [/b]The backslash character ('\') serves to introduce escaped constructs, as defined in the table above, as well as to quote characters that otherwise would be interpreted as unescaped constructs. Thus the expression \\ matches a single backslash and \{ matches a left brace. It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct. Backslashes within string literals in Java source code are interpreted as required by the Java Language Specification as either Unicode escapes or other character escapes. It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler. The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary. The string literal "$hello$" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "\$hello\$" must be used. [b]Character Classes [/b]Character classes may appear within other character classes, and may be composed by the union operator (implicit) and the intersection operator (&&). The union operator denotes a class that contains every character that is in at least one of its operand classes. The intersection operator denotes a class that contains every character that is in both of its operand classes. The precedence of character-class operators is as follows, from highest to lowest: [b][/b][b][/b][b][/b][b][/b][b][/b]Note that a different set of metacharacters are in effect inside a character class than outside a character class. For instance, the regular expression . loses its special meaning inside a character class, while the expression - becomes a range forming metacharacter. [b]Line terminators [/b]A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators: