Java正则表达式

  1. 两个问题

           a. 如何知道一个url是新浪论坛的帖子页

           b. 如何提取这些页面的发帖时间

 

    分析:

           新浪论坛的帖子页url实例:

            http://bbs.2008.sina.com.cn/tableforum/App/view.php?bbsid=274&subid=0&fid=32614&tbid=2351

           http://bbs.book.sina.com.cn/tableforum/App/view.php?bbsid=7&subid=1&fid=43640&tbid=5386

           http://bbs.book.sina.com.cn/tableforum/App/view.php?bbsid=192&subid=4&fid=10367&tbid=1683

           http://bbs.edu.sina.com.cn/tableforum/App/view.php?bbsid=41&subid=2&fid=86455&tbid=4803

 

  发现很有规律。规则差不多是这样:

http://bbs\.[a-zA-Z0-9]+\.sina\.com\.cn/tableforum/App/view.php\?bbsid=[0-9]+&subid=[0-9]+&fid=[0-9]+&tbid=[0-9 ]+     (.与?需要转义,写成\.与\?)

 

发帖时间都这样的:[2008-08-09 14:51:35] 

   规则:\[ (\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d)\]

 

 

 Regular Expression Syntax

Syntax

Explanation

Characters

c

The character c

\ unnnn , \ xnn, \ 0n , \ 0nn , \ 0nnn

The code unit with the given hex or octal value

\ t, \ n, \ r, \ f, \ a, \e

The control characters tab, newline, return, form feed, alert, and escape

\ cc

The control character corresponding to the character c

Character Classes

[ C 1C 2 . . .]

Any of the characters represented by C 1 , C 2 , . . . The Ci are characters, character ranges (c 1 -c 2 ), or character classes

[^ . . .]

Complement of character class

[ . . . && . . .]

Intersection of two character classes

Predefined Character Classes

.

Any character except line terminators (or any character if the DOTALL flag is set)

\d

A digit [0-9 ]

\D

A nondigit [^0-9 ]

\s

A whitespace character [ \t\n\r\f\x0B ]

\S

A non-whitespace character

\w

A word character [a-zA-Z0-9 _]

\W

A nonword character

\p{ name }

A named character class—see Table 12-9

\P{ name }

The complement of a named character class

Boundary Matchers

^ $

Beginning, end of input (or beginning, end of line in multiline mode)

\b

A word boundary

\B

A nonword boundary

Syntax

Explanation

\A

Beginning of input

\z

End of input

\Z

End of input except final line terminator

\G

End of previous match

Quantifiers

X?

Optional X

X*

X, 0 or more times

X +

X, 1 or more times

X {n } X {n ,} X {n,m }

X n times, at least n times, between n and m times

Quantifier Suffixes

?

Turn default (greedy) match into reluctant match

+

Turn default (greedy) match into possessive match

Set Operations

XY

Any string from X , followed by any string from Y

X|Y

Any string from X or Y

Grouping

(X)

Capture the string matching X as a group

\ n

The match of the n th group

Escapes

\ c

The character c (must not be an alphabetic character)

\ Q . . . \E

Quote . . . verbatim

(? . . . )

Special construct—see API notes of Pattern class

 

 

从html中去除标签,提取正文的正则表达式:

<script.*?</script>|<style.*?</style>|<!?[a-z]+[^>]*>|</[a-z0-9]+>|<!--.*?-->

 

上传一个正则表达式测试工具:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值