java实现登陆安全码_Java正则表达式的几个应用实例(匹配网址,匹配美国安全码,匹配日期)...

由于最近做的项目需要从英文文本中提取出字符串进行话题的聚类,于是就花了一天的时间来学习Java正则表达式,一下几个小例子是我的一些小练笔,如有不合理之处,还望各位指教!!

1.此例是用来过滤掉英文文本中的网址,并将过滤后的字符串输出

首先需要先贴出来我需要过滤的英文文本,我将这些文本存在一个名为englishtxt.txt中,其内容为

1 www.baidu.com

2 银行挤兑:可能引发下一轮金融危机的盲点 http://mp.weixin.qq.com/s?__biz=MjM5MDY4Mzg2MA==&mid=200223248&idx=1&sn=a5b668754a60a8e07f335bd59521fb03#rd?…

3 Beijing CBD right now 01 pic.twitter.com/zCNP4CFrrk

4 I see more and more Chinese ask the same question online: what if most #MH370 passengers were Americans; how would the US government react?

5 10:27:01 Chinese Net friend expectations http://chinafree.greatzhonghua.org/showthread.php?tid=5377?… Chinese Net friend expectations -...

6 01:47:01 Times silly and fantastic notions, Gu Xiaojun Thought Yiu glorious http://chinafree.greatzhonghua.org/showthread.php?tid=4969?… T...

7 [強國空氣問題比愛滋更嚴重] China Smog at Center of <> by WHO http://bloom.bg/1rqNRBP? /via @BloombergNews

8 [Android 高登仔] LIHK 已重生,你會花 HK$10 買嗎? https://play.google.com/store/apps/details?id=com.lihk.hkgolden.app.reborn?…

9 #Taiwan protests: Water cannons are an indiscriminate tool for dispersing protesters & can result in serious injury

10 NASA 的新太空衣... http://jscfeatures.jsc.nasa.gov/z2/?

11 PHOTOS: Marijuana through the years http://ow.ly/uXzuq? (AP Photo/DEA) pic.twitter.com/4LSP4nlLMQ

12 Protest in Taiwan http://blog.flickr.net/en/2014/03/24/protest-in-taiwan/?… /via @flickr

13 [原來昨天說的那位嬰兒已經...] Baby born on board diverted Cathay flight dies http://www.scmp.com/news/hong-kong/article/1456417/baby-born-board-diverted-cathay-flight-dies?… /via @SCMP_News

14 What does Apple think about the lack of diversity in emojis? We have their response. http://on.mtv.com/OWu6D7? /via @MTVact

15 Linkin Park releases customizable music video powered by Xbox's Project Spark http://www.theverge.com/2014/3/25/5546982/linkin-park-releases-customizable-music-video-powered-by-xboxs?…

16 Full draw for @afcasiancup 2015 is here pic.twitter.com/nrYJo1mm9G #AC2015

17 Interesting draw RT @afcasiancup: Group B: Saudi Arabia, China PR, DPR Korea, Uzbekistan #AC2015

18 Finally: @emirates are activating their Twitter account.

19 Interior Minister Prince Mohammed bin Naif launches new ministry site aboard what appears like a private jet —SPA pic.twitter.com/NDSGJVbXTs

从该文本文档中我们可以看出,文本中存在大量的网址,如果直接拿来进行话题聚类的话,会产生大量的噪声数据,于是需要去除这些网址,于是我的代码如下

1 import java.io.BufferedReader;

2 import java.io.File;

3 import java.io.FileNotFoundException;

4 import java.io.FileReader;

5 import java.io.IOException;

6 import java.util.regex.Matcher;

7 import java.util.regex.Pattern;

8

9 public class URLMatcher {

10 public static void main(String[] args) throws IOException {

11 BufferedReader br = new BufferedReader(new FileReader(new File("D://englishtxt.txt")));

12 System.out.println("开始从文本中读数据");

13 String line = br.readLine();

14 while(line!=null)

15 {

17 String value = line.replaceAll("(http://|https://|ftp://)?(\\w+\\.)+\\w+(:\\d*)?([^#\\s]*)","").replaceAll("[\\/?:;!@#$%^&*+()【】<<>>...-]", "");

18 StringBuilder strb = new StringBuilder();

19 Pattern ptn = Pattern.compile("\\w+");

20 Matcher mch = ptn.matcher(value);

21 while(mch.find())

22 {

23 strb.append(mch.group());

24 strb.append(" ");

25 }

26 System.out.println(strb.toString());

27 line = br.readLine();

28 }

29

30   }

31 }

上面代码的功能不仅能够过滤掉大量的网址,还可以去除一些特殊的标点符号

运行的结果如下:

开始从文本中读数据

rd

Beijing CBD right now

I see more and more Chinese ask the same question online what if most MH passengers were Americans how would the US government react

Chinese Net friend expectations Chinese Net friend expectations

Times silly and fantastic notions Gu Xiaojun Thought Yiu glorious T

China Smog at Center of Air Pollution Deaths Cited by WHO via BloombergNews

Android LIHK HK I

Taiwan protests Water cannons are an indiscriminate tool for dispersing protesters can result in serious injury

NASA

PHOTOS Marijuana through the years AP PhotoDEA

Protest in Taiwan via flickr

f Baby born on board diverted Cathay flight dies via SCMP News

What does Apple think about the lack of diversity in emojis We have their response via MTVact

Linkin Park releases customizable music video powered by Xbox s Project Spark

Full draw for afcasiancup is here AC

Interesting draw RT afcasiancup Group B Saudi Arabia China PR DPR Korea Uzbekistan AC

Finally emirates are activating their Twitter account

Interior Minister Prince Mohammed bin Naif launches new ministry site aboard what appears like a private jet SPA

从上面的结果可以看出,网址基本都被过滤出来了。

2.下面的这个小例子是用来匹配美国的安全码

代码如下:

String safeNum = "This is a safe num 999-99-9999,this is the second num 456003348,this is the third num 456-909090,this is the forth num 45677-0764";

Pattern ptn = Pattern.compile("\\d{3}\\-?\\d{2}\\-?\\d{4}");

Matcher mch = ptn.matcher(safeNum);

while(mch.find())

{

System.out.println(mch.group());

}

最后的输出结果为:

999-99-9999

456003348

456-909090

45677-0764

3.这个小例子是用来匹配英文中的日期

String strDate = "this is a date June 26,1951";

Pattern ptn = Pattern.compile("([a-zA-Z]+)\\s[0-9]{1,2},\\s*[0-9]{4}");

Matcher mch = ptn.matcher(strDate);

while(mch.find())

{

System.out.println(mch.group());

}

输出结果为:

June 26,1951

以上的这3个小例子就是我在学正则表达式的时候做的小练笔,希望对大家的学习有所帮助!!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值