写email地址的正则表达式不容易

从文本内容中提取email地址

当然用正则表达式啦。代码如下:

 

import java.util.regex.Matcher;
import java.util.regex.Pattern;


Pattern pattern = Pattern.compile("[-\\w\\.]+@[-\\w\\.]+");

// s是待处理的文本
                        Matcher m = pattern.matcher(s);
                        while (m.find()) {
                            String email = s.substring(m.start(), m.end());
                            log.debug("email found " + email);
                        }
 

 

当然,上面的email的正则表达式的写法不那么严谨。

 

email的规范格式,在[RFC 2822] Internet Message Format中有详细说明。

 

http://tools.ietf.org/html/rfc2822#section-3.4.1 写道
3.4.1. Addr-spec specification


An addr-spec is a specific Internet identifier that contains a
locally interpreted string followed by the at-sign character ("@",
ASCII value 64) followed by an Internet domain. The locally
interpreted string is either a quoted-string or a dot-atom. If the
string can be represented as a dot-atom (that is, it contains no
characters other than atext characters or "." surrounded by atext
characters), then the dot-atom form SHOULD be used and the
quoted-string form SHOULD NOT be used. Comments and folding white
space SHOULD NOT be used around the "@" in the addr-spec.

addr-spec = local-part "@" domain

local-part = dot-atom / quoted-string / obs-local-part

domain = dot-atom / domain-literal / obs-domain

domain-literal = [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]

dcontent = dtext / quoted-pair

dtext = NO-WS-CTL / ; Non white space controls

%d33-90 / ; The rest of the US-ASCII
%d94-126 ; characters not including "[",
; "]", or "\"

The domain portion identifies the point to which the mail is
delivered. In the dot-atom form, this is interpreted as an Internet
domain name (either a host name or a mail exchanger name) as
described in [STD3, STD13, STD14]. In the domain-literal form, the
domain is interpreted as the literal Internet address of the
particular host. In both cases, how addressing is used and how
messages are transported to a particular host is covered in the mail
transport document [RFC2821]. These mechanisms are outside of the
scope of this document.

The local-part portion is a domain dependent string. In addresses,
it is simply interpreted on the particular host as a name of a
particular mailbox.
 

 

大名鼎鼎的 jquery validation 中是这样来写email的正则表达式的

 

        // http://docs.jquery.com/Plugins/Validation/Methods/email
        email: function(value, element) {
            // contributed by Scott Gonzalez: http://projects.scottsplayground.com/email_address_validation/
            return this.optional(element) || /^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_
`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22
)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\u
FFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])(
[a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i.test(value);
        },

 

 

的确够复杂,其中的字符还可以包括很多unicode字符。代码的注释中,有一行提供了一个网址:http://projects.scottsplayground.com/email_address_validation/ ,访问之后找到了一个js代码,很详细的写了email的各个构成部分的验证,是根据 RFC 2822 来的。

 

// 以下代码来自 http://projects.scottsplayground.com/email_address_validation/lib/email.js

/*
 * Email addresses
 * 
 * Definitions from (unless otherwise noted):
 * - [RFC 2822] Internet Message Format
 */

// internationalized
var text =
	"(" +
		"[" +
			"\\x01-\\x09" +
			"\\x0b" +
			"\\x0c" +
			"\\x0d-\\x7f" +
		"]" +
		"|" + ucschar +
	")";

var NOWSCTL =
	"[" +
		"\\x01-\\x08" +
		"\\x0b" +
		"\\x0c" +
		"\\x0e-\\x1f" +
		"\\x7f" +
	"]";

// section 2.2.2 Header Fields
var SP = "\\x20";
var HTAB = "\\x09";
var WSP =
	"(" +
		SP +
		"|" + HTAB +
	")";

// section 2.1 General Description
var CRLF = "(\\x0d\\x0a)";

// removed obsolete folding white space (obs-FWS)
var FWS =
	"(" +
		"(" +
			WSP + "*" +
			CRLF +
		")?" +
		WSP + "+" +
	")";

var DQUOTE = "(\\x22)";

// internationalized
var qtext =
	"(" +
		NOWSCTL +
		"|\\x21" +
		"|[\\x23-\\x5b]" +
		"|[\\x5d-\\x7e]" +
		"|" + ucschar +
	")";

// removed obsolete quoted pair (obs-qp)
var quotedPair =
	"(" +
		"\\\\" +
		text +
	")";

var qcontent =
	"(" +
		qtext +
		"|" + quotedPair +
	")";

// removed comments and folding white space (CFWS)
var quotedString =
	"(" +
		DQUOTE +
		"(" +
			FWS + "?" +
			qcontent +
		")*" +
		FWS + "?" +
		DQUOTE +
	")";

// created from symbols in atext
var atextSymbols = "[!#\\$%&'\\*\\+\\-\\/=\\?\\^_`{\\|}~]";

// internationalized
var atext =
	"(" +
		ALPHA +
		"|" + DIGIT +
		"|" + atextSymbols +
		"|" + ucschar +
	")";

var dotAtomText =
	"(" +
		atext + "+" +
		"(" +
			"\\." +
			atext + "+" +
		")*" +
	")";

// removed comments and folding white space (CFWS)
var dotAtom = dotAtomText;

// removed comments and folding white space (CFWS)
var atom = atext + "+";

// ihostName from iri.http.js (http://projects.scottsplayground.com/iri)
var domain = ihostName;

// removed obsolete local part (obs-local-part)
var localPart =
	"(" +
		dotAtom +
		"|" + quotedString +
	")";

var addrSpec = localPart + "@" + domain;
 

 

另外一个很好的说明文档是 http://www.markussipila.info/pub/emailvalidator.php

它提供了两种形式的正则表达式,一种是比较正常的格式,一种比较奇怪的格式(包含!#$之类的)。如下所示:

 

// define a regular expression for "normal" addresses
$normal = "^[a-z0-9_\+-]+(\.[a-z0-9_\+-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]{2,4})$"; 

// define a regular expression for "strange looking" but syntactically valid addresses
$validButRare = "^[a-z0-9,!#\$%&'\*\+/=\?\^_`\{\|}~-]+(\.[a-z0-9,!#\$%&'\*\+/=\?\^_`\{\|}~-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]{2,})$"; 

if (eregi($normal, $email)) {
  echo("The address $email is valid and looks normal.");
} else if (eregi($validButRare, $email)) {
  echo("The address $email looks a bit strange but it is syntactically valid. You might want to check it for typos.");
} else {
  echo("The address $email is not valid.");
}
 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值