从文本内容中提取email地址
当然用正则表达式啦。代码如下:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
Pattern pattern = Pattern.compile("[-\\w\\.]+@[-\\w\\.]+");
// s是待处理的文本
Matcher m = pattern.matcher(s);
while (m.find()) {
String email = s.substring(m.start(), m.end());
log.debug("email found " + email);
}
当然,上面的email的正则表达式的写法不那么严谨。
email的规范格式,在[RFC 2822] Internet Message Format中有详细说明。
http://tools.ietf.org/html/rfc2822#section-3.4.1 写道
3.4.1. Addr-spec specification
An addr-spec is a specific Internet identifier that contains a
locally interpreted string followed by the at-sign character ("@",
ASCII value 64) followed by an Internet domain. The locally
interpreted string is either a quoted-string or a dot-atom. If the
string can be represented as a dot-atom (that is, it contains no
characters other than atext characters or "." surrounded by atext
characters), then the dot-atom form SHOULD be used and the
quoted-string form SHOULD NOT be used. Comments and folding white
space SHOULD NOT be used around the "@" in the addr-spec.
addr-spec = local-part "@" domain
local-part = dot-atom / quoted-string / obs-local-part
domain = dot-atom / domain-literal / obs-domain
domain-literal = [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
dcontent = dtext / quoted-pair
dtext = NO-WS-CTL / ; Non white space controls
%d33-90 / ; The rest of the US-ASCII
%d94-126 ; characters not including "[",
; "]", or "\"
The domain portion identifies the point to which the mail is
delivered. In the dot-atom form, this is interpreted as an Internet
domain name (either a host name or a mail exchanger name) as
described in [STD3, STD13, STD14]. In the domain-literal form, the
domain is interpreted as the literal Internet address of the
particular host. In both cases, how addressing is used and how
messages are transported to a particular host is covered in the mail
transport document [RFC2821]. These mechanisms are outside of the
scope of this document.
The local-part portion is a domain dependent string. In addresses,
it is simply interpreted on the particular host as a name of a
particular mailbox.
An addr-spec is a specific Internet identifier that contains a
locally interpreted string followed by the at-sign character ("@",
ASCII value 64) followed by an Internet domain. The locally
interpreted string is either a quoted-string or a dot-atom. If the
string can be represented as a dot-atom (that is, it contains no
characters other than atext characters or "." surrounded by atext
characters), then the dot-atom form SHOULD be used and the
quoted-string form SHOULD NOT be used. Comments and folding white
space SHOULD NOT be used around the "@" in the addr-spec.
addr-spec = local-part "@" domain
local-part = dot-atom / quoted-string / obs-local-part
domain = dot-atom / domain-literal / obs-domain
domain-literal = [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
dcontent = dtext / quoted-pair
dtext = NO-WS-CTL / ; Non white space controls
%d33-90 / ; The rest of the US-ASCII
%d94-126 ; characters not including "[",
; "]", or "\"
The domain portion identifies the point to which the mail is
delivered. In the dot-atom form, this is interpreted as an Internet
domain name (either a host name or a mail exchanger name) as
described in [STD3, STD13, STD14]. In the domain-literal form, the
domain is interpreted as the literal Internet address of the
particular host. In both cases, how addressing is used and how
messages are transported to a particular host is covered in the mail
transport document [RFC2821]. These mechanisms are outside of the
scope of this document.
The local-part portion is a domain dependent string. In addresses,
it is simply interpreted on the particular host as a name of a
particular mailbox.
大名鼎鼎的 jquery validation 中是这样来写email的正则表达式的
// http://docs.jquery.com/Plugins/Validation/Methods/email email: function(value, element) { // contributed by Scott Gonzalez: http://projects.scottsplayground.com/email_address_validation/ return this.optional(element) || /^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_ `{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]| [\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22 )))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\u FFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])( [a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i.test(value); },
的确够复杂,其中的字符还可以包括很多unicode字符。代码的注释中,有一行提供了一个网址:http://projects.scottsplayground.com/email_address_validation/ ,访问之后找到了一个js代码,很详细的写了email的各个构成部分的验证,是根据 RFC 2822 来的。
// 以下代码来自 http://projects.scottsplayground.com/email_address_validation/lib/email.js /* * Email addresses * * Definitions from (unless otherwise noted): * - [RFC 2822] Internet Message Format */ // internationalized var text = "(" + "[" + "\\x01-\\x09" + "\\x0b" + "\\x0c" + "\\x0d-\\x7f" + "]" + "|" + ucschar + ")"; var NOWSCTL = "[" + "\\x01-\\x08" + "\\x0b" + "\\x0c" + "\\x0e-\\x1f" + "\\x7f" + "]"; // section 2.2.2 Header Fields var SP = "\\x20"; var HTAB = "\\x09"; var WSP = "(" + SP + "|" + HTAB + ")"; // section 2.1 General Description var CRLF = "(\\x0d\\x0a)"; // removed obsolete folding white space (obs-FWS) var FWS = "(" + "(" + WSP + "*" + CRLF + ")?" + WSP + "+" + ")"; var DQUOTE = "(\\x22)"; // internationalized var qtext = "(" + NOWSCTL + "|\\x21" + "|[\\x23-\\x5b]" + "|[\\x5d-\\x7e]" + "|" + ucschar + ")"; // removed obsolete quoted pair (obs-qp) var quotedPair = "(" + "\\\\" + text + ")"; var qcontent = "(" + qtext + "|" + quotedPair + ")"; // removed comments and folding white space (CFWS) var quotedString = "(" + DQUOTE + "(" + FWS + "?" + qcontent + ")*" + FWS + "?" + DQUOTE + ")"; // created from symbols in atext var atextSymbols = "[!#\\$%&'\\*\\+\\-\\/=\\?\\^_`{\\|}~]"; // internationalized var atext = "(" + ALPHA + "|" + DIGIT + "|" + atextSymbols + "|" + ucschar + ")"; var dotAtomText = "(" + atext + "+" + "(" + "\\." + atext + "+" + ")*" + ")"; // removed comments and folding white space (CFWS) var dotAtom = dotAtomText; // removed comments and folding white space (CFWS) var atom = atext + "+"; // ihostName from iri.http.js (http://projects.scottsplayground.com/iri) var domain = ihostName; // removed obsolete local part (obs-local-part) var localPart = "(" + dotAtom + "|" + quotedString + ")"; var addrSpec = localPart + "@" + domain;
另外一个很好的说明文档是 http://www.markussipila.info/pub/emailvalidator.php
它提供了两种形式的正则表达式,一种是比较正常的格式,一种比较奇怪的格式(包含!#$之类的)。如下所示:
// define a regular expression for "normal" addresses
$normal = "^[a-z0-9_\+-]+(\.[a-z0-9_\+-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]{2,4})$";
// define a regular expression for "strange looking" but syntactically valid addresses
$validButRare = "^[a-z0-9,!#\$%&'\*\+/=\?\^_`\{\|}~-]+(\.[a-z0-9,!#\$%&'\*\+/=\?\^_`\{\|}~-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]{2,})$";
if (eregi($normal, $email)) {
echo("The address $email is valid and looks normal.");
} else if (eregi($validButRare, $email)) {
echo("The address $email looks a bit strange but it is syntactically valid. You might want to check it for typos.");
} else {
echo("The address $email is not valid.");
}