2.2.6. Text-to-HTML Conversion
读取文件的一段程序:
#! /usr/bin/perl -w
# Mastering Regular Expressiona: Chapter 2Section 2.
# fiveth program
undef $/;
$textStr = <>;
print $textStr;
2.2.6.1 Cooking special characters
原始文本转换位HTML编码:
‘&’ -> ‘&’
‘<’ -> ‘<’
‘>’ -> ‘>’
2.2.6.2 Separating paragraphs
需要强调的是:「^」和「$」通常匹配的不是逻辑行(logical line)的开头和结尾,而是整个的字符串的开头和结束位置。
看起来的正则,像:
s/^\s*$/<p>/mg
m: 表示匹配多个逻辑行。
2.2.6.3 “Linkizing” an email address
E-mail地址的基本形式是username@hostname所以他的正则:
s/\b(username regex\@hostnameregex)\b/<a herf = “mailto:$1”>$1<\/a>/g
2.2.6.4 Matching the username and hostname
根据e-mail规则,可以得出:
「\w[-.\w]*\@\w+(\.\w+)」
用户名是红色部分;黑色部分为“@”;蓝色部分为主机名;
请注意,我们把连字符放在字符组第一位,这样确保它们被作为连字符,而不是用来表示范围!
2.2.6.5 Putting it together
完整程序如下:
#! /usr/bin/perl -w
# Mastering Regular Expressiona: Chapter 2 Section 2.
# fiveth program
undef $/;
$textStr = <>;
$textStr =~ s/&/&/g;
$textStr =~ s/</</g;
$textStr =~ s/>/>/g;
$textStr =~ s/^\s*$/<p>/mg;
# Turn email addresses into links ...
$textStr =~ s{
\b
# Capture the address to $1 ...
(
\w[-.\w]* # username
\@
[-a-z0-9]+(\.[-a-a0-9]+)+\.(com;edu;info) # hostname
)
\b
}{<a href="mailto:$1">$1<\/a>}gix;
print $textStr;
其中的修饰符“x”表示用户能够以“宽松排列”编排正则表达式。
2.2.6.6 “Linkizing” an HTTP URL
HTTP URL的基本形式是http://hostname/path,其中的/path部分是可选的。于是我们可以得到下面的正则:
s{
\b
(
(
/path
)?
)
} {
<a href=”$1”>$1</a>
}gix
增加了网页链接后的程序,如下:
#! /usr/bin/perl -w
# Mastering Regular Expressiona: Chapter 2 Section 2.
# fiveth program
undef $/; # Enter "file-slurp" mode
$textStr = <>; # Slurp up the first file given on the command line.
$textStr =~ s/&/&/g; # Make the basic HTML ...
$textStr =~ s/</</g; # ... characters &, <, and > ...
$textStr =~ s/>/>/g; # ... HTML safe.
$textStr =~ s/^\s*$/<p>/mg; # Separate paragraphs.
# Turn email addresses into links ...
$textStr =~ s{
\b
# Capture the address to $1 ...
(
\w[-.\w]* # username
\@
[-a-z0-9]+(\.[-a-z0-9]+)*\.(com;edu;info) # hostname
)
\b
}{
<a href="mailto:$1">$1</a>
}gix;
# Turn HTTP URLs into links ...
$textStr =~ s{
\b
# Capture the URL to $1 ...
(
http://[-a-a0-9]+(\.[-a-z0-9]+)*\.(com|edu|info)\b # hostname
(
/[-a-z0-9_:\@&?=+,.!/~*'%\$]* # Optional path
(?>![.,?!]) # Not allowed to end with [.,?!]
)?
)
}{
<a href="$1">$1</a>
}gix;
print $textStr; # Finally, display the HTML-ized text..