现在开始,我们的解析器要进入实战阶段了,首先来看看,能不能直接分析ABNF文法本身呢?
先把RFC2234上面关于自身的文法定义的片段copy下来,保存成文件,例如文件名RFC2234.Demo.1.txt
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z
BIT = "0" / "1"
CHAR = %x01-7F
; any 7-bit US-ASCII character,
excluding NUL
CR = %x0D
; carriage return
CRLF = CR LF
; Internet standard newline
CTL = %x00-1F / %x7F
; controls
DIGIT = %x30-39
; 0-9
DQUOTE = %x22
; " (Double Quote)
HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
HTAB = %x09
; horizontal tab
LF = %x0A
; linefeed
LWSP = *(WSP / CRLF WSP)
; linear white space (past newline)
OCTET = %x00-FF
; 8 bits of data
SP = %x20
rulelist = 1*( rule / (*c-wsp c-nl) )
rule = rulename defined-as elements c-nl
; continues if next line starts
; with white space
rulename = ALPHA *(ALPHA / DIGIT / "-")
defined-as = *c-wsp ("=" / "=/") *c-wsp
; basic rules definition and
; incremental alternatives
elements = alternation *c-wsp
c-wsp = WSP / (c-nl WSP)
c-nl = comment / CRLF
; comment or newline
comment = ";" *(WSP / VCHAR) CRLF
alternation = concatenation
*(*c-wsp "/" *c-wsp concatenation)
concatenation = repetition *(1*c-wsp repetition)
repetition = [repeat] element
repeat = 1*DIGIT / (*DIGIT "*" *DIGIT)
element = rulename / group / option /
char-val / num-val / prose-val
group = "(" *c-wsp alternation *c-wsp ")"
option = "[" *c-wsp alternation *c-wsp "]"
char-val = DQUOTE *(%x20-21 / %x23-7E) DQUOTE
; quoted string of SP and VCHAR
without DQUOTE
num-val = "%" (bin-val / dec-val / hex-val)
bin-val = "b" 1*BIT
[ 1*("." 1*BIT) / ("-" 1*BIT) ]
; series of concatenated bit values
; or single ONEOF range
dec-val = "d" 1*DIGIT
[ 1*("." 1*DIGIT) / ("-" 1*DIGIT) ]
hex-val = "x" 1*HEXDIG
[ 1*("." 1*HEXDIG) / ("-" 1*HEXDIG) ]
prose-val = "<" *(%x20-3D / %x3F-7E) ">"
; bracketed string of SP and VCHAR
without angles
; prose description, to be used as
last resort
; space
VCHAR = %x21-7E
; visible (printing) characters
WSP = SP / HTAB
; white space
果然,不能心存侥幸:
$ java JavaParser < RFC2234.Demo.1.txt
Exception in thread "main" Input stream does not match with 'A' [41] at position 9:1. Expected value is [';', 0x0D]
at AbnfParser.c_nl(AbnfParser.java:248)
at AbnfParser.rulelist(AbnfParser.java:129)
at AbnfParser.parse(AbnfParser.java:559)
at JavaParser.main(JavaParser.java:81)
从ABNF的文法规则// rulelist = 1*( rule / (*c-wsp c-nl) )来看,规则名前面不应该有空格,这不是程序缺陷。
所以,第2步是需要对每行规则rule前面的空格删除,再运行:
$ java JavaParser < RFC2234.Demo.2.txt
Exception in thread "main" Input stream does not match with '/' [2F] at position 27:1. Expected value is ['(', '[', 0x22, '%', '<']
at AbnfParser.element(AbnfParser.java:364)
at AbnfParser.repetition(AbnfParser.java:315)
at AbnfParser.concatenation(AbnfParser.java:303)
at AbnfParser.alternation(AbnfParser.java:280)
at AbnfParser.elements(AbnfParser.java:181)
at AbnfParser.rule(AbnfParser.java:139)
at AbnfParser.rulelist(AbnfParser.java:113)
at AbnfParser.parse(AbnfParser.java:559)
at JavaParser.main(JavaParser.java:81)
原来,根据concatenation()的算法,如果一个repetition后面跟着空格,则认为后面是另一个repetition,而事实上这里的空格后面跟着的是另一个concatenation,这里无法回溯,于是就出错了。
接下来,第3步是讲alternation内部的/两边的空格消除,即:
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z
改为
ALPHA = %x41-5A/%x61-7A ; A-Z / a-z
消除/两边空格后,再运行:
$ java org.sip4x.generator.JavaParser < RFC2234.Demo.3.txt
Exception in thread "main" Input stream does not match with '
' [0D] at position 46:1. Expected value is '
at org.sip4x.abnf.AbnfParser.assertMatch(AbnfParser.java:59)
at org.sip4x.abnf.AbnfParser.CR(AbnfParser.java:224)
at org.sip4x.abnf.AbnfParser.CRLF(AbnfParser.java:231)
at org.sip4x.abnf.AbnfParser.comment(AbnfParser.java:272)
at org.sip4x.abnf.AbnfParser.c_nl(AbnfParser.java:246)
at org.sip4x.abnf.AbnfParser.c_wsp(AbnfParser.java:238)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:301)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:290)
at org.sip4x.abnf.AbnfParser.elements(AbnfParser.java:181)
at org.sip4x.abnf.AbnfParser.rule(AbnfParser.java:139)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:113)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)
用UltraEdit查看,原来是每一行以0A字符结尾(没有0D),用UltraEdit将0A更换为0D 0A,再重新运行:
$ java org.sip4x.generator.JavaParser < RFC2234.Demo.3.txt
' [0D] at position 1:2. Expected value is [0x20, 0x09] with '
at org.sip4x.abnf.AbnfParser.WSP(AbnfParser.java:218)
at org.sip4x.abnf.AbnfParser.c_wsp(AbnfParser.java:238)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:301)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:290)
at org.sip4x.abnf.AbnfParser.elements(AbnfParser.java:181)
at org.sip4x.abnf.AbnfParser.rule(AbnfParser.java:139)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:113)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)
先看看ABNF定义:
concatenation = repetition *(1*c-wsp repetition)
alternation = concatenation *(*c-wsp "/" *c-wsp concatenation)
elements = alternation *c-wsp
原来,在concatenation方法重,如果repetition后面有c-wsp,则认为c-wsp后面一定要接着另一个repetition。其实如果c-wsp后面没有接着repetition,则应该回溯到alternation(),如果alternation发现concatenation后面的c-wsp跟着的不是"/",则应该继续回溯到elements层,在本例子中,这个c-wsp应该是属于elements的,因此,本算法无法处理c-wsp的情况,只好把每条规则结尾的空格以及注释都去掉。
再执行:
$ java org.sip4x.generator.JavaParser < RFC2234.Demo.4.txt
Exception in thread "main" Input stream does not match with 'B' [42] at position 1:3. Expected value is [0x20, 0x09]
at org.sip4x.abnf.AbnfParser.WSP(AbnfParser.java:218)
at org.sip4x.abnf.AbnfParser.c_wsp(AbnfParser.java:238)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:127)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)
同理,这里的c_wsp也是无法处理的,将AbnfParser中的这一部分注释掉:
// while (match(is.peek(), 0x20) || match(is.peek(), ';') || match(is.peek(), 0x0D)) {
// c_wsp();
// }
再运行:
$ java org.sip4x.generator.JavaParser < RFC2234.Demo.4.txt
Exception in thread "main" Input stream does not match with ')' [29] at position 42:29. Expected value is ['(', '[', 0x22, '%', '<']
at org.sip4x.abnf.AbnfParser.element(AbnfParser.java:364)
at org.sip4x.abnf.AbnfParser.repetition(AbnfParser.java:315)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:303)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:290)
at org.sip4x.abnf.AbnfParser.group(AbnfParser.java:375)
at org.sip4x.abnf.AbnfParser.element(AbnfParser.java:359)
at org.sip4x.abnf.AbnfParser.repetition(AbnfParser.java:315)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:298)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:280)
at org.sip4x.abnf.AbnfParser.elements(AbnfParser.java:181)
at org.sip4x.abnf.AbnfParser.rule(AbnfParser.java:139)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:113)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)
这个问题还是由于在concatenation方法中,一个repetition后面如果有空格,则concatenation认为这个repetition后面必须是另一个repetition,所以,只能把这个空格去掉。
$ java org.sip4x.generator.JavaParser < RFC2234.Demo.4.txt
Exception in thread "main" Input stream does not match with ' ' [20] at position 1:46. Expected value is [';', 0x0D]
at org.sip4x.abnf.AbnfParser.c_nl(AbnfParser.java:248)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:129)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)
原来,这里的规则分成两行定义了:
alternation = concatenation
*(*c-wsp "/" *c-wsp concatenation)
这个换行和c-wsp的问题是一样的,将所有的规则定义修改为一行之内。
$ java org.sip4x.generator.JavaParser < RFC2234.Demo.5.txt
Exception in thread "main" Input stream does not match with ']' [5D] at position 57:63. Expected value is ['(', '[', 0x22, '%', '<']
at org.sip4x.abnf.AbnfParser.element(AbnfParser.java:364)
at org.sip4x.abnf.AbnfParser.repetition(AbnfParser.java:315)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:303)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:290)
at org.sip4x.abnf.AbnfParser.option(AbnfParser.java:391)
at org.sip4x.abnf.AbnfParser.element(AbnfParser.java:360)
at org.sip4x.abnf.AbnfParser.repetition(AbnfParser.java:315)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:303)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:280)
at org.sip4x.abnf.AbnfParser.elements(AbnfParser.java:181)
at org.sip4x.abnf.AbnfParser.rule(AbnfParser.java:139)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:113)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)
这一次在这里出错:
bin-val = "b" 1*BIT [ 1*("." 1*BIT)/("-" 1*BIT) ]
这里需要去掉[后面的空格和]前面的空格,终于执行成功了,这是执行成功的文本:
ALPHA = %x41-5A/%x61-7A
BIT = "0"/"1"
CHAR = %x01-7F
CR = %x0D
CRLF = CR LF
CTL = %x00-1F/%x7F
DIGIT = %x30-39
DQUOTE = %x22
HEXDIG = DIGIT/"A"/"B"/"C"/"D"/"E"/"F"
HTAB = %x09
LF = %x0A
LWSP = *(WSP/CRLF WSP)
OCTET = %x00-FF
SP = %x20
rulelist = 1*(rule/(*c-wsp c-nl))
rule = rulename defined-as elements c-nl
rulename = ALPHA *(ALPHA/DIGIT/"-")
defined-as = *c-wsp ("="/"=/") *c-wsp
elements = alternation *c-wsp
c-wsp = WSP/(c-nl WSP)
c-nl = comment/CRLF
comment = ";" *(WSP/VCHAR) CRLF
alternation = concatenation *(*c-wsp "/" *c-wsp concatenation)
concatenation = repetition *(1*c-wsp repetition)
repetition = [repeat] element
repeat = 1*DIGIT/(*DIGIT "*" *DIGIT)
element = rulename/group/option/char-val/num-val/prose-val
group = "(" *c-wsp alternation *c-wsp ")"
option = "[" *c-wsp alternation *c-wsp "]"
char-val = DQUOTE *(%x20-21/%x23-7E) DQUOTE
num-val = "%" (bin-val/dec-val/hex-val)
bin-val = "b" 1*BIT [1*("." 1*BIT)/("-" 1*BIT)]
dec-val = "d" 1*DIGIT [1*("." 1*DIGIT)/("-" 1*DIGIT)]
hex-val = "x" 1*HEXDIG [1*("." 1*HEXDIG)/("-" 1*HEXDIG)]
prose-val = "<" *(%x20-3D/%x3F-7E) ">"
VCHAR = %x21-7E
WSP = SP/HTAB
到这里,这一系列的文章就暂告一段落了,在成功将ABNF文法读入内存并以ABNF的文法结构在内存中表示之后,下一步就可以将ABNF文法转换为NFA,即非确定有限状态机。