基于Predictive Parsing的ABNF语法分析器(十四)——RFC2234文法解析实战

现在开始,我们的解析器要进入实战阶段了,首先来看看,能不能直接分析ABNF文法本身呢?

先把RFC2234上面关于自身的文法定义的片段copy下来,保存成文件,例如文件名RFC2234.Demo.1.txt

        ALPHA          =  %x41-5A / %x61-7A   ; A-Z / a-z

        BIT            =  "0" / "1"

        CHAR           =  %x01-7F
                               ; any 7-bit US-ASCII character,
                                  excluding NUL

        CR             =  %x0D
                               ; carriage return

        CRLF           =  CR LF
                               ; Internet standard newline

        CTL            =  %x00-1F / %x7F
                               ; controls

        DIGIT          =  %x30-39
                               ; 0-9

        DQUOTE         =  %x22
                               ; " (Double Quote)

        HEXDIG         =  DIGIT / "A" / "B" / "C" / "D" / "E" / "F"

        HTAB           =  %x09
                               ; horizontal tab

        LF             =  %x0A
                               ; linefeed

        LWSP           =  *(WSP / CRLF WSP)
                               ; linear white space (past newline)

        OCTET          =  %x00-FF
                               ; 8 bits of data

        SP             =  %x20

        rulelist       =  1*( rule / (*c-wsp c-nl) )

        rule           =  rulename defined-as elements c-nl
                               ; continues if next line starts
                               ;  with white space

        rulename       =  ALPHA *(ALPHA / DIGIT / "-")

        defined-as     =  *c-wsp ("=" / "=/") *c-wsp
                               ; basic rules definition and
                               ;  incremental alternatives

        elements       =  alternation *c-wsp

        c-wsp          =  WSP / (c-nl WSP)

        c-nl           =  comment / CRLF
                               ; comment or newline

        comment        =  ";" *(WSP / VCHAR) CRLF

        alternation    =  concatenation
                          *(*c-wsp "/" *c-wsp concatenation)

        concatenation  =  repetition *(1*c-wsp repetition)

        repetition     =  [repeat] element

        repeat         =  1*DIGIT / (*DIGIT "*" *DIGIT)

        element        =  rulename / group / option /
                          char-val / num-val / prose-val

        group          =  "(" *c-wsp alternation *c-wsp ")"

        option         =  "[" *c-wsp alternation *c-wsp "]"

        char-val       =  DQUOTE *(%x20-21 / %x23-7E) DQUOTE
                               ; quoted string of SP and VCHAR
                                  without DQUOTE

        num-val        =  "%" (bin-val / dec-val / hex-val)

        bin-val        =  "b" 1*BIT
                          [ 1*("." 1*BIT) / ("-" 1*BIT) ]
                               ; series of concatenated bit values
                               ; or single ONEOF range

        dec-val        =  "d" 1*DIGIT
                          [ 1*("." 1*DIGIT) / ("-" 1*DIGIT) ]

        hex-val        =  "x" 1*HEXDIG
                          [ 1*("." 1*HEXDIG) / ("-" 1*HEXDIG) ]

        prose-val      =  "<" *(%x20-3D / %x3F-7E) ">"
                               ; bracketed string of SP and VCHAR
                                  without angles
                               ; prose description, to be used as
                                  last resort

                               ; space

        VCHAR          =  %x21-7E
                               ; visible (printing) characters

        WSP            =  SP / HTAB
                               ; white space
果然,不能心存侥幸:

$ java JavaParser < RFC2234.Demo.1.txt 
Exception in thread "main" Input stream does not match with 'A' [41] at position 9:1. Expected value is [';', 0x0D]
at AbnfParser.c_nl(AbnfParser.java:248)
at AbnfParser.rulelist(AbnfParser.java:129)
at AbnfParser.parse(AbnfParser.java:559)
at JavaParser.main(JavaParser.java:81)

从ABNF的文法规则//   rulelist       =  1*( rule / (*c-wsp c-nl) )来看,规则名前面不应该有空格,这不是程序缺陷。

所以,第2步是需要对每行规则rule前面的空格删除,再运行:

$ java JavaParser < RFC2234.Demo.2.txt 

Exception in thread "main" Input stream does not match with '/' [2F] at position 27:1. Expected value is ['(', '[', 0x22, '%', '<']
at AbnfParser.element(AbnfParser.java:364)
at AbnfParser.repetition(AbnfParser.java:315)
at AbnfParser.concatenation(AbnfParser.java:303)
at AbnfParser.alternation(AbnfParser.java:280)
at AbnfParser.elements(AbnfParser.java:181)
at AbnfParser.rule(AbnfParser.java:139)
at AbnfParser.rulelist(AbnfParser.java:113)
at AbnfParser.parse(AbnfParser.java:559)
at JavaParser.main(JavaParser.java:81)

原来,根据concatenation()的算法,如果一个repetition后面跟着空格,则认为后面是另一个repetition,而事实上这里的空格后面跟着的是另一个concatenation,这里无法回溯,于是就出错了。

接下来,第3步是讲alternation内部的/两边的空格消除,即:

ALPHA          =  %x41-5A / %x61-7A   ; A-Z / a-z
改为
ALPHA          =  %x41-5A/%x61-7A   ; A-Z / a-z

消除/两边空格后,再运行:

$ java org.sip4x.generator.JavaParser < RFC2234.Demo.3.txt 
Exception in thread "main" Input stream does not match with '
' [0D] at position 46:1. Expected value is '
at org.sip4x.abnf.AbnfParser.assertMatch(AbnfParser.java:59)
at org.sip4x.abnf.AbnfParser.CR(AbnfParser.java:224)
at org.sip4x.abnf.AbnfParser.CRLF(AbnfParser.java:231)
at org.sip4x.abnf.AbnfParser.comment(AbnfParser.java:272)
at org.sip4x.abnf.AbnfParser.c_nl(AbnfParser.java:246)
at org.sip4x.abnf.AbnfParser.c_wsp(AbnfParser.java:238)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:301)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:290)
at org.sip4x.abnf.AbnfParser.elements(AbnfParser.java:181)
at org.sip4x.abnf.AbnfParser.rule(AbnfParser.java:139)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:113)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)

用UltraEdit查看,原来是每一行以0A字符结尾(没有0D),用UltraEdit将0A更换为0D 0A,再重新运行:

$ java org.sip4x.generator.JavaParser < RFC2234.Demo.3.txt 
' [0D] at position 1:2. Expected value is [0x20, 0x09] with '
at org.sip4x.abnf.AbnfParser.WSP(AbnfParser.java:218)
at org.sip4x.abnf.AbnfParser.c_wsp(AbnfParser.java:238)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:301)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:290)
at org.sip4x.abnf.AbnfParser.elements(AbnfParser.java:181)
at org.sip4x.abnf.AbnfParser.rule(AbnfParser.java:139)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:113)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)

先看看ABNF定义:

concatenation  =  repetition *(1*c-wsp repetition)

alternation    =  concatenation *(*c-wsp "/" *c-wsp concatenation)

elements       =  alternation *c-wsp

原来,在concatenation方法重,如果repetition后面有c-wsp,则认为c-wsp后面一定要接着另一个repetition。其实如果c-wsp后面没有接着repetition,则应该回溯到alternation(),如果alternation发现concatenation后面的c-wsp跟着的不是"/",则应该继续回溯到elements层,在本例子中,这个c-wsp应该是属于elements的,因此,本算法无法处理c-wsp的情况,只好把每条规则结尾的空格以及注释都去掉。

再执行:

$ java org.sip4x.generator.JavaParser < RFC2234.Demo.4.txt 
Exception in thread "main" Input stream does not match with 'B' [42] at position 1:3. Expected value is [0x20, 0x09]
at org.sip4x.abnf.AbnfParser.WSP(AbnfParser.java:218)
at org.sip4x.abnf.AbnfParser.c_wsp(AbnfParser.java:238)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:127)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)

同理,这里的c_wsp也是无法处理的,将AbnfParser中的这一部分注释掉:

//                while (match(is.peek(), 0x20) || match(is.peek(), ';') || match(is.peek(), 0x0D)) {
//                    c_wsp();
//                }

再运行:

$ java org.sip4x.generator.JavaParser < RFC2234.Demo.4.txt 
Exception in thread "main" Input stream does not match with ')' [29] at position 42:29. Expected value is ['(', '[', 0x22, '%', '<']
at org.sip4x.abnf.AbnfParser.element(AbnfParser.java:364)
at org.sip4x.abnf.AbnfParser.repetition(AbnfParser.java:315)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:303)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:290)
at org.sip4x.abnf.AbnfParser.group(AbnfParser.java:375)
at org.sip4x.abnf.AbnfParser.element(AbnfParser.java:359)
at org.sip4x.abnf.AbnfParser.repetition(AbnfParser.java:315)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:298)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:280)
at org.sip4x.abnf.AbnfParser.elements(AbnfParser.java:181)
at org.sip4x.abnf.AbnfParser.rule(AbnfParser.java:139)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:113)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)

这个问题还是由于在concatenation方法中,一个repetition后面如果有空格,则concatenation认为这个repetition后面必须是另一个repetition,所以,只能把这个空格去掉。

$ java org.sip4x.generator.JavaParser < RFC2234.Demo.4.txt 
Exception in thread "main" Input stream does not match with ' ' [20] at position 1:46. Expected value is [';', 0x0D]
at org.sip4x.abnf.AbnfParser.c_nl(AbnfParser.java:248)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:129)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)

原来,这里的规则分成两行定义了:

alternation    =  concatenation
                  *(*c-wsp "/" *c-wsp concatenation)

这个换行和c-wsp的问题是一样的,将所有的规则定义修改为一行之内。

$ java org.sip4x.generator.JavaParser < RFC2234.Demo.5.txt 
Exception in thread "main" Input stream does not match with ']' [5D] at position 57:63. Expected value is ['(', '[', 0x22, '%', '<']
at org.sip4x.abnf.AbnfParser.element(AbnfParser.java:364)
at org.sip4x.abnf.AbnfParser.repetition(AbnfParser.java:315)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:303)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:290)
at org.sip4x.abnf.AbnfParser.option(AbnfParser.java:391)
at org.sip4x.abnf.AbnfParser.element(AbnfParser.java:360)
at org.sip4x.abnf.AbnfParser.repetition(AbnfParser.java:315)
at org.sip4x.abnf.AbnfParser.concatenation(AbnfParser.java:303)
at org.sip4x.abnf.AbnfParser.alternation(AbnfParser.java:280)
at org.sip4x.abnf.AbnfParser.elements(AbnfParser.java:181)
at org.sip4x.abnf.AbnfParser.rule(AbnfParser.java:139)
at org.sip4x.abnf.AbnfParser.rulelist(AbnfParser.java:113)
at org.sip4x.abnf.AbnfParser.parse(AbnfParser.java:559)
at org.sip4x.generator.JavaParser.main(JavaParser.java:85)

这一次在这里出错:

bin-val        =  "b" 1*BIT [ 1*("." 1*BIT)/("-" 1*BIT) ]

这里需要去掉[后面的空格和]前面的空格,终于执行成功了,这是执行成功的文本:

ALPHA          =  %x41-5A/%x61-7A

BIT            =  "0"/"1"

CHAR           =  %x01-7F

CR             =  %x0D

CRLF           =  CR LF

CTL            =  %x00-1F/%x7F

DIGIT          =  %x30-39

DQUOTE         =  %x22

HEXDIG         =  DIGIT/"A"/"B"/"C"/"D"/"E"/"F"

HTAB           =  %x09

LF             =  %x0A

LWSP           =  *(WSP/CRLF WSP)

OCTET          =  %x00-FF

SP             =  %x20

rulelist       =  1*(rule/(*c-wsp c-nl))

rule           =  rulename defined-as elements c-nl

rulename       =  ALPHA *(ALPHA/DIGIT/"-")

defined-as     =  *c-wsp ("="/"=/") *c-wsp

elements       =  alternation *c-wsp

c-wsp          =  WSP/(c-nl WSP)

c-nl           =  comment/CRLF

comment        =  ";" *(WSP/VCHAR) CRLF

alternation    =  concatenation *(*c-wsp "/" *c-wsp concatenation)

concatenation  =  repetition *(1*c-wsp repetition)

repetition     =  [repeat] element

repeat         =  1*DIGIT/(*DIGIT "*" *DIGIT)

element        =  rulename/group/option/char-val/num-val/prose-val

group          =  "(" *c-wsp alternation *c-wsp ")"

option         =  "[" *c-wsp alternation *c-wsp "]"

char-val       =  DQUOTE *(%x20-21/%x23-7E) DQUOTE

num-val        =  "%" (bin-val/dec-val/hex-val)

bin-val        =  "b" 1*BIT [1*("." 1*BIT)/("-" 1*BIT)]

dec-val        =  "d" 1*DIGIT [1*("." 1*DIGIT)/("-" 1*DIGIT)]

hex-val        =  "x" 1*HEXDIG [1*("." 1*HEXDIG)/("-" 1*HEXDIG)]

prose-val      =  "<" *(%x20-3D/%x3F-7E) ">"

VCHAR          =  %x21-7E

WSP            =  SP/HTAB

到这里,这一系列的文章就暂告一段落了,在成功将ABNF文法读入内存并以ABNF的文法结构在内存中表示之后,下一步就可以将ABNF文法转换为NFA,即非确定有限状态机。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值