Lecture 17: Regular Expressions & Grammars

1 Grammars

  • A grammar defines a set of sentences, where each sentence is a sequence of symbols. For example, our grammar for URLs will specify the set of sentences that are legal URLs in the HTTP protocol.
  • The symbols in a sentence are called terminals (or tokens).
  • A grammar is described by a set of productions, where each production defines a nonterminal.
  • A production in a grammar has the form

    nonterminal ::= expression of terminals, nonterminals, and operators

  • Nonterminals are internal nodes of the tree representing a sentence.

1.1 Grammar Operators

  • concatenation
x ::= y z     an x is a y followed by a z
  • repetition
x ::= y*      an x is zero or more y
  • union (also called alternation)
x ::= y | z     an x is a y or a z
  • option (0 or 1 occurrence)
x ::=  y?      an x is a y or is the empty sentence
  • 1+ repetition (1 or more occurrences)
x ::= y+       an x is one or more y
               (equivalent to  x ::= y y* )
  • character classes
x ::= [abc]  is equivalent to  x ::= 'a' | 'b' | 'c' 
x ::= [^b]   is equivalent to  x ::= 'a' | 'c' | 'd' | 'e' | 'f'
                                         | ... (all other characters)
  • grouping using parentheses
x ::=  (y z | a b)*   an x is zero or more y-z or a-b pairs

2 Regular Expressions

A regular grammar has a special property: by substituting every nonterminal (except the root one) with its righthand side, you can reduce it down to a single production for the root, with only terminals and operators on the right-hand side.
Our URL grammar was regular. By replacing nonterminals with their productions, it can be reduced to a single expression:

url ::= 'http://' ([a-z]+ '.')+ [a-z]+ (':' [0-9]+)? '/'

3 Using regular expressions in Java

In Java, you can use regexes for manipulating strings (see String.split, String.matches, java.util.regex.Pattern).

  • Replace all runs of spaces with a single space:
String singleSpacedString = string.replaceAll(" +", " ");
  • Match a URL:
Pattern regex = Pattern.compile("http://([a-z]+\\.)+[a-z]+(:[0-9]+)?/");
Matcher m = regex.matcher(string);
if (m.matches()) {
    // then string is a url
}

Notice: we want to match a literal period . , so we have to first escape it as . to protect it from being interpreted as the regex match-any-character operator, and then we have to further escape it as \. to protect the backslash from being interpreted as a Java string escape character.


Reference

[1] 6.005 — Software Construction on MIT OpenCourseWare | OCW 6.005 Homepage at https://ocw.mit.edu/ans7870/6/6.005/s16/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值