1 Grammars
- A grammar defines a set of sentences, where each sentence is a sequence of symbols. For example, our grammar for URLs will specify the set of sentences that are legal URLs in the HTTP protocol.
- The symbols in a sentence are called terminals (or tokens).
- A grammar is described by a set of productions, where each production defines a nonterminal.
A production in a grammar has the form
nonterminal ::= expression of terminals, nonterminals, and operators
Nonterminals are internal nodes of the tree representing a sentence.
1.1 Grammar Operators
- concatenation
x ::= y z an x is a y followed by a z
- repetition
x ::= y* an x is zero or more y
- union (also called alternation)
x ::= y | z an x is a y or a z
- option (0 or 1 occurrence)
x ::= y? an x is a y or is the empty sentence
- 1+ repetition (1 or more occurrences)
x ::= y+ an x is one or more y
(equivalent to x ::= y y* )
- character classes
x ::= [abc] is equivalent to x ::= 'a' | 'b' | 'c'
x ::= [^b] is equivalent to x ::= 'a' | 'c' | 'd' | 'e' | 'f'
| ... (all other characters)
- grouping using parentheses
x ::= (y z | a b)* an x is zero or more y-z or a-b pairs
2 Regular Expressions
A regular grammar has a special property: by substituting every nonterminal (except the root one) with its righthand side, you can reduce it down to a single production for the root, with only terminals and operators on the right-hand side.
Our URL grammar was regular. By replacing nonterminals with their productions, it can be reduced to a single expression:
url ::= 'http://' ([a-z]+ '.')+ [a-z]+ (':' [0-9]+)? '/'
3 Using regular expressions in Java
In Java, you can use regexes for manipulating strings (see String.split
, String.matches
, java.util.regex.Pattern
).
- Replace all runs of spaces with a single space:
String singleSpacedString = string.replaceAll(" +", " ");
- Match a URL:
Pattern regex = Pattern.compile("http://([a-z]+\\.)+[a-z]+(:[0-9]+)?/");
Matcher m = regex.matcher(string);
if (m.matches()) {
// then string is a url
}
Notice: we want to match a literal period . , so we have to first escape it as . to protect it from being interpreted as the regex match-any-character operator, and then we have to further escape it as \. to protect the backslash from being interpreted as a Java string escape character.
Reference
[1] 6.005 — Software Construction on MIT OpenCourseWare | OCW 6.005 Homepage at https://ocw.mit.edu/ans7870/6/6.005/s16/