有空会翻译一下
Using a Pattern
This example shows a very simple but complete programthat builds and uses a pattern:
local lpeg = require "lpeg" -- matches a word followed by end-of-string p = lpeg.R"az"^1 * -1 print(p:match("hello")) --> 6 print(lpeg.match(p, "hello")) --> 6 print(p:match("1 hello")) --> nil
The pattern is simply a sequence of one or more lower-case lettersfollowed by the end of string (-1).The program calls match
both as a methodand as a function.In both sucessful cases,the match returns the index of the first character after the match,which is the string length plus one.
Name-value lists
This example parses a list of name-value pairs and returns a tablewith those pairs:
lpeg.locale(lpeg) -- adds locale entries into 'lpeg' table local space = lpeg.space^0 local name = lpeg.C(lpeg.alpha^1) * space local sep = lpeg.S(",;") * space local pair = lpeg.Cg(name * "=" * space * name) * sep^-1 local list = lpeg.Cf(lpeg.Ct("") * pair^0, rawset) t = list:match("a=b, c = hi; next = pi") --> { a = "b", c = "hi", next = "pi" }
Each pair has the format name = name
followed byan optional separator (a comma or a semicolon).The pair
pattern encloses the pair in a group pattern,so that the names become the values of a single capture.The list
pattern then folds these captures.It starts with an empty table,created by a table capture matching an empty string;then for each capture (a pair of names) it applies rawset
over the accumulator (the table) and the capture values (the pair of names).rawset
returns the table itself,so the accumulator is always the table.
Splitting a string
The following code builds a pattern thatsplits a string using a given patternsep
as a separator:
function split (s, sep) sep = lpeg.P(sep) local elem = lpeg.C((1 - sep)^0) local p = elem * (sep * elem)^0 return lpeg.match(p, s) end
First the function ensures that sep
is a proper pattern.The pattern elem
is a repetition of zero of morearbitrary characters as long as there is not a match againstthe separator.It also captures its match.The pattern p
matches a list of elements separatedby sep
.
If the split results in too many values,it may overflow the maximum number of valuesthat can be returned by a Lua function.In this case,we can collect these values in a table:
function split (s, sep) sep = lpeg.P(sep) local elem = lpeg.C((1 - sep)^0) local p = lpeg.Ct(elem * (sep * elem)^0) -- make a table capture return lpeg.match(p, s) end
Searching for a pattern
The primitive match
works only in anchored mode.If we want to find a pattern anywhere in a string,we must write a pattern that matches anywhere.
Because patterns are composable,we can write a function that,given any arbitrary pattern p
,returns a new pattern that searches for p
anywhere in a string.There are several ways to do the search.One way is like this:
function anywhere (p) return lpeg.P{ p + 1 * lpeg.V(1) } end
This grammar has a straight reading:it matches p
or skips one character and tries again.
If we want to know where the pattern is in the string(instead of knowing only that it is there somewhere),we can add position captures to the pattern:
local I = lpeg.Cp() function anywhere (p) return lpeg.P{ I * p * I + 1 * lpeg.V(1) } end print(anywhere("world"):match("hello world!")) -> 7 12
Another option for the search is like this:
local I = lpeg.Cp() function anywhere (p) return (1 - lpeg.P(p))^0 * I * p * I end
Again the pattern has a straight reading:it skips as many characters as possible while not matching p
,and then matches p
(plus appropriate captures).
If we want to look for a pattern only at word boundaries,we can use the following transformer:
local t = lpeg.locale() function atwordboundary (p) return lpeg.P{ [1] = p + t.alpha^0 * (1 - t.alpha)^1 * lpeg.V(1) } end
Balanced parentheses
The following pattern matches only strings with balanced parentheses:
b = lpeg.P{ "(" * ((1 - lpeg.S"()") + lpeg.V(1))^0 * ")" }
Reading the first (and only) rule of the given grammar,we have that a balanced string isan open parenthesis,followed by zero or more repetitions of eithera non-parenthesis character ora balanced string (lpeg.V(1)
),followed by a closing parenthesis.
Global substitution
The next example does a job somewhat similar to string.gsub
.It receives a pattern and a replacement value,and substitutes the replacement value for all occurrences of the patternin a given string:
function gsub (s, patt, repl) patt = lpeg.P(patt) patt = lpeg.Cs((patt / repl + 1)^0) return lpeg.match(patt, s) end
As in string.gsub
,the replacement value can be a string,a function, or a table.
Comma-Separated Values (CSV)
This example breaks a string into comma-separated values,returning all fields:
local field = '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) * '"' + lpeg.C((1 - lpeg.S',\n"')^0) local record = field * (',' * field)^0 * (lpeg.P'\n' + -1) function csv (s) return lpeg.match(record, s) end
A field is either a quoted field(which may contain any character except an individual quote,which may be written as two quotes that are replaced by one)or an unquoted field(which cannot contain commas, newlines, or quotes).A record is a list of fields separated by commas,ending with a newline or the string end (-1).
As it is,the previous pattern returns each field as a separated result.If we add a table capture in the definition of record
,the pattern will return instead a single tablecontaining all fields:
local record = lpeg.Ct(field * (',' * field)^0) * (lpeg.P'\n' + -1)
UTF-8 and Latin 1
It is not difficult to use LPeg to convert a string fromUTF-8 encoding to Latin 1 (ISO 8859-1):
-- convert a two-byte UTF-8 sequence to a Latin 1 character local function f2 (s) local c1, c2 = string.byte(s, 1, 2) return string.char(c1 * 64 + c2 - 12416) end local utf8 = lpeg.R("\0\127") + lpeg.R("\194\195") * lpeg.R("\128\191") / f2 local decode_pattern = lpeg.Cs(utf8^0) * -1
In this code,the definition of UTF-8 is already restricted to theLatin 1 range (from 0 to 255).Any encoding outside this range (as well as any invalid encoding)will not match that pattern.
As the definition of decode_pattern
demands thatthe pattern matches the whole input (because of the -1 at its end),any invalid string will simply fail to match,without any useful information about the problem.We can improve this situation redefining decode_pattern
as follows:
local function er (_, i) error("invalid encoding at position " .. i) end local decode_pattern = lpeg.Cs(utf8^0) * (-1 + lpeg.P(er))
Now, if the pattern utf8^0
stopsbefore the end of the string,an appropriate error function is called.
UTF-8 and Unicode
We can extend the previous patterns to handle all Unicode code points.Of course,we cannot translate them to Latin 1 or any other one-byte encoding.Instead, our translation results in a array with the code pointsrepresented as numbers.The full code is here:
-- decode a two-byte UTF-8 sequence local function f2 (s) local c1, c2 = string.byte(s, 1, 2) return c1 * 64 + c2 - 12416 end -- decode a three-byte UTF-8 sequence local function f3 (s) local c1, c2, c3 = string.byte(s, 1, 3) return (c1 * 64 + c2) * 64 + c3 - 925824 end -- decode a four-byte UTF-8 sequence local function f4 (s) local c1, c2, c3, c4 = string.byte(s, 1, 4) return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168 end local cont = lpeg.R("\128\191") -- continuation byte local utf8 = lpeg.R("\0\127") / string.byte + lpeg.R("\194\223") * cont / f2 + lpeg.R("\224\239") * cont * cont / f3 + lpeg.R("\240\244") * cont * cont * cont / f4 local decode_pattern = lpeg.Ct(utf8^0) * -1
Lua's long strings
A long string in Lua starts with the pattern [=*[
and ends at the first occurrence of ]=*]
withexactly the same number of equal signs.If the opening brackets are followed by a newline,this newline is discarded(that is, it is not part of the string).
To match a long string in Lua,the pattern must capture the first repetition of equal signs and then,whenever it finds a candidate for closing the string,check whether it has the same number of equal signs.
equals = lpeg.P"="^0 open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1 close = "]" * lpeg.C(equals) * "]" closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end) string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1
The open
pattern matches [=*[
,capturing the repetitions of equal signs in a group named init
;it also discharges an optional newline, if present.The close
pattern matches ]=*]
,also capturing the repetitions of equal signs.The closeeq
pattern first matches close
;then it uses a back capture to recover the capture madeby the previous open
,which is named init
;finally it uses a match-time capture to checkwhether both captures are equal.The string
pattern starts with an open
,then it goes as far as possible until matching closeeq
,and then matches the final close
.The final numbered capture simply discardsthe capture made by close
.
Arithmetic expressions
This example is a complete parser and evaluator for simplearithmetic expressions.We write it in two styles.The first approach first builds a syntax tree and thentraverses this tree to compute the expression value:
-- Lexical Elements local Space = lpeg.S(" \n\t")^0 local Number = lpeg.C(lpeg.P"-"^-1 * lpeg.R("09")^1) * Space local TermOp = lpeg.C(lpeg.S("+-")) * Space local FactorOp = lpeg.C(lpeg.S("*/")) * Space local Open = "(" * Space local Close = ")" * Space -- Grammar local Exp, Term, Factor = lpeg.V"Exp", lpeg.V"Term", lpeg.V"Factor" G = lpeg.P{ Exp, Exp = lpeg.Ct(Term * (TermOp * Term)^0); Term = lpeg.Ct(Factor * (FactorOp * Factor)^0); Factor = Number + Open * Exp * Close; } G = Space * G * -1 -- Evaluator function eval (x) if type(x) == "string" then return tonumber(x) else local op1 = eval(x[1]) for i = 2, #x, 2 do local op = x[i] local op2 = eval(x[i + 1]) if (op == "+") then op1 = op1 + op2 elseif (op == "-") then op1 = op1 - op2 elseif (op == "*") then op1 = op1 * op2 elseif (op == "/") then op1 = op1 / op2 end end return op1 end end -- Parser/Evaluator function evalExp (s) local t = lpeg.match(G, s) if not t then error("syntax error", 2) end return eval(t) end -- small example print(evalExp"3 + 5*9 / (1+1) - 12") --> 13.5
The second style computes the expression value on the fly,without building the syntax tree.The following grammar takes this approach.(It assumes the same lexical elements as before.)
-- Auxiliary function function eval (v1, op, v2) if (op == "+") then return v1 + v2 elseif (op == "-") then return v1 - v2 elseif (op == "*") then return v1 * v2 elseif (op == "/") then return v1 / v2 end end -- Grammar local V = lpeg.V G = lpeg.P{ "Exp", Exp = lpeg.Cf(V"Term" * lpeg.Cg(TermOp * V"Term")^0, eval); Term = lpeg.Cf(V"Factor" * lpeg.Cg(FactorOp * V"Factor")^0, eval); Factor = Number / tonumber + Open * V"Exp" * Close; } -- small example print(lpeg.match(G, "3 + 5*9 / (1+1) - 12")) --> 13.5
Note the use of the fold (accumulator) capture.To compute the value of an expression,the accumulator starts with the value of the first term,and then applies eval
overthe accumulator, the operator,and the new term for each repetition.