LPeg编程Example

最新推荐文章于 2024-06-25 09:47:50 发布

明潮

最新推荐文章于 2024-06-25 09:47:50 发布

阅读量734

点赞数

分类专栏： lua

本文链接：https://blog.csdn.net/u010144805/article/details/80526998

版权

lua 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

有空会翻译一下

Using a Pattern

This example shows a very simple but complete programthat builds and uses a pattern:

local lpeg = require "lpeg"

-- matches a word followed by end-of-string
p = lpeg.R"az"^1 * -1

print(p:match("hello"))        --> 6
print(lpeg.match(p, "hello"))  --> 6
print(p:match("1 hello"))      --> nil

The pattern is simply a sequence of one or more lower-case lettersfollowed by the end of string (-1).The program calls match both as a methodand as a function.In both sucessful cases,the match returns the index of the first character after the match,which is the string length plus one.

Name-value lists

This example parses a list of name-value pairs and returns a tablewith those pairs:

lpeg.locale(lpeg)   -- adds locale entries into 'lpeg' table

local space = lpeg.space^0
local name = lpeg.C(lpeg.alpha^1) * space
local sep = lpeg.S(",;") * space
local pair = lpeg.Cg(name * "=" * space * name) * sep^-1
local list = lpeg.Cf(lpeg.Ct("") * pair^0, rawset)
t = list:match("a=b, c = hi; next = pi")  --> { a = "b", c = "hi", next = "pi" }

Each pair has the format name = name followed byan optional separator (a comma or a semicolon).The pair pattern encloses the pair in a group pattern,so that the names become the values of a single capture.The list pattern then folds these captures.It starts with an empty table,created by a table capture matching an empty string;then for each capture (a pair of names) it applies rawsetover the accumulator (the table) and the capture values (the pair of names).rawset returns the table itself,so the accumulator is always the table.

Splitting a string

The following code builds a pattern thatsplits a string using a given patternsep as a separator:

function split (s, sep)
  sep = lpeg.P(sep)
  local elem = lpeg.C((1 - sep)^0)
  local p = elem * (sep * elem)^0
  return lpeg.match(p, s)
end

First the function ensures that sep is a proper pattern.The pattern elem is a repetition of zero of morearbitrary characters as long as there is not a match againstthe separator.It also captures its match.The pattern p matches a list of elements separatedby sep.

If the split results in too many values,it may overflow the maximum number of valuesthat can be returned by a Lua function.In this case,we can collect these values in a table:

function split (s, sep)
  sep = lpeg.P(sep)
  local elem = lpeg.C((1 - sep)^0)
  local p = lpeg.Ct(elem * (sep * elem)^0)   -- make a table capture
  return lpeg.match(p, s)
end

Searching for a pattern

The primitive match works only in anchored mode.If we want to find a pattern anywhere in a string,we must write a pattern that matches anywhere.

Because patterns are composable,we can write a function that,given any arbitrary pattern p,returns a new pattern that searches for panywhere in a string.There are several ways to do the search.One way is like this:

function anywhere (p)
  return lpeg.P{ p + 1 * lpeg.V(1) }
end

This grammar has a straight reading:it matches p or skips one character and tries again.

If we want to know where the pattern is in the string(instead of knowing only that it is there somewhere),we can add position captures to the pattern:

local I = lpeg.Cp()
function anywhere (p)
  return lpeg.P{ I * p * I + 1 * lpeg.V(1) }
end

print(anywhere("world"):match("hello world!"))   -> 7   12

Another option for the search is like this:

local I = lpeg.Cp()
function anywhere (p)
  return (1 - lpeg.P(p))^0 * I * p * I
end

Again the pattern has a straight reading:it skips as many characters as possible while not matching p,and then matches p (plus appropriate captures).

If we want to look for a pattern only at word boundaries,we can use the following transformer:

local t = lpeg.locale()

function atwordboundary (p)
  return lpeg.P{
    [1] = p + t.alpha^0 * (1 - t.alpha)^1 * lpeg.V(1)
  }
end

Balanced parentheses

The following pattern matches only strings with balanced parentheses:

b = lpeg.P{ "(" * ((1 - lpeg.S"()") + lpeg.V(1))^0 * ")" }

Reading the first (and only) rule of the given grammar,we have that a balanced string isan open parenthesis,followed by zero or more repetitions of eithera non-parenthesis character ora balanced string (lpeg.V(1)),followed by a closing parenthesis.

Global substitution

The next example does a job somewhat similar to string.gsub.It receives a pattern and a replacement value,and substitutes the replacement value for all occurrences of the patternin a given string:

function gsub (s, patt, repl)
  patt = lpeg.P(patt)
  patt = lpeg.Cs((patt / repl + 1)^0)
  return lpeg.match(patt, s)
end

As in string.gsub,the replacement value can be a string,a function, or a table.

Comma-Separated Values (CSV)

This example breaks a string into comma-separated values,returning all fields:

local field = '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) * '"' +
                    lpeg.C((1 - lpeg.S',\n"')^0)

local record = field * (',' * field)^0 * (lpeg.P'\n' + -1)

function csv (s)
  return lpeg.match(record, s)
end

A field is either a quoted field(which may contain any character except an individual quote,which may be written as two quotes that are replaced by one)or an unquoted field(which cannot contain commas, newlines, or quotes).A record is a list of fields separated by commas,ending with a newline or the string end (-1).

As it is,the previous pattern returns each field as a separated result.If we add a table capture in the definition of record,the pattern will return instead a single tablecontaining all fields:

local record = lpeg.Ct(field * (',' * field)^0) * (lpeg.P'\n' + -1)

UTF-8 and Latin 1

It is not difficult to use LPeg to convert a string fromUTF-8 encoding to Latin 1 (ISO 8859-1):

-- convert a two-byte UTF-8 sequence to a Latin 1 character
local function f2 (s)
  local c1, c2 = string.byte(s, 1, 2)
  return string.char(c1 * 64 + c2 - 12416)
end

local utf8 = lpeg.R("\0\127")
           + lpeg.R("\194\195") * lpeg.R("\128\191") / f2

local decode_pattern = lpeg.Cs(utf8^0) * -1

In this code,the definition of UTF-8 is already restricted to theLatin 1 range (from 0 to 255).Any encoding outside this range (as well as any invalid encoding)will not match that pattern.

As the definition of decode_pattern demands thatthe pattern matches the whole input (because of the -1 at its end),any invalid string will simply fail to match,without any useful information about the problem.We can improve this situation redefining decode_patternas follows:

local function er (_, i) error("invalid encoding at position " .. i) end

local decode_pattern = lpeg.Cs(utf8^0) * (-1 + lpeg.P(er))

Now, if the pattern utf8^0 stopsbefore the end of the string,an appropriate error function is called.

UTF-8 and Unicode

We can extend the previous patterns to handle all Unicode code points.Of course,we cannot translate them to Latin 1 or any other one-byte encoding.Instead, our translation results in a array with the code pointsrepresented as numbers.The full code is here:

-- decode a two-byte UTF-8 sequence
local function f2 (s)
  local c1, c2 = string.byte(s, 1, 2)
  return c1 * 64 + c2 - 12416
end

-- decode a three-byte UTF-8 sequence
local function f3 (s)
  local c1, c2, c3 = string.byte(s, 1, 3)
  return (c1 * 64 + c2) * 64 + c3 - 925824
end

-- decode a four-byte UTF-8 sequence
local function f4 (s)
  local c1, c2, c3, c4 = string.byte(s, 1, 4)
  return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168
end

local cont = lpeg.R("\128\191")   -- continuation byte

local utf8 = lpeg.R("\0\127") / string.byte
           + lpeg.R("\194\223") * cont / f2
           + lpeg.R("\224\239") * cont * cont / f3
           + lpeg.R("\240\244") * cont * cont * cont / f4

local decode_pattern = lpeg.Ct(utf8^0) * -1

Lua's long strings

A long string in Lua starts with the pattern [=*[and ends at the first occurrence of ]=*] withexactly the same number of equal signs.If the opening brackets are followed by a newline,this newline is discarded(that is, it is not part of the string).

To match a long string in Lua,the pattern must capture the first repetition of equal signs and then,whenever it finds a candidate for closing the string,check whether it has the same number of equal signs.

equals = lpeg.P"="^0
open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1
close = "]" * lpeg.C(equals) * "]"
closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end)
string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1

The open pattern matches [=*[,capturing the repetitions of equal signs in a group named init;it also discharges an optional newline, if present.The close pattern matches ]=*],also capturing the repetitions of equal signs.The closeeq pattern first matches close;then it uses a back capture to recover the capture madeby the previous open,which is named init;finally it uses a match-time capture to checkwhether both captures are equal.The string pattern starts with an open,then it goes as far as possible until matching closeeq,and then matches the final close.The final numbered capture simply discardsthe capture made by close.

Arithmetic expressions

This example is a complete parser and evaluator for simplearithmetic expressions.We write it in two styles.The first approach first builds a syntax tree and thentraverses this tree to compute the expression value:

-- Lexical Elements
local Space = lpeg.S(" \n\t")^0
local Number = lpeg.C(lpeg.P"-"^-1 * lpeg.R("09")^1) * Space
local TermOp = lpeg.C(lpeg.S("+-")) * Space
local FactorOp = lpeg.C(lpeg.S("*/")) * Space
local Open = "(" * Space
local Close = ")" * Space

-- Grammar
local Exp, Term, Factor = lpeg.V"Exp", lpeg.V"Term", lpeg.V"Factor"
G = lpeg.P{ Exp,
  Exp = lpeg.Ct(Term * (TermOp * Term)^0);
  Term = lpeg.Ct(Factor * (FactorOp * Factor)^0);
  Factor = Number + Open * Exp * Close;
}

G = Space * G * -1

-- Evaluator
function eval (x)
  if type(x) == "string" then
    return tonumber(x)
  else
    local op1 = eval(x[1])
    for i = 2, #x, 2 do
      local op = x[i]
      local op2 = eval(x[i + 1])
      if (op == "+") then op1 = op1 + op2
      elseif (op == "-") then op1 = op1 - op2
      elseif (op == "*") then op1 = op1 * op2
      elseif (op == "/") then op1 = op1 / op2
      end
    end
    return op1
  end
end

-- Parser/Evaluator
function evalExp (s)
  local t = lpeg.match(G, s)
  if not t then error("syntax error", 2) end
  return eval(t)
end

-- small example
print(evalExp"3 + 5*9 / (1+1) - 12")   --> 13.5

The second style computes the expression value on the fly,without building the syntax tree.The following grammar takes this approach.(It assumes the same lexical elements as before.)

-- Auxiliary function
function eval (v1, op, v2)
  if (op == "+") then return v1 + v2
  elseif (op == "-") then return v1 - v2
  elseif (op == "*") then return v1 * v2
  elseif (op == "/") then return v1 / v2
  end
end

-- Grammar
local V = lpeg.V
G = lpeg.P{ "Exp",
  Exp = lpeg.Cf(V"Term" * lpeg.Cg(TermOp * V"Term")^0, eval);
  Term = lpeg.Cf(V"Factor" * lpeg.Cg(FactorOp * V"Factor")^0, eval);
  Factor = Number / tonumber + Open * V"Exp" * Close;
}

-- small example
print(lpeg.match(G, "3 + 5*9 / (1+1) - 12"))   --> 13.5

Note the use of the fold (accumulator) capture.To compute the value of an expression,the accumulator starts with the value of the first term,and then applies eval overthe accumulator, the operator,and the new term for each repetition.