编译原理语法分析 CFG 上下文无关文法基础简单语法制导 parser

RzBu11d023r

已于 2022-05-17 23:14:30 修改

阅读量946

点赞数

分类专栏：编译原理基础胡乱思考文章标签：编译原理 c++ CFG

于 2022-05-17 19:46:53 首次发布

本文链接：https://blog.csdn.net/u010180372/article/details/124827333

版权

胡乱思考同时被 2 个专栏收录

8 篇文章 1 订阅

订阅专栏

编译原理基础

2 篇文章 0 订阅

订阅专栏

+-乘除的 Context free grammar 是：

Low precedence operator produce high precedence operator.

Left associativity place recursive production to the left of of the body.

torn apart 思路：

factor 无法被 operator torn apart （即不能让 sub factor 与 factor 被拆分运算）。
term 可以被 * 和 / torn apart，但是不能被 + - torn apart.

思考一：写一个 CFG 是很麻烦的事情，就像设计状态机和 DB 表，因此就会存在范式和转换方法，just like what we 've done in reg->NFA->DFA (thompson + subset construction) and 1NF->2NF->3NF->BCNF(在保持无损链接和函数依赖下进行 decompose 即 normalization)。因此此时此刻应当把这个 expr term factor 的 productions 给背下来。下面给出支援 unary 识别的 grammar：

来自 <Exercises for Section 2.2 | Compilers Principles, Techniques, & Tools (purple dragon book) second edition exercise answers>

一般而言，希望转换为只有两个输出的推导，即科莫夫斯基文法。这里先不说了，本文主要还是熟悉这个从初始 head non-termial 一推导到 terminals 的过程。

我把这个写在 C++ 里面，没有反射确实难实现配置文件输入，除非我先写一个编译器做代码生成器.........啊这。

{
  // terminals:
  def_terminal(num, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
  def_terminal(lowOp, "+", "-");
  def_terminal(minus, "-");
  def_terminal(highOp, "*", "/");
  def_terminal(lbra, "(");
  def_terminal(rbra, ")");
  // non-terminals:
  def_non_terminal(expr);
  def_non_terminal(term);
  def_non_terminal(unary);
  def_non_terminal(factor);
  // productions:
  typedef vector<anything *> p;
  produce(expr) >> p{expr, lowOp, term} | p{term};
  produce(term) >> p{term, highOp, unary} | p{unary};
  produce(unary) >> p{lbra, lowOp, factor, rbra} | p{unary} | p{factor};
  produce(factor) >> p{num} | p{lbra, expr, rbra};
  boolshit.addStart(expr);
}

唯一的推导：CFG 和是否 ambigous 无关。但是一般最好搞的是不 ambigous 的。A CFG is ambiguous if one or more terminal strings have multiple leftmost derivations from the start symbol. 下面是一个 ambigous 的 grammar。

前缀表达式和后缀表达式（波兰和逆波兰）都是一种无需括号的语言。

下面给出一个基于 CFG 文法的向下推导的随机语言生成器，这个不是编译器，也不是语法分析器，只是阐述一下 CFG 的推导大概是个什么编程意思，之后我先学语法制导再学那些：

#include <algorithm>
#include <deque>
#include <initializer_list>
#include <iostream>
#include <iterator>
#include <memory>
#include <random>
#include <regex>
#include <string>
#include <type_traits>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>
using namespace std;
template <typename T>
concept _StringArg = is_constructible_v<string, T>;
struct token {
  string expr;
  operator string() { return expr; }
  template <_StringArg _T> token(_T &&char_tr) : expr{char_tr} {}
  token(int i) : expr{to_string(i)} {}
};
struct anything {
  string str;
  virtual bool isTerminal() = 0;
  template <_StringArg _T> anything(_T &&char_tr) : str{char_tr} {}
};
struct terminal : anything {
  template <_StringArg _T> terminal(_T &&char_tr) : anything{char_tr} {}
  bool isTerminal() { return true; }
};
struct non_terminal : anything {
  template <_StringArg _T> non_terminal(_T &&char_tr) : anything{char_tr} {}
  bool isTerminal() { return false; }
};
struct grammar {

  template <typename K, typename V> using map_t = std::unordered_map<K, V>;
  template <typename V> using set_t = std::unordered_set<V>;
  typedef std::vector<token> tokens_t;

  // terminals
  map_t<anything *, tokens_t> map{};
  set_t<anything *> starts{};
  map_t<anything *, vector<vector<anything *>>> prods{};

  void addStart(anything *s) { starts.emplace(s); }

  void addProduction(anything *s, vector<anything *> a) {
    prods[s].emplace_back(a);
  }

  struct pipe {
    grammar *g;
    anything *t;
    friend const pipe &operator|(const pipe &p, vector<anything *> a) {
      p.g->addProduction(p.t, a);
      return p;
    }
    friend const pipe &operator>>(const pipe &p, vector<anything *> a) {
      p.g->addProduction(p.t, a);
      return p;
    }
  };

  pipe allow_produce(anything *s) { return {this, s}; }

  void addTerminal(anything *t, vector<token> s) {
    map[t].insert(map[t].end(), s.begin(), s.end());
  }

  const auto &produce(anything *from) { return prods[from]; }
  string random_produce(int aprox_len = 20) {
    srand(time(0));
    string result = "";
    auto ti = starts.begin();
    advance(ti, rand() % starts.size());
    anything *cur = *(ti);
    deque<anything *> q;
    q.push_back(cur);
    while (!q.empty()) {
      // cout << result << endl;
      cur = q.front();
      q.pop_front();
      if (cur->isTerminal()) {
        result =
            result + static_cast<string>(map[cur][rand() % map[cur].size()]);
      } else {
        const vector<vector<anything *>> &prds = produce(cur);
        const vector<anything *> *rd = nullptr;
        if (result.size() > aprox_len / 2) {
          for (auto &prd : prds) {
            if (prd[0]->str == "num") {
              rd = &prd;
              break;
            }
          }
        }
        if (!rd) {
          rd = &prds[rand() % prds.size()];
        }
        for (auto r = rd->crbegin(); r != rd->crend(); r++) {
          q.push_front(*r);
        }
      }
    }
    return result;
  }
};
struct pool {
  unordered_map<string, unique_ptr<terminal>> ts{};
  unordered_map<string, unique_ptr<non_terminal>> nts{};
  terminal *new_terminal(string name) {
    return (ts[name] = make_unique<terminal>(std::move(name))).get();
  }
  non_terminal *new_nonterminal(string name) {
    return (nts[name] = make_unique<non_terminal>(std::move(name))).get();
  }
  anything *operator[](const char *s) {
    if (ts.contains(s)) {
      return ts[s].get();
    } else {
      return nts[s].get();
    }
  }
};

grammar boolshit;
pool mypool;
anything *helper(string name, vector<token> ss) {
  auto re = mypool.new_terminal(name);
  boolshit.addTerminal(re, ss);
  return re;
}
anything *helper(string name) {
  auto re = mypool.new_nonterminal(name);
  return re;
}
auto produce(auto &&a) { return boolshit.allow_produce(a); }

#define def_terminal(name, ...) auto name = helper(#name, {__VA_ARGS__})
#define def_non_terminal(name) auto name = helper(#name)
int main(void) {
  // terminals:
  def_terminal(num, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
  def_terminal(lowOp, "+", "-");
  def_terminal(minus, "-");
  def_terminal(highOp, "*", "/");
  def_terminal(lbra, "(");
  def_terminal(rbra, ")");
  // non-terminals:
  def_non_terminal(expr);
  def_non_terminal(term);
  def_non_terminal(unary);
  def_non_terminal(factor);
  // productions:
  typedef vector<anything *> p;
  produce(expr) >> p{expr, lowOp, term} | p{term};
  produce(term) >> p{term, highOp, unary} | p{unary};
  produce(unary) >> p{lbra, lowOp, factor, rbra} | p{unary} | p{factor};
  produce(factor) >> p{num} | p{lbra, expr, rbra};
  boolshit.addStart(expr);
  auto s = boolshit.random_produce(10);
  cout << s << endl;
  return 0;
}

这个东西能够随机生成四则运算，可以用来做幼儿园学生数学练习出数学题，然后用 leetcode 中缀表达式的题目来生成结果....

 ~/projects/test ./a.out             
(1+(+((+0)*5-(-1)))+(+7)/6+(+5)*9)/8*(+4)*4

虽然，这里虽然是随机的，但是生成的括号概率太大了....

考虑已经可以从 input bottom up 建立一树之后：

就可以生成 annotation 了，annotation 本质上就是输出。

inffix to suffix 的 simple (comb 而没有新事物的产生) syntax-directed defnition：

Translation scheme 翻译模式

一般 translate 的时候，要么是 manipulate strings 要么就是 execute 一些 programfragments（semantic actions）。

这里龙书讲了一个 lisp 里面 car （Contents of the Address part of the Register）和 cdr （ Contents of the Decrement part of the Register）类似的东西。

translation scheme 就是说通过 dfs tree traversal （通常是 post order traversal，leftrightroot）来实现代码生成，而不用做 annotation 了。甚至不需要建立树，这也就是直接手写递归能 work 的原因。但是实际手写递归 parser 是 error-prone 的，不过他确实表达能力更加丰富。不过这其实是隐含建树，一般还是要能够提供树，clang 也可以获取 AST。

Yacc 能做的事情是根据 translation scheme 生成一些东西。（还不是偷懒不想先学 yacc... 不过我书都没看到 yacc，也没法学 yacc 啊），我好像搞明白了 yacc 的用法，他就是你嵌入一些 c 代码，让他在 parse 的时候填充一些 c 语言片段。即上面说的 print 这种，只不过我们一般是在 print 的地方去生成一个 AST，即填充 AST 的节点。这也是为什么我们要讲解这个 translation scheme。

recursive descent parsing，cs61a 里面已经做多了。

下面提到了性能问题，对于 CFG，string 有 n 个 terminals 的时候，可估计的上限是 cubic time，我也不知道他怎么算的。然后他说一般我们可以设计一个很快的语法。linear time 对于所有的语言就行了（suffice to）。基本的 parsing 方法是 either top-down 或者 bottom-up。一般手写 top down，通用的用 bottom up。

下面就来写 parser，实际是复习 CS61A scheme lab 了属于是。

首先，我们已经有 lex 或者之前 DFA 生成的 tokens 了，接下来我们的 parser 读入 tokens

这里的 token 分别是 tokens，然后我们有如下语法规则：

首先明白一点，我们认为读入 token 就是读入 terminal。

state

stmt

首先第一个 terminal 是 for，而此时 state （start）在 stmt 状态，读入 for （lookahead symbol）之后，选择 production 3，for 开头的 production，继续读入括号是 terminal，此时状态队列变为

state

optexpr

;

optexpr

;

opexpr

)

stmt

接下来的任务就是完成这一层的层次遍历（其中蕴含着 DFS==LRD==postorder tree traversal）。我感觉我又可以了，结合我 CS61A 学过的东西，我能改写之前那个 bullshit cfg generator 为 recursive decent parser 了。

---

待续

RzBu11d023r

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
编译原理语法分析 CFG 上下文无关文法基础简单语法制导 parser

+-乘除的 Context free grammar 是：Low precedence operator produce high precedence operator.Left associativity place recursive production to the left of of the body.torn apart 思路：factor 无法被 operator torn apart （即不能让 sub factor 与 factor 被拆分运算）。 t..
复制链接

扫一扫