openGauss学习——词法解析的关键字处理

XL_up

已于 2023-10-13 22:11:34 修改

阅读量210

点赞数

文章标签：学习

于 2023-10-13 22:03:02 首次发布

本文链接：https://blog.csdn.net/XL_up/article/details/133818938

版权

引言

在上一篇博客中我对 flex 源文件 scan.l 的文件内容进行了简要解析。在 flex 对SQL查询进行词法分析的过程中，还包括了对关键字的识别和匹配过程；即检查一个标识符是否是关键字并匹配对应的token。文件kwlookup.cpp的功能是将关键词转化为具体的token，而keywords.cpp中则存放了openGauss标准关键字的列表。本篇将对词法解析中与关键字处理相关文件的内容进行解析。

文件路径

kwlookup.cpp: src/common/backend/parser/kwlookup.cpp

keyword.cpp: src/common/backend/parser/keywords.cpp

kwlookup.cpp

文件中仅对一个函数进行了定义。

const ScanKeyword* ScanKeywordLookup(const char* text, const ScanKeyword* keywords, int num_keywords)

先来看函数注释及源码：

/*
 * ScanKeywordLookup - see if a given word is a keyword
 * 函数功能 - 确定一个给定的标识符是否为关键字 
 *
 * Returns a pointer to the ScanKeyword table entry, or NULL if no match.
 * 返回一个指向关键字对应表项的指针，若匹配失败返回NULL 
 *
 * The match is done case-insensitively.  Note that we deliberately use a 
 * dumbed-down case conversion that will only translate 'A'-'Z' into 'a'-'z',
 * even if we are in a locale where tolower() would produce more or different
 * translations.  This is to conform to the SQL99 spec, which says that
 * keywords are to be matched in this way even though non-keyword identifiers
 * receive a different case-normalization mapping.
 * 匹配不区分大小写。注意在匹配时将大写字母'A'-'Z'转化成对应的小写字母'a'-'z'
 * 这是为了符合SQL99规范，即非关键字的标识符经过不同的标准化处理后，如果
 * 可以匹配到某个关键字，也应当以这种方式来进行匹配，视为匹配成功。 
 *
 */
const ScanKeyword* ScanKeywordLookup(const char* text, const ScanKeyword* keywords, int num_keywords)
{    /*
      * @入口参数：
     *        text：待匹配的标识符 
     *        keywords：指向关键字table的指针 
     *        num_keywords: 关键字表项的数量（关键字的数量） 
     * @出口参数：
     *         指向某个keywords表项的指针。若匹配失败则返回NULL。 
      */
    int len, i;
    char word[NAMEDATALEN] = {0};    // 未找到NAMEDATALEN的定义处，根据前后文推测此为关键字的最大长度-1 
    const ScanKeyword* low = NULL;    // 二分指针 
    const ScanKeyword* high = NULL; // 二分指针 
    if (text == NULL) {        // 输入为空，返回NULL 
        return NULL;
    }
    len = strlen(text);
    /* We assume all keywords are shorter than NAMEDATALEN. */
    /* 假设所有的关键字长度都不小于NAMEDATALEN，如果text长度超过NAMEDATALEN则不为关键字 */ 
    if (len >= NAMEDATALEN) {
        return NULL;
    }
    /*
     * Apply an ASCII-only downcasing.    We must not use tolower() since it may
     * produce the wrong translation in some locales (eg, Turkish).
     *
     * 通过字符的ASCII值来进行大小写转换。此处我们不能使用tolower()函数，这是 
     * 由于贼某些语言中tolower()函数会导致错误的字符转换（例如土耳其语）。 
     */
    for (i = 0; i < len; i++) {
        char ch = text[i];
        if (ch >= 'A' && ch <= 'Z') {
            ch += 'a' - 'A';
        }
        word[i] = ch;
    }
    word[len] = '\0';    // 得到转换后的text，存放在word[]中 
    /*
     * Now do a binary search using plain strcmp() comparison.
     * 接下来进行二分查找，利用strcmp()函数匹配关键字。 
     */
    low = keywords;
    high = keywords + (num_keywords - 1);
    while (low <= high) {
        const ScanKeyword* middle = NULL;
        int difference;
        middle = low + (high - low) / 2;
        difference = strcmp(middle->name, word);
        if (difference == 0) {
            return middle;
        } else if (difference < 0) {
            low = middle + 1;
        } else {
            high = middle - 1;
        }
    }
    // 二分查找失败，返回空值，表示无匹配结果。 
    return NULL;
}

可以看到函数体还是较为简单的。函数以待匹配文本标识符、指向标准关键字表的指针和关键字表项总数作为入口参数，通过大小写转换后在标准关键字表中进行二分匹配；如果匹配成功，返回指向其对应表项的指针，否则返回NULL。

keywords.cpp

标准关键字列表存放在这一文件中。

/*    keywords.cpp     */
const ScanKeyword ScanKeywords[] = {
#include "parser/kwlist.h"
};
const int NumScanKeywords = lengthof(ScanKeywords);

文件很短，仅定义了一个标准关键字表ScanKeywords和一个整型值记录其表项个数。其中ScanKeywords的内容包含在头文件parser/kwlist.h中。另外，在keywords.h中，对关键字列表结构做了定义如下：

/*    keywords.h    */
typedef struct ScanKeyword {
    const char* name; /* 字母小写 */
    int16 value;      /* 对应token的值 */
    int16 category;   /* 关键字类型*/
} ScanKeyword;

其中关键字类型category的值有如下定义：

/*     keywords.h    */
#define UNRESERVED_KEYWORD 0
#define COL_NAME_KEYWORD 1
#define TYPE_FUNC_NAME_KEYWORD 2
#define RESERVED_KEYWORD 3

打开kwlist.h，可以看到其中定义了openGauss所有关键字在标准关键字表中对应的表项：

/*    kwlist.h    */
PG_KEYWORD("abort", ABORT_P, UNRESERVED_KEYWORD)
PG_KEYWORD("absolute", ABSOLUTE_P, UNRESERVED_KEYWORD)
PG_KEYWORD("access", ACCESS, UNRESERVED_KEYWORD)
PG_KEYWORD("account", ACCOUNT, UNRESERVED_KEYWORD)
PG_KEYWORD("action", ACTION, UNRESERVED_KEYWORD)
PG_KEYWORD("add", ADD_P, UNRESERVED_KEYWORD)
PG_KEYWORD("admin", ADMIN, UNRESERVED_KEYWORD)
……

不难发现是按照字符串字典序进行排序的。这也是支持二分搜索的数据结构基础。