Google-Kythe-Writing a New Indexer 编写新的索引器-CSDN博客

Writing a New Indexer

本文档概述了向Kythe添加新语言支持的步骤。
我们假设您已将 Kythe release package 提取到/opt/kythe。您还可以从源代码构建工具（但不必构建Kythe来为其提供图形数据）。
示例代码片段是用JavaScript编写的，但本文档不是关于索引任何特定语言的。

在kythe流水线中，语言的索引器负责构建表示特定程序的子图。
完整的索引器通常接受.kzip 文件，其中包含程序、其所有依赖项以及编译器或解释器理解该程序所需的参数。
此数据由称为提取器的单独组件打包。
根据所涉及的语言和构建系统，可以使用通用提取器来生成这些密封编译单元。我们不会在这里讨论提取。

对于开发和测试，索引器直接接受程序文本作为输入是很有用的。这就是我们在这些指令中将如何进行的。
首先，我们将从编写一些脚本开始，将文件内容插入到一个小的Kythe图中。
从那里，我们将看到如何将节点和边编码到条目(项)中，这是我们许多工具之间的交换单位。
我们将看到，某些类型的节点用于表示编程语言中常见的语义对象，而其他节点用于表示文本的语法范围。
而后，我们将添加关系作为这些节点之间的边，以将交叉引用数据添加到图中。在我们索引的程序中，允许用户在的定义和引用之间跳转。
最后，我们将讨论如何为Kythe索引器编写测试（以及如何调试）。

`Bootstrapping(自举法)` Kythe support

索引器将有向图数据作为可以表示节点或边的条目流发出。
它们有不同的编码，但为了简单起见，我们将使用JSON。
首先，让我们编写一个脚本kythe-browse.sh，
它将把JSON格式的kythe条目(项)流转换为我们的示例代码浏览器可以读取的格式。
把它放在你的Kythe 根目录中，它会破坏目录//graphstore 和//tables。

#!/bin/bash -e
set -o pipefail # 管道中返回不为0的值
BROWSE_PORT="${BROWSE_PORT:-8080}" # 当值为空时 = 8080

# binaries at 
#
# https://github.com/kythe/kythe/releases/tag/v0.0.30
#

# This script assumes that they are installed to /opt/kythe.
# If you build the tools yourself or install them to a different location,
# make sure to pass the correct public_resources directory to http_server.
rm -f -- graphstore/* tables/* # -- 删除 -开头的文件
mkdir -p graphstore tables

# Read JSON entries from standard in to a graphstore.
/opt/kythe/tools/entrystream \
  --read_format=json | \
  /opt/kythe/tools/write_entries \
  -graphstore graphstore

# Convert the graphstore to serving tables.
/opt/kythe/tools/write_tables \
  -graphstore graphstore \
  -out=tables

echo -e "\nhttp://localhost:${BROWSE_PORT}\n"
# Host the browser UI.
# ":${BROWSE_PORT}" allows access from other machines
/opt/kythe/tools/http_server \
  -public_resources /opt/kythe/web/ui \
  -serving_table tables \
  -listen="localhost:${BROWSE_PORT}"

提示
Kythe facts的协议缓冲区编码比我们在这里使用的JSON编码更有效。
Kythe 支持 JSON，因为某些语言没有很好的支持协议缓冲区。
这只适用于发出大量数据的语言，如C++。
调用kythe-browse.sh中使用的entrystream工具从标准输入读取JSON条目(项)流，
并在标准输出上发出varint32分隔的kythe.proto.Entry消息流。

您可以使用一个非常短的入口流来测试这一点。
这里唯一棘手的部分是，Kythe fact values 在序列化为JSON时采用 Base64 编码。
这确保以后可以正确反序列化它们，
因为fact values 可能包含任意二进制数据，但 JSON 字符串只允许使用 UTF-8 字符。
ZMLSZQ==是file，SGVSBG8SIHDVCMXKIQ==是Hello, world!。

echo '
{"source":{"corpus":"example","path":"hello"},
 "fact_name":"/kythe/node/kind","fact_value":"ZmlsZQ=="}
{"source":{"corpus":"example","path":"hello"},
 "fact_name":"/kythe/text","fact_value":"SGVsbG8sIHdvcmxkIQ=="}
' | ./kythe-browse.sh

You can check that http://localhost:8080/#hello?corpus=example shows ‘Hello, world!’.

Modeling Kythe entries(为Kythe项建模)

Kythe图可以使用两种基本数据类型进行编码。
第一个称为VNAME，它在图中唯一地挑选出一个节点。VNAMES有五个字符串值字段。Entries (项)记录关于单个节点和它们之间的边的事实。正如文档中所描述的，我们只需要发出边的正向版本（模式中描述的那些）。KYTHE管道负责根据效率的需要生成反向边。
We’ll encode VNames and entries in a straightforward way; in particular, we represent entries as objects, where the target’s presence or absence determines if the entry represents an edge between nodes or a fact about a single node (respectively). Our fact and edge convenience functions also assume that all of the fact and edge names we’ll use are underneath the /kythe prefix, since we’re following the Kythe schema. This prefix is a requirement of the schema, not of the data model.
我们将以一种直接的方式对VNAME和条目进行编码。特别是，我们将项当作对象，其中目标的存在与否决定了项是否表示节点之间的边或关于单个节点的事实（分别）。我们的fact和edge便利函数还假定我们将使用的所有 fact 和 edge 名称都在/kythe前缀下面，因为我们遵循Kythe模式。这个前缀是模式的要求，而不是数据模型的要求。

function vname(signature, path, language, root, corpus) {
  return {
    signature: signature,
    path: path,
    language: language,
    root: root,
    corpus: corpus,
  };
}
function fact(node, fact_name, fact_val) {
  return {
    source: node,
    fact_name: "/kythe/" + fact_name,
    fact_value: base64enc(fact_val),
  };
}
function edge(source, edge_name, target) {
  return {
    source: source,
    edge_kind: "/kythe/edge/" + edge_name,
    target: target,
    fact_name: "/",
  };
}
function ordinal_edge(source, edge_name, target, ordinal) {
  return {
    source: source,
    edge_kind: "/kythe/edge/" + edge_name + "." + ordinal,
    target: target,
    fact_name: "/",
  };
}

You can follow along at home with node.js and the following definitions:
您可以在home 使用node.js和以下定义：

function base64enc(string) {
  return new Buffer(string).toString('base64');
}
function emitEntries(entries) {
  entries.forEach(function(v){console.log(JSON.stringify(v))});
}

有了这个表示，我们的示例数据库就变成：

[
  fact(vname("", "hello", "", "", "example"), "node/kind", "file"),
  fact(vname("", "hello", "", "", "example"), "text", "Hello, world!")
]

VNames 有另一种URI-style encoding。以这种方式编码的VNames 称为ticket，ticket和VNames 在语义上是可以互换的。这种编码用于不方便或不可能以更结构化的格式存储VNames 的情况。您可以在与 Kythe command-line tool交互时使用kythe URI：

/opt/kythe/tools/kythe -api './tables' nodes 'kythe://example?path=hello'

Output

kythe://example?path=hello
  /kythe/node/kind      file
  /kythe/text   Hello, world!

kythe://example?path=hello 是上图示例中使用的VName 的URI编码。

File content

Kythe 将文件内容存储在其图表中。kythe-browse.sh脚本中使用的http_server 二进制文件不会在文件系统中查找要呈现给Web浏览器的文件，而是从图形节点读取text事实。

由于图中的每个节点都有一个VName，因此我们需要能够为索引器可能引用的任何源文件构建一个VName。在我们上面的小例子中，我们的测试文件在语料库例子中有路径hello。如何确定节点所属的语料库（以及可能的根）由您决定。最好保持这个可配置的其他Kythe索引器使用vnames.json文件来根据路径上的正则表达式选择vname字段。

All Kythe graph nodes should have a node/kind fact. For files, this kind is file. This means that each file should have at least two associated facts. You can see the JSON representation of the resulting entries above, where we used them to test the kythe-browse.sh script.
所有Kythe 图形节点都应该有一个node/kind fact。对于文件，这种类型是file。这意味着每个文件应该至少有两个关联的事实。您可以在上面看到结果条目的JSON表示，我们使用它们来测试kythe-browse.sh脚本。

Output
Kythe JSON表示要求事实值是Base64编码的。Protocol Buffer表示没有这样做，但它将事实值存储为字节而不是字符串类型。协议缓冲区字符串类型必须是有效的UTF-8，并且并非图形中的所有文件都可以使用UTF-8编码（尽管这是默认设置）。可以使用编码fact指定替代编码。