安装
采用的是anaconda设置虚拟环境,在vs studio2022上运行。
第一次配置了python3.6版本 使用pip3 install tree_sitter命令进行安装时报错:
然后配置了3.8(图)和3.9版本,发现安装成功:
选择在3.9版本中使用tree-sitter。
我也不知道为啥在3.6不好用。。。但是最终装上了!
使用
根据官网教程tree-sitter跟着尝试:
先将官网git下载到本地,然后在包里运行示例。
from tree_sitter import Language, Parser
Language.build_library(
# Store the library in the `build` directory
'build/my-languages.so',
# Include one or more languages
[
'vendor/tree-sitter-go',
'vendor/tree-sitter-javascript',
'vendor/tree-sitter-python'
]
)
GO_LANGUAGE = Language('build/my-languages.so', 'go')
JS_LANGUAGE = Language('build/my-languages.so', 'javascript')
PY_LANGUAGE = Language('build/my-languages.so', 'python')
运行该示例后发现会报错:
Traceback (most recent call last):
File "/Users/symbolk/coding/analysis/treesitter/py-tree-sitter/builder.py", line 1, in <module>
from tree_sitter import Language, Parser
File "/Users/symbolk/coding/analysis/treesitter/py-tree-sitter/tree_sitter/__init__.py", line 9, in <module>
from tree_sitter.binding import _language_field_id_for_name, _language_query
ModuleNotFoundError: No module named 'tree_sitter.binding'
网上查找,发现是不能在py-tree-sitter的源目录下运行:
import tree_sitter
will try to import the git clone and not the version installed from PyPI (because.
is the first entry in sys.path). But this does not work becausetree_sitter.binding
is a native module that has to be compiled first.
后重新创建项目,真的不报错了!!
GraphCodeBert使用语法树分词
在网上寻找教程的时候发现了这段代码,GraphCodeBert也曾经读过【不过忘了。。】。于是尝试跑了一下这个代码。
from tree_sitter import Language, Parser
def tree_to_token_index(root_node):
if (len(root_node.children) == 0 or root_node.type.find('string') != -1) and root_node.type != 'comment':
return [(root_node.start_point, root_node.end_point)]
else:
code_tokens = []
for child in root_node.children:
code_tokens += tree_to_token_index(child)
return code_tokens
def index_to_code_token(index, code):
start_point = index[0]
end_point = index[1]
if start_point[0] == end_point[0]:
s = code[start_point[0]][start_point[1]:end_point[1]]
else:
s = ""
s += code[start_point[0]][start_point[1]:]
for i in range(start_point[0]+1, end_point[0]):
s += code[i]
s += code[end_point[0]][:end_point[1]]
return s
if __name__ == '__main__':
Language.build_library(
# Store the library in the `build` directory
'build1/my-languages.so',
# Include one or more languages
[
'vendor/tree-sitter-go',
'vendor/tree-sitter-javascript',
'vendor/tree-sitter-cpp',
'vendor/tree-sitter-python',
]
)
GO_LANGUAGE = Language('build1/my-languages.so', 'go')
JS_LANGUAGE = Language('build1/my-languages.so', 'javascript')
PY_LANGUAGE = Language('build1/my-languages.so', 'python')
CPP_LANGUAGE = Language('build1/my-languages.so', 'cpp')
cpp_parser = Parser()
cpp_parser.set_language(CPP_LANGUAGE)
cpp_code_snippet = '''
int mian{
piantf("hell world");
remake O;
}
'''
tree = cpp_parser.parse(bytes(cpp_code_snippet, "utf8"))
root_node = tree.root_node
tokens_index = tree_to_token_index(root_node)
cpp_loc = cpp_code_snippet.split('\n')
code_tokens = [index_to_code_token(x, cpp_loc) for x in tokens_index]
print(code_tokens)
运行可得到结果: