1.DeepDive实例结构
ls demo/
app.ddlog db.url deepdive.conf input labeling mindbender run udf
2.DeepDive结构详解
- app.ddlog是deepdive的规划文件,此文件定义了数据的来源,数据的结构,数据的处理,KBC的构建。
- db.url,此文件定义了数据库的连接信息
- deepdive.conf deepdive环境配置,不用修改。
- input/ ,此目录放置数据文件,该数据文件需要按照app.ddlog中的规则来命名,该数据文件为应用提供源数据。
- udf/ ,存放用户定义的函数的目录,可以从deepdive.conf引用相对于应用程序根目录的路径名。
3.app.ddlog定义
1.定义数据库表结构,并且规范数据文件名称,如下所示:
@source
articles(
@key
@distributed_by
id text,
@searchable
content text
).
@source
dbdata(
@key
person1_name text,
@key
person2_name text
).
articles表示一个执行目标,该目标首先去input目录下查找以articles开头的数据文件,然后将其按照上面定义的数据格式导入到数据库中。dbdata同上。执行如下命令:
deepdive compile
每次修改app.ddlog都需要重新编译,编译的过程中会在应用目录下生成run文件夹。
继续执行如下命令:
deepdive do articles
deepdive do dbdata
2.nlp处理
首先定义处理后的数据存放结构,如下:
@source
sentences(
@key
@distributed_by
doc_id text,
@key
sentence_index int,
@searchable
sentence_text text,
tokens text[],
lemmas text[],
pos_tags text[],
ner_tags text[],
doc_offsets int[],
dep_types text[],
dep_tokens int[]
).
然后定义处理函数,该函数遵守ddlog的语法规则。需要在udf
目录下创建nlp_markup.sh脚本文件,里面包含对内容的处理逻辑。
function nlp_markup over (doc_id text, content text)
returns rows like sentences
implementation "udf/nlp_markup.sh" handles tsv lines.
sentences += nlp_markup(doc_id, content) :-
articles(doc_id, content).
执行如下代码:
deepdive do sentences
3.实体抽取
首先定义实体抽取后数据库中存放的结构,如下:
@extraction
person_mention(
@key
mention_id text,
@searchable
mention_text text,
@distributed_by
@references(relation="sentences", column="doc_id", alias="appears_in")
doc_id text,
@references(relation="sentences", column="doc_id", alias="appears_in")
sentence_index int,
begin_index int,
end_index int
).
处理函数:
function map_person_mention over (
doc_id text,
sentence_index int,
tokens text[],
ner_tags text[]
) returns rows like person_mention
implementation "udf/map_person_mention.py" handles tsv lines.
person_mention += map_person_mention(
doc_id, sentence_index, tokens, ner_tags
) :-
sentences(doc_id, sentence_index, _, tokens, _, _, ner_tags, _, _, _).
自定义抽取逻辑脚本文件udf/map_person_mention.py
.
执行如下代码:
deepdive do person_mention
4.构建实体对
首先定义实体对在数据库中存放的结构,如下:
person_candidate(
p1_id text,
p1_name text,
p2_id text,
p2_name text
).
用户自定义函数:
# 创建视图
num_people(doc_id, sentence_index, COUNT(p)) :-
person_mention(p, _, doc_id, sentence_index, _, _).
# 生成笛卡尔积
person_candidate(p1, p1_name, p2, p2_name) :-
num_people(same_doc, same_sentence, num_p),
person_mention(p1, p1_name, same_doc, same_sentence, p1_begin, _),
person_mention(p2, p2_name, same_doc, same_sentence, p2_begin, _),
num_p < 5,
p1_name != p2_name,
p1_begin != p2_begin.
执行:
deepdive do person_candidate
5.特征提取
首先定义特征集在数据库中存放的结构,如下:
@extraction
transaction_feature(
@key
@references(relation="has_transaction", column="p1_id", alias="has_transaction")
p1_id text,
@key
@references(relation="has_transaction", column="p1_id", alias="has_transaction")
p2_id text,
@key
feature text
).
用户定义的函数
function extract_transaction_features over (
p1_id text,
p2_id text,
p1_begin_index int,
p1_end_index int,
p2_begin_index int,
p2_end_index int,
doc_id text,
sent_index int,
tokens text[],
lemmas text[],
pos_tags text[],
ner_tags text[],
dep_types text[],
dep_tokens int[]
) returns rows like transaction_feature
implementation "udf/extract_transaction_features.py" handles tsv lines.
transaction_feature += extract_transaction_features(
p1_id, p2_id, p1_begin_index, p1_end_index, p2_begin_index, p2_end_index,
doc_id, sent_index, tokens, lemmas, pos_tags, ner_tags, dep_types, dep_tokens
) :-
person_mention(p1_id, _, doc_id, sent_index, p1_begin_index, p1_end_index),
person_mention(p2_id, _, doc_id, sent_index, p2_begin_index, p2_end_index),
sentences(doc_id, sent_index, _, tokens, lemmas, pos_tags, ner_tags, _, dep_types, dep_tokens).
udf/extract_transaction_features.py
中调用了ddlib来生成文本特征集。
执行:
deepdive do transaction_feature
6.利用已知数据和启发规则进行Distant supervise
首先定义label标签在数据库中存放的结构,如下:
@extraction
transaction_label(
@key
@references(relation="has_transaction", column="p1_id", alias="has_trasaction")
p1_id text,
@key
@references(relation="has_transaction", column="p2_id", alias="has_transaction")
p2_id text,
@navigable
label int,
@navigable
rule_id text
).
用户定义的函数:
#无标记记录
transaction_label(p1,p2, 0, NULL) :- person_candidate(p1, _, p2, _).
#用已知数据监督结果
transaction_label(p1,p2, 3, "from_dbdata") :-
person_candidate(p1, p1_name, p2, p2_name), dbdata(n1, n2),
[ lower(n1) = lower(p1_name), lower(n2) = lower(p2_name) ;
lower(n2) = lower(p1_name), lower(n1) = lower(p2_name) ].
#定义规则函数
function supervise over (
p1_id text, p1_begin int, p1_end int,
p2_id text, p2_begin int, p2_end int,
doc_id text,
sentence_index int,
sentence_text text,
tokens text[],
lemmas text[],
pos_tags text[],
ner_tags text[],
dep_types text[],
dep_tokens int[]
) returns (
p1_id text, p2_id text, label int, rule_id text
)
implementation "udf/supervise_transaction.py" handles tsv lines.
#执行规则监督
transaction_label += supervise(
p1_id, p1_begin, p1_end,
p2_id, p2_begin, p2_end,
doc_id, sentence_index, sentence_text,
tokens, lemmas, pos_tags, ner_tags, dep_types, dep_token_indexes
) :-
person_candidate(p1_id, _, p2_id, _),
person_mention(p1_id, p1_text, doc_id, sentence_index, p1_begin, p1_end),
person_mention(p2_id, p2_text, _, _, p2_begin, p2_end),
sentences(doc_id, sentence_index, sentence_text,
tokens, lemmas, pos_tags, ner_tags, _, dep_types, dep_token_indexes
).
#统一实体间的label
transaction_label_resolved(p1_id, p2_id, SUM(vote)) :- transaction_label(p1_id, p2_id, vote, rule_id).
执行:
deepdive do transaction_label
7.关系预测
@extraction
has_transaction?(
@key
@references(relation="person_mention", column="mention_id", alias="p1")
p1_id text,
@key
@references(relation="person_mention", column="mention_id", alias="p2")
p2_id text
).
预测函数
has_transaction(p1_id, p2_id) = if l > 0 then TRUE
else if l < 0 then FALSE
else NULL end :- transaction_label_resolved(p1_id, p2_id, l).
#has_transaction(p1, p2) = NULL :- person_candidate(p1, _, p2, _).
@weight(f)
has_transaction(p1_id, p2_id) :-
person_candidate(p1_id, _, p2_id, _),
transaction_feature(p1_id, p2_id, f).
# Inference rule: Symmetry
@weight(3.0)
has_transaction(p1_id, p2_id) => has_transaction(p2_id, p1_id) :-
person_candidate(p1_id, _, p2_id, _).
执行:
deepdive do probabilities
得到最终预测结果: