Introduction to KB
The limits of text
initially the retrieval was keyword-based (text)
then it moved to entity-based (knowledge)
Knowledge is a familiarity, awareness or understanding or someone or something, such as facts, information, descriptions or skills.
build knowledge repositories from the web
- Manifest knowledge (accessible to humans)
- knowledge bases or knowledge graphs
- typically constructed manually or from unstructured sources - latent knowledge (hidden)
- latent models or latent feature models
- typically learned using machine learning techniques.
Manifest knowledge
logic is the language that humans designed to express knowledge.
opinion: knowledge is something we can interpret without ambiguities.
knowledge bases: crystallisation of factual knowledge in the form of associations between entities and relations
- can be expressed as first-order logic
- recently Google rebranded knowledge bases as knowledge-graphs
First order logic, 简称FOL
包含的东西有常量(Constant symbol),谓词符号(Predicate symbol),函数符号(Function symbol),变量(Variable),连词( ∧∨→↔),量词(Quantifiers, ∃∀),例如:
Father(Mary) = Bob
father_of(Mary, Bob)
Latent knowledge
Opinion: we do not need to be able to interpret knowledge, as long as it does what it is supposed to do.
- become popular due to deep learning
- ex. Google’s word2vec
Knowledge bases on the Web
Word net - most famous, sets of synonyms
每个单词可以是单义词(monosemous)或者多义词(polysemous)
每个synset都有一个评注(gloss),并且同其他synsets以不同的语义关系连接起来。最重要的几个是:Hypernyms (isA / 上位词) 和Meronym (partOf /借代)
RDF - resource description framework
本质是一个数据模型,表现形式为SPO三元组 (triples)
A RDF dataset can be represented as a directed graph
举例
<http://www.vu.nl> <rdf:type> <wikipedia/University>
RDF图中一共有三种类型,International Resource Identifiers(IRIs),blank nodes 和 literals。
- Subject可以是IRI或blank node。
- Predicate是IRI。
- Object三种类型都可以。
SPARQL - to query a RDF dataset
SQL-inspired syntax
Findinganswers to a SPARQL query corresponds to find allpossible graph homomorphisms between the query and thegraph.
查询举例:
SELECT, 指定我们要查询的变量。在这里我们查询所有的变量,用*代替。
WHERE,指定我们要查询的图模式。含义上和SQL的WHERE没有区别。
FROM,指定查询的RDF数据集。
PREFIX,用于IRI的缩写。
- 没有模式匹配的查询
SELECT ?X, ?Y FROM{
?X <rdf:type> <wikipedia/University>.
?X <rdf:label> ?Y.
}
example input
<http://www.vu.nl> <rdf:type> <wikipedia/University> .
<http://www.vu.nl> <rdf:label> ”VU University” .
_:x <http://www.vu.nl#studies> <http://www.vu.nl> .
output
{
?X-> <http://www.vu.nl>
?Y-> “VU University”
}
- 如果要查询所有数据,那spo三元组每个都是未知变量
SELECT * WHERE {
?s ?p ?o
}
- 查询周星驰出演了哪些电影
*这里最终查询值是movieTitle
SELECT ?n WHERE {
?s rdf:type :Person.
?s :personName '周星驰'.
?s :hasActedIn ?o.
?o :movieTitle ?n
}
两个部分组成:协议和查询语言
一个SPARQL查询本质上是一个带有变量的RDF图
简而言之,SPARQL查询分为三个步骤:
构建查询图模式,表现形式就是带有变量的RDF。
匹配,匹配到符合指定图模式的子图。
绑定,将结果绑定到查询图模式对应的变量上。
DBPedia
Project to convert Wikipedia pages into RDF
Leverages structured content contained in thepages
Infoboxes
Labels
Categories
Redirects
Contains links toother KBs
Widely popular in the “linked-data-cloud”
Fairly large ontology but not richin terms of expressiveness
320 classes
1650 properties
Alignment between infoboxes and ontologies is done via community-provided mappings
YAGO - high standard of quality
Goals
Unify Wikipedia and Wordnet
Exploit Wikipedia Infoboxes to extract clean facts
Check the plausibility of facts via type checking
Freebase - collaborative KB
Wikidata - mainly text
数据由社区认证
保留信息来源
多语言支持
支持复数
Data is validated by the community
Keeps provenance of the data
Multilingual by design
Supports plurality
BabelNet - merging wordnet and wikipedia
linguistic community
can only be accessed through APIs
Three tasks
-
Combine Wordnet and Wikipedia by establishing mapping between them
-
Harvest multilingual lexicalizations (using Wikipedia inter-language links and machine translation)
-
Establish relations between Wordnet synsets
-
通过建立投射关系来把wordnet和Wikipedia结合起来
-
多语言化 (使用Wikipedia内建语言链接和机器翻译)
-
在wordnet的同义词组间建立联系
从关键词识别到实体识别意味着要使搜索引擎能够理解文本内容,而把data转变成knowledge,我们必须要搭建一个knowledge repository。
知识的类型又分为显式的和隐式的,前一种我们可以通过knowledge bases 或者 knowledge graphs来进行存储,后一种由于不可见,需要使用latent models or latent feature models. 显式知识通常可以人工搭建或者从非结构性来源创建,而隐式知识通常由机器学习得来。