Zhishi.me - Weaving Chinese Linking Open Data

Zhishi.me - Weaving Chinese Linking Open Data

1. 困难

  1. managing the heterogeneity of knowledge indifferent data sources.
    管理不同数据源的知识的异构性。
  2. efficiently discovering relations betweenmillions of instances.
    有效的发现数百万实例中的关系

2. 工作

  1. We take the three largest Chinese encyclopedias mentioned above as ouroriginal data and extract structured information from them. In total, about 6.6 million lemmas as well as corresponding detailed descriptions such asabstracts, infoboxes, page links, categories, etc. are parsed and presented as RDF triples.
    我们将上面提到的三个最大的中文百科全书作为原始数据,从中提取出结构化的信息。总共有660万个词元以及相应的详细描述,如文摘、信息框、页面链接、类别等,被解析成RDF三元组。
  2. Since these three encyclopedias are operated independently and have over-laps, we integrate them as a whole by constructing <owl:sameAs>relationsbetween every two encyclopedias. Some parallel instance-level matching tech-niques are employed to achieve this goal.
    由于这三个百科全书是独立操作的,并且有重叠,我们通过在每两个百科全书之间构建<owl:sameAs>关系来将它们集成为一个整体。为了实现这一目标,我们使用了一些并行实例匹配技术。
  3. In order to make connections with existing linked data and build a bridgebetween the English knowledge base and the Chinese knowledge base, wealso use<owl:sameAs>to link resources in CLOD to the ones in DBpedia,a central hub in LOD.
    为了连接已有的链接数据,在英文知识库和中文知识库之间搭建一座桥梁,我们还使用<owl:sameAs>将CLOD中的资源链接到DBpedia中的资源,DBpedia是LOD中的一个中心枢纽。
  4. The first Chinese Linking Open Data around the world has been publishedas RDF triples on the Web via Zhishi.me. Finally, we will make some conclusion and outline future work.
    世界上第一个中国LOD的已经通过zhshi .me在网上发表了RDF三元组。最后,对本文进行了总结,并对今后的工作进行了展望。

3. Semantic Data Extraction语义数据提取

3.1 Extraction Approach提取方法

  1. Wikipedia-Wikipedia provides database backup dumps,which embed all wiki articles in the form of wikitext source and meta data in XML.We employ a similar extractionalgorithm to reveal structured content from infobox templates as well as their instances(DBpedia).
    维基百科提供数据库备份转储,它以wikitext源文件和XML元数据的形式嵌入所有wiki文章。我们使用类似的提取算法从信息框模板及其实例中显示结构化内容(DBpedia)。
  2. Baidu Baike and Hudong Baike provide the WYSIWYG (what you see iswhat you get) HTML editors. So all information should be extracted from HTMLfile archives.
    百度百科和沪东百科提供所见即所得(WYSIWYG) HTML编辑器。所以所有的信息都应该从HTMLfile存档中提取。
  3. 12 typesof article content:abstracts, aliases, categories, disambiguation, external links,images, infobox properties, internal links (pagelinks), labels, redirects, relatedpages and resource ids
  1. Abstracts:All of these three encyclopedias have separate abstract or summarysections.这三种百科全书都有各自的摘要部分。 (zhishi:abstracts)
  2. Aliases:In Wikipedia, editors can customize the title of an internal link.Users cannot rename internal links in Hudong Baikeand Baidu Baike, so aliases are not included in these two sources.在维基百科中,编辑可以定制内部链接的标题。用户不能在沪东百科和百度百科中重命名内部链接,因此别名不包括在这两个来源中。(zhishi:aliases)
  3. Categories:Categories describe the subjects of a given article,(dcterms:subjectis) used to present them for the corresponding resources in Zhishi.me.Categories have hyponymy relations between themselves which are represented using (skos:broader).类别描述给定文章的主题,(dcterms:subjectis)用于将它们表示为zhshi.me中的相应资源。类别之间有上下位关系,用(skos: wider)表示。
  4. Disambiguation:
    (1)Hudong Baike and Wikipedia:use disambiguation pages to resolve the conflicts. However, disambiguation pages are written by naturallanguage, it is not convenient to identify every meaning of homonyms accurately. Thus we only consider a topic, that linked from one disambiguation page and which label shares the same primary term, as a validmeaning. For example, “Jupiter (mythology)” and “Jupiter Island” sharethe same primary term: “Jupiter”.
    (2)Baidu Baike puts all meanings as well as their descriptions of a homonymin a single page. We put every meaning in a pair of square brackets andadd it to the primary term as the final Zhishi.me resource name, such as“Term[meaning-1]”.(1)互东百科和维基百科:使用消歧页面来解决冲突。然而,消歧页面是用自然语言书写的,要准确地识别同音异义词的每一个意思是不方便的。因此,我们只考虑一个主题,该主题链接从一个消歧页面,哪个标签共享相同的主术语,作为一个有效的意义。例如,“木星(神话)”和“木星岛”有一个共同的基本术语:“木星”。
    (2)百度百科把所有意义以及它们的描述同形词放在一个单一的页面上。我们把每一种意思用一对方括号括起来,加到初等词后面,作为最后的字词。资源名称,例如“Term[meaning-1]”。
    (zhishi:pageDisambiguates)
  5. External Links:Encyclopedia articles may include links to web pages outsidethe original website.百科全书的文章可能包括到原始网站以外的网页的链接。(zhishi:externalLink)
  6. Images:All thumbnails’ information includes image URLs and their labels areextracted.所有缩略图的信息包括图像url和它们的标签被提取。(zhishi:thumbnail)
  7. Infobox Properties:An infobox is a table and presents some featured properties of the given article. 一个信息框就是一个表格,显示给定文章的一些特性。(These property names are assigned IRIs as the form http://zhishi.me/[sourceName]/property/[propertyName].)
  8. Internal Links:An internal link is a hyperlink that is a reference in an encyclo-pedia page to another encyclopedia page in the same website.内部链接是一个超链接,它是百科全书页面中对同一网站中另一个百科全书页面的引用。(zhishi:internalLink)
  9. Labels:Use Hudong Baike and Wikipedia article’s titles as labels for thecorresponding Zhishi.me resources directly.
    直接使用沪东百科和维基百科文章的标题作为Zhishi.me相应资源的标签。(refs:labels)
  10. Redirects:All of these three encyclopedias use redirects to solve the synony-mous problem. Since Wikipedia is a global encyclopedia while the othertwo are encyclopedias from Mainland China, Wikipedia contains redirect-s between Simplified and Traditional Chinese articles. There are rules ofword-to-word correspondence between Simplified and Traditional Chinese,so we just convert all Traditional Chinese into Simplified Chinese by theserules and omit redirects of this kind. 这三种百科全书都使用重定向来解决“同源”问题。由于维基百科是全球百科全书,而另外两家是中国大陆的百科全书,所以维基百科包含简体中文和繁体中文文章的转载。简体中文和繁体中文之间有逐字对应的规则,所以我们只是根据这些规则将繁体中文转换为简体中文,而省略了这种重定向。(zhishi:pageRedirects)
  11. Related pages:“Related pages” sections in Baidu Baike and Hudong Baikearticles are similar to “see also” section of an article in Wikipedia but theyalways have fixed positions and belong to fixed HTML classes.百度百科和沪东百科里的“相关页面”和维基百科里的“看”类似,但是它们都有固定的位置,属于固定的HTML类。(zhishi:relatedPage)
  12. Resource IDs:Resource IDs for Wikipedia articles and most Baidu Baike articles are just the page IDs. Due to the reason that every Zhishime resource of homonyms in Baidu Baike is newly generated, they are assigned to specialvalues (negative integers). Articles from Hudong Baike have no page IDs, sowe assign them to private numbers.维基百科和大多数百度百科文章的资源id只是页面id。由于百度百科中的每一个同音异义词的Zhishi.me资源都是新生成的,所以它们被赋给特定的值(负整数)。沪东百科的文章没有页面id,所以要给它们分配私人编号。(zhishi:resourceID)

3.2 Extraction Results and Discussions提取结果与讨论

All data sources have their own characteristics, nevertheless they representsubjects in a similar manner, which makes it possible to integrate these at-tributes. 所有数据源都有其自身的特征,但是它们以相似的方式表示对象,这使得整合这些属性成为可能。
Resources that have infobox information are much less than ones with categories. Unfortunately, the quality of these categories are not very high due tothe reason that encyclopedia editors usually choose category names casually andmany of them are not used frequently. Thus we adopt some Chinese words segmentation techniques to refine these categories, and then choose some commoncategories to map them to YAGO categories manually.具有信息框信息的资源比具有类别的资源少得多。 不幸的是,由于百科全书编辑者通常随便选择类别名称并且其中许多不经常使用,因此这些类别的质量不是很高。 因此,我们采用一些中文分词技术来细化这些类别,然后选择一些通用类别以手动将它们映射到YAGO类别。

4. Data-level Mapping among Different Datasets不同数据集之间的数据映射

Baidu Baike, Hudong Baike and Wikipedia have their own adherents. Most ofthe time, users edit a certain article by their personal knowledge and that leadto heterogeneous descriptions. Mapping these articles with various descriptionstyles can help to integrate these separated data sources as a whole. At thesame time, we try to bridge the gap between our Chinese knowledge base andEnglish one (the Linking Open Data). Descriptions in different languages of asame subject can supplement each other. We will introduce the methods we useto achieve these goals in next two sub-sections.百度百科,沪东百科和维基百科都有自己的支持者。 在大多数情况下,用户根据自己的知识并导致种类繁多的描述来编辑某篇文章。 用各种描述样式映射这些文章可以帮助将这些分离的数据源作为一个整体进行集成。 同时,我们尝试弥合中文知识库和英文知识库之间的鸿沟(链接开放数据)。 相同主题的不同语言说明可以互相补充。 在接下来的两个小节中,我们将介绍用于实现这些目标的方法。

4.1 Mappings within CLOD CLOD之间的映射

Two levels

  1. One is the practice of schema-level ontology matching.一种是模式级本体匹配的实践。
  2. The other one aims at matching instances andwe mainly focus on this kind of mapping discovery.另一个目标是匹配实例,我们主要关注这种映射发现。

Not all existing instance matching algorithms are suitable for finding<owl:sameAs>links between largescale and heterogeneous encyclopedias.并非所有现有的实例匹配算法都适合查找大型和各种各样的百科之间的<owl:sameAs>链接
KnoFuss:need instancedata represented as consistent OWL ontologies, however, our extracted semanticdata does not meet this requirement.需要将实例数据表示为一致的OWL本体,但是,我们提取的语义数据不满足此要求。
Raimondet:proposed an interlinking algorithm which took into account both the similarities of web resources and oftheir neighbors but had been proved to be operative in a really small test set.Raimondet提出了一种互连算法,该算法同时考虑了Web资源及其邻居的相似性,但已被证明可以在非常小的测试集中运行。
Silk is a well-known link discovery framework, which indexes resourcesbefore detailed comparisons are performed. Prematching by indexes can dramatically reduce the time complexity on large datasets, thus we also match our encyclopedia articles based on this principle. Silk是一个众所周知的链接发现框架,该框架在执行详细比较之前先对资源进行索引。通过索引进行预匹配可以大大减少大型数据集上的时间复杂度,因此,我们也基于此原理匹配我们的百科全书。
Simply indexing resources by their labels has some potential problems. Oneis that the same labels may not represent the same subject: different subjectshaving the same label is quite common. The other one is opposite: same subjectmay have different labels in some cases. These two possible situation would affectthe precision and recall in matching respectively.仅通过标签标记资源会存在一些潜在问题。 一个想法是,相同的标签可能无法代表相同的主题:具有相同标签的不同主题非常普遍。 另一个则相反:同一主题在某些情况下可能具有不同的标签。 这两种可能的情况将分别影响匹配的精度和召回率。

Using Original Labels:The first index generation strategy is just using origin allabels. This strategy normally has a high precision except it comes with the problem of homonyms. Fortunately, we extract different meanings of homonymsas different resources, which has been introduced in Section 2.1. In other words,it is impossible to find two resources that have different meanings with the samelabel if all homonyms are recognized. This fact ensures the correctness of thisstrategy.
There is no denying that the performance of this method depends on thequality of existing encyclopedia articles: whether the titles are ambiguous.第一个索引生成策略只是使用原始标签。 这种策略通常具有很高的精度,只是它带有同音异义的问题。 幸运的是,我们提取了同音异义词作为不同资源的不同含义,这已在第2.1节中介绍。 换句话说,如果所有同音字都能被识别,就不可能用相同的标签找到具有不同含义的两个资源。 这一事实确保了该战略的正确性。
不可否认,这种方法的性能取决于现有百科全书文章的质量:标题是否含糊。

Punctuation Cleaning:When it comes to the second problem: discovering mappings between resources with different labels, one of the most efficient methods we used is punctuation cleaning. Figure2 shows some examples of sameentity having different labels due to the different usage of Chinese punctuationmarks. These cases can be handled by the punctuation cleaning method.关于第二个问题:发现具有不同标签的资源之间的映射,我们使用的最有效的方法之一是标点清除。 图2显示了由于中文标点符号用法不同而具有不同标签的同一性示例。 这些情况可以通过标点清洁方法来处理。

(1) 肖申克的救赎=《肖申克的救赎》
(2)海尔波普彗星=海尔·波普彗星=海尔-波普彗星
(3)奋进号航天飞机=“奋进号”航天飞机

  1. Some Chinese encyclopedias encourage editors to use guillemets (《》) toindicate the title of a book, film or album etc. However, guillemets are notimperative to be part of titles.一些中国百科全书鼓励编辑者使用引号来表示书籍,电影或专辑等的标题。但是,引号并非必须成为标题的一部分。
  2. In Chinese, we often insert an interpunct(·) between two personal name components. In some certain cases, people may insert a hyphen instead orjust adjoin these components.在中文中,我们经常在两个个人名字成分之间插入一个点号(·) 。 在某些情况下,人们可能会插入连字符,或者只是将其与这些组件相邻。
  3. According to the usage of Chinese punctuation marks, it is a good practiceto quote a cited name by double styling quotation marks (“”). However, itis not a mandatory requirement.根据中文标点符号的用法,优良作法是使用双样式引号(“”)来引用引用的名称。 但是,这不是强制性要求。
    Punctuation marks may have special meanings when they constitute a highproportion of the whole label string. So we calculate the similarity between twolabels using Levenshtein distance and attach penalty if strings are too short.标点符号在整个标签字符串中占很高的比例时可能具有特殊的含义。 因此,我们使用Levenshtein距离计算两个标签之间的相似度,如果字符串太短,则附加惩罚。

Extending Synonyms:The third strategy we use in index generation also dealswith the problem of linking resources with different labels. This one is makinguse of high quality synonym relations obtained from redirects information (Aredirects to B means A and B are synonyms). We can treat redirects relationsas approximate<owl:sameAs>relations temporarily and thereupon find morelinks based on the transitive properties of<owl:sameAs>.Usually, the title of a redirected page is the standard name. So we just linktwo resources with standard names to avoid redundancy. Resources with aliasescan still connect to other data source viapageRedirects.我们在索引生成中使用的第三个策略还处理了将资源与不同标签链接在一起的问题。 这是利用从重定向信息获得的高质量同义词关系(对B的重定向意味着A和B是同义词)。 我们可以暂时将重定向关系视为近似<owl:sameAs>关系,然后根据<owl:sameAs>的传递属性找到更多链接。通常,重定向页面的标题是标准名称。 因此,我们仅将两个资源与标准名称链接在一起,以避免冗余。 具有别名的资源仍可以通过pageRedirects连接到其他数据源。

Since our dataset is very large, it still has a great time and space complexity even we adopt the pre-matching by index method. We utilize distributed MapReduce framework to accomplish this work. All resources are sorted bytheir index term in a map procedure, and naturally, similar resources will gather together and wait for detailed comparisons. In practice, totally approximately 24million index terms are generated from our data sources. This distributed algorithm makes it easier to discovering links within more datasets because pairwise comparisons between every two datasets are avoided.由于我们的数据集非常大,因此即使我们采用索引预匹配方法,它仍然具有很大的时间和空间复杂性。 我们利用分布式MapReduce框架来完成这项工作。 所有资源都按照其在映射过程中的索引项进行排序,自然,相似的资源将聚集在一起并等待详细的比较。 实际上,从我们的数据源中总共生成了大约2400万个索引词。 由于避免了每两个数据集之间的成对比较,因此这种分布式算法使发现更多数据集中的链接更加容易。
We say two resources are integrated if they are linked by /owl:sameAs/prop-erty. Thus two data sources have intersections when they provide descriptionsfor mutual things. The number of links found by our mapping strategies is re-flected in Figure 3. It confirms the nature of heterogeneity in these three datasources. Original 6.6 million resources are merged into nearly 5 million distinctresources, while only a small proportion (168,481, 3.4%) of them are shared byall.
474/5000
我们说如果两个资源通过<owl:sameAs>属性链接,则它们是集成的。因此,当两个数据源提供对共同事物的描述时,它们具有交集。我们的映射策略发现的链接数量在图3中得到反映。它确定了这三个数据源中异构性的性质。原始的660万资源合并为近500万不同资源,而所有人中只有一小部分(168,481,3.4%)被共享。

4.2 Linking CLOD with LOD

DBpedia, a structured Wikipedia, is now one of the central data sources inLinking Open Data. Mapping resources between Zhishi.me and DBpedia is across-lingual instance matching problem, which is remaining to be solved. Ngaietal. tried to use a bilingual corpus to align WordNet and HowNet. However,their experimental results showed that the mapping accuracy, 66.3%, is still notsatisfiable.
DBpedia是结构化的Wikipedia,现在是链接开放数据中的主要数据源之一。 Zhishi.me和DBpedia之间的资源映射是跨语言的实例匹配问题,仍有待解决。 Ngaieta试图使用双语语料库来对齐WordNet和HowNet。但是,他们的实验结果表明,映射精度仍达不到66.3%。
Fuet al. emphasized that translations played a critical role in discover-ing cross-lingual ontology mapping. Therefore, we make use of the high-qualityChinese-English mapping table at hand: Wikipedia interlanguage links. Wikipedi-a interlanguage links are used to link the same concepts between two Wikipedialanguage editions and this information can be extracted directly from wikitext.Fuet al强调翻译在发现跨语言本体映射中起关键作用。因此,我们利用了高质量的汉英映射表:Wikipedia中间语言链接。Wikipedia中间语言链接用于链接两个Wikipedialanguage版本之间的相同概念,并且可以直接从Wikitext中提取此信息。
DBpedia’s resource names are just taken from the URLs of Wikipedia ar-ticles, linking DBpedia and wikipedia dataset in Zhishi.me is straightforward.Likewise, resources in DBpedia and the whole Zhishi.me can be connected basedon the transitive properties of <owl:sameAs>. In total, 192,840 links are foundbetween CLOD and LOD.DBpedia的资源名称仅取自Wikipedia的URL,链接Zhishi.me中的DBpedia和Wikipedia数据集很简单,同样,DBpedia中的资源和整个Zhishi.me可以基于<owl:sameAs>的传递属性进行连接。 在CLOD和LOD之间总共找到192,840个链接。

5. Web Access to Zhishi.me

6. Conclusions and Future Work

Zhishi.me, the first effort to build Chinese Linking Open Data, currently covers three largest Chinese encyclopedias: Baidu Baike, Hudong Baike and Chinese Wikipedia. We extracted semantic data from these Web-based free-editable en-cyclopedias and integrate them as a whole so that Zhishi.me has a quite widecoverage of many domains. Observations on these independent data sources re-veal their heterogeneity and their preferences for describing entities. Then, threeheuristic strategies were adopted to discover <owl:sameAs> links between e-quivalent resources. The equivalence relation leads to about 1.6 million originalresources being merged finally.Zhishi.me是构建中文链接开放数据的第一项工作,目前涵盖了三个最大的中文百科全书:百度百科,沪东百科和中文维基百科。 我们从这些基于Web的可自由编辑的百科全书中提取了语义数据,并将它们作为一个整体进行集成,以使Zhishi.me在许多领域中具有相当广泛的覆盖范围。 对这些独立数据源的观察揭示了它们的异质性以及描述实体的偏好。 然后,采用三种启发式策略来发现等价资源之间的<owl:sameAs>链接。 等价关系最终导致约160万个原始资源被合并。
It is the first crack at building pure Chinese LOD and several specific dif-ficulties (Chinese characters comparing and Web accessing for example) havebeen bridged over. Furthermore, we have a long-term plan on improving andexpanding present CLOD:这是建立纯中文LOD的第一个漏洞,并且弥合了一些特定的难题(例如汉字比较和Web访问)。 此外,我们有一个长期计划来改善和扩展目前的CLOD:

Firstly, several Chinese non-encyclopedia data sources will be accommodatedin our knowledge. Wide domain coverage is the advantage of encyclopedia, but some domain-specific knowledge base, such as 360buy, Taobaoand Douban, can supplement more accurate descriptions. A blueprint ofChinese Linking Open Data is illustrated in Figure6.首先,根据我们的知识,将提供一些中文非百科全书数据源。 广泛的领域覆盖范围是百科全书的优势,但是某些特定领域的知识库(例如360buy,淘宝和豆瓣)可以补充更准确的描述。 中文链接开放数据的蓝图如图6所示。
The second direction we are considering is improving instance matchingstrategies. Not only boosting precision and recall of mapping discoveringwithin CLOD, but also augmenting the high-quality entity dictionary tolink more Chinese resources to the English ones within Linking Open Data. Meanwhile, necessary evaluations of matching quality will be provided.When matching quality is satisfactory enough, we will use a single constantidentifier scheme instead of current source-oriented ones. 我们正在考虑的第二个方向是改进实例匹配策略。 不仅提高了CLOD内的地图发现的准确性和召回率,而且还增强了高质量的实体词典,以便在Linking Open Data中将更多的中文资源链接到英文资源。 同时,将提供必要的匹配质量评估。当匹配质量足够令人满意时,我们将使用单个常量标识符方案,而不是当前的面向源代码的方案。
Another challenge is refining extracted properties and building a generalbut consistent ontology automatically. This is an iterative process: initialrefined properties are used for ontology learning, and the learned preliminaryontology can help abandon inaccurate properties in return. This iteration willreach the termination condition if results are convergent.另一个挑战是完善提取的属性并自动构建一个通用但一致的本体。 这是一个反复的过程:初始精炼的属性用于本体学习,学习的初步本体可以帮助放弃不准确的属性作为回报。 如果结果收敛,则该迭代将达到终止条件。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值