WAT&SWAT API Documentation翻译理解

最新推荐文章于 2024-09-23 18:03:01 发布

飞鸡110

最新推荐文章于 2024-09-23 18:03:01 发布

阅读量519

点赞数

分类专栏： NLP paper tagme 文章标签：自然语言处理 python

本文链接：https://blog.csdn.net/m0_43414114/article/details/109787165

版权

tagme 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

NLP paper

1 篇文章 0 订阅

订阅专栏

WAT

WAT是一个实体链接器，即一种工具，它可以在非结构化英文文本中标识有意义的子字符串（称为“点”），并将每个子字符串链接到明确的实体（知识库中的一项）。实体是Wikipedia / Wikidata项。这可用于一系列NLP / NLU问题，例如问题解答，知识库填充，文本分类等。您可以通过向本页中记录的RESTful API发出查询来注释文本。您可以通过查询此页面中记录的API来注释文本。

WAT不赞成TagME：尽管它目前只能处理英文文档，但它具有类似的运行时性能，但结果更准确（请参阅本文以了解详细信息）。

参数可以作为URL编码的参数或作为多部分请求的字段传递到API端点。所有端点仅处理HTTP GET请求。

Registering to the service

该服务由D4Science基础架构托管。要获得访问权限，您需要注册到TagMe VRE并通过单击左侧面板中的“显示”按钮来获取授权令牌。现在，您已具备向WAT Api发出查询的所有条件。例如，您可以将浏览器指向：

https://wat.d4science.org/wat/tag/tag?lang=en&gcube-token=XXXX&text=Obama+visited+U.K.+in+March

How to annotate

注释文本是WAT提供的主要服务。这就是所谓的Sa2KB问题。注释是一对（点，实体），其中“点”是输入文本的子字符串，“实体”是对Wikipedia项的引用，表示该点在该上下文中的含义。

响应包括在输入文本中找到的所有注释。 WAT为每个注释关联一个称为ρ（rho）的属性，该属性估计注释中的置信度。（请注意，ρ并不表示输入文本中实体的相关性）。您可以使用ρ值来丢弃低于给定阈值的注释。阈值应在间隔[0,1]中选择。合理的阈值在0.1到0.3之间。

Parameters

text - required - the text to be annotated
gcube-token - required - the D4Science Service Authorization Token.
lang - optional - The language of the text to be annotated. Currently only en for English is accepted as value.
tokenizer - 可选-要使用的Tokenizer。接受的值：opennlp（默认，用于格式正确的文本），lucene（用于非格式正确的文本）。

Advanced optional parameters

debug-可选-包括调试信息。该值被解释为多个值的按位或。每个值都提供特定的调试信息：1（文档处理），2（发现的提及），4（管道），8（解歧义模块的解释）。例如。要获取有关文档处理和管道的调试信息，请提供debug = 5。

Python Running Example

这里有一个用Python写的玩具示例，用于使用本文中介绍的最佳配置设置来查询WAT。只需将这些示例复制并粘贴到代码中，即可查看其工作原理！

import json
import requests

MY_GCUBE_TOKEN = 'copy your gcube-token here!'

class WATAnnotation:
    # An entity annotated by WAT

    def __init__(self, d):

        # char offset (included)
        self.start = d['start']
        # char offset (not included)
        self.end = d['end']

        # annotation accuracy
        self.rho = d['rho']
        # spot-entity probability
        self.prior_prob = d['explanation']['prior_explanation']['entity_mention_probability']

        # annotated text
        self.spot = d['spot']

        # Wikpedia entity info
        self.wiki_id = d['id']
        self.wiki_title = d['title']


    def json_dict(self):
        # Simple dictionary representation
        return {'wiki_title': self.wiki_title,
                'wiki_id': self.wiki_id,
                'start': self.start,
                'end': self.end,
                'rho': self.rho,
                'prior_prob': self.prior_prob
                }


def wat_entity_linking(text):
    # Main method, text annotation with WAT entity linking system
    wat_url = 'https://wat.d4science.org/wat/tag/tag'
    payload = [("gcube-token", MY_GCUBE_TOKEN),
               ("text", text),
               ("lang", 'en'),
               ("tokenizer", "nlp4j"),
               ('debug', 9),
               ("method",
                "spotter:includeUserHint=true:includeNamedEntity=true:includeNounPhrase=true,prior:k=50,filter-valid,centroid:rescore=true,topk:k=5,voting:relatedness=lm,ranker:model=0046.model,confidence:model=pruner-wiki.linear")]

    response = requests.get(wat_url, params=payload)
    return [WATAnnotation(a) for a in response.json()['annotations']]


def print_wat_annotations(wat_annotations):
    json_list = [w.json_dict() for w in wat_annotations]
    print json.dumps(json_list, indent=4)


wat_annotations = wat_entity_linking('Barack Obama was in Pisa for a flying visit.')
print_wat_annotations(wat_annotations)

URL Get-Example

gcube-token=
text=Schumacher won the race in Indianapolis
lang=en

https://wat.d4science.org/wat/tag/tag?lang=en&gcube-token=<your Service Authorization Token>&text=Schumacher won the race in Indianapolis

How to compute entity relatedness

该服务通过返回范围为[0,1]的值来计算两个实体之间的相关性，该值表示两个实体在语义上彼此相关的程度，其中0 =不相关，1 =相关。

我们指出，可以通过首先使用TagMe对其进行注释，然后估计其所有带注释实体对之间的成对相关性，来将该服务用于关联两个文本。所有这些值可以某种方式组合（例如avg，max等），以便得出表示两个输入文本之间相关性的值。

该端点接受Wikipedia页面ID的列表，并将返回所有提供的对之间的相关性值。这意味着它将返回N2个值，其中N是提供的实体数，因此请小心！

Endpoint URL

https://wat.d4science.org/wat/relatedness/graph

Parameters

lang-可选-要注释的文本语言。当前，仅英语的en被接受为值。
ids-必需，重复-实体的Wikipedia ID（数字标识符）。
relatedness -可选-相关性函数进行计算。接受的值包括：mw（Milne-Witten），jaccard（页面出站的Jaccard度量），lm（语言模型），w2v（Word2Vect），条件概率（条件概率），barabasialbert（维基百科图上的Barabasi-Albert），pmi（点向互惠信息）。

Example

计算两个实体巴拉克·奥巴马（Wikipedia ID 534366）和巴拉克·奥巴马（Barack Obama）总统职位（Wikipedia ID 20082093）之间的相关性：

gcube-token=
ids=534366
ids=20082093

这对应于GET请求：

https://wat.d4science.org/wat/tag/tag?gcube-token=<your Service Authorization Token>&ids=534366&ids=20082093

如何获取表面形式信息

表面形式是可能提及实体的文本部分。 WAT提供了两个端点来检索表面形式的信息，例如它可能引用的实体，在Wikipedia中被查看过多少次，作为链接出现过多少次等等。这两个端点具有相同的参数。第一个端点提供有关表面形式频率的信息，而第二个端点也提供有关其可能引用的实体的信息。

Endpoint URL - frequency information

https://wat.d4science.org/wat/sf/frequency

Endpoint URL - full information

https://wat.d4science.org/wat/sf/sf

Parameters

lang - optional - The language of the text to be annotated. Currently only en for English is accepted as value.
text - required - The surface form.

Example

Get information about the surface form obama:

gcube-token=
text=obama

This corresponds to the GET request:

https://wat.d4science.org/wat/sf/sf?gcube-token=<your Service Authorization Token>&text=obama

对响应中最重要字段的简要说明：

link_probability：该表面形式在Wikipedia页面中作为链接相对于普通文本出现的次数的比率。
term_frequency：此表面形式在所有Wikipedia页面中出现的次数。
term_probability：此表面形式在所有Wikipedia页面中相对于所有其他表面形式的出现率。
document_frequency：不同的Wikipedia页数包含此表面形式。
entities：对于每个实体，此表面形式在Wikipedia中指向Wikipedia的页面：
wiki_id：维基百科页面ID
num_links：此表面形式指向此Wikipedia页面的次数
probability：此表面形式指向此Wikipedia页面（相对于其他页面）的次数的比率。

Wikipedia Title resolution

为了将Wikipedia页面标题解析为其ID，WAT提供了以下API。

API Endpoint

https://wat.d4science.org/wat/title

Parameters

lang - optional - The language of the text to be annotated. Currently only en for English is accepted as value.
title - required - The Wikipedia page title.

Example

Get the ID of page Barack Obama:

gcube-token=
title=Barack_Obama

This corresponds to the GET request:

https://wat.d4science.org/wat/title?title=Barack_Obama

进阶使用：部分填入资料的注解

本文档中描述的第一个API端点负责解析文本，搜索文本以查找可能的提及以及最终将它们链接到它们所引用的实体的整个流程。 WAT还提供了一个API端点来跳过某些步骤，方法是让用户提供已解析的文本或指出提到实体的表面形式（D2KB问题）。在这种情况下，文档必须作为Json对象传递。

API Endpoint

https://wat.d4science.org/wat/tag/json

Parameters

document-必需-表示要注释文本的JSON对象。该对象具有以下键值对：
- “text”-必需-要处理的文档
- “sentences”-可选-一个由Stanford CoreNLP解析器返回的格式的数组。
- “ spans”-可选-要注释的跨度（D2KB问题），以及具有“开始”和“结束”字段的对象数组，即跨度的第一个（包括）和最后一个（不包括）字符的索引
  -gcube-token-必需-D4Science服务授权令牌。
  -lang-可选-要注释的文本语言。当前，仅英语的en被接受为值。
  -tokenizer-可选-使用的Tokenizer。接受的值：opennlp（默认，用于格式正确的文本），lucene（用于非格式正确的文本）。

示例：提供解析的句子

手动建立对此端点的请求并不方便，主要是因为JSON必须经过URL编码。这是一个Python脚本，该脚本发出一个提供已解析文本的调用：

import json
import requests

document_json = json.loads("""{
  "text": "Barack Obama was in Pisa.",
  "sentences": [{
      "tokens": [
        { "position": { "start": 0, "end": 6 },
          "ner": { "type": "PERSON", "label": "Inside" },
          "id": 0,
          "word": { "word": "Barack" }
        }, {
          "position": { "start": 7, "end": 12 },
          "ner": { "type": "PERSON", "label": "Inside" },
          "id": 1,
          "word": { "word": "Obama" }
        }, {
          "position": { "start": 13, "end": 16 },
          "ner": { "type": "O", "label": "Outside" },
          "id": 2,
          "word": { "word": "was" }
        }, {
          "position": {"start": 17, "end": 19},
          "ner": {"type": "O", "label": "Outside"},
          "id": 3,
          "word": { "word": "in" }
        }, {
          "position": { "start": 20, "end": 24 },
          "ner": { "type": "LOCATION", "label": "Inside" },
          "id": 4,
          "word": { "word": "Pisa" }
        }, {
          "position": { "start": 24,  "end": 25 },
          "ner": { "type": "O",  "label": "Outside" },
          "id": 5,
          "word": { "word": "." }
        }
      ],
      "position": { "start": 0,  "end": 25 },
      "id": 0
    }]
}""")

r = requests.get('https://wat.d4science.org/wat/tag/json', params={"document": json.dumps(document_json)})

print r.text

示例：提供提及以消除歧义（D2KB问题）

您可以通过添加“ suggested_spans”键来请求WAT注释特定范围。

import json
import requests

document_json = json.loads("""{

  "text": "Barack Obama was in Pisa.",
  "suggested_spans": [
    { "start":0, "end":6 },
    { "start":20, "end":24 }
  ]
}""")

r = requests.get('https://wat.d4science.org/wat/tag/json', params={"document": json.dumps(document_json)})

print r.text

学分和参考

要了解有关WAT功能的更多信息，请查看这篇论文发表在ERD 2014和Francesco Piccinno的博士学位论文中。

SWAT API Documentation

特警是一种实体显着性系统，可实时识别由文档的显着维基百科实体表达的文档的语义焦点。该技术的核心是基于广泛的句法和语义特征，这些语义和特征从输入文档中提取，然后馈送到分类器中，该分类器以前接受过从《纽约时报带注释的语料库》中提取的数百万个训练示例的训练。可通过https://swat.d4science.org/获得该系统的实验GUI。

注册服务

该服务由D4Science基础架构托管。要获得访问权限，您需要注册到TagMe VRE并通过单击左侧面板中的“显示”按钮来获取授权令牌。在对API的每次请求中，您都必须将此身份验证令牌作为gcube-token URL参数发出。

如何在文档中获得实体显着性

您可以通过以下网址通过HTTP POST请求调用Swat的API来使用它：

https://swat.d4science.org/salience

该端点接受一个JSON对象作为输入（将其放入POST请求的有效负载中）并返回一个JSON对象。输入对象必须/可以具有以下键值对

Key	Description	Type
content	The textual content of the input document (required)	string
title	The document’s title (optional)	string

A Python example

这是查询SWAT的Python代码的一部分：

import json
import requests

MY_GCUBE_TOKEN = 'copy your gcube-token here!'

document = {
    "title": "Obama travels.",
    "content": 'Barack Obama was in Pisa for a flying visit.'
}

url = 'https://swat.d4science.org/salience'
response = requests.post(url,
                         data=json.dumps(document),
                         params={'gcube-token': MY_GCUBE_TOKEN})

print(json.dumps(response.json(), indent=4))

回应格式

响应将是具有以下结构的JSON对象：

{
    'status'                        # str

    'annotations':
       {
           'wiki_id'                # int
           'wiki_title'             # str
           'salience_class'         # int
           'salience_score'         # float
           'spans':                 # (where the entity is mentioned in content)
                [
                    {
                        'start'     # int (character-offset, included)
                        'end'       # int (character-offset, not included)
                    }
                ]
       }

    'title'                         # str
    'content'                       # str
}