brat-nlp标注工具安装文档

brat是一款用于NLP的序列标注工具,适用于命名实体识别和关系抽取等任务。本文主要介绍了通过docker简单安装brat的步骤,包括下载docker镜像、解压、启动容器。在安装后,需要进入容器修改配置文件以支持中文,并配置标注文档的标签。使用时,通过http://ip:port访问并登录,即可开始标注工作。
摘要由CSDN通过智能技术生成

1、介绍

       什么是brat?brat是nlp的序列标注工具,用于对自然语言进行序列标注,然后该部分数据可以用于命名实体识别、关系抽取等等多种任务。

2、安装 

       安装分为两种,一种是os上面直接安装,此部分较麻烦,并且官网的包不太容易下载,第二种是docker安装,此部分比较简单,本文主要讲解docker安装。

2.1 下载所需包

(1)brat原始安装包:brat安装包

(2)docker镜像包:brat-docker镜像包

2.2 安装

# 解压镜像
tar -zcvf brat.tgz
# 加载镜像
docker load -i brat.tar
# 启动docker容器
docker run --name=brat -d -p 8088:80 -e BRAT_USERNAME=brat -e BRAT_PASSWORD=brat -e BRAT_EMAIL=brat@youremail.com cassj/brat

2.3 修改配置文件

       docker启动后,需要进入docker容器内对一些配置文件进行修改,比如中文支持、标签配置等,此部分直接进入容器修改,然后刷新页面即可生效。

(1)修改支持中文配置

# 进入容器
docker exec -it brat bash
# 找到文件
vim /var/www/brat/brat-v1.3_Crunchy_Frog/server/src/projectconfig.py

#!/usr/bin/env python
# -*- Mode: Python; tab-width: 4; indent-tabs-mode: nil; coding: utf-8; -*-
# vim:set ft=python ts=4 sw=4 sts=4 autoindent:


'''
Per-project configuration functionality for
Brat Rapid Annotation Tool (brat)

Author:     Pontus Stenetorp    <pontus is s u-tokyo ac jp>
Author:     Sampo Pyysalo       <smp is s u-tokyo ac jp>
Author:     Illes Solt          <solt tmit bme hu>
Version:    2011-08-15
'''

import re
import robotparser # TODO reduce scope
import urlparse # TODO reduce scope
import sys

from annotation import open_textfile
from message import Messager

ENTITY_CATEGORY, EVENT_CATEGORY, RELATION_CATEGORY, UNKNOWN_CATEGORY = xrange(4)

class InvalidProjectConfigException(Exception):
    pass

# names of files in which various configs are found
__access_control_filename     = 'acl.conf'
__annotation_config_filename  = 'annotation.conf'
__visual_config_filename      = 'visual.conf'
__tools_config_filename       = 'tools.conf'
__kb_shortcut_filename        = 'kb_shortcuts.conf'

# annotation config section name constants
ENTITY_SECTION    = "entities"
RELATION_SECTION  = "relations"
EVENT_SECTION     = "events"
ATTRIBUTE_SECTION = "attributes"

# aliases for config section names
SECTION_ALIAS = {
    "spans" : ENTITY_SECTION,
}

__expected_annotation_sections = (ENTITY_SECTION, RELATION_SECTION, EVENT_SECTION, ATTRIBUTE_SECTION)
__optional_annotation_sections = []

# visual config section name constants
LABEL_SECTION     = "labels"
DRAWING_SECTION   = "drawing"

__expected_visual_sections = (LABEL_SECTION, DRAWING_SECTION)
__optional_visual_sections = []

# tools config section name constants
OPTIONS_SECTION    = "options"
SEARCH_SECTION     = "search"
ANNOTATORS_SECTION = "annotators"
DISAMBIGUATORS_SECTION = "disambiguators"
NORMALIZATION_SECTION = "normalization"

__expected_tools_sections = (OPTIONS_SECTION, SEARCH_SECTION, ANNOTATORS_SECTION, DISAMBIGUATORS_SECTION, NORMALIZATION_SECTION)
__optional_tools_sections = (OPTIONS_SECTION, SEARCH_SECTION, ANNOTATORS_SECTION, DISAMBIGUATORS_SECTION, NORMALIZATION_SECTION)

# special relation types for marking which spans can overlap
# ENTITY_NESTING_TYPE used up to version 1.3, now deprecated
ENTITY_NESTING_TYPE = "ENTITY-NESTING"
# TEXTBOUND_OVERLAP_TYPE used from version 1.3 onward
TEXTBOUND_OVERLAP_TYPE = "<OVERLAP>"
SPECIAL_RELATION_TYPES = set([ENTITY_NESTING_TYPE,
                              TEXTBOUND_OVERLAP_TYPE])
OVERLAP_TYPE_ARG = '<OVL-TYPE>'

# visual config default value names
VISUAL_SPAN_DEFAULT = "SPAN_DEFAULT"
VISUAL_ARC_DEFAULT  = "ARC_DEFAULT"
VISUAL_ATTR_DEFAULT = "ATTRIBUTE_DEFAULT"

# visual config attribute name lists
SPAN_DRAWING_ATTRIBUTES = ['fgColor', 'bgColor', 'borderColor']
ARC_DRAWING_ATTRIBUTES  = ['color', 'dashArray', 'arrowHead', 'labelArrow']
ATTR_DRAWING_ATTRIBUTES  = ['box', 'dashArray', 'glyph', 'position']

# fallback defaults if config files not found
__default_configuration = """
[entities]
Protein

[relations]
Equiv   Arg1:Protein, Arg2:Protein, <REL-TYPE>:symmetric-transitive

[events]
Protein_binding|GO:0005515      Theme+:Protein
Gene_expression|GO:0010467      Theme:Protein

[attributes]
Negation        Arg:<EVENT>
Speculation     Arg:<EVENT>
"""

__default_visual = """
[labels]
Protein | Protein | Pro | P
Protein_binding | Protein binding | Binding | Bind
Gene_expression | Gene expression | Expression | Exp
Theme | Theme | Th

[drawing]
Protein bgColor:#7fa2ff
SPAN_DEFAULT    fgColor:black, bgColor:lightgreen, borderColor:black
ARC_DEFAULT     color:black
ATTRIBUTE_DEFAULT       glyph:*
"""

__default_tools = """
[search]
google     <URL>:http://www.google.com/search?q=%s
"""

__default_kb_shortcuts = """
P       Protein
"""

__default_access_control = """
User-agent: *
Allow: /
Disallow: /hidden/

User-agent: guest
Disallow: /confidential/
"""

# Reserved strings with special meanings in configuration.
reserved_config_name   = ["ANY", "ENTITY", "RELATION", "EVENT", "NONE", "EMPTY", "REL-TYPE", "URL", "URLBASE", "GLYPH-POS", "DEFAULT", "NORM", "OVERLAP", "OVL-TYPE"]
# TODO: "GLYPH-POS" is no longer used, warn if encountered and
# recommend to use "position" instead.
reserved_config_string = ["<%s>" % n for n in reserved_config_name]

# Magic string to use to represent a separator in a config
SEPARATOR_STR = "SEPARATOR"

def normalize_to_storage_form(t):
    """
    Given a label, returns a form of the term that can be used for
    disk storage. For example, space can be replaced with underscores
    to allow use with space-separated formats.
    """
    if t not in normalize_to_storage_form.__cache:
        # conservative implementation: replace any space with
        # underscore, replace unicode accented characters with
        # non-accented equivalents, remove others, and finally replace
        # all characters not in [a-zA-Z0-9_-] with underscores.

        import re
        import unicodedata

        n = t.replace(" ", "_")
        if isinstance(n, unicode):
            ascii = unicodedata.normalize('NFKD', n).encode('ascii', 'ignore')
        n  = re.sub(u'[^a-zA-Z\u4e00-\u9fa5<>,0-9_-]', '_', n)

        normalize_to_storage_form.__cache[t] = n

    return normalize_to_storage_form.__cache[t]
normalize_to_storage_form.__cache = {}

# 把第163行的代码改成下面的
n  = re.sub(u'[^a-zA-Z\u4e00-\u9fa5<>,0-9_-]', '_', n)

 (2)修改标注文档

       修改标注文件才能实现指定标签,你的所有文件都可以建立一个文件夹,比如ocr,然后你的不同段落文本就可以以ocr1.txt,ocr2.txt 等进行存储,然后定义好这个文件夹下面的标签,如下即可。 

3、使用

(1)输入地址http://ip:port访问,然后输入用户名密码brat/brat即可登录

 (2)直接标注就行了,使用很简单

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ITIRONMAN

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值