1、介绍
什么是brat?brat是nlp的序列标注工具,用于对自然语言进行序列标注,然后该部分数据可以用于命名实体识别、关系抽取等等多种任务。
2、安装
安装分为两种,一种是os上面直接安装,此部分较麻烦,并且官网的包不太容易下载,第二种是docker安装,此部分比较简单,本文主要讲解docker安装。
2.1 下载所需包
(1)brat原始安装包:brat安装包
(2)docker镜像包:brat-docker镜像包
2.2 安装
# 解压镜像
tar -zcvf brat.tgz
# 加载镜像
docker load -i brat.tar
# 启动docker容器
docker run --name=brat -d -p 8088:80 -e BRAT_USERNAME=brat -e BRAT_PASSWORD=brat -e BRAT_EMAIL=brat@youremail.com cassj/brat
2.3 修改配置文件
docker启动后,需要进入docker容器内对一些配置文件进行修改,比如中文支持、标签配置等,此部分直接进入容器修改,然后刷新页面即可生效。
(1)修改支持中文配置
# 进入容器
docker exec -it brat bash
# 找到文件
vim /var/www/brat/brat-v1.3_Crunchy_Frog/server/src/projectconfig.py
#!/usr/bin/env python
# -*- Mode: Python; tab-width: 4; indent-tabs-mode: nil; coding: utf-8; -*-
# vim:set ft=python ts=4 sw=4 sts=4 autoindent:
'''
Per-project configuration functionality for
Brat Rapid Annotation Tool (brat)
Author: Pontus Stenetorp <pontus is s u-tokyo ac jp>
Author: Sampo Pyysalo <smp is s u-tokyo ac jp>
Author: Illes Solt <solt tmit bme hu>
Version: 2011-08-15
'''
import re
import robotparser # TODO reduce scope
import urlparse # TODO reduce scope
import sys
from annotation import open_textfile
from message import Messager
ENTITY_CATEGORY, EVENT_CATEGORY, RELATION_CATEGORY, UNKNOWN_CATEGORY = xrange(4)
class InvalidProjectConfigException(Exception):
pass
# names of files in which various configs are found
__access_control_filename = 'acl.conf'
__annotation_config_filename = 'annotation.conf'
__visual_config_filename = 'visual.conf'
__tools_config_filename = 'tools.conf'
__kb_shortcut_filename = 'kb_shortcuts.conf'
# annotation config section name constants
ENTITY_SECTION = "entities"
RELATION_SECTION = "relations"
EVENT_SECTION = "events"
ATTRIBUTE_SECTION = "attributes"
# aliases for config section names
SECTION_ALIAS = {
"spans" : ENTITY_SECTION,
}
__expected_annotation_sections = (ENTITY_SECTION, RELATION_SECTION, EVENT_SECTION, ATTRIBUTE_SECTION)
__optional_annotation_sections = []
# visual config section name constants
LABEL_SECTION = "labels"
DRAWING_SECTION = "drawing"
__expected_visual_sections = (LABEL_SECTION, DRAWING_SECTION)
__optional_visual_sections = []
# tools config section name constants
OPTIONS_SECTION = "options"
SEARCH_SECTION = "search"
ANNOTATORS_SECTION = "annotators"
DISAMBIGUATORS_SECTION = "disambiguators"
NORMALIZATION_SECTION = "normalization"
__expected_tools_sections = (OPTIONS_SECTION, SEARCH_SECTION, ANNOTATORS_SECTION, DISAMBIGUATORS_SECTION, NORMALIZATION_SECTION)
__optional_tools_sections = (OPTIONS_SECTION, SEARCH_SECTION, ANNOTATORS_SECTION, DISAMBIGUATORS_SECTION, NORMALIZATION_SECTION)
# special relation types for marking which spans can overlap
# ENTITY_NESTING_TYPE used up to version 1.3, now deprecated
ENTITY_NESTING_TYPE = "ENTITY-NESTING"
# TEXTBOUND_OVERLAP_TYPE used from version 1.3 onward
TEXTBOUND_OVERLAP_TYPE = "<OVERLAP>"
SPECIAL_RELATION_TYPES = set([ENTITY_NESTING_TYPE,
TEXTBOUND_OVERLAP_TYPE])
OVERLAP_TYPE_ARG = '<OVL-TYPE>'
# visual config default value names
VISUAL_SPAN_DEFAULT = "SPAN_DEFAULT"
VISUAL_ARC_DEFAULT = "ARC_DEFAULT"
VISUAL_ATTR_DEFAULT = "ATTRIBUTE_DEFAULT"
# visual config attribute name lists
SPAN_DRAWING_ATTRIBUTES = ['fgColor', 'bgColor', 'borderColor']
ARC_DRAWING_ATTRIBUTES = ['color', 'dashArray', 'arrowHead', 'labelArrow']
ATTR_DRAWING_ATTRIBUTES = ['box', 'dashArray', 'glyph', 'position']
# fallback defaults if config files not found
__default_configuration = """
[entities]
Protein
[relations]
Equiv Arg1:Protein, Arg2:Protein, <REL-TYPE>:symmetric-transitive
[events]
Protein_binding|GO:0005515 Theme+:Protein
Gene_expression|GO:0010467 Theme:Protein
[attributes]
Negation Arg:<EVENT>
Speculation Arg:<EVENT>
"""
__default_visual = """
[labels]
Protein | Protein | Pro | P
Protein_binding | Protein binding | Binding | Bind
Gene_expression | Gene expression | Expression | Exp
Theme | Theme | Th
[drawing]
Protein bgColor:#7fa2ff
SPAN_DEFAULT fgColor:black, bgColor:lightgreen, borderColor:black
ARC_DEFAULT color:black
ATTRIBUTE_DEFAULT glyph:*
"""
__default_tools = """
[search]
google <URL>:http://www.google.com/search?q=%s
"""
__default_kb_shortcuts = """
P Protein
"""
__default_access_control = """
User-agent: *
Allow: /
Disallow: /hidden/
User-agent: guest
Disallow: /confidential/
"""
# Reserved strings with special meanings in configuration.
reserved_config_name = ["ANY", "ENTITY", "RELATION", "EVENT", "NONE", "EMPTY", "REL-TYPE", "URL", "URLBASE", "GLYPH-POS", "DEFAULT", "NORM", "OVERLAP", "OVL-TYPE"]
# TODO: "GLYPH-POS" is no longer used, warn if encountered and
# recommend to use "position" instead.
reserved_config_string = ["<%s>" % n for n in reserved_config_name]
# Magic string to use to represent a separator in a config
SEPARATOR_STR = "SEPARATOR"
def normalize_to_storage_form(t):
"""
Given a label, returns a form of the term that can be used for
disk storage. For example, space can be replaced with underscores
to allow use with space-separated formats.
"""
if t not in normalize_to_storage_form.__cache:
# conservative implementation: replace any space with
# underscore, replace unicode accented characters with
# non-accented equivalents, remove others, and finally replace
# all characters not in [a-zA-Z0-9_-] with underscores.
import re
import unicodedata
n = t.replace(" ", "_")
if isinstance(n, unicode):
ascii = unicodedata.normalize('NFKD', n).encode('ascii', 'ignore')
n = re.sub(u'[^a-zA-Z\u4e00-\u9fa5<>,0-9_-]', '_', n)
normalize_to_storage_form.__cache[t] = n
return normalize_to_storage_form.__cache[t]
normalize_to_storage_form.__cache = {}
# 把第163行的代码改成下面的
n = re.sub(u'[^a-zA-Z\u4e00-\u9fa5<>,0-9_-]', '_', n)
(2)修改标注文档
修改标注文件才能实现指定标签,你的所有文件都可以建立一个文件夹,比如ocr,然后你的不同段落文本就可以以ocr1.txt,ocr2.txt 等进行存储,然后定义好这个文件夹下面的标签,如下即可。
3、使用
(1)输入地址http://ip:port访问,然后输入用户名密码brat/brat即可登录
(2)直接标注就行了,使用很简单