brat-nlp标注工具安装文档

最新推荐文章于 2025-04-17 10:22:47 发布

ITIRONMAN

最新推荐文章于 2025-04-17 10:22:47 发布

阅读量771

点赞数

分类专栏： AI 文章标签：自然语言处理机器学习深度学习

本文链接：https://blog.csdn.net/qq_23953717/article/details/119915943

版权

AI 专栏收录该内容

28 篇文章

订阅专栏

brat是一款用于NLP的序列标注工具，适用于命名实体识别和关系抽取等任务。本文主要介绍了通过docker简单安装brat的步骤，包括下载docker镜像、解压、启动容器。在安装后，需要进入容器修改配置文件以支持中文，并配置标注文档的标签。使用时，通过http://ip:port访问并登录，即可开始标注工作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1、介绍

什么是brat？brat是nlp的序列标注工具，用于对自然语言进行序列标注，然后该部分数据可以用于命名实体识别、关系抽取等等多种任务。

2、安装

安装分为两种，一种是os上面直接安装，此部分较麻烦，并且官网的包不太容易下载，第二种是docker安装，此部分比较简单，本文主要讲解docker安装。

2.1 下载所需包

（1）brat原始安装包：brat安装包

（2）docker镜像包：brat-docker镜像包

2.2 安装

# 解压镜像
tar -zcvf brat.tgz
# 加载镜像
docker load -i brat.tar
# 启动docker容器
docker run --name=brat -d -p 8088:80 -e BRAT_USERNAME=brat -e BRAT_PASSWORD=brat -e BRAT_EMAIL=brat@youremail.com cassj/brat

2.3 修改配置文件

docker启动后，需要进入docker容器内对一些配置文件进行修改，比如中文支持、标签配置等，此部分直接进入容器修改，然后刷新页面即可生效。

（1）修改支持中文配置

# 进入容器
docker exec -it brat bash
# 找到文件
vim /var/www/brat/brat-v1.3_Crunchy_Frog/server/src/projectconfig.py


#!/usr/bin/env python
# -*- Mode: Python; tab-width: 4; indent-tabs-mode: nil; coding: utf-8; -*-
# vim:set ft=python ts=4 sw=4 sts=4 autoindent:


'''
Per-project configuration functionality for
Brat Rapid Annotation Tool (brat)

Author:     Pontus Stenetorp    <pontus is s u-tokyo ac jp>
Author:     Sampo Pyysalo       <smp is s u-tokyo ac jp>
Author:     Illes Solt          <solt tmit bme hu>
Version:    2011-08-15
'''

import re
import robotparser # TODO reduce scope
import urlparse # TODO reduce scope
import sys

from annotation import open_textfile
from message import Messager

ENTITY_CATEGORY, EVENT_CATEGORY, RELATION_CATEGORY, UNKNOWN_CATEGORY = xrange(4)

class InvalidProjectConfigException(Exception):
    pass

# names of files in which various configs are found
__access_control_filename     = 'acl.conf'
__annotation_config_filename  = 'annotation.conf'
__visual_config_filename      = 'visual.conf'
__tools_config_filename       = 'tools.conf'
__kb_shortcut_filename        = 'kb_shortcuts.conf'

# annotation config section name constants
ENTITY_SECTION    = "entities"
RELATION_SECTION  = "relations"
EVENT_SECTION     = "events"
ATTRIBUTE_SECTION = "attributes"

# aliases for config section names
SECTION_ALIAS = {
    "spans" : ENTITY_SECTION,
}

__expected_annotation_sections = (ENTITY_SECTION, RELATION_SECTION, EVENT_SECTION, ATTRIBUTE_SECTION)
__optional_annotation_sections = []

# visual config section name constants
LABEL_SECTION     = "labels"
DRAWING_SECTION   = "drawing"

__expected_visual_sections = (LABEL_SECTION, DRAWING_SECTION)
__optional_visual_sections = []

# tools config section name constants
OPTIONS_SECTION    = "options"
SEARCH_SECTION     = "search"
ANNOTATORS_SECTION = "annotators"
DISAMBIGUATORS_SECTION = "disambiguators"
NORMALIZATION_SECTION = "normalization"

__expected_tools_sections = (OPTIONS_SECTION, SEARCH_SECTION, ANNOTATORS_SECTION, DISAMBIGUATORS_SECTION, NORMALIZATION_SECTION)
__optional_tools_sections = (OPTIONS_SECTION, SEARCH_SECTION, ANNOTATORS_SECTION, DISAMBIGUATORS_SECTION, NORMALIZATION_SECTION)

# special relation types for marking which spans can overlap
# ENTITY_NESTING_TYPE used up to version 1.3, now deprecated
ENTITY_NESTING_TYPE = "ENTITY-NESTING"
# TEXTBOUND_OVERLAP_TYPE used from version 1.3 onward
TEXTBOUND_OVERLAP_TYPE = "<OVERLAP>"
SPECIAL_RELATION_TYPES = set([ENTITY_NESTING_TYPE,
                              TEXTBOUND_OVERLAP_TYPE])
OVERLAP_TYPE_ARG = '<OVL-TYPE>'

# visual config default value names
VISUAL_SPAN_DEFAULT = "SPAN_DEFAULT"
VISUAL_ARC_DEFAULT  = "ARC_DEFAULT"
VISUAL_ATTR_DEFAULT = "ATTRIBUTE_DEFAULT"

# visual config attribute name lists
SPAN_DRAWING_ATTRIBUTES = ['fgColor', 'bgColor', 'borderColor']
ARC_DRAWING_ATTRIBUTES  = ['color', 'dashArray', 'arrowHead', 'labelArrow']
ATTR_DRAWING_ATTRIBUTES  = ['box', 'dashArray', 'glyph', 'position']

# fallback defaults if config files not found
__default_configuration = """
[entities]
Protein

[relations]
Equiv   Arg1:Protein, Arg2:Protein, <REL-TYPE>:symmetric-transitive

[events]
Protein_binding|GO:0005515      Theme+:Protein
Gene_expression|GO:0010467      Theme:Protein

[attributes]
Negation        Arg:<EVENT>
Speculation     Arg:<EVENT>
"""

__default_visual = """
[labels]
Protein | Protein | Pro | P
Protein_binding | Protein binding | Binding | Bind
Gene_expression | Gene expression | Expression | Exp
Theme | Theme | Th

[drawing]
Protein bgColor:#7fa2ff
SPAN_DEFAULT    fgColor:black, bgColor:lightgreen, borderColor:black
ARC_DEFAULT     color:black
ATTRIBUTE_DEFAULT       glyph:*
"""

__default_tools = """
[search]
google     <URL>:http://www.google.com/search?q=%s
"""

__default_kb_shortcuts = """
P       Protein
"""

__default_access_control = """
User-agent: *
Allow: /
Disallow: /hidden/

User-agent: guest
Disallow: /confidential/
"""

# Reserved strings with special meanings in configuration.
reserved_config_name   = ["ANY", "ENTITY", "RELATION", "EVENT", "NONE", "EMPTY", "REL-TYPE", "URL", "URLBASE", "GLYPH-POS", "DEFAULT", "NORM", "OVERLAP", "OVL-TYPE"]
# TODO: "GLYPH-POS" is no longer used, warn if encountered and
# recommend to use "position" instead.
reserved_config_string = ["<%s>" % n for n in reserved_config_name]

# Magic string to use to represent a separator in a config
SEPARATOR_STR = "SEPARATOR"

def normalize_to_storage_form(t):
    """
    Given a label, returns a form of the term that can be used for
    disk storage. For example, space can be replaced with underscores
    to allow use with space-separated formats.
    """
    if t not in normalize_to_storage_form.__cache:
        # conservative implementation: replace any space with
        # underscore, replace unicode accented characters with
        # non-accented equivalents, remove others, and finally replace
        # all characters not in [a-zA-Z0-9_-] with underscores.

        import re
        import unicodedata

        n = t.replace(" ", "_")
        if isinstance(n, unicode):
            ascii = unicodedata.normalize('NFKD', n).encode('ascii', 'ignore')
        n  = re.sub(u'[^a-zA-Z\u4e00-\u9fa5<>,0-9_-]', '_', n)

        normalize_to_storage_form.__cache[t] = n

    return normalize_to_storage_form.__cache[t]
normalize_to_storage_form.__cache = {}

# 把第163行的代码改成下面的
n  = re.sub(u'[^a-zA-Z\u4e00-\u9fa5<>,0-9_-]', '_', n)

（2）修改标注文档

修改标注文件才能实现指定标签，你的所有文件都可以建立一个文件夹，比如ocr，然后你的不同段落文本就可以以ocr1.txt，ocr2.txt 等进行存储，然后定义好这个文件夹下面的标签，如下即可。

3、使用

（1）输入地址http://ip:port访问，然后输入用户名密码brat/brat即可登录

（2）直接标注就行了，使用很简单