断句规则 Segmentation Rule

断句规则 Segmentation Rule
每个语言有自己的断句规则。
例如拉丁语系使用句号(.),问号(?),冒号(: ),感叹号(!)断句; 而中文使用句号(。),问号(?),冒号(:),感叹号(!)断句

例如:
The Chinese nation is a great nation. With a history of more than 5,000 years, China has made indelible contributions to the progress of human civilization. After the Opium War of 1840, however, China was gradually reduced to a semi-colonial, semi-feudal society and suffered greater ravages than ever before. The country endured intense humiliation, the people were subjected to great pain, and the Chinese civilization was plunged into darkness. Since that time, national rejuvenation has been the greatest dream of the Chinese people and the Chinese nation.

拆分为:

The Chinese nation is a great nation.中华民族是世界上伟大的民族
With a history of more than 5,000 years, China has made indelible contributions to the progress of human civilization.有着5000多年源远流长的文明历史,为人类文明进步作出了不可磨灭的贡献。
After the Opium War of 1840, however, China was gradually reduced to a semi-colonial, semi-feudal society and suffered greater ravages than ever before.1840年鸦片战争以后,中国逐步成为半殖民地半封建社会
The country endured intense humiliation, the people were subjected to great pain, and the Chinese civilization was plunged into darkness.国家蒙辱、人民蒙难、文明蒙尘,中华民族遭受了前所未有的劫难
Since that time, national rejuvenation has been the greatest dream of the Chinese people and the Chinese nation.从那时起,实现中华民族伟大复兴,就成为中国人民和中华民族最伟大的梦想。

常见英文断句规则

  • Full Stop
分隔符
.非打印字符(包括空格)
.+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*\s

例外:Lower-case letter exception

分隔符
.\s\p{Ll}
.+[\p{Pe}\p{Pf}\p{Po}"]*\s\p{Ll}
  • Other
    |前|分隔符|后
    |-|-|-|
    ||?!|非打印字符(包括空格)|
[!?]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*\s

例外:Lower-case letter exception

分隔符
.\s\p{Ll}
.+[\p{Pe}\p{Pf}\p{Po}"]*\s\p{Ll}
  • Colon
分隔符
:非打印字符(包括空格)
[:]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*\s

例外:Lower-case letter exception

分隔符
.\s\p{Ll}
.+[\p{Pe}\p{Pf}\p{Po}"]*\s\p{Ll}

SRX 2.0 April 7, 2008 | GALA Global (gala-global.org)
SRX 是记录断句规则的文档,是XML文件
XML Schema for SRX

<?xml version="1.0" encoding="UTF-8"?>
    
    <!--
        Document        : srx20.xsd
        Version             : 2.0
        Created on      : December 26, 2006
        Authors            : dpooley@sdl.com   rmraya@maxprograms.com
        Description     : This XML Schema defines the structure of SRX 2.0
        Status               : OSCAR recommendation
        
        Copyright © The Localisation Industry Standards Association [LISA] 2006. 
        All Rights Reserved.
    -->
    
    <!-- 
        
        History of modifications (latest first):
        
        Jul-08-2008 by RMR: made foreign elements optional in <header>
        Jan-13-2008 by RMR: Permitted elements from foreign namespaces in <header> element
        Dec-26-2006 by RMR: Fixed namespace handling
        Changed version to "2.0"
        Removed "cascade" attribute from <languagemap>
        Removed <maprule> element
        Adjusted attributes to match the specification document    
        Jun-21-2006 by DRP: Change version number to "1.2" in readiness to move to "2.0"
        Make the cascade attribute mandatory (required) on the <header> element
        Add enumerations where necessary and some brief documentation for elements and attributes
        Jun-15-2006 by DRP: Change version number to "1.1" in readiness to move to "2.0"
        Mar-10-2006 by DRP: Add "cascade" attribute to <header>, <maprule> and <languagemap> elements
        Apr-21-2004 by DRP: Convert to version 1.0.
        Mar-22-2004 by DRP: Eighth draft version.
        Ensure the <excludeexception> element is removed
        Update version number
        Mar-17-2004 by DRP: Seventh draft version.
        Remove <exceptions>, <exception>, <endrules>, <endrule> and <excludeexception> elements
        Add <rule> element
        Update version number
        Feb-02-2004 by DRP: Sixth draft version.
        Update version number
        Oct-27-2003 by DRP: Fifth draft version.
        Removed includeformatting attribute from <header> element
        Added <formathandle> element to the <header>
        Removed priority attribute from <endrule> and <exception> elements
        Added name attribute to <exception> element
        Added <excludeexception> element to the <endrule> element
        Oct-10-2003 by DRP: Fourth draft version.
        Removed <classdefinitions> and <classdefinition> elements
        Removed classdefinitionname attribute
        Removed <digitcharacters>, <whitespacecharacters> and <wordcharacters>
        Added priority attribute to <endrule> and <exception> elements
        Added includeformatting attribute to <header> element
        Jul-24-2003 by DRP: Third draft version.
        Removed <charsets> and <charset> to be replaced with <classdefinitions> and <classdefinition>
        Renamed <digits> to <digitcharacters>
        Renamed <whitespace> to <whitespacecharacters>
        Renamed <wordchars> to <wordcharacters>
        <digitcharacters>, <whitespacecharacters> and <wordcharacters> are now optional
        Renamed <langrules> to <languagerules>
        Renamed <langrule> to <languagerule>
        Renamed <langmap> to <languagemap>
        Renamed langrulename to languagerulename
        Renamed langpattern to languagepattern
        Jun-19-2003 by DRP: Second draft version.
        Removed the <codepage> element.
        Added <header> and <body> elements.
        Nov-22-2002 by DRP: First draft version
        
    -->
    <xs:schema xmlns:srx="http://www.lisa.org/srx20" 
        targetNamespace="http://www.lisa.org/srx20" xml:lang="en" 
        xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
        <xs:import namespace="http://www.w3.org/XML/1998/namespace"
            schemaLocation="http://www.w3.org/2001/xml.xsd"/>
        <xs:element name="afterbreak">
            <xs:annotation>
                <xs:documentation>Contains the regular expression to match before the segment
                    break</xs:documentation>
            </xs:annotation>
            <xs:complexType mixed="true"/>
        </xs:element>
        <xs:element name="beforebreak">
            <xs:annotation>
                <xs:documentation>Contains the regular expression to match after the segment
                    break</xs:documentation>
            </xs:annotation>
            <xs:complexType mixed="true"/>
        </xs:element>
        <xs:element name="body">
            <xs:annotation>
                <xs:documentation>SRX body</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:languagerules"/>
                    <xs:element ref="srx:maprules"/>
                </xs:sequence>
            </xs:complexType>
        </xs:element>
        <xs:element name="formathandle">
            <xs:annotation>
                <xs:documentation>Determines which side of the segment break that formatting
                    information goes</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:attribute name="include" use="required">
                    <xs:annotation>
                        <xs:documentation>A value of "no" indicates that the format code does not belong
                            to the segment being created. A value of "yes" indicates that the format code
                            belongs to the segment being created.</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="yes"/>
                            <xs:enumeration value="no"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
                <xs:attribute name="type" use="required">
                    <xs:annotation>
                        <xs:documentation>The type of format for which behaviour is being defined. Can be
                            "start", "end" or "isolated".</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="start"/>
                            <xs:enumeration value="end"/>
                            <xs:enumeration value="isolated"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
        <xs:element name="header">
            <xs:annotation>
                <xs:documentation>SRX header</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:formathandle" minOccurs="0" maxOccurs="3"/>
                    <xs:any minOccurs="0" maxOccurs="unbounded" namespace="##other" processContents="lax"/>
                </xs:sequence>
                <xs:attribute name="segmentsubflows" use="required">
                    <xs:annotation>
                        <xs:documentation>Determines whether text subflows should be
                            segmented</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="yes"/>
                            <xs:enumeration value="no"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
                <xs:attribute name="cascade" use="required">
                    <xs:annotation>
                        <xs:documentation>Determines whether a matching &lt;languagemap&gt; element
                            should terminate the search</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="yes"/>
                            <xs:enumeration value="no"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
        <xs:element name="languagemap">
            <xs:annotation>
                <xs:documentation>Maps one or more languages to a set of rules</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:attribute name="languagerulename" type="xs:string" use="required">
                    <xs:annotation>
                        <xs:documentation>The name of the language rule to use when the languagepattern
                            regular expression is satisfied</xs:documentation>
                    </xs:annotation>
                </xs:attribute>
                <xs:attribute name="languagepattern" type="xs:string" use="required">
                    <xs:annotation>
                        <xs:documentation>The regular expression pattern match for the language
                            code</xs:documentation>
                    </xs:annotation>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
        <xs:element name="languagerule">
            <xs:annotation>
                <xs:documentation>A set of rules for a logical set of languages</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:rule" minOccurs="1" maxOccurs="unbounded"/>
                </xs:sequence>
                <xs:attribute name="languagerulename" type="xs:string" use="required">
                    <xs:annotation>
                        <xs:documentation>The name of the language rule</xs:documentation>
                    </xs:annotation>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
        <xs:element name="languagerules">
            <xs:annotation>
                <xs:documentation>Contains all the logical sets of rules</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:languagerule" minOccurs="1" maxOccurs="unbounded"/>
                </xs:sequence>
            </xs:complexType>
        </xs:element>
        <xs:element name="maprules">
            <xs:annotation>
                <xs:documentation>A set of language maps</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:languagemap" minOccurs="1" maxOccurs="unbounded"/>
                </xs:sequence>
            </xs:complexType>
        </xs:element>
        <xs:element name="rule">
            <xs:annotation>
                <xs:documentation>A break/no break rule</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:beforebreak" minOccurs="0"/>
                    <xs:element ref="srx:afterbreak" minOccurs="0"/>
                </xs:sequence>
                <xs:attribute name="break">
                    <xs:annotation>
                        <xs:documentation>Determines whether this is a segment break or an exception
                            rule</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="yes"/>
                            <xs:enumeration value="no"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
        <xs:element name="srx">
            <xs:annotation>
                <xs:documentation>OSCAR Segmentation Rules eXchange</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:header"/>
                    <xs:element ref="srx:body"/>
                </xs:sequence>
                <xs:attribute name="version" use="required">
                    <xs:annotation>
                        <xs:documentation>The version of SRX</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="2.0"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
    </xs:schema>
    
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值