断句规则 Segmentation Rule

最新推荐文章于 2023-07-07 22:12:53 发布

dark_2001

最新推荐文章于 2023-07-07 22:12:53 发布

阅读量613

点赞数

分类专栏： Trados 文章标签：断句规则

本文链接：https://blog.csdn.net/dark_2001/article/details/118730812

版权

Trados 专栏收录该内容

39 篇文章 21 订阅

订阅专栏

断句规则 Segmentation Rule
每个语言有自己的断句规则。
例如拉丁语系使用句号(.)，问号(?)，冒号(: )，感叹号(!)断句; 而中文使用句号(。)，问号(？)，冒号(：)，感叹号(！)断句

例如：
The Chinese nation is a great nation. With a history of more than 5,000 years, China has made indelible contributions to the progress of human civilization. After the Opium War of 1840, however, China was gradually reduced to a semi-colonial, semi-feudal society and suffered greater ravages than ever before. The country endured intense humiliation, the people were subjected to great pain, and the Chinese civilization was plunged into darkness. Since that time, national rejuvenation has been the greatest dream of the Chinese people and the Chinese nation.

拆分为：

The Chinese nation is a great nation.	中华民族是世界上伟大的民族
With a history of more than 5,000 years, China has made indelible contributions to the progress of human civilization.	有着5000多年源远流长的文明历史，为人类文明进步作出了不可磨灭的贡献。
After the Opium War of 1840, however, China was gradually reduced to a semi-colonial, semi-feudal society and suffered greater ravages than ever before.	1840年鸦片战争以后，中国逐步成为半殖民地半封建社会
The country endured intense humiliation, the people were subjected to great pain, and the Chinese civilization was plunged into darkness.	国家蒙辱、人民蒙难、文明蒙尘，中华民族遭受了前所未有的劫难
Since that time, national rejuvenation has been the greatest dream of the Chinese people and the Chinese nation.	从那时起，实现中华民族伟大复兴，就成为中国人民和中华民族最伟大的梦想。

常见英文断句规则

Full Stop

前	分隔符	后
	.	非打印字符（包括空格）

前	后
.+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*	\s

例外：Lower-case letter exception

前	分隔符	后
	.	\s\p{Ll}

前	后
.+[\p{Pe}\p{Pf}\p{Po}"]*	\s\p{Ll}

Other
|前|分隔符|后
|-|-|-|
||?!|非打印字符（包括空格）|

前	后
[!?]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*	\s

例外：Lower-case letter exception

前	分隔符	后
	.	\s\p{Ll}

前	后
.+[\p{Pe}\p{Pf}\p{Po}"]*	\s\p{Ll}

Colon

前	分隔符	后
	:	非打印字符（包括空格）

前	后
[:]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*	\s

例外：Lower-case letter exception

前	分隔符	后
	.	\s\p{Ll}

前	后
.+[\p{Pe}\p{Pf}\p{Po}"]*	\s\p{Ll}

SRX 2.0 April 7, 2008 | GALA Global (gala-global.org)
SRX 是记录断句规则的文档，是XML文件
XML Schema for SRX

<?xml version="1.0" encoding="UTF-8"?>
    
    <!--
        Document        : srx20.xsd
        Version             : 2.0
        Created on      : December 26, 2006
        Authors            : dpooley@sdl.com   rmraya@maxprograms.com
        Description     : This XML Schema defines the structure of SRX 2.0
        Status               : OSCAR recommendation
        
        Copyright © The Localisation Industry Standards Association [LISA] 2006. 
        All Rights Reserved.
    -->
    
    <!-- 
        
        History of modifications (latest first):
        
        Jul-08-2008 by RMR: made foreign elements optional in <header>
        Jan-13-2008 by RMR: Permitted elements from foreign namespaces in <header> element
        Dec-26-2006 by RMR: Fixed namespace handling
        Changed version to "2.0"
        Removed "cascade" attribute from <languagemap>
        Removed <maprule> element
        Adjusted attributes to match the specification document    
        Jun-21-2006 by DRP: Change version number to "1.2" in readiness to move to "2.0"
        Make the cascade attribute mandatory (required) on the <header> element
        Add enumerations where necessary and some brief documentation for elements and attributes
        Jun-15-2006 by DRP: Change version number to "1.1" in readiness to move to "2.0"
        Mar-10-2006 by DRP: Add "cascade" attribute to <header>, <maprule> and <languagemap> elements
        Apr-21-2004 by DRP: Convert to version 1.0.
        Mar-22-2004 by DRP: Eighth draft version.
        Ensure the <excludeexception> element is removed
        Update version number
        Mar-17-2004 by DRP: Seventh draft version.
        Remove <exceptions>, <exception>, <endrules>, <endrule> and <excludeexception> elements
        Add <rule> element
        Update version number
        Feb-02-2004 by DRP: Sixth draft version.
        Update version number
        Oct-27-2003 by DRP: Fifth draft version.
        Removed includeformatting attribute from <header> element
        Added <formathandle> element to the <header>
        Removed priority attribute from <endrule> and <exception> elements
        Added name attribute to <exception> element
        Added <excludeexception> element to the <endrule> element
        Oct-10-2003 by DRP: Fourth draft version.
        Removed <classdefinitions> and <classdefinition> elements
        Removed classdefinitionname attribute
        Removed <digitcharacters>, <whitespacecharacters> and <wordcharacters>
        Added priority attribute to <endrule> and <exception> elements
        Added includeformatting attribute to <header> element
        Jul-24-2003 by DRP: Third draft version.
        Removed <charsets> and <charset> to be replaced with <classdefinitions> and <classdefinition>
        Renamed <digits> to <digitcharacters>
        Renamed <whitespace> to <whitespacecharacters>
        Renamed <wordchars> to <wordcharacters>
        <digitcharacters>, <whitespacecharacters> and <wordcharacters> are now optional
        Renamed <langrules> to <languagerules>
        Renamed <langrule> to <languagerule>
        Renamed <langmap> to <languagemap>
        Renamed langrulename to languagerulename
        Renamed langpattern to languagepattern
        Jun-19-2003 by DRP: Second draft version.
        Removed the <codepage> element.
        Added <header> and <body> elements.
        Nov-22-2002 by DRP: First draft version
        
    -->
    <xs:schema xmlns:srx="http://www.lisa.org/srx20" 
        targetNamespace="http://www.lisa.org/srx20" xml:lang="en" 
        xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
        <xs:import namespace="http://www.w3.org/XML/1998/namespace"
            schemaLocation="http://www.w3.org/2001/xml.xsd"/>
        <xs:element name="afterbreak">
            <xs:annotation>
                <xs:documentation>Contains the regular expression to match before the segment
                    break</xs:documentation>
            </xs:annotation>
            <xs:complexType mixed="true"/>
        </xs:element>
        <xs:element name="beforebreak">
            <xs:annotation>
                <xs:documentation>Contains the regular expression to match after the segment
                    break</xs:documentation>
            </xs:annotation>
            <xs:complexType mixed="true"/>
        </xs:element>
        <xs:element name="body">
            <xs:annotation>
                <xs:documentation>SRX body</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:languagerules"/>
                    <xs:element ref="srx:maprules"/>
                </xs:sequence>
            </xs:complexType>
        </xs:element>
        <xs:element name="formathandle">
            <xs:annotation>
                <xs:documentation>Determines which side of the segment break that formatting
                    information goes</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:attribute name="include" use="required">
                    <xs:annotation>
                        <xs:documentation>A value of "no" indicates that the format code does not belong
                            to the segment being created. A value of "yes" indicates that the format code
                            belongs to the segment being created.</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="yes"/>
                            <xs:enumeration value="no"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
                <xs:attribute name="type" use="required">
                    <xs:annotation>
                        <xs:documentation>The type of format for which behaviour is being defined. Can be
                            "start", "end" or "isolated".</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="start"/>
                            <xs:enumeration value="end"/>
                            <xs:enumeration value="isolated"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
        <xs:element name="header">
            <xs:annotation>
                <xs:documentation>SRX header</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:formathandle" minOccurs="0" maxOccurs="3"/>
                    <xs:any minOccurs="0" maxOccurs="unbounded" namespace="##other" processContents="lax"/>
                </xs:sequence>
                <xs:attribute name="segmentsubflows" use="required">
                    <xs:annotation>
                        <xs:documentation>Determines whether text subflows should be
                            segmented</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="yes"/>
                            <xs:enumeration value="no"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
                <xs:attribute name="cascade" use="required">
                    <xs:annotation>
                        <xs:documentation>Determines whether a matching &lt;languagemap&gt; element
                            should terminate the search</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="yes"/>
                            <xs:enumeration value="no"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
        <xs:element name="languagemap">
            <xs:annotation>
                <xs:documentation>Maps one or more languages to a set of rules</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:attribute name="languagerulename" type="xs:string" use="required">
                    <xs:annotation>
                        <xs:documentation>The name of the language rule to use when the languagepattern
                            regular expression is satisfied</xs:documentation>
                    </xs:annotation>
                </xs:attribute>
                <xs:attribute name="languagepattern" type="xs:string" use="required">
                    <xs:annotation>
                        <xs:documentation>The regular expression pattern match for the language
                            code</xs:documentation>
                    </xs:annotation>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
        <xs:element name="languagerule">
            <xs:annotation>
                <xs:documentation>A set of rules for a logical set of languages</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:rule" minOccurs="1" maxOccurs="unbounded"/>
                </xs:sequence>
                <xs:attribute name="languagerulename" type="xs:string" use="required">
                    <xs:annotation>
                        <xs:documentation>The name of the language rule</xs:documentation>
                    </xs:annotation>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
        <xs:element name="languagerules">
            <xs:annotation>
                <xs:documentation>Contains all the logical sets of rules</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:languagerule" minOccurs="1" maxOccurs="unbounded"/>
                </xs:sequence>
            </xs:complexType>
        </xs:element>
        <xs:element name="maprules">
            <xs:annotation>
                <xs:documentation>A set of language maps</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:languagemap" minOccurs="1" maxOccurs="unbounded"/>
                </xs:sequence>
            </xs:complexType>
        </xs:element>
        <xs:element name="rule">
            <xs:annotation>
                <xs:documentation>A break/no break rule</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:beforebreak" minOccurs="0"/>
                    <xs:element ref="srx:afterbreak" minOccurs="0"/>
                </xs:sequence>
                <xs:attribute name="break">
                    <xs:annotation>
                        <xs:documentation>Determines whether this is a segment break or an exception
                            rule</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="yes"/>
                            <xs:enumeration value="no"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
        <xs:element name="srx">
            <xs:annotation>
                <xs:documentation>OSCAR Segmentation Rules eXchange</xs:documentation>
            </xs:annotation>
            <xs:complexType>
                <xs:sequence>
                    <xs:element ref="srx:header"/>
                    <xs:element ref="srx:body"/>
                </xs:sequence>
                <xs:attribute name="version" use="required">
                    <xs:annotation>
                        <xs:documentation>The version of SRX</xs:documentation>
                    </xs:annotation>
                    <xs:simpleType>
                        <xs:restriction base="xs:string">
                            <xs:enumeration value="2.0"/>
                        </xs:restriction>
                    </xs:simpleType>
                </xs:attribute>
            </xs:complexType>
        </xs:element>
    </xs:schema>

dark_2001

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
断句规则 Segmentation Rule

断句规则 Segmentation Rule每个语言有自己的断句规则。例如拉丁语系使用句号(.)，问号(?)，冒号(: )，感叹号(!)断句; 而中文使用句号(。)，问号(？)，冒号(：)，感叹号(！)断句例如：The Chinese nation is a great nation. With a history of more than 5,000 years, China has made indelible contributions to the progress of human civ
复制链接

扫一扫