断句规则 Segmentation Rule 每个语言有自己的断句规则。 例如拉丁语系使用句号(.),问号(?),冒号(: ),感叹号(!)断句; 而中文使用句号(。),问号(?),冒号(:),感叹号(!)断句
例如: The Chinese nation is a great nation. With a history of more than 5,000 years, China has made indelible contributions to the progress of human civilization. After the Opium War of 1840, however, China was gradually reduced to a semi-colonial, semi-feudal society and suffered greater ravages than ever before. The country endured intense humiliation, the people were subjected to great pain, and the Chinese civilization was plunged into darkness. Since that time, national rejuvenation has been the greatest dream of the Chinese people and the Chinese nation.
拆分为:
The Chinese nation is a great nation. 中华民族是世界上伟大的民族 With a history of more than 5,000 years, China has made indelible contributions to the progress of human civilization. 有着5000多年源远流长的文明历史,为人类文明进步作出了不可磨灭的贡献。 After the Opium War of 1840, however, China was gradually reduced to a semi-colonial, semi-feudal society and suffered greater ravages than ever before. 1840年鸦片战争以后,中国逐步成为半殖民地半封建社会 The country endured intense humiliation, the people were subjected to great pain, and the Chinese civilization was plunged into darkness. 国家蒙辱、人民蒙难、文明蒙尘,中华民族遭受了前所未有的劫难 Since that time, national rejuvenation has been the greatest dream of the Chinese people and the Chinese nation. 从那时起,实现中华民族伟大复兴,就成为中国人民和中华民族最伟大的梦想。
常见英文断句规则
前 后 .+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]* \s
例外:Lower-case letter exception
前 后 .+[\p{Pe}\p{Pf}\p{Po}"]* \s\p{Ll}
Other |前|分隔符|后 |-|-|-| ||?!|非打印字符(包括空格)|
前 后 [!?]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]* \s
例外:Lower-case letter exception
前 后 .+[\p{Pe}\p{Pf}\p{Po}"]* \s\p{Ll}
前 后 [:]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]* \s
例外:Lower-case letter exception
前 后 .+[\p{Pe}\p{Pf}\p{Po}"]* \s\p{Ll}
SRX 2.0 April 7, 2008 | GALA Global (gala-global.org) SRX 是记录断句规则的文档,是XML文件 XML Schema for SRX
<?xml version="1.0" encoding="UTF-8"?>
< xs: schema xmlns: srx= " http://www.lisa.org/srx20"
targetNamespace = " http://www.lisa.org/srx20" xml: lang= " en"
xmlns: xs= " http://www.w3.org/2001/XMLSchema" elementFormDefault = " qualified" >
< xs: import namespace = " http://www.w3.org/XML/1998/namespace"
schemaLocation = " http://www.w3.org/2001/xml.xsd" />
< xs: element name = " afterbreak" >
< xs: annotation>
< xs: documentation> Contains the regular expression to match before the segment
break</ xs: documentation>
</ xs: annotation>
< xs: complexType mixed = " true" />
</ xs: element>
< xs: element name = " beforebreak" >
< xs: annotation>
< xs: documentation> Contains the regular expression to match after the segment
break</ xs: documentation>
</ xs: annotation>
< xs: complexType mixed = " true" />
</ xs: element>
< xs: element name = " body" >
< xs: annotation>
< xs: documentation> SRX body</ xs: documentation>
</ xs: annotation>
< xs: complexType>
< xs: sequence>
< xs: element ref = " srx:languagerules" />
< xs: element ref = " srx:maprules" />
</ xs: sequence>
</ xs: complexType>
</ xs: element>
< xs: element name = " formathandle" >
< xs: annotation>
< xs: documentation> Determines which side of the segment break that formatting
information goes</ xs: documentation>
</ xs: annotation>
< xs: complexType>
< xs: attribute name = " include" use = " required" >
< xs: annotation>
< xs: documentation> A value of "no" indicates that the format code does not belong
to the segment being created. A value of "yes" indicates that the format code
belongs to the segment being created.</ xs: documentation>
</ xs: annotation>
< xs: simpleType>
< xs: restriction base = " xs:string" >
< xs: enumeration value = " yes" />
< xs: enumeration value = " no" />
</ xs: restriction>
</ xs: simpleType>
</ xs: attribute>
< xs: attribute name = " type" use = " required" >
< xs: annotation>
< xs: documentation> The type of format for which behaviour is being defined. Can be
"start", "end" or "isolated".</ xs: documentation>
</ xs: annotation>
< xs: simpleType>
< xs: restriction base = " xs:string" >
< xs: enumeration value = " start" />
< xs: enumeration value = " end" />
< xs: enumeration value = " isolated" />
</ xs: restriction>
</ xs: simpleType>
</ xs: attribute>
</ xs: complexType>
</ xs: element>
< xs: element name = " header" >
< xs: annotation>
< xs: documentation> SRX header</ xs: documentation>
</ xs: annotation>
< xs: complexType>
< xs: sequence>
< xs: element ref = " srx:formathandle" minOccurs = " 0" maxOccurs = " 3" />
< xs: any minOccurs = " 0" maxOccurs = " unbounded" namespace = " ##other" processContents = " lax" />
</ xs: sequence>
< xs: attribute name = " segmentsubflows" use = " required" >
< xs: annotation>
< xs: documentation> Determines whether text subflows should be
segmented</ xs: documentation>
</ xs: annotation>
< xs: simpleType>
< xs: restriction base = " xs:string" >
< xs: enumeration value = " yes" />
< xs: enumeration value = " no" />
</ xs: restriction>
</ xs: simpleType>
</ xs: attribute>
< xs: attribute name = " cascade" use = " required" >
< xs: annotation>
< xs: documentation> Determines whether a matching < languagemap> element
should terminate the search</ xs: documentation>
</ xs: annotation>
< xs: simpleType>
< xs: restriction base = " xs:string" >
< xs: enumeration value = " yes" />
< xs: enumeration value = " no" />
</ xs: restriction>
</ xs: simpleType>
</ xs: attribute>
</ xs: complexType>
</ xs: element>
< xs: element name = " languagemap" >
< xs: annotation>
< xs: documentation> Maps one or more languages to a set of rules</ xs: documentation>
</ xs: annotation>
< xs: complexType>
< xs: attribute name = " languagerulename" type = " xs:string" use = " required" >
< xs: annotation>
< xs: documentation> The name of the language rule to use when the languagepattern
regular expression is satisfied</ xs: documentation>
</ xs: annotation>
</ xs: attribute>
< xs: attribute name = " languagepattern" type = " xs:string" use = " required" >
< xs: annotation>
< xs: documentation> The regular expression pattern match for the language
code</ xs: documentation>
</ xs: annotation>
</ xs: attribute>
</ xs: complexType>
</ xs: element>
< xs: element name = " languagerule" >
< xs: annotation>
< xs: documentation> A set of rules for a logical set of languages</ xs: documentation>
</ xs: annotation>
< xs: complexType>
< xs: sequence>
< xs: element ref = " srx:rule" minOccurs = " 1" maxOccurs = " unbounded" />
</ xs: sequence>
< xs: attribute name = " languagerulename" type = " xs:string" use = " required" >
< xs: annotation>
< xs: documentation> The name of the language rule</ xs: documentation>
</ xs: annotation>
</ xs: attribute>
</ xs: complexType>
</ xs: element>
< xs: element name = " languagerules" >
< xs: annotation>
< xs: documentation> Contains all the logical sets of rules</ xs: documentation>
</ xs: annotation>
< xs: complexType>
< xs: sequence>
< xs: element ref = " srx:languagerule" minOccurs = " 1" maxOccurs = " unbounded" />
</ xs: sequence>
</ xs: complexType>
</ xs: element>
< xs: element name = " maprules" >
< xs: annotation>
< xs: documentation> A set of language maps</ xs: documentation>
</ xs: annotation>
< xs: complexType>
< xs: sequence>
< xs: element ref = " srx:languagemap" minOccurs = " 1" maxOccurs = " unbounded" />
</ xs: sequence>
</ xs: complexType>
</ xs: element>
< xs: element name = " rule" >
< xs: annotation>
< xs: documentation> A break/no break rule</ xs: documentation>
</ xs: annotation>
< xs: complexType>
< xs: sequence>
< xs: element ref = " srx:beforebreak" minOccurs = " 0" />
< xs: element ref = " srx:afterbreak" minOccurs = " 0" />
</ xs: sequence>
< xs: attribute name = " break" >
< xs: annotation>
< xs: documentation> Determines whether this is a segment break or an exception
rule</ xs: documentation>
</ xs: annotation>
< xs: simpleType>
< xs: restriction base = " xs:string" >
< xs: enumeration value = " yes" />
< xs: enumeration value = " no" />
</ xs: restriction>
</ xs: simpleType>
</ xs: attribute>
</ xs: complexType>
</ xs: element>
< xs: element name = " srx" >
< xs: annotation>
< xs: documentation> OSCAR Segmentation Rules eXchange</ xs: documentation>
</ xs: annotation>
< xs: complexType>
< xs: sequence>
< xs: element ref = " srx:header" />
< xs: element ref = " srx:body" />
</ xs: sequence>
< xs: attribute name = " version" use = " required" >
< xs: annotation>
< xs: documentation> The version of SRX</ xs: documentation>
</ xs: annotation>
< xs: simpleType>
< xs: restriction base = " xs:string" >
< xs: enumeration value = " 2.0" />
</ xs: restriction>
</ xs: simpleType>
</ xs: attribute>
</ xs: complexType>
</ xs: element>
</ xs: schema>