Sentence Boundary Disambiguation(SBD)
The SBD Process
The SBD process is language dependent and is often not straightforward
Common approach to detect boundary
- a set of rules
- a set of trainning model
Most search engines are not concerned with SBD. They are only interested in a query’s tokens and their positions
What makes SBD difficult?
- Punctuation is frequently ambiguous
- Abbreviations often contain periods(对英文来说,缩写常带着’.’号)
- Sentences may be embedded within each other by the use of quotes(引用句子)
- With more specialized text, such as tweets and chat sessions, we may need to
consider the use of new lines or completion of clauses - word, phrase or sentence?
[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)
JDK sentence breaker
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(Locale.US);
sentenceIterator.setText(paragraph);
int boundary = sentenceIterator.first();
while (boundary != BreakIterator.DONE) {
int begin = boundary;
System.out.print(boundary + "-");
boundary = sentenceIterator.next();
int end = boundary;
if (end == BreakIterator.DONE) {
break;
}
System.out.println(boundary + " [" + paragraph.substring(begin, end) + "]");
}
Training a Sentence Detector Model
using training model
evaluate model
/**
* OpenNLP train sentence
*/
try {
ObjectStream<String> lineStream = new PlainTextByLineStream(new FileReader("sentence.train"));
ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream);
// SentenceModel model = SentenceDetectorME.train("en", sampleStream, true,
// null, TrainingParameters.defaultParams());
boolean useTokenEnd = true;
Dictionary abbreviationDictionary = null;
SentenceDetectorFactory sdFactory = SentenceDetectorFactory.create(null, "en", true, abbreviationDictionary, new char[]{'.'});
SentenceDetectorME.train("en", sampleStream, sdFactory, TrainingParameters.defaultParams());
OutputStream modelStream = new BufferedOutputStream(new FileOutputStream("modelFile"));
// model.serialize(modelStream);
//using training model
// InputStream is = new FileInputStream(new File(getModelDir(), "modelFile");
SentenceModel model = new SentenceModel(is);
SentenceDetectorME detector = new SentenceDetectorME(model);
String sentences[] = detector.sentDetect(paragraph);
for (String sentence : sentences) {
System.out.println(sentence);
}
//evaluate model
SentenceDetectorEvaluator sentenceDetectorEvaluator = new SentenceDetectorEvaluator(detector, null);
sentenceDetectorEvaluator.evaluate(sampleStream);
System.out.println(sentenceDetectorEvaluator.getFMeasure());
} catch (Exception ex) {
// Handle exception
}