Finding sentences--SBD

最新推荐文章于 2021-12-07 20:15:18 发布

HoiDev

最新推荐文章于 2021-12-07 20:15:18 发布

阅读量347

点赞数

分类专栏： NLP 文章标签： nlp

本文链接：https://blog.csdn.net/qq_33938256/article/details/52763746

版权

NLP 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Sentence Boundary DisambiguationSBD

Sentence Boundary Disambiguation(SBD)

The SBD Process

The SBD process is language dependent and is often not straightforward

Common approach to detect boundary

a set of rules
a set of trainning model

Most search engines are not concerned with SBD. They are only interested in a query’s tokens and their positions

What makes SBD difficult?

Punctuation is frequently ambiguous
Abbreviations often contain periods(对英文来说，缩写常带着’.’号)
Sentences may be embedded within each other by the use of quotes(引用句子)
With more specialized text, such as tweets and chat sessions, we may need to
consider the use of new lines or completion of clauses
word, phrase or sentence?

[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)

JDK sentence breaker
BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(Locale.US);
        sentenceIterator.setText(paragraph);
        int boundary = sentenceIterator.first();

        while (boundary != BreakIterator.DONE) {
            int begin = boundary;
            System.out.print(boundary + "-");
            boundary = sentenceIterator.next();
            int end = boundary;
            if (end == BreakIterator.DONE) {
                break;
            }
            System.out.println(boundary + " [" + paragraph.substring(begin, end) + "]");
        }

Training a Sentence Detector Model

using training model

evaluate model

/**
         * OpenNLP train sentence
         */
        try {
            ObjectStream<String> lineStream = new PlainTextByLineStream(new FileReader("sentence.train"));
            ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream);
//          SentenceModel model = SentenceDetectorME.train("en", sampleStream, true,
//                  null, TrainingParameters.defaultParams());

            boolean useTokenEnd = true;
            Dictionary abbreviationDictionary = null;
            SentenceDetectorFactory sdFactory = SentenceDetectorFactory.create(null, "en", true, abbreviationDictionary, new char[]{'.'});
            SentenceDetectorME.train("en", sampleStream, sdFactory, TrainingParameters.defaultParams());

            OutputStream modelStream = new BufferedOutputStream(new FileOutputStream("modelFile"));
//          model.serialize(modelStream);

            //using training model
//          InputStream is = new FileInputStream(new File(getModelDir(), "modelFile");
            SentenceModel model = new SentenceModel(is);
            SentenceDetectorME detector = new SentenceDetectorME(model);
            String sentences[] = detector.sentDetect(paragraph);
            for (String sentence : sentences) {
                System.out.println(sentence);
            }


            //evaluate model
            SentenceDetectorEvaluator sentenceDetectorEvaluator = new SentenceDetectorEvaluator(detector, null);
            sentenceDetectorEvaluator.evaluate(sampleStream);
            System.out.println(sentenceDetectorEvaluator.getFMeasure());
        } catch (Exception ex) {
            // Handle exception
        }

HoiDev

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Finding sentences--SBD

Sentence Boundary DisambiguationSBDThe SBD ProcessWhat makes SBD difficultTraining a Sentence Detector Modelusing training modelevaluate modelSentence Boundary Disambiguation(SBD)The SBD ProcessT
复制链接

扫一扫

专栏目录