java截取文章中的句子

通过使用强大的开源Lingpipe的库实现,百行代码以内就可以轻松把文章的所有句子保存下来。

maven链接:
<dependency>
    <groupId>de.julielab</groupId>
    <artifactId>aliasi-lingpipe</artifactId>
    <version>4.1.0</version>
</dependency>
代码如下,
package com.aeschoolmate.util;

import com.aliasi.sentences.IndoEuropeanSentenceModel;
import com.aliasi.sentences.SentenceModel;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;

import java.util.ArrayList;
import java.util.List;

/**
 * 文章文字截取管理类。
 */
public class ArticleManager {

    //字母类
    public static class Letter {
        char letter;
    }

    //句子类
    public class Sentence {
        Word[] words;//句子里的每个单词类
        public String text;//句子

    }

    //字词类
    public class Word {
        Letter[] letters;//单词里的每个字符类
        String word;//单词
    }


    final TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE;
    final SentenceModel sentenceModel = new IndoEuropeanSentenceModel();

    /**
     * 根据文章截取句子;
     * @param text 文章
     * @return 句子集合
     */
    public List<Sentence> getSentences(String text) {
        List<String> wordlist = new ArrayList<String>();
        List<String> whiteList = new ArrayList<String>();
        Tokenizer tokenizer = tokenizerFactory.tokenizer(text.toCharArray(),
                0, text.length());
        tokenizer.tokenize(wordlist, whiteList);
        String[] words = new String[wordlist.size()];//得到所有的词
        String[] whites = new String[whiteList.size()];//得到空白字符
        wordlist.toArray(words);
        whiteList.toArray(whites);
        int[] sentenceLastCharIndexs = sentenceModel.boundaryIndices(words,
                whites);//拿到句子结束的索引
        int nextIndex = 0;
        ArrayList<Sentence> sentences = new ArrayList<Sentence>();
        for (int lastIndex : sentenceLastCharIndexs) {
            Sentence sentence = new Sentence();
            StringBuilder stringBuilder = new StringBuilder();
            Word[] sentencWords = new Word[lastIndex - nextIndex + 1];
            int j = 0;
            for (int i = nextIndex; i <= lastIndex; i++) {
                Word sentencWord = new Word();
                sentencWords[j++] = sentencWord;
                sentencWord.word = (words[i]);
                char[] chars = new char[sentencWord.word.length()];
                sentencWord.word.getChars(0, chars.length, chars, 0);
                Letter[] letters = new Letter[chars.length];
                for (int z = 0; z < chars.length; z++) {
                    Letter letter = new Letter();
                    letter.letter = chars[z];
                    letters[z] = letter;
                }
                sentencWord.letters = letters;
                stringBuilder.append(sentencWord.word);
                if (i != lastIndex - 1) {
                    stringBuilder.append(' ');//补空格
                }

            }
            sentence.text = stringBuilder.toString();
            sentence.words = sentencWords;
            nextIndex = lastIndex + 1;
            sentences.add(sentence);
        }
        return sentences;

    }


}
测试句子以及输出:
    static String text ="This novel has been fully translated by RWX. There are a total of 21 books spanning 800+ chapters, so sit back, buckle your seat belts, and get ready for one long ride!\n\n" +
            "\n" +
            "Empires rise and fall on the Yulan Continent. Saints, immortal beings of unimaginable power, battle using spells and swords, leaving swathes of destruction in their wake. Magical beasts rule the mountains, where the brave – or the foolish – go to test their strength. Even the mighty can fall, feasted on by those stronger. The strong live like royalty; the weak strive to survive another day.\n" +
            "\n" +
            "This is the world which Linley is born into. Raised in the small town of Wushan, Linley is a scion of the Baruch clan, the clan of the once-legendary Dragonblood Warriors. Their fame once shook the world, but the clan is now so decrepit that even the heirlooms of the clan have been sold off. Tasked with reclaiming the lost glory of his clan, Linley will go through countless trials and tribulations, making powerful friends but also deadly enemies.\n" +
            "\n" +
            "Come witness a new legend in the making. The legend of Linley Baruch.";


    public static void main(String[] args){
        ArticleManager articleManager = new ArticleManager();
        List<ArticleManager.Sentence> sentences = articleManager.getSentences(text);
        for (ArticleManager.Sentence sentence : sentences) {
            System.out.println(sentence.text);
        }

    }
控制台:
This novel has been fully translated by RWX. 
There are a total of 21 books spanning 800 + chapters , so sit back , buckle your seat belts , and get ready for one long ride! 
Empires rise and fall on the Yulan Continent. 
Saints , immortal beings of unimaginable power , battle using spells and swords , leaving swathes of destruction in their wake. 
Magical beasts rule the mountains , where the brave – or the foolish – go to test their strength. 
Even the mighty can fall , feasted on by those stronger. 
The strong live like royalty ; the weak strive to survive another day. 
This is the world which Linley is born into. 
Raised in the small town of Wushan , Linley is a scion of the Baruch clan , the clan of the once - legendary Dragonblood Warriors. 
Their fame once shook the world , but the clan is now so decrepit that even the heirlooms of the clan have been sold off. 
Tasked with reclaiming the lost glory of his clan , Linley will go through countless trials and tribulations , making powerful friends but also deadly enemies. 
Come witness a new legend in the making. 
The legend of Linley Baruch. 

Process finished with exit code 0

测试字词以及输出:

  ArticleManager articleManager = new ArticleManager();
        List<ArticleManager.Sentence> sentences = articleManager.getSentences(text);
        for (ArticleManager.Sentence sentence : sentences) {
            for (ArticleManager.Word word : sentence.words) {
                System.out.println(word.word);
            }
        }

控制台:
This
novel
has
been
fully
translated
by
RWX
.
There
are
a
total
of
21
books
spanning
800
+
chapters
,
so
sit
back
,
buckle
your
seat
belts
,
and
get
ready
for
one
long
ride
!
Empires
rise
and
fall
on
the
Yulan
Continent
.
Saints
,
immortal
beings
of
unimaginable
power
,
battle
using
spells
and
swords
,
leaving
swathes
of
destruction
in
their
wake
.
Magical
beasts
rule
the
mountains
,
where
the
brave

or
the
foolish

go
to
test
their
strength
.
Even
the
mighty
can
fall
,
feasted
on
by
those
stronger
.
The
strong
live
like
royalty
;
the
weak
strive
to
survive
another
day
.
This
is
the
world
which
Linley
is
born
into
.
Raised
in
the
small
town
of
Wushan
,
Linley
is
a
scion
of
the
Baruch
clan
,
the
clan
of
the
once
-
legendary
Dragonblood
Warriors
.
Their
fame
once
shook
the
world
,
but
the
clan
is
now
so
decrepit
that
even
the
heirlooms
of
the
clan
have
been
sold
off
.
Tasked
with
reclaiming
the
lost
glory
of
his
clan
,
Linley
will
go
through
countless
trials
and
tribulations
,
making
强大的
朋友,

也是
致命的
敌人

快来
见证
一个
新的
传说


决策


传说

林利
巴鲁克

进程以退出代码0结束

完美截取。相关API自己去查找。次文章只起启蒙作用

 

 


 

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值