java 词形还原,在Scala和Spark文本词形还原最简单的方法

最新推荐文章于 2021-03-20 08:12:26 发布

谭押沙龙

最新推荐文章于 2021-03-20 08:12:26 发布

阅读量112

点赞数

文章标签： java 词形还原

I want to use lemmatization on a text file:

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring 2008 issue moody audio backed.

omg left gotta wrap review order asap . understand hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

cables cables finally able hear gem long rumored music .

...

and expected output is :

surprise heard thump open door small seed man clasp package wrap.

upgrade system found review spring 2008 issue mood audio back.

omg left gotta wrap review order asap . understand hand deliver dali lama

speak hand wear earplug live . listen maintain link long .

cable cable final able hear gem long rumor music .

...

Can anybody help me ? and who knows the simplest method for lemmatization that it have been implemented in Scala and Spark ?

解决方案

There is a function from the book Adavanced analitics in Spark, chapter about Lemmatization:

val plainText = sc.parallelize(List("Sentence to be precessed."))

val stopWords = Set("stopWord")

import edu.stanford.nlp.pipeline._

import edu.stanford.nlp.ling.CoreAnnotations._

import scala.collection.JavaConversions._

def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {

val props = new Properties()

props.put("annotators", "tokenize, ssplit, pos, lemma")

val pipeline = new StanfordCoreNLP(props)

val doc = new Annotation(text)

pipeline.annotate(doc)

val lemmas = new ArrayBuffer[String]()

val sentences = doc.get(classOf[SentencesAnnotation])

for (sentence

val lemma = token.get(classOf[LemmaAnnotation])

if (lemma.length > 2 && !stopWords.contains(lemma)) {

lemmas += lemma.toLowerCase

}

}

lemmas

}

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

lemmatized.foreach(println)

Now just use this for every line in mapper.

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

EDIT:

I added to the code line

import scala.collection.JavaConversions._

this is needed because otherwise sentences are Java not Scala List. This should now compile without problems.

I used scala 2.10.4 and fallowing stanford.nlp dependencies:

edu.stanford.nlp

stanford-corenlp

3.5.2

edu.stanford.nlp

stanford-corenlp

3.5.2

models

You can also look at stanford.nlp page there is a lot of examples (in Java) http://nlp.stanford.edu/software/corenlp.shtml.

EDIT:

MapPartition version:

Although i dont know if its gonna speed up job significantly.

def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {

val doc = new Annotation(text)

pipeline.annotate(doc)

val lemmas = new ArrayBuffer[String]()

val sentences = doc.get(classOf[SentencesAnnotation])

for (sentence

val lemma = token.get(classOf[LemmaAnnotation])

if (lemma.length > 2 && !stopWords.contains(lemma)) {

lemmas += lemma.toLowerCase

}

}

lemmas

}

val lemmatized = plainText.mapPartitions(p => {

val props = new Properties()

props.put("annotators", "tokenize, ssplit, pos, lemma")

val pipeline = new StanfordCoreNLP(props)

p.map(q => plainTextToLemmas(q, stopWords, pipeline))

})

lemmatized.foreach(println)

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。