java 词形还原,在Scala和Spark文本词形还原最简单的方法

I want to use lemmatization on a text file:

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring 2008 issue moody audio backed.

omg left gotta wrap review order asap . understand hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

cables cables finally able hear gem long rumored music .

...

and expected output is :

surprise heard thump open door small seed man clasp package wrap.

upgrade system found review spring 2008 issue mood audio back.

omg left gotta wrap review order asap . understand hand deliver dali lama

speak hand wear earplug live . listen maintain link long .

cable cable final able hear gem long rumor music .

...

Can anybody help me ? and who knows the simplest method for lemmatization that it have been implemented in Scala and Spark ?

解决方案

There is a function from the book Adavanced analitics in Spark, chapter about Lemmatization:

val plainText = sc.parallelize(List("Sentence to be precessed."))

val stopWords = Set("stopWord")

import edu.stanford.nlp.pipeline._

import edu.stanford.nlp.ling.CoreAnnotations._

import scala.collection.JavaConversions._

def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {

val props = new Properties()

props.put("annotators", "tokenize, ssplit, pos, lemma")

val pipeline = new StanfordCoreNLP(props)

val doc = new Annotation(text)

pipeline.annotate(doc)

val lemmas = new ArrayBuffer[String]()

val sentences = doc.get(classOf[SentencesAnnotation])

for (sentence

val lemma = token.get(classOf[LemmaAnnotation])

if (lemma.length > 2 && !stopWords.contains(lemma)) {

lemmas += lemma.toLowerCase

}

}

lemmas

}

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

lemmatized.foreach(println)

Now just use this for every line in mapper.

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

EDIT:

I added to the code line

import scala.collection.JavaConversions._

this is needed because otherwise sentences are Java not Scala List. This should now compile without problems.

I used scala 2.10.4 and fallowing stanford.nlp dependencies:

edu.stanford.nlp

stanford-corenlp

3.5.2

edu.stanford.nlp

stanford-corenlp

3.5.2

models

You can also look at stanford.nlp page there is a lot of examples (in Java) http://nlp.stanford.edu/software/corenlp.shtml.

EDIT:

MapPartition version:

Although i dont know if its gonna speed up job significantly.

def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {

val doc = new Annotation(text)

pipeline.annotate(doc)

val lemmas = new ArrayBuffer[String]()

val sentences = doc.get(classOf[SentencesAnnotation])

for (sentence

val lemma = token.get(classOf[LemmaAnnotation])

if (lemma.length > 2 && !stopWords.contains(lemma)) {

lemmas += lemma.toLowerCase

}

}

lemmas

}

val lemmatized = plainText.mapPartitions(p => {

val props = new Properties()

props.put("annotators", "tokenize, ssplit, pos, lemma")

val pipeline = new StanfordCoreNLP(props)

p.map(q => plainTextToLemmas(q, stopWords, pipeline))

})

lemmatized.foreach(println)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值