scala word2vec在集群出现奇奇怪怪的问题,代码如下:
val documentDF = sentence.map(Tuple1.apply)
.toDF("user_item")
.repartition(15)
documentDF.show(3, false)
val model = new Word2Vec()
.setInputCol("user_item")
.setOutputCol("vector")
.setVectorSize(64)
.setWindowSize(2)
.setMinCount(1)
.setMaxIter(20)
.setStepSize(0.025)
.setNumPartitions(62)
.fit(documentDF)
// val modelPath = "/Model"
// model.write.overwrite().save(modelPath)
// model.findSynonyms("fdc2d9ef27bc4d149e3b4b65915c7cf5", 20)
// .show(20,false)
println("save w2v ...")
val word2Vec = model.getVectors.select("word", "vector")
.as[w2v]
.rdd
.repartition(64)
.map(x=>(x.word, x.vector.drop(1).dropRight(1)))
.toDF("word", "vector")
val w2vPath = "/wordVector"
saveMethod(word2Vec.toDF, w2vPath)
word2Vec.unpersist()
异常1:输出的词向量出现 infinity
scala> model.getVectors.show()
+-------------+--------------------+
| word| vector|
+-------------+--------------------+
| Unspoken|[-Infinity,-Infin...|
| Talent|[Infinity,-Infini...|
| Hourglass|[1.09657520526310...|
|Nickelodeon's|[2.20436549446219...|
| Priests|[-1.9625896848389...|
| Religion:|[-3.8815759928213...|
| Bu|[-7.9722236466752...|
| Totoro:|[-4.1829056206528...|
| Trouble,|[2.51985378203136...|
| Hatter|[8.49108115961009...|
| '79|[-5.4560309784650...|
| Vile|[-1.2059769646379...|
| 9/11|[Infinity,-Infini...|
| Santino|[6.30405421282099...|
| Motives|[1.96207712570869...|
| '13|[-1.7641987324084...|
| Fierce|[-Infinity,Infini...|
| Stover|[5.10057474120744...|
| 'It|[1.08629989605664...|
| Butts|[Infinity,Infinit...|
+-------------+--------------------+
only showing top 20 rows
查询知:综合:Word2Vec generate infinity vectors when numIterations are large,以及Fix infinity vectors produced by Word2Vec when numIterations are large, 发现:setNumPartitions 设置太大,后修改为15,问题解决;
异常2:代码运行中出现,无故终止情形,报word2vector内部错误,或者报 .fit() 错误,要么训练过程直接终止,要么保存词向量过程失败,偶尔能成功运行(这是最致命的)。翻遍网络没发现问题,最后发现是内存超限,调整调度参数:内存大小,成功