spark匹配html字段,Apache Spark中的高效字符串匹配

最新推荐文章于 2023-03-06 19:04:13 发布

weixin_39634132

最新推荐文章于 2023-03-06 19:04:13 发布

阅读量146

点赞数

文章标签： spark匹配html字段

我不会首先使用Spark，但如果你真的承诺特定的堆栈，你可以结合一堆ml变压器来获得最佳匹配。你需要Tokenizer(或split)：

import org.apache.spark.ml.feature.RegexTokenizer

val tokenizer = new RegexTokenizer().setPattern("").setInputCol("text").setMinTokenLength(1).setOutputCol("tokens")

NGram(例如3克)

import org.apache.spark.ml.feature.NGram

val ngram = new NGram().setN(3).setInputCol("tokens").setOutputCol("ngrams")

Vectorizer(例如CountVectorizer或HashingTF)：

import org.apache.spark.ml.feature.HashingTF

val vectorizer = new HashingTF().setInputCol("ngrams").setOutputCol("vectors")

和LSH：

import org.apache.spark.ml.feature.{MinHashLSH, MinHashLSHModel}

// Increase numHashTables in practice.

val lsh = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh")

与Pipeline

import org.apache.spark.ml.Pipeline

val pipeline = new Pipeline().setStages(Array(tokenizer, ngram, vectorizer, lsh))

适合于例如数据合并：

val query = Seq("Hello there 7l | real|y like Spark!").toDF("text")

val db = Seq(

"Hello there ! I really like Spark ❤️!",

"Can anyone suggest an efficient algorithm"

).toDF("text")

val model = pipeline.fit(db)

变换两者：

val dbHashed = model.transform(db)

val queryHashed = model.transform(query)

并加入

model.stages.last.asInstanceOf[MinHashLSHModel]

.approxSimilarityJoin(dbHashed, queryHashed, 0.75).show

+--------------------+--------------------+------------------+

| datasetA| datasetB| distCol|

+--------------------+--------------------+------------------+

|[Hello there ! ...|[Hello there 7l |...|0.5106382978723405|

+--------------------+--------------------+------------------+

weixin_39634132

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark匹配html字段,Apache Spark中的高效字符串匹配

我不会首先使用Spark，但如果你真的承诺特定的堆栈，你可以结合一堆ml变压器来获得最佳匹配。你需要Tokenizer(或split)：import org.apache.spark.ml.feature.RegexTokenizerval tokenizer = new RegexTokenizer().setPattern("").setInputCol("text").setMinToken...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。