Spark - RegexTokenizer和StopWordsRemover学习

Stop words是应当从输入中排除掉的词,一般因为他们经常出现,还没有什么意义。
StopWordsRemover接受一个字符串序列,他们已经由Tokenizer或者RegexTokenizer做了标记。stop words的列表由参数stopWords指定。

public class StopWordsRemoverDemo {

    public static void main(String[] args) {
        final SparkSession spark = SparkSession.builder()
                .master("local")
                .appName("RegexTokenizer")
                .getOrCreate();

        final List<Row> data = Arrays.asList(
                RowFactory.create(0, "Tokenization,is the process of enchanting words,from the raw text"),
                RowFactory.create(1, "If you want,to have more advance tokenization,RegexTokenizer,\n" +
                        "is a good option"),
                RowFactory.create(2, "Here,will provide a sample example on how to tockenize sentences"),
                RowFactory.create(3, "This way,you can find all matching occurrences")
        );

        final StructType schema = new StructType(new StructField[]{
                new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
                new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
        });
        final Dataset<Row> df = spark.createDataFrame(data, schema);

        final RegexTokenizer tokenizer = new RegexTokenizer()
                .setInputCol("sentence")
                .setOutputCol("words")
                .setPattern("\\W+")
                .setGaps(true);

        spark.udf().register(
                "countTokens",
                (WrappedArray<?> words) -> words.size(),
                DataTypes.IntegerType);

        final Dataset<Row> regexTokenized = tokenizer.transform(df)
                .select("id", "sentence", "words")
                .withColumn("tokens", callUDF("countTokens", col("words")));

        final StopWordsRemover remover = new StopWordsRemover()
                .setInputCol("words")
                .setOutputCol("filtered");
                
        remover.transform(regexTokenized)
                .select("id", "filtered")
                .show(false);
        spark.stop();
    }
}

输出是

+---+-----------------------------------------------------------+
|id |filtered                                                   |
+---+-----------------------------------------------------------+
|0  |[tokenization, process, enchanting, words, raw, text]      |
|1  |[want, advance, tokenization, regextokenizer, good, option]|
|2  |[provide, sample, example, tockenize, sentences]           |
|3  |[way, find, matching, occurrences]                         |
+---+-----------------------------------------------------------+
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值