Stop words是应当从输入中排除掉的词,一般因为他们经常出现,还没有什么意义。
StopWordsRemover接受一个字符串序列,他们已经由Tokenizer或者RegexTokenizer做了标记。stop words的列表由参数stopWords指定。
public class StopWordsRemoverDemo {
public static void main(String[] args) {
final SparkSession spark = SparkSession.builder()
.master("local")
.appName("RegexTokenizer")
.getOrCreate();
final List<Row> data = Arrays.asList(
RowFactory.create(0, "Tokenization,is the process of enchanting words,from the raw text"),
RowFactory.create(1, "If you want,to have more advance tokenization,RegexTokenizer,\n" +
"is a good option"),
RowFactory.create(2, "Here,will provide a sample example on how to tockenize sentences"),
RowFactory.create(3, "This way,you can find all matching occurrences")
);
final StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
final Dataset<Row> df = spark.createDataFrame(data, schema);
final RegexTokenizer tokenizer = new RegexTokenizer()
.setInputCol("sentence")
.setOutputCol("words")
.setPattern("\\W+")
.setGaps(true);
spark.udf().register(
"countTokens",
(WrappedArray<?> words) -> words.size(),
DataTypes.IntegerType);
final Dataset<Row> regexTokenized = tokenizer.transform(df)
.select("id", "sentence", "words")
.withColumn("tokens", callUDF("countTokens", col("words")));
final StopWordsRemover remover = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("filtered");
remover.transform(regexTokenized)
.select("id", "filtered")
.show(false);
spark.stop();
}
}
输出是
+---+-----------------------------------------------------------+
|id |filtered |
+---+-----------------------------------------------------------+
|0 |[tokenization, process, enchanting, words, raw, text] |
|1 |[want, advance, tokenization, regextokenizer, good, option]|
|2 |[provide, sample, example, tockenize, sentences] |
|3 |[way, find, matching, occurrences] |
+---+-----------------------------------------------------------+