spark java api_Spark Java API 之 CountVectorizer

最新推荐文章于 2022-03-07 13:52:26 发布

Luyao Zou

最新推荐文章于 2022-03-07 13:52:26 发布

阅读量149

点赞数

文章标签： spark java api

本文链接：https://blog.csdn.net/weixin_31487521/article/details/114404903

版权

Spark Java API 之 CountVectorizer

由于在Spark中文本处理与分析的一些机器学习算法的输入并不是文本数据，而是数值型向量。因此，需要进行转换。而将文本数据转换成数值型的向量有很多种方法，CountVectorizer是其中之一。

A CountVectorizer converts a collection of text documents into a vector representing the word count of text documents.

在构建向量时，有两个重要的参数：VocabSize和MinDF。前者表示词典的大小，后者表示当文档中某个Term出现的次数小于MinDF时，则不计入词典(该Term不属于词典中的单词)。

比如说现在有两篇文档：【"w1", "w2", "w4", "w5", "w2"】，【"w1", "w2", "w3"】

CountVectorizer cv = new CountVectorizer().setInputCol("text").setOutputCol("feature")

.setVocabSize(3).setMinDF(2);

根据上面代码中的参数设置，词典大小为3，即一共可以有三个Term。由于在所有的文档中，"w1"出现2次，"w2"出现2次，因此计入词典。而"w3"、"w4"、"w5"只出现一次，不属于词典中的单词(Term)。如下图所示：词典中只有两个Term

When the dictionary is not defined CountVectorizer iterates over the dataset twice to prepare

the dictionary based on frequency and size.

CountVectorizer 首先扫描Dataset(文本数据)生成词典，然后再次扫描生成向量模型(CountVectorizerModel)

在构造Dataset 时，需要指定模式。用模式来解释Dataset中每一行的数据。

StructType schema = new StructType(new StructField[]{

new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())

});

A field inside a StructType. param: name The name of this field. param: dataType The data type of this field. param: nullable Indicates if values of this field can be null values. param: metadata The metadata of this field. The metadata should be preserved during transformation if the content of the column is not modified

第一个参数是：名称；第二个参数是dataType 数据类型；第三个参数是标识该字段的值是否可以为空；第四个参数为字段的元数据信息。

整个示例代码：

import org.apache.spark.ml.feature.CountVectorizer;

import org.apache.spark.ml.feature.CountVectorizerModel;

import org.apache.spark.sql.Dataset;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.RowFactory;

import org.apache.spark.sql.SparkSession;

import org.apache.spark.sql.types.*;

import java.util.Arrays;

import java.util.List;

public class CounterVectorExample {

public static void main(String[] args) {

SparkSession spark = SparkSession.builder().appName("CountVectorizer").master("spark://172.25.129.170:7077").getOrCreate();

List data = Arrays.asList(

// RowFactory.create(Arrays.asList("a", "b", "c")),

// RowFactory.create(Arrays.asList("a", "b", "b", "c", "a")),

// RowFactory.create(Arrays.asList("a", "b", "a", "b"))

RowFactory.create(Arrays.asList("w1", "w2", "w3")),

RowFactory.create(Arrays.asList("w1", "w2", "w4", "w5", "w2"))

);

StructType schema = new StructType(new StructField[]{

new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())

});

Dataset df = spark.createDataFrame(data, schema);

CountVectorizer cv = new CountVectorizer().setInputCol("text").setOutputCol("feature")

.setVocabSize(3).setMinDF(2);

CountVectorizerModel cvModel = cv.fit(df);

//prior dictionary

CountVectorizerModel cvm = new CountVectorizerModel(new String[]{"a", "b", "c"}).setInputCol("text")

.setOutputCol("feature");

// cvm.

cvModel.transform(df).show(false);

spark.stop();

}

输出结果默认是以稀疏向量表示：

A sparse vector represented by an index array and a value array.

param: size size of the vector. param: indices index array, assume to be strictly increasing. param: values value array, must have the same length as the index array.

第一个字段代表：向量长度，由于这里词典中只有2个Term，因此转换出来的向量长度为2；第二个字段：索引下标；第三个字段：索引位置处相应的向量元素值。由上图中位置0处的Term是 w2，位置1处的Term是w1，因此，输出：

当然，我们也可以预先定义词典：在构造CountVectorizerModel的时候指定词典：【"w1", "w2", "w3"】

//prior dictionary

CountVectorizerModel cvm = new CountVectorizerModel(new String[]{"w1", "w2", "w3"}).setInputCol("text").setOutputCol("feature");

cvm.transform(df).show(false);

对于文本：[w1,w2,w3]，每个Term都在词典中，且出现了一次，因此稀疏特征向量表示为：(3,[0,1,2],[1.0,1.0,1.0])。其中，3代表向量的长度为3维向量；[0,1,2]表示向量的索引；[1.0,1.0,1.0]表示，在相应的索引处，每个元素值为1.0(即各个Term只出现了一次)。而对于文本[w1, w2, w4, w5, w2]，因为w4和w5不在词典中，w1出现一次，w2出现2次，故其特征如下：