Pyspark特征工程--StopWordsRemover

StopWordsRemover

class pyspark.ml.feature.StopWordsRemover(inputCol=None, outputCol=None, stopWords=None, caseSensitive=False, locale=None

StopWordsRemover的功能是直接移除所有停用词(stopword),所有从inputCol输入的量都会被它检查,然后再outputCol中,这些停止词都会去掉了。

默认的话会在构建StopWordsRemover对象的时候调用loadDefaultStopWords(language: String): Array[String]加载/org/apache/spark/ml/feature/stopwords/english.txt

这是一个简单的停止词表,包含181个词(spark2.2)。

默认还提供了其他几种语言(danish, dutch, english, finnish, french, german, hungarian,italian, norwegian, portuguese, russian, spanish, swedish, turkish)的停止词,遗憾的是没有中文默认停止词表,所以对于中文停止词需要自己提供.

对于不同类型的需求而言,对停止词的处理是不同的。

  1. 有监督的机器学习 – 将停止词从特征空间剔除,2. 聚类– 降低停止词的权重

  2. 信息检索– 不对停止词做索引,4. 自动摘要- 计分时不处理停止词

对于不同语言,停止词的类型都可能有出入,但是一般而言有这简单的三类

  1. 限定词 ,2. 并列连词 ,3. 介词

停止词的词表一般不需要自己制作,有很多选项可以自己下载选用

01.加载模块,生成对象

from pyspark.sql import SparkSession
from pyspark.ml.feature import StopWordsRemover
spark = SparkSession.builder.config("spark.driver.host","192.168.1.4")\
    .config("spark.ui.showConsoleProgress","false")\
    .appName("StopWordsRemover").master("local[*]").getOrCreate()

02.创建数据

data = spark.createDataFrame([
    (["Beautiful","is","better","than","ugly"],),
    (["Explicit","is","better","than","implicit"],),
    (["Simple","is","better","than","complex"],),
    (["Complex","is","better","than","complicated"],),
    (["Flat","is","better","than","nested"],),
    (["Sparse","is","better","than","dense"],)
],["text"])
data.show()

​ 输出结果:

+--------------------+
|                text|
+--------------------+
|[Beautiful, is, b...|
|[Explicit, is, be...|
|[Simple, is, bett...|
|[Complex, is, bet...|
|[Flat, is, better...|
|[Sparse, is, bett...|
+--------------------+

03.使用head,详细查看

data.head(6)

​ 输出结果:

[Row(text=['Beautiful', 'is', 'better', 'than', 'ugly']),
 Row(text=['Explicit', 'is', 'better', 'than', 'implicit']),
 Row(text=['Simple', 'is', 'better', 'than', 'complex']),
 Row(text=['Complex', 'is', 'better', 'than', 'complicated']),
 Row(text=['Flat', 'is', 'better', 'than', 'nested']),
 Row(text=['Sparse', 'is', 'better', 'than', 'dense'])]

04.使用StopWordsRemover,剔除不需要的单词,并查看结果

stopWordsRemover = StopWordsRemover(inputCol="text",outputCol="res",stopWords=["is","better","than"])
data = stopWordsRemover.transform(data)
data.show()
data.head(6)

​ 输出结果:

+--------------------+--------------------+
|                text|                 res|
+--------------------+--------------------+
|[Beautiful, is, b...|   [Beautiful, ugly]|
|[Explicit, is, be...|[Explicit, implicit]|
|[Simple, is, bett...|   [Simple, complex]|
|[Complex, is, bet...|[Complex, complic...|
|[Flat, is, better...|      [Flat, nested]|
|[Sparse, is, bett...|     [Sparse, dense]|
+--------------------+--------------------+

[Row(text=['Beautiful', 'is', 'better', 'than', 'ugly'], res=['Beautiful', 'ugly']),
 Row(text=['Explicit', 'is', 'better', 'than', 'implicit'], res=['Explicit', 'implicit']),
 Row(text=['Simple', 'is', 'better', 'than', 'complex'], res=['Simple', 'complex']),
 Row(text=['Complex', 'is', 'better', 'than', 'complicated'], res=['Complex', 'complicated']),
 Row(text=['Flat', 'is', 'better', 'than', 'nested'], res=['Flat', 'nested']),
 Row(text=['Sparse', 'is', 'better', 'than', 'dense'], res=['Sparse', 'dense'])]

05.查看结构

data.printSchema()

​ 输出结果:

root
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- res: array (nullable = true)
 |    |-- element: string (containsNull = true)
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值