spark nlp_如何将头缠绕在spark nlp上

最新推荐文章于 2022-10-31 10:49:21 发布

weixin_26720761

最新推荐文章于 2022-10-31 10:49:21 发布

阅读量487

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/how-to-wrap-your-head-around-spark-nlp-a6f6a968b7e8

版权

spark nlp

Welcome to the second part of the Spark NLP article. In the first part, the objective was to present an ice breaker for NLP practitioners and warm-up minds towards Spark NLP. The strongest bias against any Spark-based library comes from the school of thought that states “Spark code is a bit different than your regular Python script”. To fight this prejudice, learning strategies were shared, and if you have followed them, you are ready for the next level.

欢迎来到Spark NLP文章的第二部分。在第一部分中，目标是为NLP从业者和热衷于Spark NLP的人士介绍破冰船。反对任何基于Spark的库的最大偏见来自于一门思想流派，即“ Spark代码与常规Python脚本有点不同” 。为了克服这种偏见，我们分享了学习策略，如果您遵循了这些策略，则可以为下一个阶段做好准备。

In this part of the article, we will compare spaCy to Spark NLP and dive deeper into Spark NLP modules and pipelines. I created a notebook that uses both spaCy and Spark NLP for doing the same things for picture-perfect comparison. SpaCy performs well, but in terms of speed, memory consumption, and accuracy Spark NLP outperforms it. Therefore, in terms of ease of use, once initial friction of Spark is overcome, I found it to be at least on par with spaCy, thanks to the convenience that comes with pipelines.

在本文的这一部分中，我们将把spaCy与Spark NLP进行比较，并更深入地研究Spark NLP模块和管道。我创建了一个笔记本，它同时使用spaCy和Spark NLP进行相同的操作以实现画面完美的比较。 SpaCy的性能很好，但是就速度，内存消耗和准确性而言，Spark NLP的性能都优于它。因此，就易用性而言，一旦克服了Spark的初始摩擦，由于管道带来的便利，我发现它至少与spaCy相当。

Usage of smart, comprehensive notebooks, prepared by Spark NLP creators, which provide numerous examples for real-life scenarios together with the repo I created for practice is highly recommended to excel in the required skill sets.

强烈建议使用Spark NLP创作者编写的智能，全面的笔记本电脑，其中提供了许多真实场景的示例，以及我为练习而创建的回购协议，以精通所需的技能。

Day 8/9: Understanding Annotators /Transformers in Spark NLP and Text Preprocessing with Spark

第8/9天：了解Spark NLP中的注释器/变形器和 使用Spark进行 文本预处理

Built natively on Apache Spark and TensorFlow, Spark NLP library provides simple, performant as well as accurate NLP notations for machine learning pipelines which can scale easily in a distributed environment. This library reuses the Spark ML pipeline along with integrating NLP functionality.

Spark NLP库原生基于Apache Spark和TensorFlow构建，为机器学习管道提供了简单，高性能以及准确的NLP表示法，可以在分布式环境中轻松扩展。该库重用了Spark ML管道并集成了NLP功能。

The library covers many common NLP tasks, including tokenization, stemming, lemmatization, part of speech tagging, sentiment analysis, spell checking, named entity recognition, all of which are included as open-source and can be used by training models with your data. Spark NLP’s annotators utilize rule-based algorithms, machine learning, and some Tensorflow running under the hood to power specific deep learning implementations.

该库涵盖了许多常见的NLP任务，包括标记化，词干提取，词形化处理，语音标记，情感分析，拼写检查，命名实体识别，所有这些均作为开放源代码提供，并且可以由训练模型与您的数据一起使用。 Spark NLP的注释器利用基于规则的算法，机器学习以及一些在后台运行的Tensorflow来推动特定的深度学习实现。

In Spark NLP, all Annotators are either Estimators or Transformers just like Spark ML, consisting of two types: AnnotatorApproach and AnnotatorModel. Any annotator training on a DataFrame that produces a model is an AnnotatorApproach. Those transforming a DataFrame into another DataFrame through some models are AnnotatorModels (e.g. WordEmbeddingsModel). Normally, an annotator does not take Model suffix if it doesn’t rely on a pre-trained annotator while transforming a DataFrame (e.g. Tokenizer). Here is a list of Annotators and their descriptions:

在Spark NLP中，就像Spark ML一样，所有注释器都是估计器或变形器，它由两种类型组成： AnnotatorApproach和AnnotatorModel。 对产生模型的DataFrame进行的任何注释器培训都是AnnotatorApproach。通过一些模型将一个DataFrame转换为另一个DataFrame的是AnnotatorModels (例如WordEmbeddingsModel )。通常，如果注释器在转换DataFrame时不依赖于预训练的注释器(例如Tokenizer )，则不带模型后缀。这是注释器及其说明的列表：

Introduction to Spark NLP — Foundations and Basic Components Spark NLP简介—基础和基本组件

To get through the NLP process, we need to get raw data preprocessed. In addition to SQL filters, transformations, and user-defined functions, Spark NLP comes with powerful tools for the task. DocumentAssembler is a special transformer that creates the first annotation of type Document which may be used by follow up annotators in the pipeline.

为了完成NLP流程，我们需要对原始数据进行预处理。除了SQL筛选器，转换和用户定义的函数外，Spark NLP还提供了用于完成任务的强大工具。 DocumentAssembler是一个特殊的转换器，它创建Document类型的第一个注释，该注释可由管道中的后续注释器使用。

TokenAssembler reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, to use this document annotation in further annotators.

TokenAssembler通常在标记已被规范化，去词化，规范化，拼写检查等之后，从标记中重建文档类型注释，以在其他注释器中使用此文档注释。

Doc2Chunk converts Document type annotations into Chunk type with the contents of a chunk Col, while Chunk2Doc converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result.

Doc2Chunk将带有块Col内容的文档类型注释转换为Chunk类型，而Chunk2Doc将CHUNK类型列转换回DOCUMENT。尝试重新标记或对CHUNK结果进行进一步分析时很有用。

Finisher outputs annotation(s) values into a string for ease of use. Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is practically available.

装订器将注释值输出到字符串中以便于使用。一旦我们准备好了NLP管道，我们可能希望在其他实际可用的地方使用注释结果。

As stated earlier, a Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.

如前所述，流水线被指定为一个阶段序列，每个阶段都是一个Transformer或Estimator。这些阶段按顺序运行，并且输入DataFrame在通过每个阶段时都会进行转换。

And here is how this pipeline is coded in Spark NLP.

这是在Spark NLP中如何编码此管道的方法。

from pyspark.ml import Pipelinedocument_assembler = DocumentAssembler()\
 .setInputCol(“text”)\
 .setOutputCol(“document”)sentenceDetector = SentenceDetector()\
 .setInputCols([“document”])\
 .setOutputCol(“sentences”)tokenizer = Tokenizer() \
 .setInputCols([“sentences”]) \
 .setOutputCol(“token”)normalizer = Normalizer()\
 .setInputCols([“token”])\
 .setOutputCol(“normal”)word_embeddings=WordEmbeddingsModel.pretrained()\
 .setInputCols([“document”,”normal”])\
 .setOutputCol(“embeddings”)nlpPipeline = Pipeline(stages=[
 document_assembler, 
 sentenceDetector,
 tokenize

最低0.47元/天解锁文章

weixin_26720761

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark nlp_如何将头缠绕在spark nlp上

spark nlpWelcome to the second part of the Spark NLP article. In the first part, the objective was to present an ice breaker for NLP practitioners and warm-up minds towards Spark NLP. The strongest bi...
复制链接

扫一扫