spark-nlp
If you want to make a head start in enterprise NLP, but have no clue about Spark, this article is for you. I have seen many colleagues wanting to step to this domain but disheartened due to the initial learning overhead that comes with Spark. It may seem inconspicuous at first glance since Spark code is a bit different than your regular Python script. However, Spark and Spark NLP basics aren’t really hard to learn. If you axiomatically accept this assertion, I will show you how easy the basics are and will provide a road map to pave the way to learn key elements, which will satisfy most use cases of an intermediate level practitioner. Due to impeccable modularity that comes with Spark NLP pipelines, for an average learner, -mark my words- two weeks will be enough to build basic models. Roll up your sleeves, here we start!
如果您想在企业级NLP中领先,但是对Spark毫无头绪,那么本文适合您。 我已经看到许多同事希望踏入这一领域,但由于Spark带来的最初学习开销而灰心。 乍一看似乎不太明显,因为Spark代码与常规Python脚本有点不同。 但是,Spark和Spark NLP基础知识并不难学习。 如果您公理地接受此主张,我将向您展示基础知识有多么简单,并将提供路线图,为学习关键要素铺平道路,这将满足中级从业人员的大多数用例。 由于Spark NLP管道随附无可挑剔的模块化,对于普通学习者来说,请记住我的话,两周就足以建立基本模型。 卷起袖子,我们开始!
Why Spark NLP?Supply and Demand is the answer: It Is The Most Widely Used Library In Enterprises! Here are a few reasons why. Common NLP packages today have been designed by academics and they favor ease of prototyping over runtime performance, eclipsing scalability, error handling, target frugal memory consumption and code reuse. Although some libraries like ‘the industrial-strength NLP library — spaCy’ might be considered an exception (since they are designed to get things done rather than doing research), they may fall short of enterprise targets when it comes to dealing with data in volume.
为什么选择Spark NLP? 供求关系就是答案:它是企业中使用最广泛的图书馆! 原因如下 。 如今,常见的NLP软件包是由学者设计的,它们倾向于简化原型设计而不是运行时性能,使可伸缩性,错误处理,目标节俭的内存消耗和代码重用黯然失色。 尽管某些库(例如“具有行业实力的NLP库-spaCy”)可能会被视为例外(因为它们是为完成任务而不是进行研究而设计的),但在处理海量数据时,它们可能无法达到企业目标。
![Image for post](https://miro.medium.com/max/9999/1*IlLAFarDtcfS_rlxmCuH8w.png)
We are going to have a different strategy here. Rather than following the crowds in the routine, we will use basic libraries to brush up ‘basics’ and then jump directly to address the enterprise sector. Our final aim is to target the niche by building continental pipelines, which are impossible to resolve with standard libraries, albeit their capacity in their league.
我们将在这里采取不同的策略 。 我们将使用基本库来整理“基础知识”,然后直接跳转到企业领域,而不是在日常工作中随波逐流。 我们的最终目标是通过建立大陆式管道来瞄准利基市场,尽管它们在同盟中是有能力的,但是标准管道无法解决这些问题。
If you are not convinced yet, please read this article for benchmarking and comparison with spaCy, which will give you five good reasons to start with Spark NLP. First of all Spark NLP has the innate trait of scalability it inherits from Spark, which was primarily used for distributed applications, it is designed to be scalable. Spark NLP benefits from this since it can scale on any Spark cluster as well as on-premise and with any cloud provider. Furthermore, Spark NLP optimizations are done in such a way that it could run orders of magnitude faster than what the inherent design limitations of legacy libraries allow. It provides the concepts of annotators and it includes more than what other NLP libs include. It includes sentence detection, tokenization, stemming, lemmatization, POS Tagger, NER, dependency parse, text matcher, date matcher, chunking, context-aware spell checking, sentiment detector, pre-trained models, and training models with very high accuracy according to academic peer-reviewed results. Spark NLP also includes production-ready implementation of BERT embeddings for named entity recognition. For example, it makes much fewer errors on NER compared to spaCy, which we tested in the second part of this article. Also, worthy of notice, Spark NLP includes features that provide full Python API, supports training on GPU, user-defined deep learning networks, Spark, and Hadoop.
如果您还不确定,请阅读本文以进行spaCy的基准测试和比较,这将为您提供五个从Spark NLP开始的良好理由。 首先,Spark NLP具有可扩展性的先天特性,它继承自Spark,Spark主要用于分布式应用程序,并且设计为可扩展的 。 Spark NLP可以从中受益,因为它可以在任何Spark集群以及内部部署和任何云提供商上进行扩展。 此外,Spark NLP优化以一种比传统库固有的设计限制所允许的运行速度快几个数量级的方式进行。 它提供了注释器的概念,并且比其他NLP库包含的内容更多。 它包括句子检测,标记化,词干提取,词形化,POS Tagger,NER,依赖性解析,文本匹配器,日期匹配器,分块,上下文感知拼写检查,情感检测器,预训练模型以及根据以下目的非常高精度的训练模型学术同行评审的结果。 Spark NLP还包括用于命名实体识别的BERT嵌入的生产就绪型实现。 例如,与spaCy相比,它在NER上产生的错误要少得多,我们在本文的第二部分对此进行了测试。 另外,值得注意的是,Spark NLP包括提供完整Python API的功能,支持有关GPU,用户定义的深度学习网络,Spark和Hadoop的培训。
The library comes with a production-ready implementation of BERT embeddings and uses transfer learning for data extraction. Transfer learning is a highly-effective method of extracting data that can leverage even small amounts of data. As a result, there’s no need to collect large amounts of data to train SOTA models.
该库带有可直接用于BERT嵌入的生产环境的实现,并使用传输学习进行数据提取。 转移学习是一种提取数据的高效方法,可以利用少量数据。 结果,无需收集大量数据即可训练SOTA模型。
Also, John Snow labs Slack channel provides top tier support that is beneficial because developers and new learners tend to band together and create resources that every one of them can benefit from. You will get answers to your questions right from the developers with dedication. I have been there a few times, can attest that they are quick and accurate in their response. Additionally, anyone finding themselves stuck can quickly get help from people that have had similar problems through Stack Overflow or similar platforms.
另外,John Snow实验室的Slack频道提供了顶级支持,这是有益的,因为开发人员和新学习者倾向于联合起来并创建每个人都可以从中受益的资源。 您将全心全意地从开发人员那里得到问题的答案。 我去过几次,可以证明他们的React是快速而准确的。 此外,任何发现自己陷入困境的人都可以通过Stack Overflow或类似平台Swift从遇到类似问题的人那里获得帮助。