ios 多语言 默认语言
Some time ago I presented a talk at CocoaHeads SP on how to use NLP in an iOS app. A lot has changed since then, so I thought it would be nice to post something about it.
前段时间,我在CocoaHeads SP上发表了有关如何在iOS应用中使用NLP的演讲 。 从那时起,发生了很多变化,所以我认为发布一些相关信息会很好。
自然语言处理 (Natural Language Processing)
The idea of processing human language with computer programs has been around for a while. The tools, methods and approaches change rapidly and there is a myriad of algorithms and techniques some of which people have been using for decades, and others were created just a few years ago.
用计算机程序处理人类语言的想法已经存在了一段时间。 工具,方法和方法Swift变化,并且有无数算法和技术,其中一些已经被人们使用了数十年,而另一些则是几年前创建的。
Some of the common tasks in NLP are tokenization, lemmatization, part-of-speech tagging, word embeddings and text classification. There are, obviously, many other tasks in NLP but it would be impossible to talk about all of them in one post so I decided to talk about those ones.
NLP中的一些常见任务是标记化 , 词形 化 , 词性标记 , 词嵌入和文本分类 。 显然,NLP中还有许多其他任务,但是不可能一次发表谈论所有这些任务,因此我决定谈论这些任务。
For each of the tasks I will give a brief explanation, maybe with some use cases for an app, and then I will show how to implement it in iOS.
对于每项任务,我都会做一个简短的解释,也许会针对一个应用程序给出一些用例,然后我将展示如何在iOS中实现它。
代币化 (Tokenization)
A text is usually represented in a program as a string. Tokenization handles the question (which might look trivial from a simplistic perspective) of how to split that string into units (paragraphs, sentences, words, etc.).
文本通常在程序中表示为字符串。 令牌化处理了有关如何将字符串分成单位(段落,句子,单词等)的问题(从简单的角度看,这似乎是微不足道的)。
At first you might be tempted to split the string by breaking it at every period for sentences and every blank space for words, for instance. It could work for a few very short texts, but if you’re handling a larger text chances are that approach would not suffice. Whenever you’re splitting your text, you’d wanna have “Mr. Hegarty” together in the same sentence when tokenizing a sentence like: Mr. Hegarty lives in New York.
起初,您可能会想通过在句子的每个句点和单词的每个空格将其断开来分割字符串。 它可能适用于一些非常短的文本,但是如果您要处理较大的文本,则该方法可能不够用。 每当您拆分文本时,您都想拥有“先生。 标记以下句子时,在同一句子中同时显示“ Hegarty”: Hegarty先生居住在纽约 。
Tokenization is the technique used to decompose a text into units (“tokens”) that can be used later in the processing.
令牌化是一种用于将文本分解为单位(“令牌”)的技术,可在以后的处理中使用。
In order to tokenize a text in iOS, you'll need to instantiate an NLTokenizer
and call the enumerateTokens
method.
为了在iOS中标记文本,您需要实例化NLTokenizer
并调用enumerateTokens
方法。
合法化 (Lemmatization)
As a former Linguistics Student, I’m tempted to spend longer than I should discussing what lemmatization really is, but I’ll be brief for now — at the risk of disappointing my linguist friends— for the sake of simplicity.
作为前语言学的学生,我很想花的时间比我要讨论什么词形还原确实是,但我会现在是短暂的-在令人失望的我语言学家FRIENDS-为简单起见的风险。
The idea behind lemmatizing a word is turning both lover and loving into the same lemma: love, or turning is and were into be. This can be useful for a number of use cases. Say you want your user to be able to search through a database of descriptions of photos and they want to find pictures with rain, for instance. They might type “raining” at the text field, but I bet you they would like to get results like “… it rained all day…” or “… the rain didn’t stop for a minute…” although none of those sentences present the exact word “raining”.
背后lemmatizing一个字的想法是既转爱人 , 爱到同一个引理: 爱情 ,或者是转弯并且是到BE。 这对于许多用例可能很有用。 假设您希望用户能够搜索照片描述数据库,例如,他们想下雨的照片。 他们可能会在文本字段中键入“ raining”,但我敢打赌,他们希望得到诸如“… 整天下雨 ……”或“… 下雨没有一分钟 ……”的结果,尽管这些句子都没有出现。确切的词“ raining”。
In order to get the word’s lemma in iOS, you'll need to use the NLTagger
class. You should initialize an NLTagger
object with .lemma
as one of its scheemes, set its string
property to be the text you wanna lemmatize and call enumerateTags
.
为了在iOS中获取单词的引理,您需要使用NLTagger
类。 您应该使用.lemma
作为其架构之一来初始化NLTagger
对象,将其string
属性设置为您要进行NLTagger
的文本,然后调用enumerateTags
。
词性标记 (Part-of-speech tagging)
A part-of-speech is the syntactic class to which a word belongs. A word can be a verb, a noun, a preposition and so on.
词性是单词所属的句法类。 单词可以是动词 , 名词 , 介词等。
Determining the POS of a word in a text is far from being a trivial task. In languages like English, where almost every word can get “verbalized”, things could get really complicated.
确定文本中单词的POS并不是一件容易的事。 在像英语这样的语言中,几乎每个单词都可以被“语言化”,事情可能会变得非常复杂。
In order to the POS tag for the tokens in a text we can use a very similar approach to that of lemmatization. The only difference is that instead of using the .lemma
scheme, we should use .lexicalClass
.
为了对文本中的令牌使用POS标签,我们可以使用与词义化非常相似的方法。 唯一的区别是,应该使用.lexicalClass
而不是使用.lemma
方案。
词嵌入 (Word embeddings)
Representing words has always been a challenge. With the advancement of GPUs about a decade ago, and the consequent revival of neural networks, it was necessary to represent words numerically, and it would be even nicer if the word representation could somehow encode the similarities/differences between words.
代表单词一直是一个挑战。 随着大约10年前GPU的发展以及神经网络的复兴,有必要用数字表示单词,并且如果单词表示可以某种方式编码单词之间的相似性/差异,那就更好了。
Word embeddings provide just that. They are a representation of words as n-dimensional vectors (where n
usually goes from 50 to 300). That representation is suitable as an input to Machine Learning models, like neural nets. Also, with that representation we end up getting some interesting consequences. For example, we can calculate the Euclidean distance between words (everybody remembers Pythagoras' Theorem, right?), and that distance is related to the semantic similarity between words. So you can imagine the vector for the word "dog" being closer to the vector for the word "cat" then to that of the word "space".
单词嵌入正是提供了这一点。 它们将单词表示为n维向量(其中n
通常从50到300)。 该表示适合作为机器学习模型(例如神经网络)的输入。 同样,通过这种表示,我们最终会得到一些有趣的结果。 例如,我们可以计算单词之间的欧几里得距离(每个人都记得毕达哥拉斯定理,对吗?),该距离与单词之间的语义相似性有关。 因此,您可以想象单词“ dog”的向量比单词“ cat”的向量更接近单词“ space”的向量。
A vector representation can be useful for an iOS app in a number of ways. I would like to mention two use cases for word embeddings in an app: firstly, as an input for a Core ML model; and, secondly, in order to make some features somewhat "smarter".
向量表示可以多种方式对iOS应用程序有用。 我想提到两个在应用程序中进行单词嵌入的用例:首先,作为Core ML模型的输入; 其次,为了使某些功能更“智能”。
In order to be used as an input for a Core ML model, you'd have to encode your text as a matrix where each line is a vector for a word in the text. The steps would be basically (1) tokenize your text string — as I mentioned above — ; (2) get the vector for each token; (3) pass on as an input to your Core ML model the list of vectors.
为了用作Core ML模型的输入,您必须将文本编码为矩阵,其中每一行都是文本中单词的向量。 这些步骤基本上是(1)标记化您的文本字符串(如上所述)。 (2)获得每个令牌的向量; (3)将向量列表作为输入传递给您的Core ML模型。
When Apple announced Core ML at the WWDC 2017, I remember one of my unanswered questions in the lab was how to easily use word embeddings in order to preprocess the text for a model. Back then, if you wanted to use word-embeddings, you’d have to do it “manually”, loading the vectors from disk at run time.
当苹果在WWDC 2017上发布Core ML时,我记得我在实验室中尚未回答的问题之一是如何轻松使用单词嵌入来预处理模型文本。 那时,如果要使用单词嵌入,则必须“手动”执行,并在运行时从磁盘加载向量。
A lot has changed since then, and now getting a vector representation for a word is easier than adding a gradient background to a button!
此后发生了很多变化,现在为单词获取矢量表示比向按钮添加渐变背景更容易!
All you need to do is instanciate an NLEmbedding
using the wordEmbedding(for:)
factory method, then you call vector(for:)
passing the word you need as a string.
您需要做的就是使用wordEmbedding(for:)
工厂方法实例化NLEmbedding
,然后调用vector(for:)
将所需的单词作为字符串传递。
Another interesting use case for word embeddings is allowing your app to be "smarter". Using the same example for an app where the user can search for a picture based on the descriptions. Say your user wants to find a picture with a house on it, then he or she goes ahead and types "house" in the search bar. It's quite possible that they would like to get as a result a picture whose description mentions a "mansion" or a "building". How could you implement that?
单词嵌入的另一个有趣用例是允许您的应用“更智能”。 对于用户可以根据描述搜索图片的应用,使用相同的示例。 假设您的用户想查找上面有房屋的图片,然后他或她继续在搜索栏中键入“房屋”。 结果他们很可能希望获得一张描述中提到“豪宅”或“建筑物”的图片。 您如何实现呢?
One of the cool features of word embeddings is getting the "close neighbors" of a given word. So, if your user searches for a word w
you could implement your search bringing the results where such word appears, but also bringing back to the user the results where the closest neighbors of that word appear.
单词嵌入的一个很酷的功能之一就是获得给定单词的“近邻”。 因此,如果您的用户搜索单词w
,则可以实施搜索,将结果带到出现该单词的位置,也可以将结果带回给用户,该结果是该单词最接近的邻居出现的位置。
Getting the neighbors for a given word in iOS is a piece of cake:
在iOS中让给定单词的邻居很容易:
文字分类 (Text classification)
Lastly, I'd like to mention text classification. The idea in text classification is basically, given a text, determine whether it belongs to a class (for example, news article, or sports text, or even positive/negative sentiment).
最后,我想提一下文本分类。 在给定文本的情况下,文本分类的思想基本上是确定其是否属于一类(例如, 新闻文章或体育文本 ,甚至是正面/负面情绪 )。
One of the ways to achieve text classification is using Core ML, as I mentioned above. Depending on the model you're gonna use you may need to use word embeddings as a preprocessing step. But what I would like to mention here is one of the most used types of text classification which is sentiment analysis.
如上所述,实现文本分类的方法之一是使用Core ML。 根据您要使用的模型,您可能需要使用词嵌入作为预处理步骤。 但是我想在这里提及的是情感分析中最常用的文本分类类型之一。
The NaturalLanguage
framework in iOS has a simple high level API to determine whether a text is positive or negative.
iOS中的NaturalLanguage
框架具有一个简单的高级API,可以确定文本是肯定的还是否定的。
In order to classify a text, you'll use the NLTagger
introduced above.
为了对文本进行分类,您将使用NLTagger
介绍的NLTagger
。
那不是全部! (That's not all!)
There's still a lot of things you can do in the intersection between iOS and NLP. Obviously I had no pretension to be exhaustive in this post, but you can look up things like, language detection, named entity recognition, document analysis and many other techniques that are easy to use and can have a very positive impact in your apps and, mainly, in the lives of the users.
在iOS和NLP之间的交集中,您仍然可以做很多事情。 显然,我在这篇文章中并没有力求详尽,但是您可以查找语言检测,命名实体识别,文档分析以及许多其他易于使用的技术,这些技术可以对您的应用产生非常积极的影响,主要是在用户的生活中。
I hope you liked it.
我希望你喜欢它。
If you have any comments, questions, suggestions, etc. leave a comment! I'll be glad to answer!
如果您有任何意见,问题,建议等,请发表评论! 我很乐意回答!
翻译自: https://medium.com/cocoaacademymag/natural-language-processing-in-ios-2455a3f541a5
ios 多语言 默认语言