lstm预测单词
As part of my summer internship with Linagora’s R&D team, I was tasked with developing a next word prediction and autocomplete system akin to that of Google’s Smart Compose. In this article, we look at everything necessary to get started with next word prediction.
在我与Linagora的研发团队进行暑期实习的过程中,我的任务是开发类似于Google Smart Compose的下一个单词预测和自动完成系统。 在本文中,我们研究了开始下一个单词预测所需的一切。
Email continues to be one of the largest forms of communication, professionally as well as personally. It is estimated that almost 3.8 billion users send nearly 300 billion emails a day! Thus, for Linagora’s open source collaborative platform OpenPaaS, which includes an email system, improving user experience is a priority. With real time assisted writing, users are able to craft richer messages from scratch and have a smoother flow as compared to the static counterpart: complete automated replies.
电子邮件仍然是最大的专业和个人通信形式之一。 据估计,近38亿用户每天发送近3000亿封电子邮件! 因此,对于包含电子邮件系统的Linagora开源协作平台OpenPaaS,改善用户体验是当务之急。 借助实时辅助编写功能,与静态副本相比,用户可以从头开始编写更丰富的消息,并且流程更加顺畅:完全自动答复。
A real-time assisted writing system
实时辅助写作系统
The general pipeline of an assisted writing system relies on an accurate and fast next word prediction model. It is crucial to consider several problems in regards to building an industrial language model that enhances user experience: time on inference, model compression, transfer learning to further personalise suggestions.
辅助书写系统的总体流程依赖于准确而快速的下一个单词预测模型。 在构建可增强用户体验的工业语言模型时,考虑几个问题至关重要:推理时间,模型压缩,转移学习以进一步个性化建议。
At the moment, these issues are addressed by the growing use of Deep Learning techniques for mobile keyboard predictions. Leading companies have already made the switch from statistical n-gram models to machine learning based systems deployed on mobile devices. However, with Deep Learning comes the baggage of extremely large model sizes unfit for on-device prediction. As a result, model compression is key to maintaining accuracy while not using too much space. Furthermore, mobile keyboards are also tasked with learning your personal texting habits in order to make predictions that are suited to your style rather than just a general language model . Let’s take a look at how it’s done!
目前,通过深度学习技术越来越多地用于移动键盘预测来解决这些问题。 领先的公司已经从统计n-gram模型转变为部署在移动设备上的基于机器学习的系统。 但是,随着深度学习的到来,非常大的模型尺寸不适合设备上的预测。 结果,模型压缩是在不占用过多空间的情况下保持准确性的关键。 此外,移动键盘还负责学习您的个人发短信习惯,以便做出适合您的风格的预测,而不仅仅是普通的语言模型。 让我们看看它是如何完成的!
Some useful training corpora
一些有用的训练语料库
In order to evaluate the different model architecture’s I will use to build my language model, it is crucial to have a solid benchmark evaluation set. For this task, I chose the Penn Tree Bank dataset which tests easily if your model is overfitting. Due to the small vocabulary size (roughly 10,000 unique words!), it is imperative to build a model that is well regularised. In addition to this, Enron’s email corpus was used to train on real email data to test predictions in the context of emails (with a much larger and richer vocabulary size of nearly 39,000 unique words).
为了评估不同的模型体系结构,我将用来构建语言模型,拥有一个可靠的基准评估集至关重要。 对于此任务,我选择了Penn Tree Bank数据集,可以轻松测试模型是否过拟合。 由于词汇量较小(大约10,000个唯一的单词!),因此必须构建一个规范化的模型。 除此之外,还使用了安然的电子邮件语料库来训练真实的电子邮件数据,以测试电子邮件上下文中的预测(词汇量更大,更丰富,包含近39,000个唯一单词)。
My task was to build a bilingual model, in French and English, with intensions of generalising to other languages as well. For this, I also considered several widely available French texts. FrWac is a web archive of the whole .fr domain which is a great corpus to train on a diverse set of French sequences. For those with extensive GPU resources, the entire French wikipedia dump is also available online. With my code, I trained on a couple of short stories from Project Gutenberg (another great resource for textual data in multiple languages!)
我的任务是建立法语和英语双语模型,并具有将其推广到其他语言的意图。 为此,我还考虑了几种广泛使用的法语文本。 FrWac是整个.fr域的网络档案,它是训练各种法语序列的强大语料库。 对于拥有大量GPU资源的用户,还可以在线获取整个法语维基百科转储。 通过我的代码,我训练了一些来自Gutenberg项目的短篇故事(另一种很棒的资源,可提供多种语言的文本数据!)
<