nlp课程_使用nlp阻止无请求的销售电子邮件的无服务器堆栈中的课程

这是一篇关于利用自然语言处理(NLP)技术,在无服务器架构中构建系统,以阻止未经请求的销售电子邮件的教程。文章探讨了如何结合NLP和服务器less堆栈来实现这一目标。
摘要由CSDN通过智能技术生成

nlp课程

In parts I and II of this series, I described a system to block unsolicited sales emails by applying natural language processing. After training and deploying a model, I built the entire app using serverless infrastructure to understand the relative cost and effort to develop such a system. This post highlights various lessons learned in my journey. If you’re reading this, I hope it helps you avoid some face palming mistakes!

本系列的第一 部分第二 部分中 ,我描述了一种通过应用自然语言处理来阻止未经请求的销售电子邮件的系统。 训练和部署模型后,我使用无服务器基础架构构建了整个应用程序,以了解开发此类系统的相对成本和工作量。 这篇文章重点介绍了我在旅途中吸取的各种教训。 如果您正在阅读此书,希望它可以帮助您避免一些面部手掌错误!

一,我的系统概述 (First, An Overview of My System)

Here’s an overview of my application to set the context for the following sections.

这是我为以下各节设置上下文的应用程序的概述。

A user begins by authorizing the Service to access Gmail. That authorization causes Gmail to send new messages to a Google Pub/Sub topic. A subscriber takes each topic message and calls an API endpoint to process the email (predicting whether it’s sales spam or not). If the NLP model predicts that the email is spam, then the app causes the user’s Gmail to reply to the sender, prompting them to unblock the email by solving a Captcha. If the salesperson solves the Captcha, then their email is marked unread and brought to the user’s primary inbox.

用户首先要授权服务访问Gmail。 该授权使Gmail可以将新邮件发送到Google Pub / Sub主题。 订阅者获取每条主题消息,并调用API端点以处理电子邮件(预测其是否为垃圾邮件)。 如果NLP模型预测电子邮件为垃圾邮件,则该应用程序使用户的Gmail回复发件人,并提示他们通过解决验证码来取消阻止电子邮件。 如果销售人员解决了验证码,则他们的电子邮件被标记为未读,并被带到用户的主要收件箱。

Image for post

The following diagram provides a better view under the hood:

下图提供了更好的内幕:

Image for post

I implemented all of the blocks above using serverless infrastructure, leveraging DynamoDB, Lambda Functions, SQS queues, and Google Pub/Sub.

我使用DynamoDB,Lambda函数,SQS队列和Google Pub / Sub使用无服务器基础结构来实现上述所有模块。

通过Zappa的无服务器机器学习模型 (Serverless Machine Learning Model via Zappa)

Here, ‘serverless’ means hosting that requires no permanent infrastructure. In that regard, “Zappa makes it super easy to build and deploy serverless, event-driven Python applications . . . on AWS Lambda + API Gateway.”

在这里,“无服务器”意味着不需要永久性基础架构的托管。 在这方面,“ Zappa使构建和部署无服务器,事件驱动的Python应用程序变得非常容易。 。 。 在AWS Lambda + API网关上。”

That means infinite scaling, zero downtime, zero maintenance — and at a fraction of the cost of your current deployments!

这意味着无限扩展,零停机时间,零维护,而成本仅为当前部署的一小部分!

@Gun.io

@ Gun.io

Simplicity and versatility are Zappa’s greatest strengths. While you may read many blog posts about people using it for Django web applications, Zappa also provides turn-key ways to host serverless machine learning models accessible via API.

简单性和多功能性是Zappa的最大优势。 尽管您可能会读到许多有关将其用于Django Web应用程序的人的博客文章,但Zappa还提供了可通过API 托管无服务器机器学习模型交钥匙方式

配置Zappa (Configuring Zappa)

I highly recommend reading Gun.io’s documentation. I’ll focus on the basics here to highlight Zappa’s elegant simplicity while calling out a few lessons.

我强烈建议您阅读Gun.io的文档。 在这里,我将重点介绍一些基础知识,以突出Zappa的优雅简洁性。

First and foremost, navigate to your project root folder and configure a virtual environment with all of your required libraries, dependencies, and such. If you’re not just practicing, consider setting up a Docker environment for your Zappa app because the “closer [your Zappa matches the AWS lambda environment], then there will be less difficult-to-debug problems.” Importantly, the virtual environment name should not be the same as the Zappa project name, as this may cause errors.

首先 ,导航到您的项目根文件夹,并使用所有必需的库,依赖项等配置虚拟环境。 如果您不只是练习,请考虑为您的Zappa应用程序设置Docker环境,因为“更紧密的[您的Zappa与AWS Lambda环境匹配,那么将减少难以调试的问题。” 重要的是,虚拟环境名称不应与Zappa项目名称相同,因为这可能会导致错误。

$ mkdir zappatest
$ cd zappatest
$ virtualenv ve
$ source ve/bin/activate
$ pip install zappa

Then run init to configure a variety of initial settings for your app:

然后运行init为您的应用配置各种初始设置:

$ zappa init

Open the zappa_settings.json file to edit or configure other settings, such as identifying what S3 bucket will store the Lambda function artifact. Here’s what I used:

打开zappa_settings.json文件以编辑或配置其他设置,例如确定哪个S3存储桶将存储Lambda函数工件。 这是我使用的:

{
"dev": {
"app_function": "api.app.app",
"aws_region": "us-east-1",
"profile_name": "default",
"project_name": "serverless-ML",
"runtime": "python3.6",
"s3_bucket": "MY_BUCKET_NAME",
"slim_handler": true,
"debug": true
}
}

Note the “app_function”: “api.app.app” line calls out a module (folder) api with an app.py file where app = Flask(__name__) in this directory tree:

请注意“app_function”: “api.app.app”行使用此目录树中的app.py文件调用模块(文件夹) api ,其中app = Flask(__name__)

~env$my_project_name
.
|- api
|- app.py
|- ...

Finally, call an initial deployment to your environment:

最后,调用对您的环境的初始部署:

zappa deploy YOUR_STAGE_NAME

That’s it! “If your application has already been deployed and you only need to upload new Python code, but not touch the underlying routes, you can simply” call zappa update YOUR_STAGE_NAME.

而已! “如果您的应用程序已经部署完毕,而您只需要上传新的Python代码,而无需接触基础路由,则只需”就可以致电zappa update YOUR_STAGE_NAME

在Zappa应用中加载模型并调用.Predict() (Loading a Model & Calling .Predict( ) in a Zappa App)

After training my model, pickle the artifact into a model.pkl file and store it in an S3 bucket. Then load the model into a Zappa application using the Boto3 library. Next, transform the input JSON into a format amenable to the model and return a response:

训练好模型后 ,将工件腌制到model.pkl文件中,并将其存储在S3存储桶中。 然后使用Boto3库将模型加载到Zappa应用程序中。 接下来,将输入的JSON转换为适合模型的格式并返回响应:

Call zappa update YOUR_STAGE_NAME and the model is accessible via API. Hit the model with cURL request to test it:

调用zappa update YOUR_STAGE_NAME ,即可通过API访问该模型。 使用cURL请求命中模型以对其进行测试:

$ curl -F 'file=@payload.json' https://YOUR_ENDPOINT.execute-api.us-east-1.amazonaws.com/dev/handle_data_from_app

and watch the magic unfold in your CloudWatch logs:

并查看您的CloudWatch日志中的魔术:

Image for post

在无服务器ML模型中包括帮助器模块 (Including Helper Modules in Your Serverless ML Model)

The trickiest part of the setup above is highlighted in lines 63–66, where I call my helper module preprocess in the __main__ namespace. Without line 66, the model.pkl file performs various transformations, calling my preprocess module but producing errors that it cannot find that module name.

上面设置最棘手的部分在第63-66行中突出显示,在这里我在__main__命名空间中将我的助手模块称为preprocess 。 没有第66行, model.pkl文件将执行各种转换,调用我的preprocess模块,但会产生错误,导致找不到该模块名称。

This happened because, before pickling my model, I used a local module from helper import preprocess as part of the pipeline called from within .predict(). So when I want to reuse that model, the environment in which it is called was not identical. I spent hours trying to figure out how to get the environments to match-up. Here’s the key lesson: Zappa wraps up the dependencies and libraries installed in your virtual environment into a zip file that gets uploaded into S3, which forms the content of the Lambda function. Essentially,

发生这种情况的原因是,在对模型进行酸洗之前,我使用了from helper import preprocess的本地模块作为.predict()调用的管道的一部分。 因此,当我想重用该模型时, 调用该模型的环境并不相同 。 我花了几个小时试图弄清楚如何匹配环境。 这是关键课程:Zappa将虚拟环境中安装的依赖项和库包装到一个zip文件中,该文件被上载到S3中,形成了Lambda函数的内容。 本质上,

if you cannot or did not package your local modules into the virtual environment, then they do not exist in your Zappa Lambda environment.

如果您不能或没有将本地模块打包到虚拟环境中,则它们在Zappa Lambda环境中不存在。

Furthermore, if your model’s .predict() has a particular namespace that it is using, then you must replicate that exactly, too. This is why I had to instantiate from helper import preprocess in the ‘__main__’ block of the application. Placing it elsewhere resulted in path mismatches to what the pipeline within my pickled model’s .predict() expected.

此外,如果模型的.predict()具有正在使用的特定名称空间,则也必须完全复制该名称空间。 这就是为什么我必须在应用程序的'__main__'块中from helper import preprocess实例化的原因。 将其放置在其他位置会导致路径与我的腌制模型的.predict()预期的管道不匹配。

通过无服务器Web应用程序将Gmail连接到我的ML模型 (Connecting Gmail to My ML Model via a Serverless Web App)

I created a second Zappa application to connect Users with my ML model (the “Service”). At its core, the Service connects to a User’s Gmail account, fetches messages, and passes them to the model so it can predict whether the content is sales spam. To recap:

我创建了第二个Zappa应用程序,以将Users与我的ML模型(“服务”)连接起来。 该服务的核心是连接到用户的Gmail帐户,提取邮件,然后将其传递给模型,以便它可以预测内容是否为垃圾邮件。 回顾一下:

Image for post

The guts of this app is beyond the scope of this post, so I’ll focus on the lessons learned instead.

这个应用程序的勇气超出了本文的范围,因此我将重点关注所学的课程。

30秒:启用API的Lambda函数的重要限制 (30 Seconds: An Important Limit of API-enabled Lambda Functions)

In the first version of my application, I took a batch of 50 messages from a user’s Gmail account and ran .predict() in one function call. Yet the function kept timing out despite attempting a 15 minute timeout period in my AWS Lambda. I also tried extending the timeout_seconds parameter in the Zappa settings, but that did not help either.

在我的应用程序的第一个版本中,我从一个用户的Gmail帐户中提取了50条消息,并在一个函数调用中运行了.predict() 。 尽管在我的AWS Lambda中尝试了15分钟的超时时间,但该功能仍保持超时 。 我还尝试在Zappa设置中扩展timeout_seconds参数,但这也无济于事。

The bottleneck is not Zappa or Lambda functions but rather the API Gateway that Zappa uses. As noted in the AWS Documentation, the “maximum integration timeout” is “30 seconds” and it cannot be increased upon request. Apparently, the function continues to run in the background, but in my case I needed a response.

瓶颈不是Zappa或Lambda函数,而是Zappa使用的API网关 。 如AWS文档中所述,“最大集成超时”为“ 30秒”,不能根据要求增加。 显然,该功能继续在后台运行,但就我而言,我需要一个响应。

Separately, this issue helped me realize that I wasn’t following the principles of serverless technology, which is meant to leverage parallel, horizontal scaling.

另外,这个问题使我意识到我没有遵循无服务器技术的原则,该原则旨在利用并行,水平扩展。

Cue SQS! (Get it? ;)

提示SQS! (得到它? ;)

Instead of predicting 50+ messages in a single Lambda invocation, it’s better to queue those messages in SQS. Then Zappa can trigger the .predict() function using the SQS event stream (e.g., once for each message). Boom––horizontal scaling with no timeout!

与其在单个Lambda调用中预测50条以上的消息,不如在SQS中将这些消息排队。 然后Zappa可以使用SQS事件流(例如,每条消息一次.predict()触发.predict()函数。 动臂–水平缩放,无超时!

使用SQS事件触发Zappa函数 (Using SQS Events to Trigger Zappa Functions)

On my first pass, I manually configured a Lambda function to accept an SQS event via the UI console.

在第一遍中,我手动配置了Lambda函数以通过UI控制台接受SQS事件。

Image for post

Then I coded the Lambda event to parse the SQS event and call a Zappa URL endpoint:

然后,我对Lambda事件进行了编码,以解析SQS事件并调用Zappa URL端点:

Image for post

While that’s fairly simple, it’s missing git version control and command line redeployment! Fortunately, Zappa permits you to trigger functions within the app via AWS events.

虽然这很简单,但是缺少git版本控制和命令行重新部署! 幸运的是,Zappa允许您通过AWS事件触发应用内的功能

First, I ported the Lambda function as just another Zappa function in the main app.py file:

首先,我在主app.py文件中将Lambda函数移植为另一个Zappa函数:

def get_sqs_msg_and_call_predict(event, context):    try:
for record in event['Records']:
body = record["body"]
resp = sqs_predict_label(body)
except Exception as e:
# Send some context about this error to Lambda Logs
print(e)

Then I reconfigured zappa_settings.json to include events:

然后,我将zappa_settings.json重新配置为包含events

"events": [
{
"function": "api.app.get_sqs_msg_and_call_predict",
"event_source": {
"arn": "arn:aws:sqs:us-east-1:XXXXXX:YYYYY",
"batch_size": 1, // Max: 10. Use 1 to trigger immediate processing
"enabled": true // Default is false
}
}
]

I lost hours trying to diagnose the proper incantation for "function": "xxx". The official Zappa documentation mentions "function": "your_module.process_upload_function". Intending to call a function in the main app.py portion of my Zappa application, I tried:

我花了很多时间来尝试为"function": "xxx"诊断正确的咒语。 Zappa官方文档提到了"function": "your_module.process_upload_function". 为了在Zappa应用程序的主要app.py部分中调用一个函数,我尝试了:

  • "function": "get_sqs_msg_and_call_predict"

    "function": "get_sqs_msg_and_call_predict"

  • "function": "app.get_sqs_msg_and_call_predict"

    "function": "app.get_sqs_msg_and_call_predict"

  • "function": "api.get_sqs_msg_and_call_predict"

    "function": "api.get_sqs_msg_and_call_predict"

All of which gave me variations on this error:

所有这些都使我对这个错误有所了解:

Traceback (most recent call last):
File "/var/task/handler.py", line 609, in lambda_handler
return LambdaHandler.lambda_handler(event, context)
File "/var/task/handler.py", line 243, in lambda_handler
return handler.handler(event, context)
File "/var/task/handler.py", line 418, in handler
app_function = self.import_module_and_get_function(whole_function)
File "/var/task/handler.py", line 235, in import_module_and_get_function
app_function = getattr(app_module, function)
AttributeError: module 'api' has no attribute 'get_sqs_msg_and_call_predict'
[123123] You are currently running on staging
[123123] module 'api' has no attribute 'get_sqs_msg_and_call_predict': AttributeError

Important reminder: you must consider the module as it relates to your root application folder. If you have a folder structure like this:

重要提醒 :您必须考虑与您的根应用程序文件夹相关的模块。 如果您具有这样的文件夹结构:

~env$project_name
.
|- api
|- app.py
|- templates
|- ...

the proper setup includes "function": "api.app.some_function_therein". I hope this detail saves you time and headaches, dear reader.

正确的设置包括"function": "api.app.some_function_therein" 。 亲爱的读者,我希望这个细节可以节省您的时间和头痛。

事件触发的Zappa函数中的“ return”语句 (The ‘return’ statement in event-triggered Zappa functions)

Handling the SQS events in the main app.py file resulted in an unusual scenario. Here is the function again:

在主app.py文件中处理SQS事件会导致异常情况。 这又是函数:

def get_sqs_msg_and_call_predict(event, context):    try:
for record in event['Records']:
body = record["body"]
resp = sqs_predict_label(body)
except Exception as e:
# Send some context about this error to Lambda Logs
print(e)

You’ll note that the function above does not include a return statement. Tried as I might, I was not able to generate an acceptable return statement. Fortunately, my code did not need one, so I decided to skip this rather than dwell on the proper way to handle this scenario.

您会注意到上面的函数不包含return语句。 我尝试过,但无法生成可接受的return语句。 幸运的是,我的代码不需要一个代码,因此我决定跳过此代码,而不是停留在处理此场景的正确方法上。

Here’s the problem in a nutshell:

简而言之,这就是问题所在:

[1231231] Result of zappa.asynchronous.route_lambda_task
[1231232] <Response 192 bytes [200 OK]>
[1231233] An error occurred during JSON serialization of response: <Response 192 bytes [200 OK]> is not JSON serializable
Traceback (most recent call last):
File "/var/lang/lib/python3.6/json/__init__.py", line 238, in dumps **kw).encode(obj)
File "/var/lang/lib/python3.6/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True)
File "/var/lang/lib/python3.6/json/encoder.py", line 257, in iterencode return _iterencode(o, 0)
File "/var/runtime/awslambda/bootstrap.py", line 145, in decimal_serializer
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <Response 192 bytes [200 OK]> is not JSON serializable

StackOverflow offers solutions using jsonify(response) to make the response output JSON serializable. Unfortunately, that did not work in this case:

StackOverflow提供使用jsonify(response)解决方案,以使响应输出JSON可序列化。 不幸的是,在这种情况下不起作用:

This typically means that you attempted to use functionality that needed to interface with the current application object in some way. To solve this, set up an application context with app.app_context().  See the documentation fo
r more information.: RuntimeError Traceback (most recent call last):
File "/var/task/handler.py", line 609, in lambda_handler
return LambdaHandler.lambda_handler(event, context)
File "/var/task/handler.py", line 243, in lambda_handler
return handler.handler(event, context)
File "/var/task/handler.py", line 419, in handler
result = self.run_function(app_function, event, context)
File "/var/task/handler.py", line 282, in run_function
result = app_function(event, context)
File "/tmp/push-email-hand/api/app.py", line 1216, in get_sqs_msg_and_call_predict
raise e
File "/tmp/push-email-hand/api/app.py", line 1205, in get_sqs_msg_and_call_predict
sqs_predict_label(body)
File "/tmp/push-email-hand/api/app.py", line 1249, in sqs_predict_label
return make_response(jsonify(response), status)
File "/tmp/push-email-hand/flask/json/__init__.py", line 358, in jsonify
if current_app.config["JSONIFY_PRETTYPRINT_REGULAR"] or current_app.debug:
File "/var/task/werkzeug/local.py", line 348, in __getattr__
return getattr(self._get_current_object(), name)
File "/var/task/werkzeug/local.py", line 307, in _get_current_object
return self.__local()
File "/tmp/push-email-hand/flask/globals.py", line 52, in _find_app
raise RuntimeError(_app_ctx_err_msg)
RuntimeError: Working outside of application context.

I suspect this happened because I was calling an asynchronous task using from zappa.asynchronous import task , which somehow changed the application context. I’d love to hear from you if you figure this one out!

我怀疑发生这种情况是因为我正在使用from zappa.asynchronous import task调用异步from zappa.asynchronous import task ,从而以某种方式更改了应用程序上下文。 如果您想出这件事,我很想听听您的意见!

通过Zappa配置环境变量 (Configuring Environment Variables via Zappa)

It’s quite easy to set environment variables in zappa_settings.json to pass in the code environment, like this:

zappa_settings.json设置环境变量以在代码环境中传递非常容易,如下所示:

{
"prod": {
"app_function": "api.app.app",
# . . . #
"environment_variables": {
"code_env": "prod"
}
}
}

Beware an importance nuance: environment_variables are local to your application and are injected into the Lambda handler at runtime. In other words, environment_variables will not set AWS Lambda environment variables:

当心一个重要的细微差别environment_variables是应用程序本地的,并在运行时注入到Lambda处理程序中。 换句话说, environment_variables不会设置AWS Lambda环境变量:

Image for post

Yet these variables are required from time to time, such as when you want to set credentials. One way to solve this issue is to navigate to the UI console and set them there. This approach is not ideal for deploying multiple environments (e.g., staging-branch1, staging-branch2, testing, prod, etc.). The better way is to use Zappa’s aws_environment_variables configuration setting:

但是,有时会需要这些变量,例如,当您要设置凭据时。 解决此问题的一种方法是导航到UI控制台并在那里进行设置。 对于部署多个环境(例如,staging-branch1,staging-branch2,测试,产品等),此方法不是理想的选择。 更好的方法是使用Zappa的aws_environment_variables配置设置:

{
"prod": {
"app_function": "api.app.app",
# . . . #
"aws_environment_variables": {
"GOOGLE_CLIENT_ID": "XXX"
"GOOGLE_CLIENT_SECRET": "YYY"
"GOOGLE_REFRESH_TOKEN": "ZZZ"
"code_env": "prod"
}
}
}

解析Gmail邮件 (Parsing a Gmail Message)

Many have written on the complicated subject of parsing email content from Gmail messages. Long story short, emails come in many shapes and sizes (text/plain, text/html, multipart, etc.). Had I parsed text/plain email only, I would have missed a significant number of messages. Thus, I’m including the code below in hopes of helping readers who don’t want to re-write this from scratch. Additionally, it was frustrating how various Python libraries wouldn’t play well together. So I’m including the versions I used below:

许多人撰写了有关从Gmail邮件解析电子邮件内容的复杂主题 。 长话短说,电子邮件有多种形状和大小( text/plain text/htmltext/htmlmultipart等)。 如果仅解析text/plain电子邮件,我将错过大量消息。 因此,我包含以下代码,以希望帮助不想从头开始重写此代码的读者。 此外,令人沮丧的是各种Python库无法一起正常使用。 因此,我包括以下使用的版本:

Copy-paster beware: the function above isn’t perfect, but it gets the job done. Feel free to send me your thoughts if you end up implementing this.

复制粘贴时要小心:上面的功能并不完美,但是可以完成工作。 如果您最终实现了这个想法,请随时向我发送您的想法。

处理“至少一次交货” (Dealing with “At Least Once Delivery”)

I had a lot of fun debugging an issue related to “at least once delivery” from Google’s Pub/Sub system. By way of background, my application calls .watch() on a User’s Gmail Account, which sends “the ID of the mailbox's current history record“ to a Google Publisher Topic. A connected Subscriber then makes a push notification to my Zappa endpoint, which processes the historyId necessary to fetch the latest messages and queue them into SQS.

在调试与Google的发布/订阅系统中的“至少一次发货”相关的问题时,我玩得很开心 。 作为背景,我的应用程序在用户的Gmail帐户上调用.watch() ,该帐户将“邮箱当前历史记录的ID”发送到Google Publisher主题。 然后,连接的订户向我的Zappa终结点发出推送通知,该终结点处理获取最新消息并将它们排队到SQS所需的historyId

Image for post

While playing with the Gmail API from my console, I got the impression that the subscriber push notification provided a historyId that retrieved messages since the last push notification. In other words, I thought receiving historyId at time=2 meant that I could retrieve all messages from time=1 to time=2. So I built my app’s logic on this assumption and it seemingly worked…

从我的控制台使用Gmail API时,给人的印象是订阅者的推送通知提供了一个historyId ,该historyId检索自上次推送通知以来的消息。 换句话说,我认为在time=2处接收historyId意味着我可以检索从time=1time=2所有消息。 因此,我根据此假设建立了应用程序的逻辑,它似乎奏效了……

Then I went mad debugging why the system would sometimes fetch 0 messages from a historyId, whereas other times it would fetch a few messages, without rhyme or reason.

然后我疯狂地调试为什么系统有时会从historyId提取0条消息,而有时却会无缘无故地获取一些消息。

At-least-once delivery means that “messages may sometimes be delivered out of order or more than once.”

至少一次传递意味着“消息有时可能会乱序发送或多次传递。”

Days later, after probing the application with loggers and tracers, it hit me like a ton of bricks. When the subscriber delivered a message the first time, it was providing the latest historyId and there were no new messages to fetch. But due to at-least-once delivery, the subscriber could re-deliver the messages out of order or several times. If the Subscriber re-delivered a message and the User received new email since the first Subscriber push, then the system would fetch the new emails — giving me the false impression that my earlier logic worked properly. #facePalm

几天后,在使用记录器和跟踪器探查应用程序之后,它像一堆砖头一样打击了我。 订户一次发送消息时,它提供的是最新的historyId ,没有新消息可提取。 但是由于至少一次传递,订户可能会无序或多次重新传递消息。 如果订户重新发送了一条消息, 并且自第一次订户推送以来用户收到了新电子邮件,则系统将提取新电子邮件- 给我一种错误的印象,即我先前的逻辑工作正常 。 #facePalm

Image for post
Image by Alex E. Proimos, cc-by-2.0. 图片由Alex E.Proimos提供 ,抄送方式:2.0。

My solution was to build a historyId buffer per User, using the (n-1) historyId for fetching messages and then replacing the second latest value with the latest historyId:

我的解决方案是为每个用户建立一个historyId缓冲区,使用(n-1) historyId来获取消息,然后将第二个最新值替换为最新的historyId

建立幂等服务 (Building Idempotent Services)

Here’s another gem in the Google Pub/Sub documentation:

这是Google Pub / Sub文档中的另一个亮点:

Typically, Pub/Sub delivers each message once and in the order in which it was published. However, messages may sometimes be delivered out of order or more than once. In general, accommodating more-than-once delivery requires your subscriber to be idempotent when processing messages.

通常,发布/订阅会按照发布的顺序一次发送每条消息。 但是,有时邮件可能会乱序发送或不止一次发送。 通常,要容纳多于一次的传递,则要求您的订户在处理消息时必须是幂等的

In short, a function is idempotent if it can be repeated or retried as often as necessary without causing unintended effects. This approach is useful given Gmail’s push messages may arrive repeatedly. So rather than re-processing the same message — and wasting resources––again and again, I set up a DynamoDB table that stored each msg_id for lookup. If found, that message would be dropped from further processing.

简而言之,如果可以根据需要多次重复或重试该功能而不会引起意想不到的效果,则该功能是幂等的。 考虑到Gmail的推送消息可能反复到达,这种方法很有用。 因此,我一次又一次地设置了一个DynamoDB表,该表存储了每个msg_id进行查找,而不是重新处理相同的消息(并浪费资源)。 如果找到,则该消息将从进一步处理中删除。

使用AWS SAM模板部署资源 (Deploying Resources with AWS SAM Templates)

One of the best things about serverless development is the ability to deploy multiple, interconnected resources from the command line. Suppose an application required six Lambda functions, four DynamoDB tables, two SQS queues, and an API Gateway. Using the AWS CLI or UI Console would be tiresome. Thanks to AWS Serverless Application Model (AWS SAM) this task is very simple. But if you’re new to serverless, CloudFormation, or SAM, the template file that manages those resources can read like hieroglyphics. So let’s start with a quick anatomy of the SAM template.

关于无服务器开发的最好的事情之一就是能够从命令行部署多个互连的资源。 假设一个应用程序需要六个Lambda函数,四个DynamoDB表,两个SQS队列和一个API网关。 使用AWS CLI或UI控制台会很麻烦。 借助AWS无服务器应用程序模型(AWS SAM),该任务非常简单。 但是,如果您不熟悉无服务器,CloudFormation或SAM,则管理这些资源的模板文件可以像象形文字一样读取。 因此,让我们开始快速剖析SAM模板。

AWS SAM模板剖析 (AWS SAM Template Anatomy)

A template is a JSON- or YAML-formatted text file that describes your AWS infrastructure.” The following snippet shows a SAM template and its sections:

模板是描述您的AWS基础设施的JSON或YAML格式的文本文件 。” 以下代码片段显示了SAM模板及其部分:

The SAM template includes a Transform: to identify the version of the AWS Serverless Application Model (AWS SAM) to use. Then the template sets Globals: settings for the serverless app. For instance, there I set the CORS policy for my API resource. Next, the template identifies Parameters: that are passed into the template at run-time via the command line. After that, the template has

SAM模板包括一个Transform:标识要使用的AWS无服务器应用程序模型(AWS SAM)的版本。 然后,模板为无服务器应用程序设置Globals:设置。 例如,在那里我为我的API资源设置了CORS策略。 接下来,模板标识Parameters:在运行时通过命令行传递到模板中。 之后,模板具有

解压缩Lambda资源 (Unpacking a Lambda Resource)

The Lambda function has a Logical ID, which is the unique name for the resource in a template file. In the example below, the Logical ID is SomeLambdaFunction:

Lambda函数具有逻辑ID,这是模板文件中资源的唯一名称。 在下面的示例中,逻辑ID为SomeLambdaFunction

The Path: /your_path identifies the API path to call this function, while RequestParameters: tells the template what variables to expect. The Variables: are local environment variables for the Lambda function, which prove useful for managing database connections, among other things. The Policies: section identifies permissions for this function. For example, if this function is meant to put items in a DynamoDB table, then you must specify the actions and the ARN to the relevant table. Finally, the line CodeUri: functions/your_event/ maps your resource to the Lambda handler code in your project folder. For instance, the Lambda function above should have this folder structure:

Path: /your_path标识调用此函数的API路径,而RequestParameters:告诉模板期望哪些变量。 Variables:是Lambda函数的局部环境变量,除其他功能外,它们对于管理数据库连接非常有用。 Policies:部分标识此功能的权限。 例如,如果此功能旨在将项目放入DynamoDB表中,则必须为相关表指定操作和ARN。 最后,行CodeUri: functions/your_event/将资源映射到项目文件夹中的Lambda处理程序代码。 例如,上面的Lambda函数应具有以下文件夹结构:

~env$my_SAM_project
.
|- functions
|-
|- ...

The file app.py then performs whatever functions you need. By way of example, in my application, the app.py takes query string parameters to update a DynamoDB table:

然后,文件app.py将执行所需的任何功能。 举例来说,在我的应用程序中, app.py查询字符串参数来更新DynamoDB表:

Take note of this line: table = ddbclient.Table(os.environ['DDBTableName']) because it shows how to use an environment variables passed into a Lambda function via a template file. In a section below you will see how this powerful feature helps manage multiple environments and tables using a single template file.

请注意这一行: table = ddbclient.Table(os.environ['DDBTableName'])因为它显示了如何使用通过模板文件传递给Lambda函数的环境变量。 在下面的部分中,您将看到此强大功能如何使用单个模板文件帮助管理多个环境和表。

Once you are done coding your Lambda functions, you must build the SAM template using the AWS CLI:

完成Lambda函数的编码后,您必须使用AWS CLI构建SAM模板:

$ sam build

If you are still making changes to the app.py file, then I recommend that you remove the AWS SAM build artifacts so that your code changes take effect immediately:

如果您仍在更改app.py文件,那么我建议您删除AWS SAM构建工件,以便您的代码更改立即生效:

$ rm -rf .aws-sam/*

Otherwise, you have to re-build the SAM template after each code change. This gets very annoying, very quickly. Just don’t forget to re-build the SAM template if you make any changes to the template file!

否则,您必须在每次更改代码后重新构建SAM模板。 这变得非常烦人,很快。 如果对模板文件进行任何更改,请不要忘记重新构建SAM模板!

Finally, after testing your functions locally, you can deploy your SAM template, like this:

最后,在本地测试功能之后,可以部署SAM模板,如下所示:

sam deploy --template-file .aws-sam/build/template.yaml --stack-name YOUR_STACK_NAME --s3-bucket YOUR_S3_BUCKET --capabilities CAPABILITY_NAMED_IAM --region us-east-1 --parameter-overrides ParameterKey=DDBTableName,ParameterValue=MyUserTable

You should test your functions and resources locally given how long it takes to deploy a SAM template. That is beyond the scope of this (already lengthy) post, so I leave you with this excellent guide instead.

给定部署SAM模板需要花费多长时间,您应该在本地测试功能和资源。 这已经超出了本篇(已经很长的篇幅)的范围,因此,我将为您提供这份出色的指南

通过单个SAM模板管理多个环境 (Managing Multiple Environments from a Single SAM Template)

Working with a template file is simple enough, but there is some complexity as you leverage multiple environments, such as staging, testing, and production. Some people suggest maintaining multiple template files or even multiple AWS accounts for this purpose. I do not find those proposals convincing because they invite other problems, such as:

使用模板文件非常简单,但是在利用多个环境(例如stagingtestingproduction ,会有些复杂。 有人建议为此目的维护多个模板文件甚至多个AWS账户。 我认为这些建议没有说服力,因为它们提出了其他问题,例如:

  • ensuring that the multiple templates are actually identical, or

    确保多个模板实际上是相同的,或者
  • managing billing, user, and security issues due to using multiple AWS accounts.

    管理由于使用多个AWS账户而导致的账单,用户和安全问题。

The better approach requires a bit of setup but provides an elegant solution to managing multiple environments and resources using a single template file.

更好的方法需要一些设置,但是提供了一个优雅的解决方案,可以使用一个模板文件来管理多个环境和资源。

The first step is to introduce a few parameters into your template file:

第一步是将一些参数引入模板文件:

Then you can use those parameters in the template file to dynamically build each Lambda function, DynamoDB table, and so forth. Here’s an example from my SAM template:

然后,您可以在模板文件中使用这些参数来动态构建每个Lambda函数,DynamoDB表等。 这是我的SAM模板中的一个示例:

Then I updated my ~/.bash_profile to include this script:

然后,我更新了~/.bash_profile使其包含以下脚本:

Now I can deploy the entire serverless stack like this:

现在,我可以像这样部署整个无服务器堆栈:

$ deploy_sam "staging"

Each deployment gives me differently-named yet identical resources – such as staging_master_sales-spam-blocker_MyUserTable – with automatic, proper routing between tables and Lambda functions thanks to the Resource: > Environment: > Variables: and the !Sub command:

借助Resource: > Environment: > Variables:!Sub命令,每次部署都为我提供了名称不同但相同的资源(例如staging_master_sales-spam-blocker_MyUserTable ,并在表和Lambda函数之间进行了自动,正确的路由。

Image for post

使用SAM事件在DynamoDB中实现聚合表 (Implementing Aggregate Tables in DynamoDB with SAM Events)

For context, my application has a concept of ‘credits’ that are used to unblock emails from a salesperson to a User. If a sender has more than zero credits, then the sender’s email is unblocked automatically and their credits are decremented. But if a sender has zero credits, then they are sent a new Captcha puzzle to earn more credits. Thus, it is critical to know how many credits a given sender has at a point in time.

就上下文而言,我的应用程序具有“信用”的概念,用于取消阻止从销售人员到用户的电子邮件。 如果发件人的信用额度超过零,则发件人的电子邮件将自动取消阻止,其信用额度将递减。 但是,如果发件人的信用额为零,则会向他们发送一个新的验证码难题,以赚取更多的信用额。 因此,至关重要的是要知道给定发件人在某个时间点有多少信用。

In a typical relational database, I would have inserted each Captcha transaction with datetime and sender_id data. Then I would have queried the table to obtain the SUM(credits) for a sender_id at a given datetime. I sought a similar function in DynamoDB, but it turns out that aggregates––like SUM() and MAX()— are not supported.

在典型的关系数据库中,我将在每个Captcha事务中插入datetimesender_id数据。 然后,我将查询该表以获取给定日期时间的sender_idSUM(credits) 。 我在DynamoDB中寻求了类似的功能,但事实证明,不支持聚合(如SUM()MAX()

And so began the search for answers. Fortunately, I found this excellent post by Peter Hodgkinson on aggregate tables. Here’s the key insight:

于是开始寻找答案。 幸运的是,我在汇总表上发现了Peter Hodgkinson的 出色表现 。 这是关键的见解:

Use DynamoDB Streams to send database events to a downstream Lambda function that performs the aggregation on the fly.

使用DynamoDB流将数据库事件发送到下游Lambda函数,该函数即时执行聚合。

The following steps demonstrate how to accomplish this with AWS SAM and Python 3.6. For additional context, this diagram helps visualize the interactions described below:

以下步骤演示了如何使用AWS SAM和Python 3.6完成此操作。 对于其他上下文,此图有助于可视化下面描述的交互:

Image for post

While my app has many other resources, this section focuses on the essentials to show calculating the aggregate tables.

尽管我的应用程序还有许多其他资源,但本节重点介绍显示汇总表计算的要点。

First, configure a SAM template with two DynamoDB tables and a Lambda function, like this:

首先,使用两个DynamoDB表和一个Lambda函数配置SAM模板,如下所示:

Pay close attention to lines 41–46, which authorize the Lambda function to access the streams on the target table.

请密切注意第41–46行,该行授权Lambda函数访问目标表上的流。

Unlike the other Lambda functions, which use an API Events: type, such as:

与其他Lambda函数不同,后者使用API Events:类型,例如:

Events:
YourEvent:
Type: Api
Properties:
Path: /your_path
Method: post
RequestParameters:
- method.request.querystring.userId
- method.request.querystring.credentials
RestApiId: !Ref MyApi

you must configure the Lambda Events: to use the DynamoDB stream, like this:

您必须配置Lambda Events:以使用DynamoDB流,如下所示:

Events:
DDBEvent:
Type: DynamoDB
Properties:
Stream:
!GetAtt UserTable.StreamArn StartingPosition: TRIM_HORIZONBatchSize: 1
Enabled: true

Now, whenever data is INSERTed into DynamoDB, your Lambda function will receive events like this:

现在,每当将数据插入INSERT到DynamoDB中时,您的Lambda函数将接收如下事件:

Pro Tip: save the event above in a JSON file, such as dynamodb-event-update-user-msg-id.json, to test your setup locally via the command line:

专家提示 :将上述事件保存在JSON文件中,例如dynamodb-event-update-user-msg-id.json ,以通过命令行在本地测试您的设置:

$ sam local invoke AggDailyCreditsPerSender --event dynamodb-event-update-user-msg-id.json --parameter-overrides ParameterKey=Environment,ParameterValue=localhost ParameterKey=paramAggDailyCreditTable,ParameterValue=AggDailyCreditTable --docker-network sam-demo

Next, you must configure the Lambda function handler to perform the aggregate calculation. In the snippet below, I query the aggregate table to get the current balance, calculate the new running balance, and update the aggregate table:

接下来,您必须配置Lambda函数处理程序以执行聚合计算。 在下面的代码段中,我查询汇总表以获取当前余额,计算新的运行余额并更新汇总表:

Presto! Every time the UserTable records a new msg_id, the Lambda function calculates a running balance for a given sender.

快点! 每次UserTable记录新的msg_id ,Lambda函数都会为给定的发送方计算运行余额。

Now to access that data, we need another Lambda function to get_item from the aggregate table:

现在访问该数据,我们需要另一个Lambda函数从聚合表中get_item

For completeness sake, and to highlight a problem you may run into, here’s the accompanying Lambda function to fetch the data from the aggregate table:

为了完整起见,并突出显示您可能遇到的问题 ,以下是随附的Lambda函数,用于从聚合表中获取数据:

DynamoDB uses a generic 'N' designation for all numeric data types, rather than specifying “Integer” or “Float” types. So despite using integers in the aggregate calculations, the {k: deserializer.deserialize(v) for k,v in low_level_data.items()} outputs Decimal('X') objects. Much ado has been made of this issue on GitHub, but coercing the Decimal object into an integer works just fine (noted in line 44 above).

DynamoDB对所有数字数据类型使用通用的'N'标记,而不是指定“ Integer”或“ Float”类型。 因此,尽管在汇总计算中使用了整数,但{k: deserializer.deserialize(v) for k,v in low_level_data.items()}{k: deserializer.deserialize(v) for k,v in low_level_data.items()}输出Decimal('X')对象。 无事生非已使这个问题在GitHub上,但强迫小数对象到一个整数工作得很好(在上面一行44所示)。

With this, the application can retrieve the current, cumulative credit balance for any sender in milliseconds. What’s more, the query performance should not degrade despite a quickly growing number of senders or transaction records, unlike a SQL aggregate query.

这样,应用程序可以检索任何发送者的当前累计信用余额( 以毫秒为单位) 。 此外,与SQL聚合查询不同,尽管发送者或事务记录的数量Swift增长,查询性能也不会降低。

无服务器聚合功能的替代解决方案 (Alternative Solutions for Serverless Aggregate Functions)

A tempting, alternative approach was to use AWS Aurora Serverless to maintain SQL logic while leveraging serverless technology. From AWS:

一种诱人的替代方法是使用AWS Aurora Serverless维护SQL逻辑,同时利用无服务器技术。 从AWS:

Amazon Aurora Serverless is an on-demand, auto-scaling configuration for Amazon Aurora (MySQL-compatible and PostgreSQL-compatible editions), where the database will automatically start up, shut down, and scale capacity up or down based on your application’s needs.

Amazon Aurora Serverless是针对Amazon Aurora (MySQL兼容和PostgreSQL兼容版本)的按需自动扩展配置,该数据库将根据应用程序的需要自动启动,关闭和扩展或缩减容量。

Ultimately, I preferred DynamoDB’s lightning fast queries where the app mostly needs to perform a simple lookup on potentially millions of msg_id and sender_id records. That said, I am curious to benchmark the two technologies in future posts!

最终,我更喜欢DynamoDB的闪电般的快速查询,在该查询中,应用程序主要需要对可能数百万个msg_idsender_id记录进行简单查找。 也就是说,我很好奇在以后的文章中对这两种技术进行基准测试!

未来方向和分手的想法 (Future Direction & Parting Thoughts)

I hope this post helps you diagnose, debug, or avoid issues in your serverless ML projects. Despite learning much about this emerging technology, it’s clear to me that I’m just scratching the surface. Here are some areas I’d like to explore in the future:

希望本文有助于您诊断,调试或避免无服务器ML项目中的问题。 尽管对这种新兴技术有很多了解,但对我来说很明显,我只是在摸索。 以下是我将来要探索的一些领域:

  • monitoring my ML model for uptime, skew, and drift;

    监视我的ML模型的正常运行时间,偏斜和漂移;
  • configuring an A/B testing framework where I can route live data to a current model (A) and second model (B) to measure comparative F1 scores; and

    配置A / B测试框架,在该框架中,我可以将实时数据路由到当前模型(A)和第二模型(B)以测量F1的相对得分; 和
  • setting up a pipeline for auditing and reproducibility.

    建立审核和重现性管道。

If you’re interested, reach out to me so we can collaborate and learn together!

如果您有兴趣,请与我联系,以便我们一起合作并共同学习!

Lastly, I’ll note that while I possess a degree in Electrical Engineering, my career has predominantly focused on Product leadership. In that role, it’s often too easy to ask for this or that feature coupled with tight deadlines. Designing, coding, and testing this application forced me to “walk a mile in a developer’s shoes”, and my feet hurt (in a good way)! So I’ll finish with a reminder – to my future self and other Product leaders – we owe respect and admiration to our engineering teammates.

最后,我会指出,虽然我拥有电气工程学位,但是我的职业主要集中在产品领导力上。 在这个角色中, 通常要求这个功能很容易,而最后期限很紧。 设计,编码和测试该应用程序迫使我“在开发人员的鞋子上走了一英里”,而且脚部受伤(以一种很好的方式)! 因此,在结束时,我要提醒我-未来的自我和其他产品负责人- 我们对我们的工程团队表示敬意和钦佩

Whenever you get a chance, applaud developer efforts, especially when they anticipate trouble down the road or explain why a seemingly “simple” request ain’t so.

只要有机会,就称赞开发人员的努力,尤其是当他们预期未来会遇到麻烦或解释为什么看似“简单”的要求不是这样的时候。

There are too many people to thank for making this post possible. Here are a few people that significantly guided my journey: Gun.io (on fantastic documentation for Zappa), Sanket Doshi (on reading emails in Python), François Marceau (on deploying a ML model on AWS Lambda), Lorenz Vanthillo (on deploying a local serverless app with AWS SAM), and Peter Hodgkinson (on real-time aggregation with DynamoDB Streams). Thank you!

有太多人要感谢使这篇文章成为可能。 以下是一些对我的旅程有重大意义的人: Gun.io (有关Zappa的精彩文档 ), Sanket Doshi (有关阅读Python电子邮件 ), FrançoisMarceau (有关在AWS Lambda部署ML模型 ), Lorenz Vanthillo (有关部署)使用AWS SAM的本地无服务器应用程序 )和Peter Hodgkinson ( 使用DynamoDB Streams进行实时聚合 )。 谢谢!

“I[3]- Isaac Newton in 1675

“我[3] -1675年的艾萨克·牛顿

翻译自: https://towardsdatascience.com/lessons-in-serverless-stacks-using-nlp-to-block-unsolicited-sales-emails-8fda116273e8

nlp课程

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值