脚本 api_从脚本到预测API

最新推荐文章于 2022-07-16 11:56:53 发布

weixin_26713521

最新推荐文章于 2022-07-16 11:56:53 发布

阅读量657

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/from-scripts-to-prediction-api-2372c95fb7c7

版权

脚本 api

This is the continuation of my previous article:

这是我上一篇文章的延续：

From Jupyter Notebook To Scripts

从Jupyter Notebook到脚本

Last time we discussed how to convert Jupyter Notebook to scripts, together with all sorts of basic engineering practices such as CI, unit testing, package environment, configuration, logging, etc

上次我们讨论了如何将Jupyter Notebook转换为脚本，以及各种基本工程实践，例如CI，单元测试，软件包环境，配置，日志记录等。

Even with the script form, it still requires us to change the configuration and run the script, It is OK for Kaggle competitions because all you need is just the submission.csv, but you probably don’t want to sit behind the computer 24/7 and hit Run whenever users send you a prediction request 🙁

即使使用脚本形式，它仍然需要我们更改配置并运行脚本。对于Kaggle竞赛来说，这是可以的，因为您所需要的只是submitt.csv ，但您可能不想坐在计算机后面24 / 7，并在用户向您发送预测请求时点击运行🙁

In this article, we will discuss how to utilize the models we have built last time and create prediction APIs to do model serving using FastAPI!

在本文中，我们将讨论如何利用上次构建的模型并创建预测API来使用FastAPI进行模型服务！

For ML/DL folks, we are talking about FastAPI, NOT fast.ai !!!

对于ML / DL人士，我们正在谈论FastAPI ，而不是fast.ai ！

背景：FastAPI (Background: FastAPI)

There are many frameworks in Python ecosystem for API, my original thought was to use Flask. But I am impressed by how simple and intuitive [and fast as the name suggests] FastAPI is and love to try it out in this mini-project!

Python生态系统中有许多API框架，我最初的想法是使用Flask。但是，令我印象深刻的是FastAPI如此简单和直观(而且顾名思义，它很快)，我喜欢在这个小型项目中尝试一下！

“Rome wasn’t built in a day”, FastAPI has learned a lot from the previous frameworks such as Django, Flask, APIStar, I cannot explain better than the creator himself and this article is great!

“罗马不是一天建成的”，FastAPI从Django，Flask，APIStar之类的先前框架中学到了很多东西，我无法比创建者本人解释得更好，并且这篇文章很棒！

无聊但必要的设置 (The boring but necessary setup)

Things are all in one repo which is probably not a good practice, it should be different GitHub repo in the real use case, maybe I will refactor [professional way to say “clean my previous sxxt”] later!

事情全放在一个仓库中，这可能不是一个好习惯，在实际用例中应该是不同的GitHub仓库，也许以后我会重构 [专业方式来“清理我以前的sxxt”]！

*CS folks always say single responsibility principle, instead of saying “don’t put code with different functionalities together”, next time maybe you can say “we should follow single responsibility principle for this!”

* CS员工总是说单一责任原则 ，而不是说“不要将具有不同功能的代码放在一起”，下次您可以说“我们应该遵循单一责任原则！”

First of all, let’s update requirements.txt with the new packages, as we mentioned last time, we should specify the exact version such that others can reproduce the work!

首先，让我们用新的软件包更新requirements.txt，正如我们上次提到的那样，我们应该指定确切的版本 ，以便其他人可以复制作品！

# for last article
pytest==6.0.1
pandas==1.0.1
Click==7.0
scikit-learn==0.22.1
black==19.10b0
isort==4.3.21
PyYAML==5.2# for FastAPI
fastapi==0.61.0
uvicorn==0.11.8
chardet==3.0.4

After this, we need to install requirements.txt again in the conda env [because we have new packages]

此后，我们需要在conda env中再次安装requirements.txt [因为我们有新软件包]

# You can skip the line below if you have created conda env
conda create - name YOU_CHANGE_THIS python=3.7 -yconda activate YOU_CHANGE_THISpip install –r requirements.txt

游戏计划 (The game plan)

Let’s think about what is happening, we want to have API endpoint to do prediction, to be specific if users give us the input, we need to use the model to predict and return prediction.

让我们考虑发生了什么，我们想要让API端点进行预测，具体来说，如果用户给我们输入，我们需要使用模型来预测并返回预测。

Instead of having us [human] handle the incoming request, we just create an API server to wait for the requests, parse the inputs, do prediction, and return results. API is just the structural way to talk to our computer and ask for the service [prediction in this case]

代替让我们 [人类]处理传入的请求，我们只是创建一个API服务器来等待请求，解析输入，进行预测并返回结果。 API只是与我们的计算机对话并请求服务的结构化方式[在这种情况下为预测]

Below is the pseudocode:

下面是伪代码：

# Load trained model
trained_model = load_model(model_path)# Let's create a API that can receive user request
api = CreateAPI()# If user send us the request to `predict` endpoint
when user sends request to `api`.`predict`:
    input = api[`predict`].get(input) # get input
    prediction = trained_model(input) # apply model
    return prediction                 # return prediction

This is good for the happy flow! But we should NEVER trust the user, just ask yourself, will you ever read the user manual in your daily life?

这对幸福的流好！但是，我们永远不要相信用户，只是问问自己，您是否会在日常生活中阅读用户手册？

For example, we expect {‘a’: 1, ‘b’: 2, ‘c’: 3} from the user but we may get:

例如，我们希望用户收到{'a'：1，'b'：2，'c'：3}，但我们可能会得到：

Wrong order {‘b’: 2, ‘a’: 1, ‘c’: 3}, or
错误的顺序{'b'：2，'a'：1，'c'：3}，或
Wrong key {‘a’: 1, ‘b’: 2, ‘d’: 3}, or
错误的键{'a'：1，'b'：2，'d'：3}，或
Missing key{‘a’: 1, ‘b’: 2}, or
缺少键{'a'：1，'b'：2}或
Negative value {‘a’: -1, ‘b’: 2, ‘c’: 3}, or
负值{'a'：-1，'b'：2，'c'：3}，或
Wrong type {‘a’: “HELLO WORLD”, ‘b’: 2, ‘c’: 3}, or
错误的类型{'a'：“ HELLO WORLD”，“ b”：2，“ c”：3}或
etc etc
等

This is fatal to our API because our model doesn’t know how to respond to this. We need to introduce some input structures to protect us! Therefore, we should update our pseudocode!

这对我们的API是致命的，因为我们的模型不知道如何响应。我们需要引入一些输入结构来保护我们！因此，我们应该更新我们的伪代码！

# Define input schema
input_schema = {......}# Load trained model
trained_model = load_model(model_path)# Let's create a API that can receive user request
api = CreateAPI()# If user send us the request to `predict` endpoint
when user sends request to `api`.`predict`:
    input = api[`predict`].get(input) # get input           
    transformed_input = apply(input_schema, input)
    if not transformed_input.valid(): return Error    prediction = trained_model(transformed_input) # apply model
    return prediction                 # return prediction

代码 (The Code)

Looks good to me now! Let’s translate them using FastAPI part by part!

现在对我来说很好！让我们使用FastAPI进行部分翻译！

Input schema

输入模式

It seems many lines but things are the same, as you can guess, we define a class called `Sample` which defines every predictor as float and greater than [gt] zero!

似乎有很多行，但是事情都是一样的，正如您所猜到的，我们定义了一个名为`Sample`的类，该类将每个预测变量定义为float且大于[gt]零！

Load model

负荷模型

Then we load the trained model, hmmm what is `Predictor`? it is just a custom class that wraps the model with different methods so we can call a method instead of implementing the logic in API server

然后我们加载训练后的模型hmmm，什么是“预测变量”？它只是一个自定义类，使用不同的方法包装模型，因此我们可以调用方法而不是在API服务器中实现逻辑

Create an API server

创建一个API服务器

Then we create the API using FastAPI……pseudocode is almost the code already

然后我们使用FastAPI创建API……伪代码几乎已经是代码

predict endpoint

预测终点

This looks complicated but they are very straightforward

这看起来很复杂，但是非常简单

Instead of saying “when the user sends a request to `api`.`predict`”

而不是说“当用户向api发送请求时。预测”。

We say: “Hey, app, if people send “GET request” to `predict`, please run function predict_item, we expect the input follows the schema we defined in `Sample`”

我们说： “嘿，应用程序，如果人们向“ predict” 发送“ GET request”，请运行函数predict_item，我们期望输入遵循我们在“ Sample”中定义的模式。”

What predict_item does is only transform input shape, feed to trained model and return the prediction, simple Python function

Forecast_item所做的只是变换输入形状，输入经过训练的模型并返回预测，简单的Python函数

If you want to know more about HTTP request methods

如果您想进一步了解HTTP请求方法

But you may ask: hey! One line is missing!!! Where is the input validation? What if users provide the wrong data type/key or miss a field?

但是您可能会问：嘿！缺少一行！！！输入验证在哪里？如果用户提供了错误的数据类型/密钥或缺少字段怎么办？

Well…….Remember we have defined `Sample` class for the input schema? Fast API automatically validates it for us according to the schema and we don’t need to care about that!!! This saves a lot of brainpower and many lines of code to build a robust and well-tested API!

好吧……。还记得我们为输入模式定义了`Sample`类吗？快速API会根据架构自动为我们验证它，我们不需要在意！！！！这样可以节省大量的人力资源和许多行代码，以构建强大且经过良好测试的API！

尝试使用 (Try to use)

# At project root, we can run this
# --reload is for development, API server autorefresh
# when you change the codeuvicorn prediction_api.main:app --reload

You should be able to see these, the API server is now running on “http://127.0.0.1:8000”!

您应该能够看到这些，API服务器现在正在“ http://127.0.0.1:8000”上运行！

There are different ways to experiment with the API, depending on your environment, you may use requests in Python or cURL in the command line. BTW there is a handy tool is called Postman, try it out, it is a very intuitive and user-friendly tool for API!

有多种尝试API的方法，具体取决于您的环境，您可以在命令行中使用Python或cURL中的 请求。顺便说一句，有一个方便的工具叫做Postman ，试试看，它是一个非常直观且用户友好的API工具！

We will use Python requests for the following examples, you can see them in this Notebook [sometimes Jupyter is helpful 😎]

我们将针对以下示例使用Python请求，您可以在本笔记本中查看它们[有时Jupyter会有所帮助😎]

Example below uses a valid input: YEAH! 😍 We made it! The endpoint returns the prediction!!!

以下示例使用有效输入：YEAH！ 😍我们做到了！端点返回预测！！！

payload = {
    "fixed_acidity": 10.5,
    "volatile_acidity": 0.51,
    "citric_acid": 0.64,
    "residual_sugar": 2.4,
    "chlorides": 0.107,
    "free_sulfur_dioxide": 6.0,
    "total_sulfur_dioxide": 15.0,
    "density": 0.9973,
    "pH": 3.09,
    "sulphates": 0.66,
    "alcohol": 11.8,
}result = requests.get("http://127.0.0.1:8000/predict", data = json.dumps(payload))print(result.json())Output
{'prediction': 1, 'utc_ts': 1597537570, 'model': 'RandomForestClassifier'}

Example below misses a field and FastAPI helps us to handle it according to our defined schema, I literally write nothing other than the schema class

下面的示例错过了一个字段，FastAPI可以帮助我们根据定义的模式来处理它，我只写了架构类

payload = {
    "volatile_acidity": 0.51,
    "citric_acid": 0.64,
    "residual_sugar": 2.4,
    "chlorides": 0.107,
    "free_sulfur_dioxide": 6.0,
    "total_sulfur_dioxide": 15.0,
    "density": 0.9973,
    "pH": 3.09,
    "sulphates": 0.66,
    "alcohol": 11.8,
}result = requests.get("http://127.0.0.1:8000/predict", data = json.dumps(payload))print(result.json())Output
{'detail': [{'loc': ['body', 'fixed_acidity'], 'msg': 'field required', 'type': 'value_error.missing'}]}

Just for fun, I also implemented a update_model PUT API to swap the models, for example, originally we were using Random Forest, I updated it to Gradient Boosting☺️

只是为了好玩，我还实现了update_model PUT API来交换模型，例如，最初我们使用随机森林，我将其更新为GradientBoosting☺️

result = requests.put("http://127.0.0.1:8000/update_model")print(result.json())Output
{'old_model': 'RandomForestClassifier', 'new_model': 'GradientBoostingClassifier', 'utc_ts': 1597537156}

自动产生的文件 (Auto-generated Document)

One of the cool FastAPI features is auto-document, just go to http://127.0.0.1:8000/docs#/ and you will have the interactive and powerful API document out of the box! So intuitive that I don’t need to elaborate

FastAPI的一项很酷的功能是自动文档，只需转到http://127.0.0.1:8000/docs#/ ，您便可以立即获得强大的交互式API文档！如此直观，我无需赘述

重新访问pytest (Revisit pytest)

I cannot emphasize enough about the importance of unit testing, it verifies the functions are doing what we expect them to do such that you will not break things accidentally!

对于单元测试的重要性，我不能太强调，它可以验证功能是否正在按我们期望的方式进行，以确保您不会意外破坏！

But if I try to cover every test it will be too boring and lengthy. What I plan to do here is to share some areas I will brainlessly test & some [probably useful] articles. Then I will talk about a pytest feature called parameterized unit test and some testing options in pytest. The easiest way to motivate yourself to learn unit testing is to try to refactor your previous code, the larger the better!

但是，如果我尝试涵盖所有测试，那将太无聊且冗长。我打算在这里做的是分享我将无脑地测试的一些领域和一些[可能有用的]文章。然后，我将讨论一种称为参数化单元测试的pytest功能以及pytest中的一些测试选项。激励自己学习单元测试的最简单方法是尝试重构您以前的代码，越大越好！

Unit testing

单元测试

Whenever you found difficult to write/understand the unit tests, you probably need to review your code structure first. Below are 4 areas that I will consider brainlessly:

每当发现难以编写/理解单元测试时，您可能需要首先查看您的代码结构。以下是我将全力以赴的4个方面：

Input data: dimension [eg: df.shape], type [eg: str], value range [eg: -/0/+]
输入数据：尺寸[例如：df.shape]，类型[例如：str]，值范围[例如：-/ 0 / +]
Output data: dimension [eg: df.shape], type [eg: str], value range [eg: -/0/+]
输出数据：尺寸[例如：df.shape]，类型[例如：str]，值范围[例如：-/ 0 / +]
Compare: output and expected result
比较：输出和预期结果
After I debug, prevent it happens again
调试后，防止再次发生

For example, I focus quite a lot on output dimension, type, value range below. It seems simple but if you modify any output format, it will remind you what is the expected formats!

例如，我将很多精力放在下面的输出尺寸，类型，值范围上。看起来很简单，但是如果您修改任何输出格式，它将提醒您期望的格式是什么！

Some articles FYR:

FYR的一些文章：

Unit Testing for Data Scientists

数据科学家的单元测试

How to unit test machine learning code [deep learning]

如何对机器学习代码进行单元测试 [深度学习]

Parameterized unit test

参数化单元测试

Suppose you have 100 mock data [annotate by D_i, i: 1..100] and you want to run the same unit test for each of them, how will you do?

假设您有100个模拟数据[用D_i注释，我：1..100注释]，并且您要为每个模拟数据运行相同的单元测试，您将如何处理？

A brute force solution

蛮力解决方案

def test_d1():
    assert some_operation(D_1)def test_d2():
    assert some_operation(D_2)def test_d3():
    assert some_operation(D_3)......def test_d100():
    assert some_operation(D_100)

But if you need to modify `some_operation`, you need to modify it 100 times LOL………Although you can make it as a utility function, this makes the tests hard to read and very lengthy

但是，如果您需要修改“ some_operation”，则需要对其进行100倍的LOL修改…………尽管您可以将其作为实用程序功能使用，但这会使测试难以阅读且非常冗长

A better way maybe for-loop?

更好的方法也许是循环的？

def test_d():
    for D in [D_1, D_2, D_3, ..., D_100]:
        assert some_operation(D)

But you can’t know exactly which tests fail because these 100 tests are all in one test

但是您无法确切知道哪些测试失败，因为这100个测试全部在一个测试中

pytest offers us a feature called parametrize

pytest为我们提供了一个称为 参数化 的功能

@pytest.mark.parametrize("test_object", [D_1, D_2, ..., D_100])
def test_d(test_object):
    assert some_operation(test_object)

Common pytest options

常见的pytest选项

pytest FOLDER

pytest文件夹

Last time we mentioned we can just run `pytest` in command line and pytest will find ALL the tests under the folder itself. But sometimes we may not want to run all the unit tests during development [maybe some tests take a long time but unrelated to your current tasks]

上次提到时，我们可以在命令行中运行pytest ，而pytest将在文件夹本身下找到所有测试。但是有时我们可能不希望在开发过程中运行所有的单元测试[也许某些测试需要很长时间，但与您当前的任务无关]

In that case, you can simply run pytest FOLDER, eg: `pytest ./scripts` or `pytest ./prediction_api` in the demo

在这种情况下，您可以简单地运行pytest FOLDER ，例如：演示中的`pytest。/ scripts`或`pytest。/ prediction_api`。

parallel pytest

并行pytest

Sometimes your test cases are too heavy, it may be a good idea to run things in parallel! You can install pytest-xdist and replace pytest by py.test in your command, eg: py.test -n 4

有时您的测试用例过于繁重，最好并行运行！您可以在命令中安装pytest-xdist并用py.test替换pytest，例如：py.test -n 4

pytest -v

pytest -v

This is personal taste, I prefer the verbose output and see green PASSED ✅ to start my day

这是个人喜好，我更喜欢详细的输出，看到绿色的PASSED ASS开始我的一天

You can read more from the materials below:

您可以从以下材料中了解更多信息：

https://docs.pytest.org/en/stable/

https://docs.pytest.org/en/stable/

https://www.guru99.com/pytest-tutorial.html#5

https://www.guru99.com/pytest-tutorial.html#5

最后，我希望您能像我一样欣赏这段1分钟的YouTube视频 😆 (At last, I hope you can enjoy this 1 min Youtube video as I do 😆)

结论 (Conclusions)

Yooo✋ we have created a prediction API that consumes our model, users can now send the request and get the prediction without a human sitting behind, this over-simplifies the reality [throughput, latency, model management, authentication, AB testing, etc] but this is the idea!

✋，我们创建了一个预测API，该API可以使用我们的模型，用户现在可以发送请求并获得预测，而无需人工干预，这过分简化了现实情况(吞吐量，延迟，模型管理，身份验证，AB测试等)但这是主意！

At least if your prototype is at this level, engineers are much happier to take over from this point and hence speed up the whole process, and you can show them you know something 😈

至少如果您的原型处于此级别，工程师会更乐于接受这一点，从而加快了整个过程的速度，您可以向他们展示您知道的东西😈

To wrap up, we:

最后，我们：

a. Update conda env [requirements.txt]
b. Brainstorm pseudocode and convert to code [FastAPI, uvicorn]
c. Utilize API [cURL, requests, Postman]
d. Talk about Auto-generated documents by FastAPI
e. Some pytest techniques [parallel, parameterized, -v]

File tree below to show the development steps

下面的文件树显示了开发步骤

.
├── notebook
│   ├── prediction-of-quality-of-wine.ipynb
│   └── prediction_API_test.ipynb           [c] <-consume API
├── prediction_api
│   ├── __init__.py
│   ├── api_utility.py                      [b] <-wrap up methods
│   ├── main.py                             [b] <-modify demo
│   ├── mock_data.py                        [e] <-Unit test
│   ├── test_api_utility.py                 [e] <-Unit test
│   └── test_main.py                        [e] <-Unit test
├── requirements.txt                        [a] <-FastAPI doc
.
.
.

BUT (again, bad news usually starts with BUT) they are still on my local computer.

但是 (同样，坏消息通常以BUT开始)它们仍然在我的本地计算机上。

Although we don’t need to sit behind and hit Run, the user requests cannot reach the API endpoints. Even if they do, it means I cannot close my Macbook, it means I cannot scale if there are many incoming prediction requests😱!!!

尽管我们不需要坐在后面并单击“运行”，但用户请求无法到达API端点。即使它们这样做，也意味着我无法关闭我的Macbook，这意味着如果有很多传入的预测请求，我也无法扩展😱！

The way to escape from this hell, as we mentioned in the last article, is to either buy another computer OR rent a server from cloud providers such as AWS

就像我们在上一篇文章中提到的那样，摆脱困境的方法是要么购买另一台计算机，要么从云服务提供商(例如AWS)租用服务器。

But first, we also need to ensure the code is working fine there! How?

但是首先，我们还需要确保该代码在这里能正常工作！怎么样？

Short answer: Docker

简短答案：Docker

Aside:

在旁边：

Although I haven’t tried, there is a startup called Cortex which focuses on open source machine learning API framework and they also use FastAPI under the hood!

尽管我还没有尝试过，但是有一家名为Cortex的初创公司专注于开源机器学习API框架，他们还使用了FastAPI ！

By now, you should be able to understand their tutorial, in short, they solve many production-level problems behind the scene such as rolling update, DL model inference, integration with AWS, Autoscaling etc…… [These are DevOps concerns? Or maybe fancier term: MLOps]

到目前为止，您应该已经能够理解他们的教程，简而言之，他们解决了幕后的许多生产级问题，例如滚动更新，DL模型推断，与AWS集成，Autoscaling等……[这些是DevOps所关注的吗？也许是更特别的术语： MLOps

But from user [aka you] perspective, they deploy the APIs using declarative yml [similar to how we configure model in the last article], have a predictor class [similar to our Predictor class], trainer.py [similar to train.py in the last article]

但是从用户[aka you]的角度来看，他们使用声明性yml部署API [类似于上一篇文章中的配置模型的方式]，有一个预报器类[类似于我们的Predictor类]，trainer.py [类似于train.py在上一篇文章中]

Writing the code is relatively easy but writing an article for the code is hard, if you found this article useful, you can leave some comments

编写代码相对容易，但是为代码编写文章却很困难，如果您发现本文有用，则可以留下一些评论

OR you can star my repo!

或者你可以给我的仓库加注星号！

OR my LinkedIn [Welcome but please leave a few words to indicate you are not zombie]!

或我的LinkedIn (欢迎使用，请留下几句话以表明您不是僵尸)！

BTW, we know COVID-19 has bad if not catastrophic impacts on everyone’s career, especially for graduates. Research shows the unlucky effect can set a graduate back for years given they do nothing wrong. Well……what else can you say🤒? If you [or know people who] are hiring, feel free to reach out and we can forward the opportunities to people who are in need 🙏

顺便说一句，我们知道COVID-19对每个人的职业都有严重的影响，即使不是灾难性的影响，尤其是对于毕业生而言。 研究 表明，不幸的是，只要他们没有做错任何事情，就会使毕业生退缩多年。 好吧……你还能说什么？ 如果你[或知道是谁的人]雇用，随意 伸手 ，我们可以转发机会的人谁需要 🙏