rails sql 学习_实用的机器学习和Rails

最新推荐文章于 2024-06-18 10:32:35 发布

羊牮

最新推荐文章于 2024-06-18 10:32:35 发布

阅读量81

点赞数

文章标签：机器学习人工智能 python 深度学习

原文链接：https://medium.com/swlh/practical-machine-learning-and-rails-5d73979315b2

版权

rails sql 学习

I’ve been going through the Machine Learning Engineer track on DataCamp, and I’ve enjoyed it a lot. It is convenient and a bit repetitive but in the right way because it has been constructive to do things a few times (maybe because I’m getting old). On the downside, all the courses and projects are done within their hosted editor and programming environment, so I felt I was still missing some practical knowledge.

我一直在DataCamp上学习机器学习工程师的课程，并且对此非常满意。它很方便，而且有点重复，但是用了正确的方式，因为它可以多次执行某些操作(也许是因为我变老了)。不利的一面是，所有课程和项目都是在其托管的编辑器和编程环境中完成的，所以我感到我仍然缺少一些实践知识。

On the other hand, I’ve been building PocketPatch on my spare time for the past year or so, and I’ve built up a nice dataset (of my own expenses) I thought I might experiment on a bit. The app uses Plaid to collect transaction data and then lets users categorize them as they see fit. Part of my proposal is that you should use your own categories to classify your expenses instead of standardized ones. So some users might choose to lump all food expenses into a “Food” category, and others might split them between “Eating out” and “Groceries.”

另一方面，在过去的一年左右的时间里，我一直在利用PocketPatch来构建，并且我已经建立了一个不错的(自费)数据集，我想我可以尝试一下。该应用程序使用Plaid收集交易数据，然后让用户根据需要对其进行分类。我的建议的一部分是，您应该使用自己的类别对费用进行分类，而不是标准化费用。因此，某些用户可能选择将所有食品支出归为“食品”类别，而其他用户则可能将其划分为“外出就餐”和“杂货店”。

The problem here, as you might have noticed, if you pay attention to your bank app’s expense tracking features, is that the categories provided by points of sale are rarely accurate, let alone specific. PocketPatch manages this by putting all new transactions in an inbox where you can review and fix their categories. Going through this step is a bit of a speed bump, but it ensures the data you’re tracking is high-quality. It also has the added benefit of building awareness about your spending patterns. If you make an effort to review your transactions a couple of times a week, you’ll be more able to react to harmful habits and have your targets more present.

您可能已经注意到，这里的问题是，如果您注意银行应用程序的费用跟踪功能，则销售点提供的类别很少准确，更不用说具体了。 PocketPatch通过将所有新交易放入收件箱中来进行管理，您可以在其中查看和修改其类别。通过此步骤可能会遇到一些麻烦，但是可以确保您跟踪的数据是高质量的。它还具有增加对支出模式的意识的附加好处。如果您每周努力尝试几次交易，那么您将更有能力对不良习惯做出React，并更加关注目标。

Ideally, only a few of the transactions would need editing in the inbox, but the reality is that most of them do. The silver lining is that this is a very approachable Machine Learning problem for a beginner like me. I’ve been tracking my expenses for about three months using the tool. I have built up around 300 entries, where I have a very well defined mapping of transaction description to the personalized category, so I decided to give it a try as the learning project I was looking for, and then maybe incorporate it into PocketPatch. Let’s dive into it.

理想情况下，收件箱中只有少数交易需要进行编辑，但实际情况是大多数交易都需要进行编辑。一线希望是，对于像我这样的初学者来说，这是一个非常容易解决的机器学习问题。我已经使用该工具跟踪了大约三个月的费用。我已经建立了大约300个条目，在其中我对事务描述到个性化类别有一个非常明确的映射，因此我决定尝试一下它作为我正在寻找的学习项目，然后将其合并到PocketPatch中。让我们开始吧。

1.遵循指南 (1. Following the guides)

Since this is such a typical problem, there was an excellent example in ScikitLearn’s guides, so I went through that first to get a good feel for what I needed to do. ScikitLearn, aka sklearn, is one of the better known Machine Learning libraries for Python, and what I’ve been using on the DataCamp courses. It was easy to follow the example using Google’s Colab Notebooks, which are also a very nice tool (I wish I could find a non-Google option though, so please get in touch if you know of any alternatives).

由于这是一个典型的问题，因此在ScikitLearn的指南中有一个很好的例子，因此我首先进行了尝试，以使自己对需要做的事情有很好的感觉。 ScikitLearn，又名sklearn，是最知名的Python机器学习库之一，也是我在DataCamp课程中一直使用的库。使用Google的Colab笔记本可以很容易地遵循该示例，这也是一个非常不错的工具(我希望我可以找到非Google的选项，因此，如果您有其他选择，请与我们联系)。

Going through the example gave me a good feeling that I could achieve something useful without too much complexity.

通过示例，我感觉很好，无需太多复杂性就可以实现有用的东西。

2.提取数据 (2. Extracting the data)

PocketPatch is a Ruby on Rails app, and it already has a CSV export feature, meant for users to be able to download their data if they want. Still, it was useful (with a couple of tweaks I put behind a feature flag) to export the dataset as needed for this exercise: basically the transaction description and the user’s personalized category, which will be the target for the prediction.

PocketPatch是Ruby on Rails应用程序，它已经具有CSV导出功能，供用户在需要时下载其数据。尽管如此，根据本练习的需要导出数据集还是很有用的(我做了一些调整，在功能标记后面)：基本上是交易描述和用户的个性化类别，这将是预测的目标。

Right now, using only the description is giving me good enough results, but down the line, I can imagine using the bank-assigned category and the amount range as features too.

现在，仅使用描述就可以给我足够好的结果，但是总的来说，我可以想象使用银行分配的类别和金额范围作为功能。

3.拟合模型 (3. Fitting the model)

I tweaked the code from the guides I mentioned before to run for my data, and it got me a decent accuracy without much tuning. It was fascinating to see the test data categorized, and most of the selected categories make sense.

我对前面提到的指南中的代码进行了调整，以运行我的数据，并且无需进行太多调整即可获得不错的准确性。看到测试数据经过分类很有趣，并且大多数选定的分类都有意义。

I put all the code in this notebook with more detailed explanations. It is a very standard pipeline per the courses I’ve taken:

我将所有代码都放在笔记本中，并附有更详细的说明。根据我所修的课程，这是一个非常标准的流程：

Read in the dataset from the exported CSV file into a pandas. DataFrame.
将数据集从导出的CSV文件读入熊猫。 DataFrame 。
Define the feature-set (X), which is the part of the data that helps make the prediction. You might as well call it the input. In this case, only the transaction description (at least for now)
定义特征集( X )，这是有助于进行预测的数据部分。您最好将其称为输入。在这种情况下，仅交易描述(至少目前如此)
Defines the target labels (y), which are the expected results from those known inputs.
定义目标标签( y )，这是来自那些已知输入的预期结果。
Creates a train-test split (part of the data set is left out from the training to see how the model performs on data it hasn’t seen).
创建训练测试拆分(训练中遗漏了部分数据集，以查看模型如何处理尚未看到的数据)。
Transforms the input into count vectors: for the model to understand the input, it needs to be in the form of numerical values, so first, we transform the descriptions into word counts.
将输入转换为计数向量：为了使模型能够理解输入，它必须采用数值形式，因此，首先，我们将描述转换为字数。
Then we transform these word counts into frequencies, which makes them more meaningful for the classifier using .
然后，我们将这些字数转换为频率，这使它们对于使用的分类器更有意义。
Fit the model.
拟合模型。
Evaluate the accuracy.
评估准确性。

I’m basically copying the tutorial, and I’m already getting around 60% accuracy. What’s more, I’m getting probabilities, which are helpful to suggest alternatives if the category is off. I’m sure I can come back to this and make the model even better, but first, I wanted to make sure I could deploy it and use it in a way that is simple enough to be worthwhile.

我基本上是在复制该教程，并且已经达到60％左右的准确性。而且，我正在获得概率，如果类别关闭，这将有助于建议替代方法。我敢肯定，我可以回到这个问题上来并使模型变得更好，但是首先，我想确保可以以一种简单而值得的方式来部署和使用它。

4.设计API (4. Designing the API)

What is a bit non-standard about my workflow is that I want to fit a model for each user in the app (I hope this doesn’t bite me), so the API has three endpoints:

我的工作流程有点不规范，是我想为应用程序中的每个用户都适合一个模型(我希望这不会对我造成影响)，因此API具有三个端点：

Store a CSV file for the model to use,
存储供模型使用的CSV文件，
Take the name of a CSV file, a user ID, and a webhook URL, fit the model with the contents of the CSV file, store it with a unique ID, and notify the webhook with the user ID and the model ID.
取一个CSV文件的名称，一个用户ID和一个Webhook URL，使模型适合CSV文件的内容，将其存储为唯一ID，然后使用用户ID和模型ID通知Webhook。
Take a model ID and an array of descriptions and predict their categories using the model.
获取模型ID和描述数组，并使用模型预测其类别。

The first endpoint is provided for free by the service I used, and you can see the code for the fitting and prediction steps here.

我使用的服务免费提供了第一个端点，您可以在此处查看拟合和预测步骤的代码。

5.部署 (5. Deploying it)

PocketPatch is a Ruby on Rails app, and the Machine Learning code is in Python. I know I could probably find a way to run Python from Ruby but just saying that feels wrong. Plus, since this is very experimental, I don’t want to mangle it with the rest of the app. I’m a sucker for code quality, and I’ve made a huge effort to keep the codebase smooth.

PocketPatch是Ruby on Rails应用程序，而机器学习代码是Python。我知道我可能会找到一种从Ruby运行Python的方法，但是只是说那感觉是错误的。另外，由于这是非常实验性的，所以我不想与应用程序的其余部分打成一片。我对代码质量很着迷，并且为保持代码库的平滑做出了巨大的努力。

I’ve also been using a lot of FaaS (functions as a service) lately and have particularly enjoyed Vercel’s version. Still, there is a limitation I couldn’t get around: fitting the model is quite intensive, so I don’t want to wait for an HTTP response, and there is no async invocation in Vercel. Since there is one model per user, I need the fitting step to be part of the API.

我最近也使用了很多FaaS(功能即服务)，尤其喜欢Vercel的版本。仍然有一个我无法解决的限制：拟合模型非常耗时，因此我不想等待HTTP响应，并且Vercel中没有异步调用。由于每个用户只有一个模型，因此我需要拟合步骤成为API的一部分。

So? Algorithmia to the rescue: it offers a straightforward way to host Machine Learning workflows where you write a Python module, which exposes an apply function that runs your algorithm, and they'll run it and scale it for you. You can call these asynchronously by setting output: "void" , so it works for my requirement.

所以？拯救算法：它提供了一种直接的方法来托管机器学习工作流，您可以在其中编写Python模块，该模块公开了运行算法的apply函数，然后它们将为您运行并扩展它。您可以通过将output: "void"设置为output: "void"异步调用它们，因此它可以满足我的要求。

The only downsides were that I had to host the code in two separate repos: one for the fitting step, and one for the prediction. Running the algorithms locally required writing some glue code, and I couldn’t figure out how to add environment variables, which would’ve been handy. So the experience is a bit raw, but the simplicity almost entirely makes up for it.

唯一的缺点是我必须将代码托管在两个单独的存储库中：一个用于拟合步骤，一个用于预测。在本地运行算法需要编写一些粘合代码，而且我不知道如何添加环境变量，这很方便。因此，体验有些原始，但是简单性几乎完全弥补了这一点。

On the bright side, they have a nice and simple API for storing files, which is quite useful for Machine Learning problems, and made it even easier to implement my desired API.

从好的方面来说，它们有一个很好的和简单的API用于存储文件，这对于机器学习问题非常有用，并且使实现我所需的API更加容易。

I’m exposing the code on a GitHub Repo instead of Algorithmia (which offers an option to make the algorithm public) because I’m not sure I’m sandboxing the data correctly. But you can see that on a few lines of code, I’ve built and deployed my API.

我将代码公开在GitHub Repo上，而不是Algorithmia(提供了将算法公开的选项)上，因为我不确定是否正确地对数据进行了沙箱处理。但是您可以看到，在几行代码中，我已经构建并部署了我的API。

7.与PocketPatch集成 (7. Integrating with PocketPatch)

I haven’t done this yet (I have a big backlog!), but I tried Algorithmia’s Ruby SDK, and it worked well. So in terms of the API, I defined earlier:

我还没有做(我还有很多积压！)，但是我尝试了Algorithmia的Ruby SDK，并且效果很好。因此，就API而言，我之前已定义：

I already have the CSV export logic in place on the Rails app. Now I need it to use Algorithmia’s data store (my “Save file” endpoint) instead of my Digital Ocean Spaces bucket to share the training data.
我已经在Rails应用程序中安装了CSV导出逻辑。现在，我需要它使用Algorithmia的数据存储(我的“保存文件”端点)而不是我的Digital Ocean Spaces存储桶来共享训练数据。
I need to set up a job to save an export and call the “Fit model” endpoint periodically to keep the models up to date.
我需要设置工作以保存导出，并定期调用“ Fit模型”端点，以使模型保持最新状态。
I have to set up a webhook handling endpoint to store the model’s ID for each user.
我必须设置一个Webhook处理端点来存储每个用户的模型ID。
And then start pre-classifying the transactions as they come in from Plaid using the “Predict” endpoint.
然后使用“ Predict”端点开始对来自Plaid的交易进行预分类。

If something interesting comes up while doing this, I’ll write a follow-up!

如果在执行此操作时出现有趣的事情，我将写一封后续邮件！

The workflow I walked you through is a straightforward example and is still an experimental feature, but it is undoubtedly going to make PocketPatch much better if it works. I can’t believe that in a week’s worth of side-project work, I could achieve this, using a language I’m not very comfortable with yet. It is exciting how the functions as a service idea and the popularization of Machine Learning is empowering us, developers, to move beyond our comfort zone and create value in new ways!

我向您介绍的工作流是一个简单的示例，仍然是一个实验性功能，但是如果可以使用，无疑可以使PocketPatch更好。我不敢相信，在一个星期的辅助项目工作中，我可以使用一种我不太熟悉的语言来实现这一目标。令人兴奋的是，功能即服务理念和机器学习的普及使我们(开发人员)有能力超越我们的舒适范围并以新方式创造价值！

Originally published at https://perezperret.com.

最初发布在https://perezperret.com 。

翻译自: https://medium.com/swlh/practical-machine-learning-and-rails-5d73979315b2

rails sql 学习

羊牮

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
rails sql 学习_实用的机器学习和Rails

rails sql 学习I’ve been going through the Machine Learning Engineer track on DataCamp, and I’ve enjoyed it a lot. It is convenient and a bit repetitive but in the right way because it has been construct...
复制链接

扫一扫