nosql 数据模型_用于实时功能工程和ml模型的nosql

最新推荐文章于 2024-07-19 23:41:27 发布

weixin_26730921

最新推荐文章于 2024-07-19 23:41:27 发布

阅读量129

点赞数

文章标签： python 机器学习 java

原文链接：https://towardsdatascience.com/nosql-for-real-time-feature-engineering-and-ml-models-93d057c0a7b8

版权

nosql 数据模型

For the majority of my data science career, I’ve built machine learning models using data fetched from a data warehouse or lake. With this approach, you can create a feature vector for each user by applying SQL commands to transform several events into a user summary. However, one of the main issues with this approach is that it is a batch operation and may not include the most recent events for a user, if it takes minutes or hours for events sent from an application to show up in the data store.

在我的大部分数据科学职业中，我都使用从数据仓库或湖泊中获取的数据构建了机器学习模型。使用这种方法，可以通过应用SQL命令将多个事件转换为用户摘要来为每个用户创建一个特征向量。但是，此方法的主要问题之一是，这是批处理操作，并且如果从应用程序发送的事件要花几分钟或几小时才能显示在数据存储中，则可能不包含用户的最新事件。

If you need to use recent data points when applying predictive models, then it may be necessary to build a streaming data pipeline that applies feature engineering in near real-time to build user summaries. For example, an e-commerce web site may want to send a notification to users that add items to a cart, but do not check out, based on the output of an ML model. This use case requires a model to use data based on recent web session activity, which will likely not be available in a data warehouse due to latency introduced through batching and ETL operations. For this scenario, it may be necessary to incrementally build a user summary based on a streaming data set in order to provide up-to-date model predictions for determining whether or not to send a notification to the user.

如果在应用预测模型时需要使用最近的数据点，则可能有必要构建一个流数据管道，该流数据管道几乎实时地应用要素工程来构建用户摘要。例如，一个电子商务网站可能希望基于ML模型的输出，向将商品添加到购物车但不签出商品的用户发送通知。该用例需要一个模型来基于最近的Web会话活动使用数据，由于批处理和ETL操作引入的延迟，该数据可能在数据仓库中不可用。对于这种情况，可能有必要基于流数据集以增量方式构建用户摘要，以便提供用于确定是否向用户发送通知的最新模型预测。

I’m a data scientist working for a mobile game publisher, and we encounter similar scenarios, where we need to use recent session data to determine how to personalize the experience for users. I’ve recently been exploring NoSQL data stores as a way of building near real-time data products that perform feature engineering and model application with minimal latency. While this post will walk through a sample pipeline using Python, data scientists that want to get hands-on with building production-grade systems should explore Java or Go, as I’ve discussed here. In this post I’ll introduce NoSQL for data science, introduce options for using Redis, and walk through a sample Flask application that performs real-time feature engineering using Redis.

我是一位为移动游戏发行商工作的数据科学家，我们遇到类似的情况，我们需要使用最新的会话数据来确定如何个性化用户体验。我最近一直在探索NoSQL数据存储，以此作为构建近实时数据产品的一种方式，这些产品以最小的延迟执行功能工程和模型应用程序。正如我在这里所讨论的那样，尽管本文将使用Python进行示例管道浏览，但是想要动手构建生产级系统的数据科学家应该探索Java或Go。在本文中，我将介绍用于数据科学的NoSQL，介绍使用Redis的选项，并逐步介绍一个示例Flask应用程序，该应用程序使用Redis执行实时功能工程。

NoSQL用于数据科学 (NoSQL for Data Science)

NoSQL data stores are an alternative to relational databases that focus on minimizing latency while limiting the types of operations that can be performed. While there is a broad set of services that fall within the concept of NoSQL, one of the most common implementations is a key-value data store. With a key-value store, you can save and retrieve elements from a data store with sub-millisecond latency, but you cannot query across the contents within the data store. You need a key which is used as an index for saving a value, such as a user ID, and the same key is used when retrieving a value. While this set of operations may seem limiting for data scientists used to working with relational databases, it provides latency and throughout performance that is orders of magnitude faster than traditional databases.

NoSQL数据存储是关系数据库的替代方法，后者专注于最大程度地减少延迟，同时限制可以执行的操作类型。尽管NoSQL的概念中包含多种服务，但最常见的实现之一是键值数据存储。使用键值存储，您可以以亚毫秒级的延迟保存和检索数据存储中的元素，但是无法查询数据存储中的内容。您需要一个用作索引的键来保存值，例如用户ID，并且在检索值时使用相同的键。虽然这组操作似乎限制了过去使用关系数据库的数据科学家，但它提供的延迟和整个性能要比传统数据库快几个数量级。

This type of workflow, where you perform create-read-update-delete (CRUD) operations on data sets may not be familiar for data scientists, because it is common to work with batch processes, where SQL can be used to transform events into user summaries. With a key-value data store there are no SQL commands that can be used to aggregate data sets, and instead, user summaries need to be built incrementally, where new data received by the system is used to update the user record, such as incrementing the session count each time a user opens the game on a phone.

这种类型的工作流，您在数据集上执行创建-读取-更新-删除(CRUD)操作，对于数据科学家来说可能并不熟悉，因为它与批处理一起使用是很常见的，在批处理中，SQL可用于将事件转换为用户总结。对于键值数据存储，没有可用于聚合数据集SQL命令，而是需要增量构建用户摘要，其中系统接收的新数据用于更新用户记录，例如增量用户每次在手机上打开游戏时的会话计数。

Using a NoSQL data store enables a data product to update user summaries in near real-time as new data is received. This means that the feature vector for a user will lag by only seconds for each new data point, versus minutes or hours when using traditional data warehousing workflows. There is a trade-off when using this approach, which is that the types of features that you can use are limited versus the options available when using a SQL database. For example, you cannot count distinct values or calculate a median value when updating data incrementally, since you are working with single events rather than historical events for a user. However, most systems that use real-time data will also include features from a delayed process that provides a more complete user profile. For example, with streaming data it’s not possible to count the distinct number of modes played by a user (unless 1-hot feature encoding is used), but there could be a batch process that calculates this value with additional lag that is appended to the feature vector.

使用NoSQL数据存储，可使数据产品在接收到新数据时几乎实时地更新用户摘要。这意味着，对于每个新数据点，用户的特征向量将仅滞后几秒钟，而使用传统的数据仓库工作流程则仅需数分钟或数小时。使用这种方法时需要权衡取舍，即与使用SQL数据库时可用的选项相比，可以使用的功能类型受到限制。例如，由于您正在处理单个事件而不是用户的历史事件，因此，当增量更新数据时，您不能计算不同的值或计算中间值。但是，大多数使用实时数据的系统还将包括延迟过程中的功能，这些功能可提供更完整的用户配置文件。例如，使用流数据时，无法计算用户所播放模式的不同数量(除非使用1-hot特征编码)，但是可能存在一个批处理过程，该过程会计算此值，并附加附加到特征向量。

Data scientists should explore NoSQL solutions when they need to build user profiles that are updated incrementally versus batch processes. A common use case for this type of workflow is when you need to provide personalized treatments for users based on recent session data. For example, users that have previously opted into notifications may be more likely to interact with future notifications within a new game. A mobile game publisher can leverage this session data to provide personalized experiences for users.

数据科学家在需要构建与批处理相比逐步更新的用户配置文件时，应探索NoSQL解决方案。此类工作流程的常见用例是，您需要根据最近的会话数据为用户提供个性化的待遇。例如，先前选择了通知的用户可能更可能与新游戏中的未来通知进行交互。移动游戏发行商可以利用此会话数据为用户提供个性化的体验。

To provide a concrete example of what this looks like, we’ll use the Kaggle NHL data set to provide a streaming data source where Hockey player profiles are updated incrementally based on a real-time data source, implemented via web REST calls. The output of this sample application will be user profiles stored in Redis that can then be used to apply ML models in real time. We’ll focus on the feature engineering steps rather than the model application steps, but show how the user profiles can be used for model predictions.

为了提供具体的外观示例，我们将使用Kaggle NHL数据集来提供流数据源，在该数据源中，将通过Web REST调用基于实时数据源对曲棍球运动员配置文件进行增量更新。该示例应用程序的输出将是存储在Redis中的用户配置文件，然后可以将其用于实时应用ML模型。我们将专注于要素工程步骤而不是模型应用程序步骤，但将展示如何将用户配置文件用于模型预测。

We’re going to use Redis as our NoSQL solution for this exercise. While there are similar alternatives to Redis, such as Memcached, Redis is a standard that works across multiple programming languages and has managed implementations across popular cloud infrastructures. Redis is meant to be viewed as an ephemeral cache, which means that it might not be the best approach if your user profiles need long-term persistence. However, if you’re building ML models for targeting new users it’s a great choice.

在本练习中，我们将使用Redis作为我们的NoSQL解决方案。虽然Redis有类似Redis的替代方案，例如Memcached，但Redis是一种跨多种编程语言工作的标准，并已管理跨流行的云基础架构的实现。 Redis被视为临时缓存，这意味着如果您的用户配置文件需要长期持久性，它可能不是最佳方法。但是，如果要构建面向新用户的ML模型，这是一个不错的选择。

With NoSQL for data science projects, you typically use the data store to update user profiles based on real-time data being passed to the data product. For streaming data, this endpoint can be set up to process data as REST commands, or be set up to process data streams from tools such as Kinesis and Kafka. We’ll walk through a sample deployment for setting up a REST endpoint that incrementally updates user profiles using Redis.

使用NoSQL进行数据科学项目时，通常会使用数据存储来基于传递到数据产品的实时数据来更新用户配置文件。对于流数据，可以将此端点设置为将数据作为REST命令处理，或者设置为处理来自Kinesis和Kafka等工具的数据流。我们将逐步进行示例部署，以设置一个REST端点，该端点使用Redis增量更新用户配置文件。

Redis部署 (Redis Deployments)

There’s a variety of different options for setting up a Redis instance that will be used to cache data for your machine learning project. Here are some of the different options that are available.

设置Redis实例有很多不同的选项，这些实例将用于为您的机器学习项目缓存数据。以下是一些可用的不同选项。

Mock Implementations: fakeredis for Python, jedis-mock for Java
mock实现： fakeredis为Python， jedis -模拟的Java
Local Deployments: Build and run locally, or use Docker
本地部署：在本地构建和运行，或使用Docker
Hosted Deployments: Run Redis on a cluster in the Cloud
托管部署：在云中的群集上运行Redis
Managed Deployment: AWS ElastiCache, GCP Memorystore, Redis Cloud
托管部署： AWS ElastiCache，GCP Memorystore， Redis Cloud

When getting starting with Redis, using a Mock implementation is great for learning the interface to Redis. Also, a mock implementation can be great for setting up unit testing for your application. Once you want to get a read on the potential throughput of your application, it’s good to move to a local instance of Redis running on your machine directly or through Docker. This will enable you to test connecting to Redis, and get a better read on profiling of your application.

在开始使用Redis时，使用Mock实现非常适合学习Redis的界面。此外，模拟实现对于为应用程序设置单元测试非常有用。一旦您想了解应用程序的潜在吞吐量，最好直接或通过Docker迁移到在计算机上运行的Redis本地实例。这将使您能够测试与Redis的连接，并更好地了解应用程序的性能分析。

Once you want to put your system into production, then you’ll want to move to a Redis cluster, which can be a hosted solution on a cloud platform, where you are responsible for provisioning and monitoring Redis on a cluster of machines, or you can use a managed solution that handles all of the overhead of maintaining a cluster while providing the same Redis interface. Managed solutions are great for getting up and running with Redis in a production application, but there are some factors to consider when choosing a hosted versus managed solution for Redis, or other NoSQL solutions:

要将系统投入生产后，您将需要迁移到Redis集群，该集群可以是云平台上的托管解决方案，您可以在其中负责配置和监视计算机集群上的Redis，或者可以使用托管解决方案，该解决方案在提供相同的Redis接口的同时处理维护集群的所有开销。托管解决方案非常适合在生产应用程序中启动和运行Redis，但是在为Redis选择托管与托管解决方案或其他NoSQL解决方案时，需要考虑一些因素：

What are your latency requirements?
您的延迟要求是什么？
What are your memory requirements?
您的内存需求是什么？
What are your throughout requirements?
您的整个要求是什么？

With a hosted solution, you can make sure that the Redis cluster is co-located with your data product to ensure minimal latency between your service and the Redis instances. With GCP Memorystore, you can configure both of these clusters to live within the same availability zone, which results in sub-millisecond latency for Redis commands, but you do lose the ability to configure your instance when using a managed approach.

使用托管解决方案，您可以确保Redis群集与数据产品位于同一位置，以确保服务和Redis实例之间的延迟最小。借助GCP Memorystore，您可以将这两个集群配置为位于同一可用性区域内，这会导致Redis命令的延迟达到亚毫秒级，但是使用托管方法时，确实会失去配置实例的能力。

The likely factor that will determine a hosted versus managed approach is the anticipated cost of the cluster. With Memorystore you are charged per GB per hour, and there are different tiers of pricing based on capacity. There may also be changes for read or write commands, which is part of the pricing for DynamoDB on AWS. If the costs seem reasonable, then using a managed option may be preferred because it can reduce the amount of DevOps required by your team to maintain the cluster. If you have large memory requirements, such as more than 1TB per region, then a managed solution may not be able to scale to your use case.

确定托管方法与托管方法的可能因素是群集的预期成本。使用Memorystore，您需要按每小时每GB收费，并且根据容量有不同的定价等级。读取或写入命令可能也会发生更改，这是AWS上DynamoDB定价的一部分。如果成本合理，则首选使用托管选项，因为它可以减少团队维护集群所需的DevOps数量。如果您有较大的内存要求，例如每个区域超过1TB，则托管解决方案可能无法扩展到您的用例。

For this post, we’ll stick to the mock implementation of Redis, to keep everything related to Python coding, and because Redis is a vast topic that readers should explore in more detail beyond this post.

对于本文，我们将坚持使用Redis的模拟实现，以保持与Python编码相关的所有内容，并且由于Redis是一个广泛的主题，因此读者在本文之外应更详细地研究。

Python中的实时应用 (A Real-time Application in Python)

To simulate building feature vectors in real time, we’ll use the Kaggle NHL data set. The game_skater_stats.csv file in this data set provides player-level game summaries, such as the number of shots, goals, and assists completed during a game. We’ll read in this file, filter the rows to a single player, and then iterate through the events and send them to a Flask endpoint that will update the user profile. After sending the event, we’ll also call the endpoint to get an updated player score using a simple linear regression model. The goal is to show how a Flask endpoint can be set up to process a streaming data set and serve real-time model predictions. The full code for this post is available as a Jupyter notebook on GitHub.

为了实时模拟建筑物特征向量，我们将使用Kaggle NHL 数据集。此数据集中的game_skater_stats.csv文件提供了玩家级别的游戏摘要，例如游戏过程中的射门次数，进球数和助攻数。我们将阅读该文件，将行过滤到单个播放器中，然后遍历事件并将其发送到Flask端点，该端点将更新用户配置文件。发送事件后，我们还将调用端点，以使用简单的线性回归模型获得更新的玩家分数。目的是展示如何设置Flask端点以处理流数据集并提供实时模型预测。这篇文章的完整代码可以在GitHub上的Jupyter笔记本中找到。

For readers new to Redis that want to get started with Python, it’s useful to refer to the redis-py documentation for additional details about the Python interface. For this post, we’ll demonstrate basic functionality using a mock Redis server that implements a subset of this interface. To get started, install the following Python libraries:

对于想使用Python入门的Redis初学者来说，可以参考redis-py文档以获取有关Python接口的更多详细信息。在本文中，我们将使用模拟Redis服务器(实现该接口的子集)来演示基本功能。首先，请安装以下Python库：

pip install pandas
pip install fakeredis
pip install flask

Next, we’ll run through CRUD commands using Redis. The code snippet below shows how to start an in-process Redis service, and retrieve a record using the key 12345. Since we haven’t stored any records yet, the print statement will output None. The remainder of the snippet shows how to create, read, update, and delete records.

接下来，我们将使用Redis遍历CRUD命令。下面的代码段显示了如何启动进程内Redis服务以及如何使用键12345检索记录。由于我们尚未存储任何记录，所以print语句将输出None 。该代码段的其余部分显示了如何创建，读取，更新和删除记录。

To create a record in Redis, which is a key-value entry, we can use the set command, which takes key and value parameters. You can think of the key as an index that Redis uses for retrieving the value. The code above checks for the record and if a record is not found a new user profile is created as a dictionary and the object is saved to Redis using the player ID as the key.

要在Redis中创建记录(这是键值条目)，我们可以使用set命令，该命令采用键和值参数。您可以将密钥视为Redis用于检索值的索引。上面的代码检查记录，如果未找到记录，则将创建新的用户配置文件作为字典，并使用玩家ID作为键将对象保存到Redis。

Next, we read the record using the get command in Redis, which retrieves the most recent value saved to the data store. If no record is found, then a value of None is returned. We then translate the String value returned from Redis into a dictionary using the json library.

接下来，我们在Redis中使用get命令读取记录，该命令检索保存到数据存储区的最新值。如果未找到任何记录，则返回值None。然后，我们使用json库将Redis返回的String值转换为字典。

Next, we update the user summary by incrementing the sessions value and then using the set command to save the updated record to Redis. If you run the create, read, update commands multiple times, then you’ll see the session count update with each run.

接下来，我们通过增加会话值，然后使用set命令将更新的记录保存到Redis，来更新用户摘要。如果您多次运行创建，读取，更新命令，则每次运行时会话计数都会更新。

The last operation shown above is the delete operation, which can be performed using an expiration or by deleting the key. The delete command immediately removes the key-value pair from the data store, while the expire command will delete the pair once the specified number of seconds has passed. Using the expire command is a common use case, because you may only need to maintain recently updated data.

上面显示的最后一个操作是删除操作，可以使用到期时间或通过删除键来执行。 delete命令立即将键值对从数据存储中删除，而expire命令将在指定的秒数过去后将其删除。使用expire命令是一种常见的用例，因为您可能只需要维护最近更新的数据。

Next, we’ll pull the NHL data set into memory as a Pandas dataframe and then iterate over the frame, translate the rows into dictionaries, and send the events to the endpoint that we’ll set up. For now, you’ll want to comment out the post and get commands, since the server is not yet running.

接下来，我们将NHL数据集作为Pandas数据帧拉入内存，然后在该帧上进行迭代，将行转换为字典，然后将事件发送到我们将要设置的端点。现在，由于服务器尚未运行，因此您将要注释掉该帖子并获取命令。

For this example, we filter the dataframe to rows with the player ID equal to 8467412, which results in 213 records. Once we have the endpoint set up, we can try iterating through all of the records in order to test the performance of the endpoint.

在此示例中，我们将数据帧过滤到播放器ID等于8467412的行 ，从而产生213条记录。一旦设置了端点，就可以尝试遍历所有记录以测试端点的性能。

The code snippet below shows the code for a sample Flask application that sets up two routes. The /update route implements the feature engineering workflow in the application and the /score route implements model application using a simple linear model.

下面的代码段显示了示例Flask应用程序的代码，该应用程序设置了两条路由。 /update路由在应用程序中实现要素工程工作流， /score路由使用简单的线性模型实现模型应用程序。

The update route uses the CRUD pattern to update user profiles as new data is received, but it does not include a delete or expiration step. As new events are sent to the server, the application will fetch the most recent player summary, update the record with new data points, and then save the updated profile. This means that the profiles are updated in real-time as data is streamed to the server, minimizing the amount of latency in the model pipeline. One thing to note when using this approach for feature engineering is only a subset of features can be used when compared to SQL, since aggregation commands are not available. You can update counters, set flags, calculate means, but can’t perform operations such as calculating median or counting distinct values.

当接收到新数据时，更新路由使用CRUD模式更新用户配置文件，但不包括删除或到期步骤。当新事件发送到服务器时，应用程序将获取最新的播放器摘要，使用新的数据点更新记录，然后保存更新的配置文件。这意味着在将数据流传输到服务器时，概要文件会实时更新，从而最大程度地减少了模型管道中的延迟时间。使用这种方法进行特征工程时要注意的一件事是，与SQL相比，只能使用部分特征，因为聚合命令不可用。您可以更新计数器，设置标志，计算均值，但不能执行诸如计算中位数或计算不同值之类的操作。

The update route takes a player ID and returns a score based on the retrieved feature vector, if available. It parses the player ID from the query string and then fetches the corresponding player summary. The values in this summary are combined with coefficients from a hard-coded linear regression model to return a model prediction. The result is an endpoint that we can call to get model predictions in real-time using up-to-date data.

更新路线获取玩家ID，并根据检索到的特征向量(如果有)返回分数。它从查询字符串中解析玩家ID，然后获取相应的玩家摘要。此摘要中的值与来自硬编码线性回归模型的系数组合在一起，以返回模型预测。结果是一个端点，我们可以调用该端点来使用最新数据实时获取模型预测。

We now have a mock service that shows how to perform feature engineering and model application in real-time using Redis as a data store. In practice, we’d swap out the mock Redis implementation with a Redis cluster, use a model store for the ML model to apply, run the Flask application using a WSGI server such a gunicorn, and scale the application using tools such as Docker and Kubernetes. There’s lots of different approaches for putting this workflow into production, but we’ve demonstrated the core loop with a simple service.

现在，我们有了一个模拟服务，该服务演示了如何使用Redis作为数据存储来实时执行要素工程和模型应用程序。在实践中，我们将使用Redis集群替换模拟Redis实现，使用ML模型的模型存储进行应用，使用WSGI服务器(例如gunicorn)运行Flask应用程序，并使用诸如Docker和Kubernetes。有很多不同的方法可以将此工作流程投入生产，但是我们已经通过简单的服务演示了核心循环。

结论 (Conclusion)

NoSQL tools are useful for data scientists to explore when they need to move from batch to streaming workflows for building predictive models. By using NoSQL data stores, the latency in a workflow can be reduced from minutes or hours to just seconds, enabling ML models to use up-to-date inputs. While data scientists may not typically be hands on with these types of data stores, there’s a variety of ways of getting hands on with tools such as Redis, and it’s possible to prototype services even within Jupyter notebooks.

NoSQL工具对于数据科学家探索何时需要从批处理工作流转移到流工作流以建立预测模型很有用。通过使用NoSQL数据存储，工作流中的延迟可以从几分钟或几小时减少到几秒钟，从而使ML模型可以使用最新的输入。尽管数据科学家通常可能不熟悉这些类型的数据存储，但是有多种方法可以使用诸如Redis之类的工具，并且即使在Jupyter笔记本电脑中也可以对服务进行原型设计。

There’s a variety of different concerns that arise when scaling a service for real-time feature engineering with Redis. One concern is concurrent updates to a key, which can occur when different threads or servers are updating a profile. This can be handled with the check-and-set pattern. On the model application side, there are also issues such as model maintenance and it’s common to have different services for building the feature vectors and for applying models.

使用Redis扩展服务以进行实时功能工程时，会出现各种不同的问题。一个关注点是对密钥的并发更新，当不同的线程或服务器更新配置文件时，可能会发生并发更新。这可以通过检查并设置模式来处理。在模型应用程序方面，还存在诸如模型维护之类的问题，并且通常有不同的服务来构建特征向量和应用模型。

Redis and other NoSQL solutions are just one way of implementing real-time feature engineering for data science workflows. Another approach is to use streaming systems such as Kafka or Kinesis, and stream processing tools such as Kinesis Analytics or Apache Flink. It’s good to explore different approaches in order to find the best solution that will work for your organizations data platform and services.

Redis和其他NoSQL解决方案只是为数据科学工作流实施实时功能工程的一种方法。另一种方法是使用诸如Kafka或Kinesis之类的流系统，以及诸如Kinesis Analytics或Apache Flink之类的流处理工具。最好探索不同的方法，以找到适用于您的组织数据平台和服务的最佳解决方案。

Ben Weber is a distinguished data scientist at Zynga. We are hiring!

Ben Weber是Zynga的杰出数据科学家。我们正在招聘！

翻译自: https://towardsdatascience.com/nosql-for-real-time-feature-engineering-and-ml-models-93d057c0a7b8

nosql 数据模型

weixin_26730921

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nosql 数据模型_用于实时功能工程和ml模型的nosql

nosql 数据模型For the majority of my data science career, I’ve built machine learning models using data fetched from a data warehouse or lake. With this approach, you can create a feature vector for each ...
复制链接

扫一扫