物化视图日志不实例化咋查询_按需实例化视图一种可扩展的解决方案，用于图形分析或机器学习w...

最新推荐文章于 2024-07-12 15:51:39 发布

weixin_26739165

最新推荐文章于 2024-07-12 15:51:39 发布

阅读量174

点赞数

文章标签： java mysql python oracle 大数据

原文链接：https://towardsdatascience.com/on-demand-materialized-views-a-scalable-solution-for-graphs-analysis-or-machine-learning-w-d3816af28f1

版权

物化视图日志不实例化咋查询

Aggregating data for graphs, analysis, portfolios, or even machine learning can be an arduous task and difficult to scale. In this article, I will go over MongoDB’s new(ish) $merge pipeline that I feel resolves a lot of these scaling issues and automates certain design practices that previously took a lot of custom development to accomplish, however, Mongo’s documentation fails to provide extrapolated examples or multiple use cases. This article will be diving heavily into MongoDB’s aggregation operation. It will assume you already have knowledge of how to aggregate data and will be focused primarily on the $merge pipeline which covers scalability, caching and data growth.

对于图表，分析，投资组合，甚至是机器学习ggregating数据可以是一个艰巨的任务，不易结垢。在本文中，我将介绍MongoDB的新(ish) $ merge管道，我认为它解决了许多这些扩展问题并实现了以前需要大量自定义开发才能完成的某些设计实践的自动化，但是，Mongo的文档无法提供推论示例或多个用例。本文将深入探讨MongoDB的聚合操作。它将假定您已经了解如何聚合数据，并将主要关注$ merge管道，该管道涉及可伸缩性，缓存和数据增长。

Table of ContentsBasic Usage
Incrementing New Subsets of Data
Incrementing or Replacing a Field based off a conditionalAggregating Data from Multiple Collections
Creating a Basic Graph or Machine Learning Data Set

Let’s create a simple example with some mock data. In this example we will aggregate generic posts and determine how many posts each profile has, then we will aggregate comments. If you are using the code snippets to follow this article, you will want to create a few data points following the style below. However this solution would easily scale for a database with a large amount of data points

让我们创建一个包含一些模拟数据的简单示例。在此示例中，我们将汇总通用帖子并确定每个配置文件有多少帖子，然后我们将汇总评论。如果您使用的是代码片段，以遵循本文，则您将需要按照以下样式创建一些数据点。但是，此解决方案很容易扩展为具有大量数据点的数据库

db.posts.insert({profileId: "1", title: "title", body: "body", createdAt: new Date()})db.comments.insert({profileId: "1", body: "body", createdAt: new Date()})

Next we will aggregate the data with a simple grouping

接下来，我们将通过简单的分组汇总数据

db.posts.aggregate([{
  “$group”: {
    “_id”: “$profileId”,
    “totalPostCount”: {“$sum”: 1}
}}])

This will give us an array of documents looking something like this

这将为我们提供一系列看起来像这样的文档

[
{_id: “1”, totalPostCount: 5}, 
{_id: “2”, totalPostCount: 4}, 
{_id: “3”, totalPostCount: 9}
]

To avoid having to run a collection scan regularly — an operation that scans the entire collection rather than a subset of a collection, we might store this information in the profile and update it occasionally with a cronjob, or perhaps we will cache the contents somewhere and rerun the entire aggregation to resync the counts. The problem becomes more evident when there are millions of profiles and millions of posts. Suddenly an aggregation will take a considerable amount of computing resources and time driving up costs and server load. This becomes even worse if we are showing some sort of portfolio view, or some scenario where the end user is physically waiting for these counts and they need to be 100% up-to-date, and even worse that many users may be making this request at the same time, overloading our servers and database and crashing our app.

为了避免必须定期运行集合扫描(该操作会扫描整个集合而不是集合的子集) ，我们可以将此信息存储在配置文件中，并偶尔使用cronjob更新它，或者也许将内容缓存在某个位置并重新运行整个聚合以重新同步计数。当有数百万个个人资料和数百万个帖子时，问题变得更加明显。突然，聚合将占用大量计算资源和时间，从而增加成本和服务器负载。如果我们显示某种投资组合视图，或者某些情况下最终用户实际上在等待这些计数，并且它们必须是100％最新的，这会变得更糟，甚至更糟的是，许多用户可能正在这样做同时请求，会使我们的服务器和数据库超载，并使我们的应用程序崩溃。

按需实例化视图 (On Demand Materialized Views)

This introduces the need for MongoDB’s On Demand Materialized Views.

这引入了对MongoDB的按需实例化视图的需求。

In Computer Science, A Materialized View is the result of a previously run data query stored separate from the original dataset. In this case, it describes the $merge operation and how it outputs the results directly into another collection rather than into a cursor to immediately return to the application. Mongo’s documentation page describes that the content is updated each time it is run — i.e On Demand. However, it fails to properly explain how to show incremental representations of data when scanning smaller, more recent subsets of data. In this rest of this article, I will show several examples and use cases on exactly how to do that

在计算机科学中，物化视图是与原始数据集分开存储的先前运行的数据查询的结果。在这种情况下，它描述了$ merge操作以及如何将结果直接输出到另一个集合中，而不是直接输出到游标中以立即返回到应用程序。 Mongo的文档页面描述了内容在每次运行时即按需更新的情况。但是，它无法正确解释在扫描较小的，较新的数据子集时如何显示数据的增量表示形式。在本文的其余部分中，我将展示一些示例和用例，以确切说明如何做到这一点。

This approach adds the $merge pipeline to the end of an aggregation operation. It can output the contents of an aggregation operation into a specific collection created for this purpose, and either replace or merge with a document that it matches with. The contents will be outputted as 1 document per element in the returned array, allowing for further aggregation and calculations to be done on the new aggregated collection. This is a huge upgrade from the previous $out pipeline operator which would overwrite all of the matching entries. $merge adds data that didn’t exist before and replaces data that already exists. The link shows a very clear example of that behavior.

此方法将$ merge管道添加到聚合操作的末尾。它可以将聚合操作的内容输出到为此目的而创建的特定集合中，并替换或合并与其匹配的文档。内容将作为返回的数组中每个元素的1个文档输出，从而可以对新的聚合集合进行进一步的聚合和计算。与以前的$ out管道运算符相比，这是一个巨大的升级，它将覆盖所有匹配的条目。 $ merge添加以前不存在的数据，并替换已经存在的数据。该链接显示了该行为的非常清晰的示例。

基本用法 (Basic Usage)


db.posts.aggregate([
{
  “$group”: {
    “_id”: “$profileId”,
    “totalPostCount”: {“$sum”: 1}
  }
},
{
  “$merge”: {
      “into”: “metricAggregates”
  }
}])

Now we can collect the data with a normal query of the metricAggregates collection

现在我们可以使用metricAggregates集合的常规查询来收集数据

db.metricAggregates.find()
->
[{_id: “1”, totalPostCount: 5}, 
{_id: “2”, totalPostCount: 4}, 
{_id: “3”, totalPostCount: 9}]

The example fails to cover more complicated use-cases. It covers only the addition of new profiles, but in our example, what about new posts for existing profiles? How can we avoid re-scanning previously aggregated data. With millions of profiles and millions of posts, we cannot afford such a heavy operation.

该示例无法涵盖更复杂的用例。它仅涵盖新个人资料的添加，但是在我们的示例中，现有个人资料的新帖子如何处理？我们如何避免重新扫描以前聚合的数据。拥有数以百万计的个人资料和数以百万计的帖子，我们负担不起如此繁重的操作。

增加新的数据子集 (Incrementing New Subsets of Data)

When dealing with millions of documents, we need to find a way to only aggregate the most recent data, and only the data we have not aggregated already. We don’t want to replace the fields that already exist, we want to increment them. The solution hides in the documentation of $merge on the bottom of the optional “whenMatched” field

在处理数百万个文档时，我们需要找到一种方法来仅聚合最新数据，并且仅聚合尚未聚合的数据。我们不想替换已经存在的字段，我们想增加它们。该解决方案隐藏在可选的“ whenMatched ”字段底部的$ merge文档中

An aggregation pipeline to update the document in the collection.[ <stage1>, <stage2> … ]

用于更新集合中文档的聚合管道。[<stage1>，<stage2>…]

The pipeline can only consist of the following stages:$addFields and its alias $set $project and its alias $unset $replaceRoot and its alias $replaceWith

管道只能包括以下阶段： $ addFields及其别名$ set $ project及其别名$ unset $ replaceRoot及其别名$ replaceWith

By applying the whenMatched option, we can apply a $project pipeline operator allowing us to increment fields

通过应用whenMatched选项，我们可以应用$ project管道运算符，使我们能够增加字段

db.posts.aggregate([
{ "$match": {"createdAt": {"$gt": aggregationLastUpdatedAt}}},
{
  "$group": {
    "_id": "$profileId",
    "totalPostCount": {"$sum": 1}
  }
},
{
  "$merge": {
    "into": "metricAggregates",
    "whenMatched": [{
      "$project": {
        "_id": "$_id",
        updatedAt: new Date,
        totalPostCount: {
          "$sum": ["$totalPostCount", "$$new.totalPostCount"]
        }
      }
    }]
  }
}])

There are two things we added to this operation. The first is the $match. Now we need to query the most recent updatedAt field and withdraw it from the collection. After we can include it in the match so we only pull the posts that were created since the last time we called the operation. Down in the $merge pipeline, we add a $project operation, so every time there is a match on the _id field, the updatedAt will be refreshed and the totalCount will be incremented instead of replaced. The syntax $$new is a keyword that relates to the data from the aggregation operation we just performed

我们在此操作中添加了两件事。第一个是$ match 。现在，我们需要查询最近的updatedAt字段并将其从集合中撤出。在可以将其包括在比赛中之后，我们只提取自上次调用该操作以来创建的帖子。在$ merge管道中，我们添加了一个$ project操作，因此，每次_id字段上存在匹配项时， updatedAt将被刷新， totalCount将被递增而不是被替换。语法$$ new是与刚执行的聚合操作中的数据相关的关键字

The Data only ever needs to be looked at once, and only in bite sized increments

仅需一次查看数据，并且仅需按位大小进行查看

根据条件增加或替换字段(Incrementing or Replacing a Field based off of a conditional)

But what if it is more complicated? What if we need to also show counts for the posts made this week? Where we need to conditionally increment or replace the field depending on a timestamp or some other information

但是，如果更复杂呢？如果我们还需要显示本周发布的帖子数怎么办？我们需要根据时间戳或其他一些信息有条件地增加或替换字段的地方

let aggregationLastUpdatedAt = //retrieve most recent timestamp in the metricAggregates collectionlet startOfWeekDate = //new Date representing the beginning of the weekdb.posts.aggregate([
{
  "$match": {"createdAt": {"$gt": aggregationLastUpdatedAt}}
},
{
  "$group": {
    "_id": "$profileId",
    "totalPostCount": {"$sum": 1},
    "postsThisWeek": {"$sum": {
       "$cond": {
          "if": {"$gte": ["$createdAt", startOfWeekDate]},
          "then": 1, "else": 0}}},
  }
},
{
  "$merge": {
    "into": "metricAggregates",
    "whenMatched": [{
      "$project": {
        "_id": "$_id",
        "updatedAt": new Date,
        "weekStartedAt": startOfWeekDate,
        "totalPostCount": {
          "$sum": ["$totalPostCount", "$$new.totalPostCount"]
         },
        "postsThisWeek": {
          "$cond": {
            "if": {"$eq": ["$weekStartedAt", startOfWeekDate]},
            "then": {
              "$sum": ["$postsThisWeek", "$$new.postsThisWeek"]
            }, 
            "else": "$$new.postsThisWeek"
          }
        }
      }
    }]
  }
}
])

Now we conditionally increment postsThisWeek if it matches the weekStartedAt date, or replace it if it does not

现在，如果有条件，则有条件地增加postsThisWeek，如果它与weekStartedAt日期匹配，则替换它，如果不匹配

聚合来自多个集合的数据(Aggregating Data from Multiple Collections)

What if we have other collections we need to aggregate data from? Previously we might have to use a $lookUp operator, but $lookUp fails in that it only matches with the base collection. For example, what if we need to gather metrics from our comments collection? A $lookup would skip all of the profiles that have never made a post, causing profiles that only made comments to be completely missing from the aggregated results. $merge easily resolves this by allowing us to aggregate on different collections at different times, places or services, and all output to the same collection and document

如果我们需要其他集合来汇总数据怎么办？以前，我们可能必须使用$ lookUp运算符，但$ lookUp失败的原因在于它仅与基本collection相匹配。例如，如果我们需要从评论集合中收集指标怎么办？ $ lookup会跳过所有从未发布过的个人资料，从而导致仅发表评论的个人资料从汇总结果中完全丢失。 $ merge通过允许我们在不同时间，地点或服务上汇总不同的集合，并将所有输出汇总到同一集合和文档中，轻松解决了这一问题

db.comments.aggregate([
{
  "$match": {"createdAt": {"$gt": commentsAggregationLastUpdatedAt}}
},
{
  "$group": {
    "_id": "$profileId",
    "totalComments": {"$sum": 1},
    "commentsThisWeek": {
      "$sum": {"$cond": {
         "if": {"$gte": ["$createdAt", startOfWeekDate]},
         "then": 1, "else": 0}}},
  }
},
{
  "$project": {
    "_id": "$_id",
    "totalComments": 1,
    "commentsThisWeek": 1,
    "weekStartedAt": startOfWeekDate,
    "postsThisWeek": {"$literal": 0}, // explained below
  }
},
{
  "$merge": {
    "into": "metricAggregates",
    "whenMatched": [{
      "$project": {
        "_id": "$_id",
        "commentsUpdatedAt": new Date(),
        "weekStartedAt": startOfWeekDate,
        "totalComments": {
          "$sum": ["$totalComments", "$$new.totalComments"]
        },
        "commentsThisWeek": {"$cond": {
          "if": {"$eq": ["$weekStartedAt", startOfWeekDate]},
          "then": {
            "$sum": ["$commentsThisWeek", "$$new.commentsThisWeek"]
          },
          "else": "$$new.commentsThisWeek"
         }},
         //explained below
        "postsThisWeek": {"$cond": {
          "if": {"$eq": ["$weekStartedAt", startOfWeekDate]},
          "then": {"$sum": ["$thisWeek", "$$new.thisWeek"]}, 
          "else": "$$new.thisWeek"
        }},
      }
    }]
  }
}])

Now in the comments collection, we quickly follow the same aggregation principle, and the collection will automatically be merged in the exact way we want. You may have noticed an extra $project operation as well as a the postsThisWeek field still in the $merge pipeline. The reason for this is because if the comments aggregation operation occurs in a new week, the totalComments will accurately be reset, and the weekStartedDate correctly updated. However if the post aggregation occurs later, the start of week replacement will not fire as the weekStartedAt will already be matched, causing the post fields to erroneously be incremented when they should be reset. By including those fields and setting the field to {$literal 0} — $literal sets that field to the literal integer value of 0 rather than being interpreted as an exclusion. The code translates to “If it is a new week, set the field to 0 otherwise increment it by 0”

现在，在注释集合中，我们快速遵循相同的聚合原理，并且集合将按照我们想要的确切方式自动合并。您可能已经注意到了额外的$ project操作以及$ merge管道中的postsThisWeek字段。这样做的原因是因为，如果在新的一周发生的评论聚合操作中，将totalComments准确复位，并weekStartedDate正确更新。但是，如果帖子聚合发生在稍后，则将不触发星期替换的开始，因为weekStartedAt已被匹配，导致帖子字段在应重置时错误地增加。通过包含这些字段并将字段设置为{ $ literal 0} ， $ literal将该字段设置为文字整数值0，而不是被解释为排除项。代码转换为“如果是新的一周，请将字段设置为0，否则将其增加0”

Notice we also set a unique date field in the $merge. We need to separate when the comments were last aggregated and the posts, otherwise there will be a potential for missing data

注意，我们还在$ merge中设置了唯一的日期字段。我们需要将评论的最后汇总时间和帖子分开，否则可能会丢失数据

When the end user requests the data, they simply pull it from the output collection like any normal mongoDB operation. It can be easily sorted, or paginated and filtered as well as indexed, even though the data is an aggregate of multiple aggregation queries and collections.

当最终用户请求数据时，他们像任何普通的mongoDB操作一样简单地从输出集合中提取数据。即使数据是多个聚合查询和集合的聚合，也可以轻松对其进行排序，分页，过滤和索引。

This approach guarantees that even for complicated calculations, we only need to scan the data a single time, and only in bite size pieces. The data can be additionally aggregated every time a page is viewed, or it can be managed by a cronjob. It can span any number of collections without the need for $lookup, and the complexity can be increased depending on the use case.

这种方法保证了即使对于复杂的计算，我们也只需要一次扫描数据，并且只需要一口大小。每次查看页面时，都可以附加聚合数据，也可以由cronjob进行管理。它可以跨越任何数量的集合，而无需$ lookup ，并且根据使用情况可以增加复杂性。

Finally, the new output collection can also be aggregated to come up with different interesting metrics which could greatly aid various machine learning applications or portfolio views.

最后，新的输出集合也可以汇总起来，得出不同的有趣指标，这些指标可以极大地帮助各种机器学习应用程序或组合视图。

创建基本图或机器学习数据集 (Creating a Basic Graph or Machine Learning Data Set)

As a final example, I will include an aggregation operation that sorts the total counts by week, this would be useful to create a visual graph or for a machine learning data set

作为最后一个示例，我将包括一个聚合操作，该操作按周对总计数进行排序，这对于创建可视化图形或机器学习数据集很有用。

db.posts.aggregate([
{"$match": {"createdAt": {"$gt": aggregationLastUpdatedAt}}},
{
  "$project": {
    "createdAt": 1,
    "week": {"$trunc": 
      {"$divide": [
        {"$divide": [{"$subtract": ["$createdAt", startDate]}, 86400000]},
        7
      ]}
    }
  }
},
{
  "$group": {
    "_id": "$week",
    "week": {"$first": "$week"},
    "totalPostCount": {"$sum": 1}
  }
},
{
  "$merge": {
    "into": "metricsByWeek",
    "on": ["week"], // this requires a unique index on the metricsByWeek collection
    "whenMatched": [{
      "$project": {
        "week": 1,
        "updatedAt": new Date,
        "totalPostCount": {
          "$sum": ["$totalPostCount", "$$new.totalPostCount"]
        }
      }
    }]
  }
}])

If you are following the code examples live, you will need to copy and paste the following code snippet before running the above code

如果您正在实时跟踪代码示例，则在运行上面的代码之前，需要复制并粘贴以下代码片段

db.metricsByWeek.createIndex({week:1}, {unique:true})

This is because when you customize which fields the $merge operator is looking for as a match, the field (or combination of fields) must have a unique index so that it is guaranteed that mongo will only find a single match

这是因为当您自定义$ merge运算符正在寻找哪些字段作为匹配项时，该字段(或字段组合)必须具有唯一索引，以确保mongo仅找到单个匹配项

This now creates a collection with documents like this which can be plugged into a graphing library or any other application

现在，这将创建一个包含此类文档的集合，可以将其插入图形库或任何其他应用程序

{
  week: 0,
  totalCount: 3
}
{
  week: 1,
  totalCount: 9,
}
{
  week:2,
  totalCount: 25
}

翻译自: https://towardsdatascience.com/on-demand-materialized-views-a-scalable-solution-for-graphs-analysis-or-machine-learning-w-d3816af28f1

物化视图日志不实例化咋查询

weixin_26739165

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
物化视图日志不实例化咋查询_按需实例化视图一种可扩展的解决方案，用于图形分析或机器学习w...

物化视图日志不实例化咋查询Aggregating data for graphs, analysis, portfolios, or even machine learning can be an arduous task and difficult to scale. In this article, I will go over MongoDB’s new(ish) $merge pipeli...
复制链接

扫一扫