2t的数据从存储中拷贝需要_在S3存储桶之间移动数据需要2天和7位工程师-CSDN博客

2t的数据从存储中拷贝需要

A team of engineers tried to quickly transfer 25TB of data from one S3 bucket to another [1]. Their requirement was to move a large number of small log files (in the range of MB), ideally within the next two hours. It turned out that they needed two full working days and a team of seven engineers to complete the task. It’s easy to just call them extremely inefficient — from a business perspective this problem seems simple. But the truth is that performing such large data transfers is not something that can be easily accomplished within two hours, without preparation.

一组工程师试图将25TB数据从一个S3存储桶快速传输到另一个存储桶[1]。他们的要求是最好在接下来的两个小时内移动大量小日志文件( MB范围内)。事实证明，他们需要两个完整的工作日和七个工程师的团队来完成任务。称它们为极低效率很容易-从业务角度来看，这个问题似乎很简单。但是事实是，执行如此大的数据传输是不可能在两个小时之内轻松完成的。

Let’s learn from their mistakes and look at how we can accomplish this faster and more efficiently.

让我们从他们的错误中学习，看看如何才能更快，更有效地完成此任务。

什么地方出了错？ (What Went Wrong?)

From the original Reddit post, we don’t get a lot of background information other than that they wanted to quickly transfer 25 TB of data, mostly made up of small log files. We don’t know which AWS regions they have to transfer this data to and from. All we know about their approach is that they briefly conducted their research and concluded that all options are too time-consuming, so they decided to migrate the data by running parallel uploads using AWS CLI, similar to the following:

从最初的Reddit帖子中，我们除了获得他们想要快速传输25 TB数据(主要由小型日志文件组成)的快速信息之外，没有其他背景信息。我们不知道他们必须在哪个AWS区域之间来回传输数据。我们对他们的方法唯一了解的是，他们简短地进行了研究并得出结论，所有选项都太耗时，因此他们决定通过使用AWS CLI运行并行上传来迁移数据，类似于以下内容：

aws s3 cp s3://source-bucket-name/ \
s3://destination-bucket-name/ --recursive \
--exclude "*" --include "2020-10*" \
--include "2020-09*" --include "2020-08*" \
--include "2020-07*" --include "2020-06*"

This way, they split the data transfer into multiple operations, leveraging multi-threading [2]. Each --include block generates a new upload thread for files that start with a specific prefix. The --exclude "*" block ensures that we exclude all files before starting to include only files with a specific prefix. In the command shown above, we only include log files that start with a specific year and month prefix for each upload thread.

这样，他们利用多线程[2]将数据传输分为多个操作。每个--include块都会为以特定前缀开头的文件生成一个新的上传线程。该--exclude "*"块确保我们开始之前，排除所有文件， include只与特定的前缀的文件。在上面显示的命令中，我们仅包括以每个上传线程的特定年份和月份前缀开头的日志文件。

This command would be performed and monitored by one engineer, while another one (or the same engineer but within another terminal session) could run the transfer for other months:

此命令将由一名工程师执行和监视，而另一名工程师(或同一工程师，但在另一个终端会话中)可以将传输运行另外几个月：

aws s3 cp s3://source-bucket-name/ \
s3://destination-bucket-name/ --recursive \
--exclude "*" --include "2020-05*" \
--include "2020-04*" --include "2020-03*" \
--include "2020-02*" --include "2020-01*"

Even though this is one of the options recommended by AWS [2] to move large amounts of data between S3 buckets, it reminds me of a map-reduce problem: How to count the number of books in a library. Divide the work between several people (workers/processes), split the work evenly between them so that each person counts only books from specific shelves and every worker reports the result to the coordinating person (master).

即使这是AWS [2]推荐的在S3存储桶之间移动大量数据的选项之一，它也让我想起了map-reduce 问题：如何计算图书馆中的书籍数量。将工作划分为几个人(工人/过程)，在他们之间平均分配工作，这样每个人只计算特定书架上的书，每个工人将结果报告给协调人(主管)。

This approach works, but it generates a lot of overhead (they needed seven engineers and two days to coordinate this). There must be a better way!

这种方法有效，但会产生大量开销(他们需要七名工程师和两天的时间来协调此工作)。肯定有更好的办法！

Note: Even though we used aws s3 cp we could also use aws s3 mv to ensure that the data is not only copied to the destination but also deleted from the source bucket.

注意：即使我们使用了aws s3 cp我们也可以使用aws s3 mv来确保不仅将数据复制到目标，而且还从源存储桶中删除了数据。

设置从存储桶A到存储桶B的复制 (Setting Up Replication From Bucket A to Bucket B)

Ideally, we don’t want to migrate single files separately. We’d prefer to just configure things to transfer data from bucket A to bucket B. There is one option which would let us do that: replication.

理想情况下，我们不想单独迁移单个文件。我们只想配置一些东西以将数据从存储桶A传输到存储桶B。有一个选项可以让我们做到这一点：复制。

Replication allows us to configure bucket A to be constantly in sync with bucket B and to automatically make sure that all files are copied over.

通过复制，我们可以将存储区A配置为与存储区B始终保持同步，并自动确保所有文件都被复制。

Replication is particularly useful if we want to copy data from a production bucket to some development bucket. This way we can ensure our development environment has an exact copy of production data, which allows for a reliable development setup.

如果我们要将数据从生产存储桶复制到某个开发存储桶，则复制特别有用。这样，我们可以确保我们的开发环境具有生产数据的准确副本，从而可以进行可靠的开发设置。

AWS allows us to use [3]:

AWS允许我们使用[3]：

CRR (Cross Region Replication)
CRR(跨区域复制)
SRR (Same Region Replication)
SRR(相同区域复制)

Those two options allow us to move the data between buckets across regions (CRR) or within the same region (SRR).

这两个选项使我们能够在跨区域(CRR)或同一区域(SRR)内的存储桶之间移动数据。

Note that replication only works if both S3 buckets have versioning enabled.

请注意，复制仅在两个S3存储桶都启用了版本控制的情况下才有效。

To implement this, we go to the management console and within our source bucket, we select Management → Replication → Add rule. Then, we follow the three steps shown in the screenshot below, to enable versioning and replication from bucket A to bucket BP:

为此，我们转到管理控制台，然后在源存储区中，选择管理→复制→添加规则。然后，我们按照下面的屏幕快照中所示的三个步骤，以实现从存储桶A到存储桶BP的版本控制和复制：

In the end, we should see a screen confirming that the replication has been established:

最后，我们应该看到一个屏幕，确认已建立复制：

Are we done? Not quite.

我们完了吗？不完全的。

Overall, replication sounds great because as long as we configure it once, there’s nothing else left to do for us: AWS automatically replicates the objects from bucket A to bucket B. But there’s one caveat: After we’ve set this up, the replication only works for the files that we will upload in the future, it won’t replicate the existing objects!

总体而言，复制听起来不错，因为只要配置一次，就无所要做：AWS自动将对象从存储桶A复制到存储桶B。但是有一个警告：在设置好之后，复制仅适用于我们将来将要上传的文件，不会复制现有对象！

There is a trick, though: it’s sufficient to change the storage class of the S3 bucket (or, alternatively, to change the encryption status). This could involve changing the storage class from Standard to Intelligent tiering, but the main point is: it must be from one class to a different one. Trying to change from Standard to Standard wouldn’t modify the objects. By changing the storage class, we can ensure that all the files will be:

但是，有一个技巧：更改S3存储桶的存储类别(或者更改加密状态)就足够了。这可能涉及将存储类别从Standard更改为Intelligent 分层，但要点是：它必须是从一个类到另一个类。尝试从Standard更改为Standard不会修改对象。通过更改存储类，我们可以确保所有文件均为：

Moved from bucket A back to bucket A (but with a new storage class).
从存储区A移回存储区A(但具有新的存储类)。
Automatically replicated to bucket B.
自动复制到存储区B。

We could achieve this by the following command [2]:

我们可以通过以下命令[2]实现此目的：

aws s3 cp s3://source-bucket-name s3://source-bucket-name --recursive --storage-class

You can do the same from the console:

您可以从控制台执行相同操作：

Having made those changes, all of the data gets automatically copied over from one bucket to another:

进行了这些更改之后，所有数据都会自动从一个存储桶复制到另一个存储桶：

We could now change back to our previous storage class.

现在，我们可以回到先前的存储类。

S3批处理操作 (S3 Batch Operations)

The first method that we introduced (AWS CLI) suffers from the fact that we need to do a lot of work on our side (our own “map-reduce”) and make many API calls, which can incur larger costs. The second method (replication) suffers from the fact that this process is asynchronous, which means that all objects will eventually get replicated. According to AWS:

我们引入的第一种方法(AWS CLI)遭受这样的事实，即我们需要自己做很多工作(我们自己的“ map-reduce”)并进行许多API调用，这可能会导致更高的成本。第二种方法(复制)受以下事实困扰：该过程是异步的，这意味着所有对象最终都将被复制。根据AWS：

“Most objects replicate within 15 minutes, but sometimes replication can take a couple hours or more.” [4]

“大多数对象在15分钟内即可复制，但有时复制可能需要几个小时或更长时间。” [4]

The potential latency may be the reason why those engineers didn’t choose this option — they wanted to accomplish the data transfer within two hours.

潜在的延迟可能是那些工程师没有选择此选项的原因-他们想在两个小时内完成数据传输。

In this case, the service “S3 batch operations” seems to be an attractive alternative. It promises to quickly process a large amount of S3 objects within a single API request [2].

在这种情况下，服务“ S3批处理操作”似乎是一种有吸引力的选择。它承诺可以在单个API请求中快速处理大量S3对象[2]。

S3批处理操作的实际操作：过程 (S3 batch operations in action: the process)

The entire process of moving data from bucket A to bucket B entails the following steps:

将数据从存储区A移动到存储区B的整个过程需要执行以下步骤：

Setting up the inventory report (it could be stored in the same bucket we want to copy data into → bucket B) to generate a list of all objects that need to be copied over from bucket A to B.
设置库存报告(可以将其存储在我们要将数据复制到→存储桶B的同一个存储桶中)，以生成需要从存储桶A复制到存储桶B的所有对象的列表。

Creating IAM roles for the S3 batch operations to give the job permissions to read and write data to and from both buckets (or three buckets if you configured to store the inventory report to a third bucket).
为S3批处理操作创建IAM角色，以授予作业权限来往两个存储桶(如果配置为将库存报告存储到第三个存储桶，则为三个存储桶)之间的数据读写。
Within the AWS Management Console (or within AWS CLI) create an S3 batch operation job with a PUT copy operation to do the actual data transfer based on the inventory job’s output.
在AWS管理控制台中(或在AWS CLI中)使用PUT复制操作创建S3批处理操作作业，以根据清单作业的输出进行实际的数据传输。
Run the job and view the completion report to validate that all objects have been successfully transferred.
运行作业并查看完成报告以验证是否已成功传输所有对象。

The entire process should take a couple of hours.

整个过程需要几个小时。

Note that an important prerequisite for creating a job in S3 batch operations is having the inventory report in place (step 1).

请注意，在S3批处理操作中创建作业的重要前提是拥有库存报告(步骤1)。

S3批处理操作的实际操作：执行 (S3 batch operations in action: the implementation)

We start by creating an S3 inventory report of our bucket A (select your bucket → Management → Inventory), which will (when completed) list all objects in our S3 bucket:

我们首先创建存储桶A的S3库存报告(选择存储桶→管理→库存)，该报告(完成后)将列出S3存储桶中的所有对象：

Now we can customize the report: select the destination bucket, choose whether we want to create this report daily or weekly, add optional fields to include extra metadata such as object size, last modified date, or whether an object is encrypted. We should select CSV, as this is the only format that can be used for S3 batch operations:

现在，我们可以自定义报告：选择目标存储区，选择是每天还是每周创建此报告，添加可选字段以包括额外的元数据，例如对象大小，上次修改日期或对象是否加密。我们应该选择CSV，因为这是可用于S3批处理操作的唯一格式：

After we click on save, the configuration is finished. However, AWS informs us that it may take up to 48 hours to deliver the first report!

单击保存后，配置完成。但是，AWS通知我们，可能最多需要48个小时才能交付第一份报告！

I was trying to fake it and generate the inventory report myself by just listing the files with AWS CLI, saving the results to a CSV file, and uploading it to my S3 bucket:

我试图通过使用AWS CLI仅列出文件，将结果保存到CSV文件并将其上传到我的S3存储桶来伪造它并自己生成库存报告。

# create the file manually
aws s3 ls s3://ecommerce-marketplace --recursive > manifest.csv# upload to S3
aws s3 cp manifest.csv s3://e-commerce-marketplace/manifest.csv

But it seems that this service requires the inventory report to be in a specific format, which caused my “S3 batch operations” attempt to fail:

但似乎该服务要求清单报告采用特定格式，这导致我的“ S3批处理操作”尝试失败：

Apparently, no inventory report means no S3 batch operations job!

显然，没有库存报告意味着没有S3批处理作业！

If we had this inventory already in place, we could continue as follows to create an S3 batch operations job:

如果我们已经有了此清单，则可以继续执行以下步骤来创建S3批处理作业：

Create a new job:
创建一个新工作：

Specify the path to the CSV inventory report and the S3 destination (bucket B), and then select the type of operation we want to perform (Copy):
指定CSV库存报告的路径和S3目标(存储区B )，然后选择我们要执行的操作类型(复制)：

The final steps are configuring the completion report and IAM role to grant the job permissions to access our S3 resources:

最后一步是配置完成报告和IAM角色，以授予作业访问我们S3资源的权限：

替代选择(Alternative Options)

In addition to the methods above, AWS offers other ways of transferring large amounts of data between S3 buckets:

除了上述方法之外，AWS还提供了在S3存储桶之间传输大量数据的其他方式：

AWS SDK: This entails you write a custom application (for example, in Java) to do this simple COPY operation.
AWS开发工具包：需要编写一个自定义应用程序(例如，用Java)来执行此简单的COPY操作。
Spinning up a Hadoop cluster on Amazon EMR and performing a S3DistCp operation to copy big data from S3 to a new destination. This involves running parallel copy commands that would download the data from bucket A to a Hadoop cluster and would write files in parallel to bucket B.
在Amazon EMR上启动Hadoop集群并执行S3DistCp操作，以将大数据从S3复制到新目的地。这涉及运行并行复制命令，该命令会将数据从存储桶A下载到Hadoop集群，并将文件并行写入存储桶B。

If you ask me, those options seem to be an example of overengineering, but it can make sense in certain scenarios, especially if you have such unexpected requirements as having to move terabytes of data within a couple of hours.

如果您问我，这些选项似乎是过度工程的一个例子，但在某些情况下可能是有道理的，尤其是当您有意外的要求(例如必须在几个小时内移动TB的数据)时。

结论 (Conclusion)

In this article, we discussed several options to move large amounts of data between S3 buckets: AWS CLI copy command, replication, and S3 batch operations. We can conclude by saying that without planning and more time for such migrations (for example, planning for a large enough buffer to wait until an inventory report is generated, or waiting until replication will take care of the process for us), the process of moving large amounts of data is very involved. It requires custom, often overengineered solutions such as:

在本文中，我们讨论了在S3存储桶之间移动大量数据的几种选项：AWS CLI复制命令，复制和S3批处理操作。我们可以得出这样的结论：在没有计划和更多时间进行此类迁移的情况下(例如，计划足够大的缓冲区来等待直到生成库存报告，或者等到复制将为我们处理该过程)，移动大量数据非常复杂。它需要定制的，通常是过度设计的解决方案，例如：

Writing custom applications with AWS SDK.
使用AWS开发工具包编写自定义应用程序。
Performing S3DistCp on a Hadoop cluster.
在Hadoop集群上执行S3DistCp。
Even trying to perform your own map-reduce job by splitting the AWS CLI copy process into separate sessions and dividing the work across seven engineers.
甚至尝试通过将AWS CLI复制过程拆分为单独的会话并将工作分配给7位工程师来执行自己的映射减少任务。

We shouldn’t require such large data transfers to be conducted within two hours by a single engineer. Additionally, planning ahead may eliminate the need for such large data transfers in the first place.

我们不应该要求单个工程师在两个小时内进行如此大的数据传输。此外，预先计划可以消除对大型数据传输的需求。

Overall, I wish that business owners and managers would recognize that there are many things in engineering that just don’t happen overnight (certainly not within two hours). Everything requires planning, preparation, gathering and discussing requirements with stakeholders, infrastructure setup, and plenty of testing. This is the only way to provide high-quality IT solutions to business problems.

总体而言，我希望企业所有者和经理能够认识到工程中有很多事情不是一朝一夕就能完成的(一定不是在两个小时之内)。一切都需要计划，准备，收集并与利益相关者讨论需求，基础架构设置和大量测试。这是为业务问题提供高质量IT解决方案的唯一方法。

I learned a lot from the experience of those engineers and I’m grateful that they shared their story. I hope it was useful for you, too. Thank you for reading!

我从那些工程师的经验中学到了很多，并且我很感激他们分享了他们的故事。我希望它对您也有用。感谢您的阅读！

[1] Reddit post: https://www.reddit.com/r/aws/comments/irkshm/moving_25tb_data_from_one_s3_bucket_to_another/

[1] Reddit帖子： https : //www.reddit.com/r/aws/comments/irkshm/moving_25tb_data_from_one_s3_bucket_to_another/

[2] AWS Knowledge Center: https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

[2] AWS知识中心： https : //aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

[3] Amazon S3 — Replication: https://docs.aws.amazon.com/AmazonS3/latest/dev/replication.html

[3] Amazon S3 —复制： https : //docs.aws.amazon.com/AmazonS3/latest/dev/replication.html

[4] S3 CRR Replication time: https://aws.amazon.com/premiumsupport/knowledge-center/s3-crr-replication-time/

[4] S3 CRR复制时间： https ： //aws.amazon.com/premiumsupport/knowledge-center/s3-crr-replication-time/