

A team of engineers tried to quickly transfer 25TB of data from one S3 bucket to another [1]. Their requirement was to move a large number of small log files (in the range of MB), ideally within the next two hours. It turned out that they needed two full working days and a team of seven engineers to complete the task. It’s easy to just call them extremely inefficient — from a business perspective this problem seems simple. But the truth is that performing such large data transfers is not something that can be easily accomplished within two hours, without preparation.

一组工程师试图将25TB数据从一个S3存储桶快速传输到另一个存储桶[1]。 他们的要求是最好在接下来的两个小时内移动大量小日志文件( MB范围内)。 事实证明,他们需要两个完整的工作日和七个工程师的团队来完成任务。 称它们为极低效率很容易-从业务角度来看,这个问题似乎很简单。 但是事实是,执行如此大的数据传输是不可能在两个小时之内轻松完成的。

Let’s learn from their mistakes and look at how we can accomplish this faster and more efficiently.


什么地方出了错? (What Went Wrong?)

From the original Reddit post, we don’t get a lot of background information other than that they wanted to quickly transfer 25 TB of data, mostly made up of small log files. We don’t know which AWS regions they have to transfer this data to and from. All we know about their approach is that they briefly conducted their research and concluded that all options are too time-consuming, so they decided to migrate the data by running parallel uploads using AWS CLI, similar to the following:

从最初的Reddit帖子中,我们除了获得他们想要快速传输25 TB数据(主要由小型日志文件组成)的快速信息之外,没有其他背景信息。 我们不知道他们必须在哪个AWS区域之间来回传输数据。 我们对他们的方法唯一了解的是,他们简短地进行了研究并得出结论,所有选项都太耗时,因此他们决定通过使用AWS CLI运行并行上传来迁移数据, 类似于以下内容:

aws s3 cp s3://source-bucket-name/ \
s3://destination-bucket-name/ --recursive \
--exclude "*" --include "2020-10*" \
--include "2020-09*" --include "2020-08*" \
--include "2020-07*" --include "2020-06*"

This way, they split the data transfer into multiple operations, leveraging multi-threading [2]. Each --include block generates a new upload thread for files that start with a specific prefix. The --exclude "*" block ensures that we exclude all files before starting to include only files with a specific prefix. In the command shown above, we only include log files that start with a specific year and month prefix for each upload thread.

这样,他们利用多线程[2]将数据传输分为多个操作。 每个--include块都会为以特定前缀开头的文件生成一个新的上传线程。 该--exclude "*"块确保我们开始之前,排除所有文件, include只与特定的前缀的文件。 在上面显示的命令中,我们仅包括以每个上传线程的特定年份和月份前缀开头的日志文件。

This command would be performed and monitored by one engineer, while another one (or the same engineer but within another terminal session) could run the transfer for other months:


aws s3 cp s3://source-bucket-name/ \
s3://destination-bucket-name/ --recursive \
--exclude "*" --include "2020-05*" \
--include "2020-04*" --include "2020-03*" \
--include "2020-02*" --include "2020-01*"

Even though this is one of the options recommended by AWS [2] to move large amounts of data between S3 buckets, it reminds me of a map-reduce problem: How to count the number of books in a library. Divide the work between several people (workers/processes), split the work evenly between them so that each person counts only books from specific shelves and every worker reports the result to the coordinating person (master).

即使这是AWS [2]推荐的在S3存储桶之间移动大量数据的选项之一,它也让我想起了map-reduce 问题:如何计算图书馆中的书籍数量。 将工作划分为几个人(工人/过程),在他们之间平均分配工作,这样每个人只计算特定书架上的书,每个工人将结果报告给协调人(主管)。

This approach works, but it generates a lot of overhead (they needed seven engineers and two days to coordinate this). There must be a better way!

这种方法有效,但会产生大量开销(他们需要七名工程师和两天的时间来协调此工作)。 肯定有更好的办法!

Note: Even though we used aws s3 cp we could also use aws s3 mv to ensure that the data is not only copied to the destination but also deleted from the source bucket.

注意:即使我们使用了aws s3 cp我们也可以使用aws s3 mv来确保不仅将数据复制到目标,而且还从源存储桶中删除了数据。

设置从存储桶A到存储桶B的复制 (Setting Up Replication From Bucket A to Bucket B)

Ideally, we don’t want to migrate single files separately. We’d prefer to just configure things to transfer data from bucket A to bucket B. There is one option which would let us do that: replication.

理想情况下,我们不想单独迁移单个文件。 我们只想配置一些东西以将数据从存储桶A传输到存储桶B。有一个选项可以让我们做到这一点:复制。

Replication allows us to configure bucket A to be constantly in sync with bucket B and to automatically make sure that all files are copied over.


Replication is particularly useful if we want to copy data from a production bucket to some development bucket. This way we can ensure our development environment has an exact copy of production data, which allows for a reliable development setup.

如果我们要将数据从生产存储桶复制到某个开发存储桶,则复制特别有用。 这样,我们可以确保我们的开发环境具有生产数据的准确副本,从而可以进行可靠的开发设置。

AWS allows us to use [3]:


  • CRR (Cross Region Replication)

  • SRR (Same Region Replication)


Those two options allow us to move the data between buckets across regions (CRR) or within the same region (SRR).


Note that replication only works if both S3 buckets have versioning enabled.


To implement this, we go to the management console and within our source bucket, we select Management → Replication → Add rule. Then, we follow the three steps shown in the screenshot below, to enable versioning and replication from bucket A to bucket BP:

为此,我们转到管理控制台,然后在源存储区中,选择管理→复制→添加规则。 然后,我们按照下面的屏幕快照中所示的三个步骤,以实现从存储桶A到存储桶BP的版本控制和复制:

Image for post
Image for post
Image for post
Setting up replication — image by author

In the end, we should see a screen confirming that the replication has been established:


Image for post
Replication successful — image by author

Are we done? Not quite.

我们完了吗? 不完全的。

Overall, replication sounds great because as long as we configure it once, there’s nothing else left to do for us: AWS automatically replicates the objects from bucket A to bucket B. But there’s one caveat: After we’ve set this up, the replication only works for the files that we will upload in the future, it won’t replicate the existing objects!


There is a trick, though: it’s sufficient to change the storage class of the S3 bucket (or, alternatively, to change the encryption status). This could involve changing the storage class from Standard to Intelligent tiering, but the main point is: it must be from one class to a different one. Trying to change from Standard to Standard wouldn’t modify the objects. By changing the storage class, we can ensure that all the files will be:

但是,有一个技巧:更改S3存储桶的存储类别(或者更改加密状态)就足够了。 这可能涉及将存储类别从Standard更改为Intelligent 分层,但要点是:它必须是从一个类到另一个类。 尝试从Standard更改为Standard不会修改对象。 通过更改存储类,我们可以确保所有文件均为:

  • Moved from bucket A back to bucket A (but with a new storage class).

  • Automatically replicated to bucket B.


We could achieve this by the following command [2]:


aws s3 cp s3://source-bucket-name s3://source-bucket-name --recursive --storage-class 

You can do the same from the console:


Image for post
Image for post
Changing a storage class — image by author

Having made those changes, all of the data gets automatically copied over from one bucket to another:


Image for post

We could now change back to our previous storage class.


S3批处理操作 (S3 Batch Operations)

The first method that we introduced (AWS CLI) suffers from the fact that we need to do a lot of work on our side (our own “map-reduce”) and make many API calls, which can incur larger costs. The second method (replication) suffers from the fact that this process is asynchronous, which means that all objects will eventually get replicated. According to AWS:

我们引入的第一种方法(AWS CLI)遭受这样的事实,即我们需要自己做很多工作(我们自己的“ map-reduce”)并进行许多API调用,这可能会导致更高的成本。 第二种方法(复制)受以下事实困扰:该过程是异步的,这意味着所有对象最终都将复制。 根据AWS:

“Most objects replicate within 15 minutes, but sometimes replication can take a couple hours or more.” [4]

“大多数对象在15分钟内即可复制,但有时复制可能需要几个小时或更长时间。” [4]

The potential latency may be the reason why those engineers didn’t choose this option — they wanted to accomplish the data transfer within two hours.


In this case, the service “S3 batch operations” seems to be an attractive alternative. It promises to quickly process a large amount of S3 objects within a single API request [2].

在这种情况下,服务“ S3批处理操作”似乎是一种有吸引力的选择。 它承诺可以在单个API请求中快速处理大量S3对象[2]。

S3批处理操作的实际操作:过程 (S3 batch operations in action: the process)

The entire process of moving data from bucket A to bucket B entails the following steps:


  • Setting up the inventory report (it could be stored in the same bucket we want to copy data into bucket B) to generate a list of all objects that need to be copied over from bucket A to B.


  1. Creating IAM roles for the S3 batch operations to give the job permissions to read and write data to and from both buckets (or three buckets if you configured to store the inventory report to a third bucket).


  2. Within the AWS Management Console (or within AWS CLI) create an S3 batch operation job with a PUT copy operation to do the actual data transfer based on the inventory job’s output.

    在AWS管理控制台中(或在AWS CLI中)使用PUT复制操作创建S3批处理操作作业,以根据清单作业的输出进行实际的数据传输。

  3. Run the job and view the completion report to validate that all objects have been successfully transferred.


The entire process should take a couple of hours.


Note that an important prerequisite for creating a job in S3 batch operations is having the inventory report in place (step 1).


S3批处理操作的实际操作:执行 (S3 batch operations in action: the implementation)

We start by creating an S3 inventory report of our bucket A (select your bucket → Management → Inventory), which will (when completed) list all objects in our S3 bucket:


Image for post
Image for post
Configuring inventory report — image by author

Now we can customize the report: select the destination bucket, choose whether we want to create this report daily or weekly, add optional fields to include extra metadata such as object size, last modified date, or whether an object is encrypted. We should select CSV, as this is the only format that can be used for S3 batch operations:

现在,我们可以自定义报告:选择目标存储区,选择是每天还是每周创建此报告,添加可选字段以包括额外的元数据,例如对象大小,上次修改日期或对象是否加密。 我们应该选择CSV,因为这是可用于S3批处理操作的唯一格式:

Image for post
Configuring the inventory report in S3 — image by author

After we click on save, the configuration is finished. However, AWS informs us that it may take up to 48 hours to deliver the first report!

单击保存后,配置完成。 但是,AWS通知我们,可能最多需要48个小时才能交付第一份报告!

Image for post
The inventory report should be delivered within 48 hours — image by author

I was trying to fake it and generate the inventory report myself by just listing the files with AWS CLI, saving the results to a CSV file, and uploading it to my S3 bucket:

我试图通过使用AWS CLI仅列出文件,将结果保存到CSV文件并将其上传到我的S3存储桶来伪造它并自己生成库存报告。

# create the file manually
aws s3 ls s3://ecommerce-marketplace --recursive > manifest.csv# upload to S3
aws s3 cp manifest.csv s3://e-commerce-marketplace/manifest.csv

But it seems that this service requires the inventory report to be in a specific format, which caused my “S3 batch operations” attempt to fail:

但似乎该服务要求清单报告采用特定格式,这导致我的“ S3批处理操作”尝试失败:

Image for post
Attempt to “fake” the inventory report — image by author

Apparently, no inventory report means no S3 batch operations job!


If we had this inventory already in place, we could continue as follows to create an S3 batch operations job:


  • Create a new job:

Image for post
Create job — image by author
  • Specify the path to the CSV inventory report and the S3 destination (bucket B), and then select the type of operation we want to perform (Copy):

    指定CSV库存报告的路径和S3目标(存储区B ),然后选择我们要执行的操作类型(复制):

Image for post
Image for post
Configuring the job — image by author

The final steps are configuring the completion report and IAM role to grant the job permissions to access our S3 resources:


Image for post
Image for post
Image for post
All subsequent steps needed to complete the S3 batch operations job to move big data between S3 buckets — image by author

替代选择(Alternative Options)

In addition to the methods above, AWS offers other ways of transferring large amounts of data between S3 buckets:


  • AWS SDK: This entails you write a custom application (for example, in Java) to do this simple COPY operation.


  • Spinning up a Hadoop cluster on Amazon EMR and performing a S3DistCp operation to copy big data from S3 to a new destination. This involves running parallel copy commands that would download the data from bucket A to a Hadoop cluster and would write files in parallel to bucket B.

    在Amazon EMR上启动Hadoop集群并执行S3DistCp操作,以将大数据从S3复制到新目的地。 这涉及运行并行复制命令,该命令会将数据从存储桶A下载到Hadoop集群,并将文件并行写入存储桶B。

If you ask me, those options seem to be an example of overengineering, but it can make sense in certain scenarios, especially if you have such unexpected requirements as having to move terabytes of data within a couple of hours.


结论 (Conclusion)

In this article, we discussed several options to move large amounts of data between S3 buckets: AWS CLI copy command, replication, and S3 batch operations. We can conclude by saying that without planning and more time for such migrations (for example, planning for a large enough buffer to wait until an inventory report is generated, or waiting until replication will take care of the process for us), the process of moving large amounts of data is very involved. It requires custom, often overengineered solutions such as:

在本文中,我们讨论了在S3存储桶之间移动大量数据的几种选项:AWS CLI复制命令,复制和S3批处理操作。 我们可以得出这样的结论:在没有计划和更多时间进行此类迁移的情况下(例如,计划足够大的缓冲区来等待直到生成库存报告,或者等到复制将为我们处理该过程),移动大量数据非常复杂。 它需要定制的,通常是过度设计的解决方案,例如:

  • Writing custom applications with AWS SDK.

  • Performing S3DistCp on a Hadoop cluster.

  • Even trying to perform your own map-reduce job by splitting the AWS CLI copy process into separate sessions and dividing the work across seven engineers.

    甚至尝试通过将AWS CLI复制过程拆分为单独的会话并将工作分配给7位工程师来执行自己的映射减少任务。

We shouldn’t require such large data transfers to be conducted within two hours by a single engineer. Additionally, planning ahead may eliminate the need for such large data transfers in the first place.

我们不应该要求单个工程师在两个小时内进行如此大的数据传输。 此外,预先计划可以消除对大型数据传输的需求。

Overall, I wish that business owners and managers would recognize that there are many things in engineering that just don’t happen overnight (certainly not within two hours). Everything requires planning, preparation, gathering and discussing requirements with stakeholders, infrastructure setup, and plenty of testing. This is the only way to provide high-quality IT solutions to business problems.

总体而言,我希望企业所有者和经理能够认识到工程中有很多事情不是一朝一夕就能完成的(一定不是在两个小时之内)。 一切都需要计划,准备,收集并与利益相关者讨论需求,基础架构设置和大量测试。 这是为业务问题提供高质量IT解决方案的唯一方法。

I learned a lot from the experience of those engineers and I’m grateful that they shared their story. I hope it was useful for you, too. Thank you for reading!

我从那些工程师的经验中学到了很多,并且我很感激他们分享了他们的故事。 我希望它对您也有用。 感谢您的阅读!

[1] Reddit post: https://www.reddit.com/r/aws/comments/irkshm/moving_25tb_data_from_one_s3_bucket_to_another/

[1] Reddit帖子: https : //www.reddit.com/r/aws/comments/irkshm/moving_25tb_data_from_one_s3_bucket_to_another/

[2] AWS Knowledge Center: https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

[2] AWS知识中心: https : //aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

[3] Amazon S3 — Replication: https://docs.aws.amazon.com/AmazonS3/latest/dev/replication.html

[3] Amazon S3 —复制: https : //docs.aws.amazon.com/AmazonS3/latest/dev/replication.html

[4] S3 CRR Replication time: https://aws.amazon.com/premiumsupport/knowledge-center/s3-crr-replication-time/

[4] S3 CRR复制时间: https//aws.amazon.com/premiumsupport/knowledge-center/s3-crr-replication-time/

翻译自: https://medium.com/better-programming/it-took-2-days-and-7-engineers-to-move-data-between-s3-buckets-d79c55b16d0


