【Big Data 每日一题20181114】How NOT to pull from S3 using Apache Spark

总结:

具体是说spark使用 textFiles 读取 aws S3 上的文件会有意想不到的問題
解決辦法是使用S3客戶端API并行读取s3上的文件
问题的英文描述
This worked fine at first but as the dataset grew we noticed that there would always be a large period of inactivity between jobs.

 

S3 is NOT a file system

Apache Spark comes with the built-in functionality to pull data from S3 as it would with HDFS using the SparContext’s textFiles method. textFiles allows for glob syntax, which allows you to pull hierarchal data as in textFiles(s3n://bucket/2015/*/*). Though this seems great at first, there is an underlying issue with treating S3 as a HDFS; that is that S3 is not a file system.

Though it is common to organize your S3 keys with slashes (/), and AWS S3 Console will present you said keys is a nice interface if you do, this is actually misleading. S3 isn’t a file system, it is a key-value store. The keys 2015/05/01 and 2015/05/02 do not live in the “same place”. They just happen to have a similar prefix: 2015/05.

This causes issues when using Apache Spark’s textFiles since it assumes that anything being put through it behaves like an HDFS.

The problem

The Kinja Analytics team runs an Apache Spark cluster on AWS EMR continuously. As soon as a EMR step finishes, a job adds the next step. The jobs on the cluster pull data from S3 (placed there using our event stream), runs multiple computations on that data set and persist the data into a MySQL table. The data in S3 is stored in chronological order as bucket/events/$year/$month/$day/$hour/$minute/data.txt. Each data.txt file is about 10MB large.

Originally we were pulling the data using SparkContext’s textFiles method as such sc.textFiles(s3n://bucket/events/*/*/*/*/*/*). This worked fine at first but as the dataset grew we noticed that there would always be a large period of inactivity between jobs.

When the periods were as long as 3 hours, we figured something was wrong. At first we thought that simply adding more machines would solve this (since that’s how you are supposed to speed up Spark/Hadoop), but when that failed, we dug deeper.

Hints

Using Ganglia graphs, we noticed that during that time only one of the boxes was actually doing any work (which explains why adding more boxes did nothing). This box was the driver for that given application. We went ahead and looked at the logs for the driver and noticed something peculiar (NOTE: The logs that EMR places in S3 are behind, so you would need to wait for your application to finish before seeing the complete logs. If you want live logs you need to log into the machine).

https://gist.githubusercontent.com/pjrt/f1cad93b1…

The Solution

The solution is quite simple: do not use textFiles. Instead use the AmazonS3Client to manually get every key (maybe with a prefix), then parallelize the data pulling using SparkContext’s parrallelize method and said AmazonS3Client.

https://gist.githubusercontent.com/pjrt/f1cad93b1…

Above, we get all of the keys for a bucket and a prefix (events) and parallelize all of the keys (give them to the workers/partitions) and make each worker pull the data for that one key.
 

After this change, our “time of inactivity” went down to a couple of minutes!

And elapsed times went from 4 hours to 1 hour:

转:https://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219

https://gist.githubusercontent.com/pjrt/f1cad93b154ac8958e65/raw/7b0b764408f145f51477dc05ef1a99e8448bce6d/S3Puller.scala

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值