spark如何解决文件不存在_我可以将多个文件从S3读入Spark Dataframe中，然后传递不存在的文件吗？...

最新推荐文章于 2022-09-29 19:55:26 发布

weixin_39862097

最新推荐文章于 2022-09-29 19:55:26 发布

阅读量149

点赞数

文章标签： spark如何解决文件不存在

本文链接：https://blog.csdn.net/weixin_39862097/article/details/111862797

版权

I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this:

files = ['s3a://dev/2017/01/03/data.parquet',

's3a://dev/2017/01/02/data.parquet']

df = session.read.parquet(*files)

This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as it finds into the dataframe, and return this result without complaining. Is this possible?

解决方案

Yes, it's possible if you change method of specifying input to hadoop glob pattern, for example:

files = 's3a://dev/2017/01/{02,03}/data.parquet'

df = session.read.parquet(files)

You can read more on patterns in Hadoop javadoc.

But, in my opinion this isn't elegant way of working with data partitioned by time (by day in your case). If you are able to rename directories like this:

s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet

s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet

then you can take advantage of spark partitioning schema and read data by:

session.read.parquet('s3a://dev/') \

.where(col('day').between('2017-01-02', '2017-01-03')

This way will omit empty/non-existing directories as well. Additionall column day will appear in your dataframe (it will be string in spark <2.1.0 and datetime in spark >= 2.1.0), so you will know in which directory each record exists.

weixin_39862097

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark如何解决文件不存在_我可以将多个文件从S3读入Spark Dataframe中，然后传递不存在的文件吗？...

I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this:files = ['s3a://dev/2017/01/03/data.parquet','s3a://dev/2017/01/02/data.parqu...
复制链接

扫一扫