Spark是如何读取大量小文件的

最新推荐文章于 2023-06-15 10:50:02 发布

javaisGod_s

最新推荐文章于 2023-06-15 10:50:02 发布

阅读量414

点赞数

分类专栏：大数据文章标签：大数据 spark

本文链接：https://blog.csdn.net/sijiwang95/article/details/129112118

版权

大数据专栏收录该内容

22 篇文章 0 订阅

订阅专栏

在实际项目中，有时往往处理的数据文件属于小文件（每个文件数据数据量很小，比如 KB ,几十 MB 等），文件数量又很大，如果一个个文件读取为 RDD 的一个个分区，计算数据时很耗时性能低下，使用 SparkContext 中提供： wholeTextFiles 类，专门读取小文件数据。

/**
* Read a directory of text files from HDFS , a local file system .
* Each file is read as a single record and returned in a key - value pair , where the key is the path of each file , the value is the content of each file .
 **/

 def wholeTextFiles (
 path : String ,/／文件存储目录
 minPartitions : Int = defaultMinPartitions //RDD分区数目

): RDD [( String , String )]

关注博主即可阅读全文