pyspark 读取本地csv_pyspark将多个csv文件读取到一个数据帧中(或RDD?)

I've got a Spark 2.0.2 cluster that I'm hitting via Pyspark through Jupyter Notebook. I have multiple pipe delimited txt files (loaded into HDFS. but also available on a local directory) that I need to load using spark-csv into three separate dataframes, depending on the name of the file.

I see three approaches I can take - either I can use python to somehow iterate through the HDFS directory (haven't figured out how to do this yet, load each file and then do a union.

I also know that there exists some wildcard functionalty (see here) in spark - I can probably leverage

Lastly, I could use pandas to load the vanilla csv file from disk as a pandas dataframe and then create a spark dataframe. The downside here is that these files are large, and loading into memory on a single node could take ~8gb. (that's why this is moving to a cluster in the first place).

Here is the code I have so far and some pseudo code for the two methods:

import findspark

findspark.init()

import pyspark

from pyspark.sql import SparkSession

import pandas as pd

sc = pyspark.SparkContext(appName = 'claims_analysis', master='spark://someIP:7077')

spark = SparkSession(sc)

#METHOD 1 - iterate over HDFS directory

for currFile in os.listdir(HDFS:///someDir//):

if #filename contains 'claim':

#create or unionAll to merge claim_df

if #filename contains 'pharm':

#create or unionAll to merge pharm_df

if #filename contains 'service':

#create or unionAll to merge service_df

#Method 2 - some kind of wildcard functionality

claim_df = spark.read.format('com.databricks.spark.csv').options(delimiter = '|',header ='true',nullValue ='null').load('HDFS:///someDir//*.csv')

pharm_df = spark.read.format('com.databricks.spark.csv').options(delimiter = '|',header ='true',nullValue ='null').load('HDFS:///someDir//*.csv')

service_df = spark.read.format('com.databricks.spark.csv').options(delimiter = '|',header ='true',nullValue ='null').load('HDFS:///someDir//*.csv')

#METHOD 3 - load to a pandas df and then convert to spark df

for currFile in os.listdir(HDFS:///someDir//)

pd_df = pd.read_csv(currFile, sep = '|')

df = spark.createDataFrame(pd_df)

if #filename contains 'claim':

#create or unionAll to merge claim_df

if #filename contains 'pharm':

#create or unionAll to merge pharm_df

if #filename contains 'service':

#create or unionAll to merge service_df

Does anyone know how to implement method 1 or 2? I haven't been able to figure these out. Also, I was surprised that there isn't a better way to get csv files loaded into a pyspark dataframe - using a third party package for something that seems like it should be a native feature confused me (did I just miss the standard use case for loading csv files into a dataframe?) Ultimately, I'm going to be writing a consolidated single dataframe back to HDFS (using .write.parquet() ) so that I can then clear the memory and do some analytics using MLlib. If the approach I've highlighted isn't best practice, I would appreciate a push in the right direction!

解决方案

Approach 1 :

In python you cannot directly refer to HDFS location. You need to take help of another library like pydoop. In scala and java, you have API. Even with pydoop, you will be reading the files one by one. It is bad to read files one by one and not use the parallel reading option provided by spark.

Approach 2 :

You should be able to point the multiple files with comma separated or with wild card. This way spark takes care of reading files and distribute them into partitions. But if you go with union option with each data frame there is one edge case when you dynamically read each file. When you have lot of files, the list can become so huge at driver level and can cause memory issues. Main reason is that, the read process is still happening at driver level.

This option is better. The spark will read all the files related to regex and convert them into partitions. You get one RDD for all the wildcard matches and from there you dont need to worry about union for individual rdd's

Sample code cnippet :

distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")

Approach 3 :

Unless you have some legacy application in python which uses the features of pandas, I would better prefer using spark provided API

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值