环境:pycharm-2020.1 + spark-2.3.3(单机默认配置)
需求:需要将dataframe 中数据转成json 并写入到s3对象文件中,数据量大小为180.1M。
解决方向:在不改变资源参数的情况下,解决GC问题(这只是其中方法之一)
出错的核心代码如下:
if __name__ == "__main__":
"""
1. load the conf of es
2. build spark or sc
3. build s3 client
"""
esConf,s3Conf = getOneMapConf(r"onemap.json")
conf = getSparkConf(esConf["user"],esConf["passwd"],esConf["nodes"],esConf["port"])
s3Client = build_s3(access_key=s3Conf["access_key"],secret_key=s3Conf["secret_key"],endpoint_url=s3Conf["endpoint_url"])
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
# get elasticsearch data
dt = datetime.datetime.strftime(datetime.datetime.now() + datetime.timedelta(days=-1), "%Y-%m-%d")
print(dt)
resource,indexDt = update_es_resource(esConf["resource"])
esDf_tmp = spark.read.format("org.elasticsearch.spark.sql").load("index/type")
esDf = esDf_tmp.toJSON().collect()
# write es data into the store of s3
s3Client.put_object(Bucket=s3Conf["bucket"], Key="/".join([esConf["savePath"],dt,"{}_{}.json".format(esConf["saveTable"],dt)]),Body="\n".join(esDf))
spark.stop()
具体出错的地方:esDf = esDf_tmp.toJSON().collect() ,这个地方会将所有的数据拉取到dirver段,内存不足会导致GC overhead limit exceeded
在原有资源参数不动得情况下,改进方法如下:
if __name__ == "__main__":
"""
1. load the conf of es
2. build spark or sc
3. build s3 client
"""
esConf,s3Conf = getOneMapConf("onemap.json")
conf = getSparkConf(esConf["user"],esConf["passwd"],esConf["nodes"],esConf["port"])
s3Client = build_s3(access_key=s3Conf["access_key"],secret_key=s3Conf["secret_key"],endpoint_url=s3Conf["endpoint_url"])
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
# get elasticsearch data
dt = datetime.datetime.strftime(datetime.datetime.now() + datetime.timedelta(days=-1), "%Y-%m-%d")
print(dt)
resource,indexDt = update_es_resource(esConf["resource"])
esDf_tmp = spark.read.format("org.elasticsearch.spark.sql").load("index/type")
esDf_tmp.cache()
import random
def psize(part):
ls = []
r = random.randint(0,1000000000)
for i in part:
ls.append(str(i.asDict()))
print("total size:",sys.getsizeof(ls))
#conf = getSparkConf(esConf["user"],esConf["passwd"],esConf["nodes"],esConf["port"])
s3Client = build_s3(access_key=s3Conf["access_key"],secret_key=s3Conf["secret_key"],endpoint_url=s3Conf["endpoint_url"])
s3Client.put_object(Bucket=s3Conf["bucket"], Key="/".join([esConf["savePath"],dt,"{}_{}_{}.json".format(esConf["saveTable"],dt,r)]),Body="\n".join(ls))
esDf_tmp.foreachPartition(lambda p:psize(p))
# write es data into the store of s3
# s3Client.put_object(Bucket=s3Conf["bucket"], Key="/".join([esConf["savePath"],dt,"{}_{}.json".format(esConf["saveTable"],dt)]),Body="\n".join(esDf))
具体得解决方法操作是:foreachPartition + psize函数(自定义的)。这个方法方向没错,写法上有待改进,
psize函数解读:
1. foreachPartition 传入到psize是一个迭代对象(partittion),需要将对象中每个partition的数据(Row(id=1,name="guo"))取出来追加到列表中;
2. 然后为每一个partition 创建一个s3Client 对象,
3. s3Client 对象写入到s3云存储上;
4. r = random.ranint(0,1000000000)是为了防止云盘因数据文件名相同而冲刷掉数据;
5. 关于其中的配置文件(onemap.json)需要自己根据实际去配置。