合并多文件后分组再结构化

【问题】

Here's the problem statement:

In a folder in HDFS, there're a few csv files with each row being a record with the schema (ID, attribute1, attribute2, attribute3).

Some of the columns (except ID) could be null or empty strings, and no 2 records with the same ID can have the same non-empty value.

We'd like to merge all records with the same ID, and write all merged records also in HDFS. For example:

Record R1: ID = 1, attribute1 = "hello", attribute2 = null, attribute3 = "";

Record R2: ID = 1, attribute1 = null, attribute2 = null, attribute3 = "testa";

Record R3: ID = 1 attribute1 = null, attribute2 = "okk", attribute3 = "testa";

Merged record should be: ID = 1, attribute1 = "hello", attribute2 = "okk", attribute3 = "testa"

I'm just starting to learn Spark. Could anybody share some thoughts on how to write this in Java with Spark? Thanks!

Here're some sample csv files:

file1.csv:

ID,str1,str2,str3

1,hello,,

file2.csv:

ID,str1,str2,str3

1,,,testa

file3.csv:

ID,str1,str2,str3

1,,okk,testa

The merged file should be:

ID,str1,str2,str3

1,hello,okk,testa

It's known beforehand that there won't be any conflicts on any fields.

Thanks!

【回答】

复述问题:有N个文件相当于N条记录,逻辑上按ID分为M个组,将每组整理为一条记录,第2-4字段的值为将本组记录中该字段的第一个非空取值,如果都空,则本字段也为空。

JAVA(Spark包)写些代码较为复杂,可考虑用SPL实现,代码简单易懂,也能直接访问HDFS:

A
1=["file1.csv","file2.csv","file3.csv"].("hdfs://192.168.1.210:9000/user/hfiles/"+~)
2=hdfs_client(;"hdfs:// 192.168.1.210:9000")
3=A1.conj(hdfs_file(A2,~).import@ct())
4=A3.group(#1)
5=A4.new(#1,~.(#2).select@1(~),~.(#3).select@1(~),~.(#4).select@1(~))
6=hdfs_file(A2,"/user/hfiles/result.csv").export@tc(A5)

A1:拼成字符串序列

A2:连接hdfs文件系统

A3:读取每个文件中的内容,并将数据合并到一起。

 

A4:按照第一列分组

A5:将每组整理为一条记录,第2-4字段的值为将本组记录中该字段的第一个非空取值。

A5还可简化为:=A4.new(#1,${to(2,4).("~.(#"/~/").select@1(~)").concat@c()})

A6:写入结果文件

上述代码很容易和JAVA集成(可参考Java 如何调用 SPL 脚本)。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值