java 解析dataset_使用Apache Spark和Java将CSV解析为DataFrame / DataSet

最新推荐文章于 2021-03-07 22:36:53 发布

门捷列夫斯基

最新推荐文章于 2021-03-07 22:36:53 发布

阅读量182

点赞数

文章标签： java 解析dataset

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_36071371/article/details/114606636

版权

我是新手,我想要使用group-by& reduce从CSV中找到以下内容(使用一行)：

Department, Designation, costToCompany, State

Sales, Trainee, 12000, UP

Sales, Lead, 32000, AP

Sales, Lead, 32000, LA

Sales, Lead, 32000, TN

Sales, Lead, 32000, AP

Sales, Lead, 32000, TN

Sales, Lead, 32000, LA

Sales, Lead, 32000, LA

Marketing, Associate, 18000, TN

Marketing, Associate, 18000, TN

HR, Manager, 58000, TN

我想通过Department,Designation,State简化包含sum(costToCompany)和TotalEmployeeCount的其他列的CSV

应得到如下结果：

Dept, Desg, state, empCount, totalCost

Sales,Lead,AP,2,64000

Sales,Lead,LA,3,96000

Sales,Lead,TN,2,64000

有没有办法使用转换和动作来实现这一点.或者我们应该进行RDD操作？

解决方法:

程序

>创建一个类(模式)来封装您的结构(方法B不需要它,但如果您使用Java,它将使您的代码更容易阅读)

public class Record implements Serializable {

String department;

String designation;

long costToCompany;

String state;

// constructor , getters and setters

}

>加载CVS(JSON)文件

JavaSparkContext sc;

JavaRDD data = sc.textFile("path/input.csv");

//JavaSQLContext sqlContext = new JavaSQLContext(sc); // For previous versions

SQLContext sqlContext = new SQLContext(sc); // In Spark 1.3 the Java API and Scala API have been unified

JavaRDD rdd_records = sc.textFile(data).map(

new Function() {

public Record call(String line) throws Exception {

// Here you can use JSON

// Gson gson = new Gson();

// gson.fromJson(line, Record.class);

String[] fields = line.split(",");

Record sd = new Record(fields[0], fields[1], fields[2].trim(), fields[3]);

return sd;

}

});

此时您有两种方法：

A. SparkSQL

>注册表(使用您定义的Schema类)

JavaSchemaRDD table = sqlContext.applySchema(rdd_records, Record.class);

table.registerAsTable("record_table");

table.printSchema();

>使用所需的Query-group-by查询表

JavaSchemaRDD res = sqlContext.sql("

select department,designation,state,sum(costToCompany),count(*)

from record_table

group by department,designation,state

");

>在这里,您还可以使用SQL方法执行您想要的任何其他查询

B.火花

>使用复合键进行映射：Department,Designation,State

JavaPairRDD> records_JPRDD =

rdd_records.mapToPair(new

PairFunction>(){

public Tuple2> call(Record record){

Tuple2> t2 =

new Tuple2>(

record.Department + record.Designation + record.State,

new Tuple2(record.costToCompany,1)

);

return t2;

}

});

> reduceByKey使用复合键,汇总costToCompany列,并按键累计记录数

JavaPairRDD> final_rdd_records =

records_JPRDD.reduceByKey(new Function2, Tuple2

Integer>, Tuple2>() {

public Tuple2 call(Tuple2 v1,

Tuple2 v2) throws Exception {

return new Tuple2(v1._1 + v2._1, v1._2+ v2._2);

}

});

标签：java,apache-spark,apache-spark-sql,hadoop,hdfs

来源： https://codeday.me/bug/20190926/1821001.html

门捷列夫斯基

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java 解析dataset_使用Apache Spark和Java将CSV解析为DataFrame / DataSet

我是新手,我想要使用group-by& reduce从CSV中找到以下内容(使用一行)：Department, Designation, costToCompany, StateSales, Trainee, 12000, UPSales, Lead, 32000, APSales, Lead, 32000, LASales, Lead, 32000, TNSales, Lead, 320...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。