spark入门java_JavaSpark入门第一例

开始学点 Spark。做了第一个小例子,记录一下 ^_^

背景

有个退款文件如下:

仅退款,E20190201001,I001,0.01,0.01

退货退款,E20190201002,I002,0.01,0.01

退货退款,E20190201003,I003,1.2,1.2

退货退款,E20190201004,I004,10.9,10.9

仅退款,E20190201004,I005,10.9,10.9

仅退款,E20190201005,I006,2,1

仅退款,E20190201006,I007,0.18,0.05

打算用 Spark 来处理它。

pom文件

使用 spark 最新版本, opencsv 来读取 csv 文件。

org.apache.spark

spark-core_2.12

2.4.0

com.thoughtworks.paranamer

paranamer

2.8

net.sf.opencsv

opencsv

2.3

代码

BasicSpark.java

package zzz.spark;

import au.com.bytecode.opencsv.CSVReader;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.TextInputFormat;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import java.io.IOException;

import java.io.StringReader;

import java.util.stream.Collectors;

import java.util.stream.StreamSupport;

public class BasicSpark {

public static void main(String[] args) {

JavaSparkContext sc = buildSparkContext();

JavaRDD rdd = sc.hadoopFile("/Users/shuqin/Downloads/refund.csv", TextInputFormat.class,

LongWritable.class, Text.class).map(pair -> new String(pair._2.getBytes(), 0, pair._2.getLength(), "GBK"));

JavaRDD refundInfos = rdd.map(BasicSpark::parseLine).map(RefundInfo::from);

System.out.println("refund info number: " + refundInfos.count());

JavaRDD filtered = refundInfos.filter(refundInfo -> refundInfo.getRealPrice() >=10 );

System.out.println("realPrice > 10: " + filtered.collect().stream().map(RefundInfo::getOrderNo).collect(Collectors.joining()));

JavaPairRDD> grouped = refundInfos.groupBy(RefundInfo::getType);

JavaPairRDD groupedRealPaySumRDD = grouped.mapValues(info -> StreamSupport.stream(info.spliterator(),false).mapToDouble(RefundInfo::getRealPrice).sum());

System.out.println("groupedRealPaySum: " + groupedRealPaySumRDD.collectAsMap());

JavaPairRDD groupedNumberRDD = grouped.mapValues(info -> StreamSupport.stream(info.spliterator(),false).count());

System.out.println("groupedNumber: " + groupedNumberRDD.collectAsMap());

}

public static JavaSparkContext buildSparkContext() {

SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("learningSparkInJava")

.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

return new JavaSparkContext(sparkConf);

}

public static String[] parseLine(String line) {

try {

CSVReader reader = new CSVReader(new StringReader(line));

return reader.readNext();

} catch (IOException e) {

return new String[0];

}

}

}

RefundInfo.java

package zzz.spark;

import lombok.Data;

@Data

public class RefundInfo {

private String type; // 退款方式

private String orderNo; // 订单编号

private String goodsTitle; // 商品名称

private Double realPrice; // 订单金额

private Double refund; // 退款金额

public static RefundInfo from(String[] arr) {

if (arr == null || arr.length != 5) {

return null;

}

RefundInfo refundInfo = new RefundInfo();

refundInfo.setType(arr[0]);

refundInfo.setOrderNo(arr[1]);

refundInfo.setGoodsTitle(arr[2]);

refundInfo.setRealPrice(Double.valueOf(arr[3]));

refundInfo.setRefund(Double.valueOf(arr[4]));

return refundInfo;

}

}

讲解

Java Spark 有两种操作: 一种将 RDD 转换成另一种 RDD, 是惰性的; 一种是从 RDD 生成结果。

RDD 有两种,一种是列表型的,一种是Map型的。

代码都在上面了,相信有一定 java stream 基础的读者是可以看懂的。

问题解决

P1. Exception in thread "main" java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V

解决方案:scala-library 2.11.8 -> 2.12.0-RC2

P2. in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.()V from class org.apache.hadoop.mapred.FileInputFormat 。

解决方法:guava 23.0 -> 15.0

P3. object not serializable.

解决方法: JavaSparkContext.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

P4. 中文处理

使用 sc.hadoopFile(path, TextInputFormat.class,

LongWritable.class, Text.class).map(pair -> new String(pair._2.getBytes(), 0, pair._2.getLength(), "GBK")); 而不是 sc.textFile(path)

【未完待续】

标签:第一例,入门,java,JavaSpark,org,apache,import,spark,String

来源: https://www.cnblogs.com/lovesqcc/p/10367168.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值