开始学点 Spark。做了第一个小例子,记录一下 ^_^
背景
有个退款文件如下:
仅退款,E20190201001,I001,0.01,0.01
退货退款,E20190201002,I002,0.01,0.01
退货退款,E20190201003,I003,1.2,1.2
退货退款,E20190201004,I004,10.9,10.9
仅退款,E20190201004,I005,10.9,10.9
仅退款,E20190201005,I006,2,1
仅退款,E20190201006,I007,0.18,0.05
打算用 Spark 来处理它。
pom文件
使用 spark 最新版本, opencsv 来读取 csv 文件。
org.apache.spark
spark-core_2.12
2.4.0
com.thoughtworks.paranamer
paranamer
2.8
net.sf.opencsv
opencsv
2.3
代码
BasicSpark.java
package zzz.spark;
import au.com.bytecode.opencsv.CSVReader;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.io.IOException;
import java.io.StringReader;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;
public class BasicSpark {
public static void main(String[] args) {
JavaSparkContext sc = buildSparkContext();
JavaRDD rdd = sc.hadoopFile("/Users/shuqin/Downloads/refund.csv", TextInputFormat.class,
LongWritable.class, Text.class).map(pair -> new String(pair._2.getBytes(), 0, pair._2.getLength(), "GBK"));
JavaRDD refundInfos = rdd.map(BasicSpark::parseLine).map(RefundInfo::from);
System.out.println("refund info number: " + refundInfos.count());
JavaRDD filtered = refundInfos.filter(refundInfo -> refundInfo.getRealPrice() >=10 );
System.out.println("realPrice > 10: " + filtered.collect().stream().map(RefundInfo::getOrderNo).collect(Collectors.joining()));
JavaPairRDD> grouped = refundInfos.groupBy(RefundInfo::getType);
JavaPairRDD groupedRealPaySumRDD = grouped.mapValues(info -> StreamSupport.stream(info.spliterator(),false).mapToDouble(RefundInfo::getRealPrice).sum());
System.out.println("groupedRealPaySum: " + groupedRealPaySumRDD.collectAsMap());
JavaPairRDD groupedNumberRDD = grouped.mapValues(info -> StreamSupport.stream(info.spliterator(),false).count());
System.out.println("groupedNumber: " + groupedNumberRDD.collectAsMap());
}
public static JavaSparkContext buildSparkContext() {
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("learningSparkInJava")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
return new JavaSparkContext(sparkConf);
}
public static String[] parseLine(String line) {
try {
CSVReader reader = new CSVReader(new StringReader(line));
return reader.readNext();
} catch (IOException e) {
return new String[0];
}
}
}
RefundInfo.java
package zzz.spark;
import lombok.Data;
@Data
public class RefundInfo {
private String type; // 退款方式
private String orderNo; // 订单编号
private String goodsTitle; // 商品名称
private Double realPrice; // 订单金额
private Double refund; // 退款金额
public static RefundInfo from(String[] arr) {
if (arr == null || arr.length != 5) {
return null;
}
RefundInfo refundInfo = new RefundInfo();
refundInfo.setType(arr[0]);
refundInfo.setOrderNo(arr[1]);
refundInfo.setGoodsTitle(arr[2]);
refundInfo.setRealPrice(Double.valueOf(arr[3]));
refundInfo.setRefund(Double.valueOf(arr[4]));
return refundInfo;
}
}
讲解
Java Spark 有两种操作: 一种将 RDD 转换成另一种 RDD, 是惰性的; 一种是从 RDD 生成结果。
RDD 有两种,一种是列表型的,一种是Map型的。
代码都在上面了,相信有一定 java stream 基础的读者是可以看懂的。
问题解决
P1. Exception in thread "main" java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
解决方案:scala-library 2.11.8 -> 2.12.0-RC2
P2. in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.()V from class org.apache.hadoop.mapred.FileInputFormat 。
解决方法:guava 23.0 -> 15.0
P3. object not serializable.
解决方法: JavaSparkContext.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
P4. 中文处理
使用 sc.hadoopFile(path, TextInputFormat.class,
LongWritable.class, Text.class).map(pair -> new String(pair._2.getBytes(), 0, pair._2.getLength(), "GBK")); 而不是 sc.textFile(path)
【未完待续】
标签:第一例,入门,java,JavaSpark,org,apache,import,spark,String
来源: https://www.cnblogs.com/lovesqcc/p/10367168.html