spark 编程

最新推荐文章于 2024-05-16 10:24:30 发布

wandy0211

最新推荐文章于 2024-05-16 10:24:30 发布

阅读量274

点赞数

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/wjandy0211/article/details/80509944

版权

Spark program

Quick Start——Spark Shell

./bin/spark-shell

scala> val textFile = spark.read.textFile("README.md")

textFile: org.apache.spark.sql.Dataset[String] = [value: string]

scala> textFile.count() // Number of items in this Dataset

res0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputs

scala> textFile.first() // First item in this Dataset

res1: String = # Apache Spark

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()

wordCounts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]

scala> linesWithSpark.cache()

res7: linesWithSpark.type = [value: string]

Applications

We’ll create a very simple Spark application, SimpleApp.java:

/* SimpleApp.java */

import org.apache.spark.sql.SparkSession;

import org.apache.spark.sql.Dataset;

public class SimpleApp {

public static void main(String[] args) {

String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system

SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate();

Dataset<String> logData = spark.read().textFile(logFile).cache();

long numAs = logData.filter(s -> s.contains("a")).count();

long numBs = logData.filter(s -> s.contains("b")).count();

System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

spark.stop();

}

}

To build the program, we also write a Maven pom.xml file that lists Spark as a dependency. Note that Spark artifacts are tagged with a Scala version.

<project>

<groupId>edu.berkeley</groupId>

<artifactId>simple-project</artifactId>

<modelVersion>4.0.0</modelVersion>

<name>Simple Project</name>

<packaging>jar</packaging>

<version>1.0</version>

<dependencies>

<dependency> 

<groupId>org.apache.spark</groupId>

<artifactId>spark-sql_2.11</artifactId>

<version>2.3.0</version>

</dependency>

</dependencies>

</project>

We lay out these files according to the canonical Maven directory structure:

$ find .

./pom.xml

./src

./src/main

./src/main/java

./src/main/java/SimpleApp.java

Now, we can package the application using Maven and execute it with ./bin/spark-submit.

# Package a JAR containing your application

$ mvn package

...

[INFO] Building jar: {..}/{..}/target/simple-project-1.0.jar

# Use spark-submit to run your application

$ YOUR_SPARK_HOME/bin/spark-submit \

--class "SimpleApp" \

--master local[4] \

target/simple-project-1.0.jar

...

Lines with a: 46, Lines with b: 23

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark 编程

Spark programQuick Start——Spark Shell./bin/spark-shellscala&gt; val textFile = spark.read.textFile("README.md")textFile: org.apache.spark.sql.Dataset[String] = [value: string]scala&gt; textFile.count(...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。