【openeuler/spark docker image overview】

会飞的雀儿

已于 2024-06-18 09:50:41 修改

阅读量849

点赞数 25

文章标签： spark docker 大数据容器机器学习数据分析

于 2024-06-17 15:30:45 首次发布

本文链接：https://blog.csdn.net/weixin_43878094/article/details/139743219

版权

openEuler

Quick reference

The official Spark docker image.
Maintained by: openEuler CloudNative SIG.
Where to get help: openEuler CloudNative SIG, openEuler.

Spark | openEuler

Current MLflow docker images are built on the openEuler. This repository is free to use and exempted from per-user rate limits.

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

Learn more on Spark website.

Supported tags and respective Dockerfile links

The tag of each spark docker image is consist of the version of spark and the version of basic image. The details are as follows

Tags	Currently	Architectures
3.3.1-22.03-lts	spark 3.3.1 on openEuler 22.03-LTS	amd64, arm64
3.3.2-22.03-lts	spark 3.3.2 on openEuler 22.03-LTS	amd64, arm64
3.4.0-22.03-lts	spark 3.4.0 on openEuler 22.03-LTS	amd64, arm64

Usage

In this usage, users can select the corresponding {Tag} based on their requirements.

Online Documentation
You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.
Pull the openeuler/redis image from docker
```
docker pull openeuler/spark:{Tag}
```
Interactive Scala Shell
The easiest way to start using Spark is through the Scala shell:
```
docker run -it --name spark openeuler/spark:{Tag} /opt/spark/bin/spark-shell
```
Try the following command, which should return 1,000,000,000:
```
scala> spark.range(1000 * 1000 * 1000).count()
```
Interactive Python Shell
The easiest way to start using PySpark is through the Python shell:
```
docker run -it --name spark openeuler/spark:{Tag} /opt/spark/bin/pyspark
```
And run the following command, which should also return 1,000,000,000:
```
>>> spark.range(1000 * 1000 * 1000).count()
```
Running Spark on Kubernetes
https://spark.apache.org/docs/latest/running-on-kubernetes.html⁠.
Configuration and environment variables
See more in https://github.com/apache/spark-docker/blob/master/OVERVIEW.md#environment-variable.