Spark-Standalone

  1. 安全:默认关闭
  2. 手工启动集群:
    • 使用./sbin/start-master.sh,启动后,master将会打印出spark://HOST:PORT,可以用来连接workers
    • 监控:默认为http://localhost:8080/
    • 启动worker连接到master:./sbin/start-slave.sh <master-spark-URL>
    • ArgumentMeaning
      -h HOST--host HOSTHostname to listen on
      -i HOST--ip HOSTHostname to listen on (deprecated, use -h or --host)
      -p PORT--port PORTPort for service to listen on (default: 7077 for master, random for worker)
      --webui-port PORTPort for web UI (default: 8080 for master, 8081 for worker)
      -c CORES--cores CORESTotal CPU cores to allow Spark applications to use on the machine (default: all available); only on worker
      -m MEM--memory MEMTotal amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine's total RAM minus 1 GB); only on worker
      -d DIR--work-dir DIRDirectory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker
      --properties-file FILEPath to a custom Spark properties file to load (default: conf/spark-defaults.conf)
    • 初始化conf/slaves文件
    • 初始化conf/spark-env.sh
      • Environment VariableMeaning
        SPARK_MASTER_HOSTBind the master to a specific hostname or IP address, for example a public one.
        SPARK_MASTER_PORTStart the master on a different port (default: 7077).
        SPARK_MASTER_WEBUI_PORTPort for the master web UI (default: 8080).
        SPARK_MASTER_OPTSConfiguration properties that apply only to the master in the form "-Dx=y" (default: none). See below for a list of possible options.
        SPARK_LOCAL_DIRSDirectory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.
        SPARK_WORKER_CORESTotal number of cores to allow Spark applications to use on the machine (default: all available cores).
        SPARK_WORKER_MEMORYTotal amount of memory to allow Spark applications to use on the machine, e.g. 1000m2g (default: total memory minus 1 GB); note that each application's individual memory is configured using its spark.executor.memoryproperty.
        SPARK_WORKER_PORTStart the Spark worker on a specific port (default: random).
        SPARK_WORKER_WEBUI_PORTPort for the worker web UI (default: 8081).
        SPARK_WORKER_DIRDirectory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work).
        SPARK_WORKER_OPTSConfiguration properties that apply only to the worker in the form "-Dx=y" (default: none). See below for a list of possible options.
        SPARK_DAEMON_MEMORYMemory to allocate to the Spark master and worker daemons themselves (default: 1g).
        SPARK_DAEMON_JAVA_OPTSJVM options for the Spark master and worker daemons themselves in the form "-Dx=y" (default: none).
        SPARK_DAEMON_CLASSPATHClasspath for the Spark master and worker daemons themselves (default: none).
        SPARK_PUBLIC_DNSThe public DNS name of the Spark master and workers (default: none).
    • SPARK_MASTER_OPTS
      • Property NameDefaultMeaning
        spark.deploy.retainedApplications200The maximum number of completed applications to display. Older applications will be dropped from the UI to maintain this limit.
        spark.deploy.retainedDrivers200The maximum number of completed drivers to display. Older drivers will be dropped from the UI to maintain this limit.
        spark.deploy.spreadOuttrueWhether the standalone cluster manager should spread applications out across nodes or try to consolidate them onto as few nodes as possible. Spreading out is usually better for data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. 
        spark.deploy.defaultCores(infinite)Default number of cores to give to applications in Spark's standalone mode if they don't set spark.cores.max. If not set, applications always get all available cores unless they configure spark.cores.max themselves. Set this lower on a shared cluster to prevent users from grabbing the whole cluster by default. 
        spark.deploy.maxExecutorRetries10Limit on the maximum number of back-to-back executor failures that can occur before the standalone cluster manager removes a faulty application. An application will never be removed if it has any running executors. If an application experiences more thanspark.deploy.maxExecutorRetries failures in a row, no executors successfully start running in between those failures, and the application has no running executors then the standalone cluster manager will remove the application and mark it as failed. To disable this automatic removal, set spark.deploy.maxExecutorRetries to -1
        spark.worker.timeout60Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats.
    • SPARK_WORKER_OPTS
      • Property NameDefaultMeaning
        spark.worker.cleanup.enabledfalseEnable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
        spark.worker.cleanup.interval1800 (30 minutes)Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
        spark.worker.cleanup.appDataTtl604800 (7 days, 7 * 24 * 3600)The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
        spark.storage.cleanupFilesAfterExecutorExittrueEnable cleanup non-shuffle files(such as temp. shuffle blocks, cached RDD/broadcast blocks, spill files, etc) of worker directories following executor exits. Note that this doesn't overlap with `spark.worker.cleanup.enabled`, as this enables cleanup of non-shuffle files in local directories of a dead executor, while `spark.worker.cleanup.enabled` enables cleanup of all files/subdirectories of a stopped and timeout application. This only affects Standalone mode, support of other cluster manangers can be added in the future.
        spark.worker.ui.compressedLogFileLengthCacheSize100For compressed log files, the uncompressed file can only be computed by uncompressing the files. Spark caches the uncompressed file size of compressed log files. This property controls the cache size.
  3. 连接到集群:
    ./bin/spark-shell --master spark://IP:PORT
  4. 出错重跑:针对standalone模式需要在提交spark-submit时配置--supervise,使用如下命令kill一直失败的应用
    ./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>
  5. 资源配置
    1. APP:
      export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=<value>"
    2. Executor: spark.executor.cores
  6. 监控:默认8080
  7. 日志:默认(SPARK_HOME/work),STDOUT\STDERR
  8. 与Hadoop交互:hdfs:// URL 如 hdfs://<namenode>:9000/path
  9. HA:master使用zookeeper

转载于:https://www.cnblogs.com/liudingchao/p/11269606.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值