1、概述
MMLSpark ,即 Microsoft Machine Learning for Apache Spark
,是微软开源的一个针对 Apache Spark 的深度学习和数据可学工具,为大型映像和文本数据库快速创建强大、可缩放性能优越的预测和分析模型。
2、下载安装包
按照官方示例的spark package安装方式进行安装使用:
spark-shell --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1
pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1
spark-submit --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1 MyApp.jar
但是因为机房CDH网络归属国内原因,访问此项目maven仓库异常缓慢,基本无法使用,如
# pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1 --repositories=https://mmlspark.azureedge.net/maven
执行后下载dependencies包异常缓慢,短则几个小时
分析spark --packages执行过程会先将依赖jar包下载缓存到本地/${user}/.ivy2/jars中,可利用AWS网络进行依赖包下载
[root@ip-192-168-15-101 spark]# pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1 --repositories=https://mmlspark.azureedge.net/maven
Python 2.7.18 (default, May 7 2020, 09:20:17)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
https://mmlspark.azureedge.net/maven added as a remote repository with the name: repo-1
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.microsoft.ml.spark#mmlspark_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-fcd33844-8550-4d17-b600-7a07aa1902c9;1.0
confs: [default]
found com.microsoft.ml.spark#mmlspark_2.11;1.0.0-rc1 in repo-1
found org.scalactic#scalactic_2.11;3.0.5 in central
found org.scala-lang#scala-reflect;2.11.12 in central
found org.scalatest#scalatest_2.11;3.0.5 in central
found org.scala-lang.modules#scala-xml_2.11;1.0.6 in central
found io.spray#spray-json_2.11;1.3.2 in central
found com.microsoft.cntk#cntk;2.4 in central
found org.openpnp#opencv;3.2.0-1 in central
found com.jcraft#jsch;0.1.54 in central
found org.apache.httpcomponents#httpclient;4.5.6 in central
found org.apache.httpcomponents#httpcore;4.4.10 in central
found commons-logging#commons-logging;1.2 in local-m2-cache
found commons-codec#commons-codec;1.10 in local-m2-cache
found com.microsoft.ml.lightgbm#lightgbmlib;2.3.100 in central
found com.github.vowpalwabbit#vw-jni;8.7.0.3 in central
:: resolution report :: resolve 690ms :: artifacts dl 16ms
:: modules in use:
com.github.vowpalwabbit#vw-jni;8.7.0.3 from central in [default]
com.jcraft#jsch;0.1.54 from central in [default]
com.microsoft.cntk#cntk;2.4 from central in [default]
com.microsoft.ml.lightgbm#lightgbmlib;2.3.100 from central in [default]
com.microsoft.ml.spark#mmlspark_2.11;1.0.0-rc1 from repo-1 in [default]
commons-codec#commons-codec;1.10 from local-m2-cache in [default]
commons-logging#commons-logging;1.2 from local-m2-cache in [default]
io.spray#spray-json_2.11;1.3.2 from central in [default]
org.apache.httpcomponents#httpclient;4.5.6 from central in [default]
org.apache.httpcomponents#httpcore;4.4.10 from central in [default]
org.openpnp#opencv;3.2.0-1 from central in [defa