mysql spark centos_CentOS6下配置Spark+Python开发环境记录

最新推荐文章于 2021-10-01 20:39:47 发布

ZIBO资博

最新推荐文章于 2021-10-01 20:39:47 发布

阅读量188

点赞数

文章标签： mysql spark centos

本文链接：https://blog.csdn.net/weixin_29658015/article/details/114329984

版权

在CentOS6上配置Spark+Python开发环境时，遇到了因缺少zlib库导致的错误，通过安装和重新编译Python解决。还需要处理SparkShell启动时的警告和错误，包括权限问题、依赖包缺失、Hive相关配置。最后解决了pyspark运行时的KeyError异常。

摘要由CSDN通过智能技术生成

1. 使用$SPARK_HOME/sbin/下的pyspark启动时，报错Traceback (most recent call last):

File "/home/joy/spark/spark/Python/pyspark/shell.py", line 28, in

首先按照搜索结果使用 yum install -y zlib* 安装了欠缺的包，但是仍报错，后使用sudo命令执行./pyspark即可正常执行。目前必须使用sudo命令才能正常执行，可能与环境设置有关，待解决——因为使用sudo命令安装，所以文件的所有者为root，chown更改所有者。

但是这样必须使用sudo安装pip，为了一劳永逸，重新编译python

解决方法：

1、安装依赖zlib、zlib-devel

2、重新编译安装Python

./configure

编辑Modules/Setup文件

找到下面这句，去掉注释

zlib zlibmodule.c -I$(prefix)/include -L$(exec_prefix)/lib -lz

重新编译安装：make & make install

编译后报错仍有部分模块未编译成功

Python build finished, but the necessary bits to build these modules were not found:

_bsddb _curses _curses_panel

_sqlite3 _ssl _tkinter

bsddb185 bz2 dbm

dl gdbm imageop

无论报错信息如何，意思很明确，我们编译的时候，系统没有办法找到对应的模块信息，为了解决这些报错，我们就需要提前安装依赖包，这些依赖包对应列表如下(不一定完全)：

模块依赖说明

_bsddb

bsddb

Interface to Berkeley DB library。Berkeley数据库的接口.bsddb is deprecated since 2.6. The ideal is to use the bsddb3 module.

_curses

ncurses

Terminal handling for character-cell displays。

_curses_panel

ncurses

A panel stack extension for curses。

_sqlite3

sqlite

DB-API 2.0 interface for SQLite databases。SqlLite，CentOS可以安装sqlite-devel

_ssl

openssl-devel.i686

TLS/SSL wrapper for socket objects。

_tkinter

N/A

a thin object-oriented layer on top of Tcl/Tk。如果不使用桌面程序可以忽略TKinter

bsddb185

old bsddb module

老的bsddb模块，可忽略。

bz2

bzip2-devel.i686

Compression compatible with bzip2。bzip2-devel

dbm

bsddb

Simple “database” interface。

N/A

Call C functions in shared objects.Python2.6开始，已经弃用。

gdbm

gdbm-devel.i686

GNU’s reinterpretation of dbm

imageop

N/A

Manipulate raw image data。已经弃用。

readline

readline-devel

GNU readline interface

sunaudiodev

N/A

Access to Sun audio hardware。这个是针对Sun平台的，CentOS下可以忽略

zlib

Zlib

Compression compatible with gzip

在CentOS下，可以安装这些依赖包：readline-devel，sqlite-devel，bzip2-devel.i686，openssl-devel.i686，gdbm-devel.i686，libdbi-devel.i686，ncurses-libs，zlib-devel.i686。完成这些安装之后，可以再次编译，上表中指定为弃用或者忽略的模块错误可以忽略。

在编译完成之后，就可以接着上面的第六步安装Python到指定目录下。安装完成之后，我们可以到安装目录下查看Python是否正常安装。

3. SparkSQL准备

首先呢，看使用HiveContext都需要哪些要求，文章中有这么三个要求：

1、检查$SPARK_HOME/lib目录下是否有datanucleus-api-jdo-3.2.1.jar、datanucleus-rdbms-3.2.1.jar

、datanucleus-core-3.2.2.jar 这几个jar包。

2、检查$SPARK_HOME/conf目录下是否有从$HIVE_HOME/conf目录下拷贝过来的hive-site.xml。

3、提交程序的时候将数据库驱动程序的jar包指定到DriverClassPath，如bin/spark-submit --driver-class-path *.jar。或者在spark-env.sh中设置SPARK_CLASSPATH。

参考文章，将$HIVE_HOME/lib下以datanucleus开头的几个jar包复制到$SPARK_HOME/lib下；$HIVE_HOME/conf下的hive-site.xml 复制到 $SPARK_HOME/conf下；将$HIVE_HOME/lib 下的MySQL-connector复制到$SPARK_HOME/jars下，

2. 启动spark-shell时报错

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

17/01/17 11:42:58 WARN SparkContext: Support for 7 is deprecated as of Spark 2.0.0

17/01/17 11:43:00 WARN NativeCodeLoader: Unable to load native-Hadoop library for your platform... using builtin-java classes where applicable

17/01/17 11:43:00 WARN Utils: Your hostname, node1 resolves to a loopback address: 127.0.0.1; using 192.168.85.128 instead (on interface eth1)

17/01/17 11:43:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

17/01/17 11:43:11 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.server2.thrift.http.min.worker.threads does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.mapjoin.optimized.keys does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.mapjoin.lazy.hashtable does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.datampi.maxslots does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.metastore.ds.retry.attempts does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.server2.thrift.http.max.worker.threads does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.datampi.sendqueue does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.optimize.multigroupby.common.distincts does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.metastore.ds.retry.interval does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.datampi.parallelism does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.stats.map.parallelism does not exist

17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.datampi.memusedpercent does not exist

17/01/17 11:43:12 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/joy/spark/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/joy/spark/spark/jars/datanucleus-rdbms-3.2.9.jar."

17/01/17 11:43:12 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/joy/spark/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/joy/spark/spark/jars/datanucleus-api-jdo-3.2.6.jar."

17/01/17 11:43:12 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/joy/spark/spark/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/joy/spark/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."

17/01/17 11:43:16 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.server2.thrift.http.min.worker.threads does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.mapjoin.optimized.keys does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.mapjoin.lazy.hashtable does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.datampi.maxslots does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.metastore.ds.retry.attempts does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.server2.thrift.http.max.worker.threads does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.datampi.sendqueue does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.optimize.multigroupby.common.distincts does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.metastore.ds.retry.interval does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.datampi.parallelism does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.stats.map.parallelism does not exist

17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.datampi.memusedpercent does not exist

17/01/17 11:43:22 ERROR ObjectStore: Version information found in metastore differs 0.13.0 from expected schema version 1.2.0. Schema verififcation is disabled hive.metastore.schema.verification so setting version.

java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':

at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)

at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)

at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)

at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)

at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)

at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)

at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)

at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)

at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)

at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)

... 47 elided

Caused by: java.lang.reflect.InvocationTargetException: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)

... 58 more

Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':

at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:169)

at org.apache.spark.sql.internal.SharedState.(SharedState.scala:86)

at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)

at scala.Option.getOrElse(Option.scala:121)

at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:101)

at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:100)

at org.apache.spark.sql.internal.SessionState.(SessionState.scala:157)

at org.apache.spark.sql.hive.HiveSessionState.(HiveSessionState.scala:32)

... 63 more

Caused by: java.lang.reflect.InvocationTargetException: java.lang.reflect.InvocationTargetException: java.lang.RuntimeException: java.io.FileNotFoundException: File /hive/tmp does not exist

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:166)

... 71 more

**Caused by: java.lang.reflect.InvocationTargetException: java.lang.RuntimeException: java.io.FileNotFoundException: File /hive/tmp does not exist**

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)

at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:366)

at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:270)

at org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:65)

... 76 more

Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File /hive/tmp does not exist

at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)

at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:192)

... 84 more

Caused by: java.io.FileNotFoundException: File /hive/tmp does not exist

at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)

at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)

at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)

at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)

at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)

at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)

at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)

... 85 more

分析报错信息，发现出错原因为/hive/tmp不存在的FileNotExist错误，查找hive-site.xml文件，该路径为 hive.exec.scratchdir 值， hive.exec.scratchdir 为 HDFS路径，用于存储不同 map/reduce 阶段的执行计划和这些阶段的中间输出结果。

在终端输入hadoop fs -ls /hive，执行结果为

Found 2 items

drwxr-xr-x - joy supergroup 0 2016-06-12 21:35 /hive/log

drwxr-xr-x - joy supergroup 0 2017-01-16 14:17 /hive/tmp

权限分配不对，应该增加g+w，hadoop fs -chmod g+w /hive/tmp 以及hadoop fs -chmod g+w /hive/log，但是依然报错不存在

在$SPARK_HOME/conf下的spark-env.sh中增加HADOOP_CONF_DIR，增加后报错信息变更为