当前使用的spark版本是1.0.2,spark-assenbly已经包含了hive,可以直接使用。
Spark hive通过把hive-site.xml加载到classpath中的方式来读取hive连接元数据,我这里通过java代码动态的把hive-site.xml加载到classpath里
String hiveConfDir = System.getenv( "HIVE_CONF_DIR" );
File hivePath = new File( hiveConfDir );
URLClassLoader classLoader = (URLClassLoader) ClassLoader.getSystemClassLoader();
Method add = URLClassLoader.class.getDeclaredMethod( "addURL", new Class[]{URL.class} );
add.setAccessible( true );
add.invoke( classLoader, hivePath.toURI().toURL() );
然后把spark-assenbly上传到hdfs上,通过sparkSubmit.main提交代码,local模式下运行正常,但是提交到yarn-clusters上执行的时候,
Caused by: org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification. at org.datanucleus.NucleusContext.<init>(NucleusContext.java:280) at org.datanucleus.NucleusContext.<init>(NucleusContext.java:244) at org.datanucleus.NucleusContext.<init>(NucleusContext.java:222) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.<init>(JDOPersistenceManagerFactory.java:409) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:294) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:195)
仔细check了代码之后,感觉是datanucleus jar包加载的问题,把assenbly解开,发现里面有datanucleus-rdbms-3.2.1.jar,datanucleus-core-3.2.2.jar,datanucleus-api-jdo-3.2.1.jar 3个jar包,把这3个jar单独拿出来,放到classpath里面,用--jars作为本地jar重新提交代码,作业成功。