最后,在我查看了spark-shell在服务器中自动创建的sqlContext变量的配置之后,我看到有很多url和配置变量,我在HDFS或其他服务器中没有权限。需要对PROD Metastore执行查询。
我知道用直线工作查询PROD Metastore,我知道我可以通过JDBC查询这个Metastore,所以我把这个调用的JDBC URL带到beeline。
然后我使用这个JDBC URL并开始使用本机Java(来自Scala)方法和实用程序来连接DBvíaJDBC:
/We will need hive-jdbc-0.14.0.jar to connect to a Hive metastore via JDBC /import java.sql.ResultSetMetaDataimport java.sql.{DriverManager, Connection, Statement, ResultSet}import org.apache.spark.sql.types.StringTypeimport org.apache.spark.sql.types.StructFieldimport org.apache.spark.sql.types.StructTypeimport org.apache.spark.sql.Row/ In the following lines I connect to Prod Metastore via JDBC and I execute the query as if I am connecting to a simple DB. Notice that, using this method, you are not using distributed computing /val url="jdbc:hive2://URL_TO_PROD_METASTORE/BD;CREDENTIALS OR URL TO KERBEROS"val query="SELECT * FROM BD.TABLE LIMIT 100"val driver="org.apache.hive.jdbc.HiveDriver"Class.forName(driver).newInstanceval conn: Connection = DriverManager.getConnection(url)val r: ResultSet = conn.createStatement.executeQuery(query)val list =scala.collection.mutable.MutableList[Row]()/ Now we want to get all the values from all the columns. Notice that I creat a ROW object for each row of the results. Then I add each Row to a MutableList/while(r.next()){ var value : Array[String] = new ArrayString) for(i
/ Now we have the results of the query to PROD metastore and we want to transform this data to a Dataframe so we have to create a StructType with the names of the columns and we also need a list of rows with previous results /var array : Array[StructField] = new ArrayStructField)for(i
465

被折叠的 条评论
为什么被折叠?



