先用hive创建一张存储为sequenceFile文件的表
create table test(id string,time string) row format delimited fields terminated by '\t';
load data local inpath '/home/hadoop/data/order_seq.txt' into table test;
//需要借助一张临时表,因为不能直接加载数据到存储为sequencefile 的表里
create table order_seq stored as sequencefile as select * from test;
hive (g6)> select * from order_seq ;
OK
id time
107030111111 2014-05-01 06:01:01.334+01
107030222222 2014-05-01 07:01:01.334+01
107030333333 2014-05-01 08:01:01.334+01
107030444444 2014-05-01 09:01:01.334+01
107030555555 2014-05-01 10:01:01.334+01
Time taken: 0.082 seconds, Fetched: 5 row(s)
启动spark-shell,读取sequenceFile文件–报错
// ./spark-shell --master local[2] 启动后操作如下:
scala> val b = sc.sequenceFile[org.apache.hadoop.io.IntWritable, org.apache.hadoop.io.Text]("hdfs://hadoop001:9000//user/hive/warehouse/g6.db/test_seq/000000_0")
b: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.IntWritable, org.apache.hadoop.io.Text)] = MapPartitionsRDD[7] at sequenceFile at <console>:26
scala> b.collect
19/06/05 17:23:08 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 5)
java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable
Serialization stack:
- object not serializable (class: org.apache.hadoop.io.BytesWritable, value: )
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (,22))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 2)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:456)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
上面报错:java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable
没有被序列化什么的。网上给查了下,可以这样:
启动spark-shell加序列化参数,读取sequenceFile文件–正常
// ./spark-shell --master local[2] --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
//这样启动spark-shell 加了个序列化的方式
scala> val b = sc.sequenceFile[org.apache.hadoop.io.IntWritable, org.apache.hadoop.io.Text]("hdfs://hadoop001:9000//user/hive/warehouse/g6.db/order_seq")
b: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.IntWritable, org.apache.hadoop.io.Text)] = MapPartitionsRDD[3] at sequenceFile at <console>:24
scala> b.collect
res1: Array[(org.apache.hadoop.io.IntWritable, org.apache.hadoop.io.Text)] = Array((,107030555555?2014-05-01 10:01:01.334+01), (,107030555555?2014-05-01 10:01:01.334+01), (,107030555555?2014-05-01 10:01:01.334+01), (,107030555555?2014-05-01 10:01:01.334+01), (,107030555555?2014-05-01 10:01:01.334+01))
//中间多了个问号,还没解决。。。。
用IDEA来操作:spark读取sequenceFile文件
----报错未解决
import org.apache.spark.{SparkConf, SparkContext}
object SparkContextApp {
def main(args: Array[String]): Unit = {
val SparkConf = new SparkConf().setAppName("SparkContextApp").setMaster("local[2]")
val sc = new SparkContext(SparkConf)
val a = sc.sequenceFile[Int,String]("hdfs://hadoop001:9000//user/hive/warehouse/g6.db/test_seq/000000_0")
a.collect()
sc.stop()
}
}
用上述代码来操作,spark读取sequenceFile文件,报错如下:
......
ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
......
INFO HadoopRDD: Input split: hdfs://hadoop001:9000/user/hive/warehouse/g6.db/test_seq/000000_0:0+119
.......
WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
......
WARN DFSClient: Failed to connect to /10.9.140.90:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection timed out: no further information
java.net.ConnectException: Connection timed out: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
.......
INFO DFSClient: Could not obtain BP-319425841-10.9.140.90-1559687031867:blk_1073741864_1040 from any node: java.io.IOException: No live nodes contain block BP-319425841-10.9.140.90-1559687031867:blk_1073741864_1040 after checking nodes = [10.9.140.90:50010], ignoredNodes = null No live nodes contain current block Block locations: 10.9.140.90:50010 Dead nodes: 10.9.140.90:50010. Will get new block locations from namenode and retry...
.....
WARN DFSClient: Failed to connect to /10.9.140.90:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection timed out: no further information
......
WARN DFSClient: DFS Read
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-319425841-10.9.140.90-1559687031867:blk_1073741864_1040 file=/user/hive/warehouse/g6.db/test_seq/000000_0
.....
代码应该没有问题,应该是IDEA用spark访问不了HDFS导致的,网上查的原因貌似没有管用的。。。。待解决