在按照https://github.com/cloudera/impala所给出的文档进行impala的源码编译之后,在运行下面的脚本之后,出现了一系列的问题:
1
2
|
${IMPALA_HOME}
/bin/start-impalad
.sh -use_statestore=
false
${IMPALA_HOME}
/bin/impala-shell
.sh
|
问题1:
虽然本地集群的hive metastore已经配置好了,执行impala-shell.sh脚本后也能成功,但是执行show databases的时候,却看不到在hive里已经创建的test.db这个数据库,而脚本还在其执行目录生成了derby.log和metastore.db两个日志文件和目录,这是impala自带的hive元数据库,所以问题就很清楚了,这是因为impala未能了解你所配置的hive元数据。
按照官网上所说的,为了配置impala需要使用的hdfs,hbase,hive的metastore,其内部实现是将其配置文件放入fe/src/test/resources目录下,这个是在${IMPALA_HOME}/bin/set-classpath.sh中设置的,set-classpath.sh中的shell脚本如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
#!/bin/sh
# Copyright 2012 Cloudera Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script explicitly sets the CLASSPATH for embedded JVMs (e.g. in
# Impalad or in runquery) Because embedded JVMs do not honour
# CLASSPATH wildcard expansion, we have to add every dependency jar
# explicitly to the CLASSPATH.
CLASSPATH=\
$IMPALA_HOME
/fe/src/test/resources
:\
$IMPALA_HOME
/fe/target/classes
:\
$IMPALA_HOME
/fe/target/dependency
:\
$IMPALA_HOME
/fe/target/test-classes
:\
${HIVE_HOME}
/lib/datanucleus-core-2
.0.3.jar:\
${HIVE_HOME}
/lib/datanucleus-enhancer-2
.0.3.jar:\
${HIVE_HOME}
/lib/datanucleus-rdbms-2
.0.3.jar:\
${HIVE_HOME}
/lib/datanucleus-connectionpool-2
.0.3.jar:${CLASSPATH}
for
jar
in
`
ls
${IMPALA_HOME}
/fe/target/dependency/
*.jar`;
do
CLASSPATH=${CLASSPATH}:$jar
done
export
CLASSPATH
|
但是问题出现了,我所编译成功后的源码fe/src目录下,并没有resources这个目录,所以我从其它地方将它下了下来,然后放到相应目录中,修改相应的core.site.xml,hdfs-site.xml,hive-site.xml这三个配置文件,和集群配置相同;然后执行source bin/set-classpath.sh,这样第一个问题就解决了!
问题2:
待到前面的那个问题解决之后,执行与前面相同的两个脚本,执行下面的命令:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Build version: build version not available)
[Not connected] > connect hadoop-01
[hadoop-01:21000] > show databases;
default
test_impala
[hadoop-01:21000] > use test_impala;
[hadoop-01:21000] > show tables;
tab1
tab2
tab3
[hadoop-01:21000] >
select
* from tab3;
[hadoop-01:21000] >
select
* from tab1;
ERROR: Failed to
open
HDFS
file
hdfs:
//hadoop-01
.localdomain:8030
/user/impala/warehouse/test_impala
.db
/tab1/tab1
.csv
Error(255): Unknown error 255
ERROR: Invalid query handle
[hadoop-01:21000] >
select
* from tab1;
ERROR: Failed to
open
HDFS
file
hdfs:
//hadoop-01
.localdomain:8030
/user/impala/warehouse/test_impala
.db
/tab1/tab1
.csv
Error(255): Unknown error 255
ERROR: Invalid query handle
[hadoop-01:21000] > quit
|
后台impalad的日志信息如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
13
/01/18
11:50:46 INFO service.Frontend: createExecRequest
for
query
select
* from tab1
13
/01/18
11:50:46 INFO service.JniFrontend: Plan Fragment 0
UNPARTITIONED
EXCHANGE (1)
TUPLE IDS: 0
Plan Fragment 1
RANDOM
STREAM DATA SINK
EXCHANGE ID: 1
UNPARTITIONED
SCAN HDFS table=test_impala.tab1 (0)
TUPLE IDS: 0
13
/01/18
11:50:46 INFO service.JniFrontend: returned TQueryExecRequest2: TExecRequest(stmt_type:QUERY, sql_stmt:
select
* from tab1, request_id:TUniqueId(hi:-6897121767931491435, lo:-4792011001236606993), query_options:TQueryOptions(abort_on_error:
false
, max_errors:0, disable_codegen:
false
, batch_size:0, return_as_ascii:
true
, num_nodes:0, max_scan_range_length:0, num_scanner_threads:0, max_io_buffers:0, allow_unsupported_formats:
false
, partition_agg:
false
), query_exec_request:TQueryExecRequest(desc_tbl:TDescriptorTable(slotDescriptors:[TSlotDescriptor(
id
:0, parent:0, slotType:INT, columnPos:0, byteOffset:4, nullIndicatorByte:0, nullIndicatorBit:1, slotIdx:1, isMaterialized:
true
), TSlotDescriptor(
id
:1, parent:0, slotType:BOOLEAN, columnPos:1, byteOffset:1, nullIndicatorByte:0, nullIndicatorBit:0, slotIdx:0, isMaterialized:
true
), TSlotDescriptor(
id
:2, parent:0, slotType:DOUBLE, columnPos:2, byteOffset:8, nullIndicatorByte:0, nullIndicatorBit:2, slotIdx:2, isMaterialized:
true
), TSlotDescriptor(
id
:3, parent:0, slotType:TIMESTAMP, columnPos:3, byteOffset:16, nullIndicatorByte:0, nullIndicatorBit:3, slotIdx:3, isMaterialized:
true
)], tupleDescriptors:[TTupleDescriptor(
id
:0, byteSize:32, numNullBytes:1, tableId:1)], tableDescriptors:[TTableDescriptor(
id
:1, tableType:HDFS_TABLE, numCols:4, numClusteringCols:0, hdfsTable:THdfsTable(hdfsBaseDir:hdfs:
//hadoop-01
.localdomain:8030
/user/impala/warehouse/test_impala
.db
/tab1
, partitionKeyNames:[], nullPartitionKeyValue:__HIVE_DEFAULT_PARTITION__, partitions:{-1=THdfsPartition(lineDelim:10, fieldDelim:44, collectionDelim:44, mapKeyDelim:44, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[], blockSize:0, compression:NONE), 1=THdfsPartition(lineDelim:10, fieldDelim:44, collectionDelim:44, mapKeyDelim:44, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[], blockSize:0, compression:NONE)}), tableName:tab1, dbName:test_impala)]), fragments:[TPlanFragment(plan:TPlan(nodes:[TPlanNode(node_id:1, node_type:EXCHANGE_NODE, num_children:0, limit:-1, row_tuples:[0], nullable_tuples:[
false
], compact_data:
false
)]), output_exprs:[TExpr(nodes:[TExprNode(node_type:SLOT_REF,
type
:INT, num_children:0, slot_ref:TSlotRef(slot_id:0))]), TExpr(nodes:[TExprNode(node_type:SLOT_REF,
type
:BOOLEAN, num_children:0, slot_ref:TSlotRef(slot_id:1))]), TExpr(nodes:[TExprNode(node_type:SLOT_REF,
type
:DOUBLE, num_children:0, slot_ref:TSlotRef(slot_id:2))]), TExpr(nodes:[TExprNode(node_type:SLOT_REF,
type
:TIMESTAMP, num_children:0, slot_ref:TSlotRef(slot_id:3))])], partition:TDataPartition(
type
:UNPARTITIONED, partitioning_exprs:[])), TPlanFragment(plan:TPlan(nodes:[TPlanNode(node_id:0, node_type:HDFS_SCAN_NODE, num_children:0, limit:-1, row_tuples:[0], nullable_tuples:[
false
], compact_data:
false
, hdfs_scan_node:THdfsScanNode(tuple_id:0))]), output_sink:TDataSink(
type
:DATA_STREAM_SINK, stream_sink:TDataStreamSink(dest_node_id:1, output_partition:TDataPartition(
type
:UNPARTITIONED, partitioning_exprs:[]))), partition:TDataPartition(
type
:RANDOM, partitioning_exprs:[]))], dest_fragment_idx:[0], per_node_scan_ranges:{0=[TScanRangeLocations(scan_range:TScanRange(hdfs_file_split:THdfsFileSplit(path:hdfs:
//hadoop-01
.localdomain:8030
/user/impala/warehouse/test_impala
.db
/tab1/tab1
.csv, offset:0, length:192, partition_id:1)), locations:[TScanRangeLocation(server:THostPort(
hostname
:192.168.1.2, ipaddress:192.168.1.2, port:50010), volume_id:0)])]}, query_globals:TQueryGlobals(now_string:2013-01-18 11:50:46.000000862)), result_set_metadata:TResultSetMetadata(columnDescs:[TColumnDesc(columnName:
id
, columnType:INT), TColumnDesc(columnName:col_1, columnType:BOOLEAN), TColumnDesc(columnName:col_2, columnType:DOUBLE), TColumnDesc(columnName:col_3, columnType:TIMESTAMP)]))
hdfsOpenFile(hdfs:
//hadoop-01
.localdomain:8030
/user/impala/warehouse/test_impala
.db
/tab1/tab1
.csv): FileSystem
#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream;) error:
java.lang.IllegalArgumentException: Wrong FS: hdfs:
//hadoop-01
.localdomain:8030
/user/impala/warehouse/test_impala
.db
/tab1/tab1
.csv, expected: hdfs:
//localhost
:20500
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:547)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:169)
at org.apache.hadoop.hdfs.DistributedFileSystem.
open
(DistributedFileSystem.java:245)
at org.apache.hadoop.hdfs.DistributedFileSystem.
open
(DistributedFileSyst
|
问题所指的是Wrong FS错误,expected:hdfs://localhost:20500,我在resources目录下的core-site.xml配置文件明明就已经指定了namenode的地址和端口为8030,后来看了下impala关于impala的源码,才发现在/ be / src / runtime / hdfs-fs-cache.cc目录下,有指定默认的nn和nn_port,
1
2
|
DEFINE_string(nn,
"localhost"
,
"hostname or ip address of HDFS namenode"
);
DEFINE_int32(nn_port, 20500,
"namenode port"
);
|
所以,在启动impalad的服务的时候,需要同时指定nn和nn_port为集群所设置的相应地址和端口,如下所示:
1
|
/bin/start-impalad
.sh -use_statestore=
false
-nn=hadoop-01.localdomain -nn_port=8030
|
这样关于expected: hdfs://localhost:20500第二个问题也就解决了,执行任何查询都没有问题!