仅翻译部分,看着简单的和后面的一些没翻译
although it may very well work on other similar platforms. It does not work on Cygwin.
Installation and Configuration
Requirements
Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).
Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z:
$ tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME to point to the installation directory:
$ cd hive-x.y.z $ export HIVE_HOME={{pwd}}
Finally, add $HIVE_HOME/bin to your PATH:
$ export PATH=$HIVE_HOME/bin:$PATH
Building Hive from Source
The Hive SVN repository is located here: http://svn.apache.org/repos/asf/hive/trunk
$ svn co http://svn.apache.org/repos/asf/hive/trunk hive $ cd hive $ ant clean package $ cd build/dist $ ls README.txt bin/ (all the shell scripts) lib/ (required jar files) conf/ (configuration files) examples/ (sample input and query files)
In the rest of the page, we use build/dist and <install-dir> interchangeably.
Compile hive on hadoop 23
$ svn co http://svn.apache.org/repos/asf/hive/trunk hive $ cd hive $ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -Dhadoop.mr.rev=23 $ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23
Running Hive
Hive uses hadoop that means:
- you must have hadoop in your path OR
- export HADOOP_HOME=<hadoop-install-dir>
In addition, you must create /tmp and /user/hive/warehouse
(aka hive.metastore.warehouse.dir) and set them chmod g+w in
HDFS before a table can be created in Hive.
Commands to perform this setup
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
I also find it useful but not necessary to set HIVE_HOME
$ export HIVE_HOME=<hive-install-dir>
To use hive command line interface (cli) from the shell:
$ $HIVE_HOME/bin/hive
Configuration management overview
- Hive default configuration is stored in <install-dir>/conf/hive-default.xml
Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml - The location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable.
- Log4j configuration is stored in <install-dir>/conf/hive-log4j.properties
- Hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default.
- Hive configuration can be manipulated by:
- Editing hive-site.xml and defining any desired variables (including hadoop variables) in it
- From the cli using the set command (see below)
- By invoking hive using the syntax:
- $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
this sets the variables x1 and x2 to y1 and y2 respectively
- $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
- By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above
Runtime configuration
- Hive queries are executed using map-reduce queries and, therefore, the behavior
of such queries can be controlled by the hadoop configuration variables.
- The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:
hive> SET mapred.job.tracker=myhost.mycompany.com:50030; hive> SET -v;
The latter shows all the current settings. Without the -v option only the
variables that differ from the base hadoop configuration are displayed
Hive, Map-Reduce and Local-Mode
mapred.job.tracker
hive> SET mapred.job.tracker=local;
hive> SET hive.exec.mode.local.auto=false;
- The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
- 任务的总输入大小小于:hive.exec.mode.local.auto.inputbytes.max(默认128MB)
- The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
- map-task的总数小于:hive.exec.mode.local.auto.tasks.max(默认4)
- The total number of reduce tasks required is 1 or 0
- reduce任务的总数需要的是0或者1
Error Logs
- /tmp/<user.name>/hive.log
- bin/hive -hiveconf hive.root.logger=INFO,console
- bin/hive -hiveconf hive.root.logger=INFO,DRFA
DDL Operations
Creating Hive tables and browsing through them
hive> CREATE TABLE pokes (foo INT, bar STRING);
Creates a table called pokes with two columns, the first being an integer and the other a string
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).
hive> SHOW TABLES;
lists all the tables
hive> SHOW TABLES '.*s';
lists all the table that end with 's'. The pattern matching follows Java regular
expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
hive> DESCRIBE invites;
shows the list of columns
As for altering tables, table names can be changed and additional columns can be dropped:
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); hive> ALTER TABLE events RENAME TO 3koobecaf;
Dropping tables:
hive> DROP TABLE pokes;
Metadata Store
The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model.
In the future, the metastore itself can be a standalone server.
If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode.
DML Operations
Loading data from flat files into Hive:
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loads a file that contains two columns separated by ctrl-a into pokes table.
'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS.
The keyword 'overwrite' signifies that existing data in the table is deleted.
If the 'overwrite' keyword is omitted, data files are appended to existing data sets.
NOTES:
- NO verification of data against the schema is performed by the load command.
- 没有验证架构对数据进行负载命令。
- If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
- 如果文件在hdfs中,它会被移动到hive控制的文件系统名空间中
The root of the Hive directory is specified by the option hive.metastore.warehouse.dir
in hive-default.xml. We advise users to create this directory before trying to create tables via Hive. - 我们建议用户在在hive中建表之前创建这个目录
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
SQL Operations
Example Queries
Some example queries are shown below. They are available in build/dist/examples/queries.
More are available in the hive sources at ql/src/test/queries/positive
SELECTS and FILTERS
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
selects column 'foo' from all rows of partition ds=2008-08-15 of the invites table. The results are not
stored anywhere, but are displayed on the console.
Note that in all the examples that follow, INSERT (into a hive table, local
directory or HDFS directory) is optional.
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
selects all rows from partition ds=2008-08-15 of the invites table into an HDFS directory. The result data
is in files (depending on the number of mappers) in that directory.
NOTE: partition columns if any are selected by the use of *. They can also
be specified in the projection clauses.
Partitioned tables must always have a partition selected in the WHERE clause of the statement.
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
Selects all rows from pokes table into a local directory
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15'; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
GROUP BY
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar; hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
JOIN
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
MULTITABLE INSERT
FROM src INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
STREAMING
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
This streams the data in the map phase through the script /bin/cat (like hadoop streaming).
Similarly - streaming can be used on the reduce side (please see the Hive Tutorial for examples)
Simple Example Use Cases
MovieLens User Ratings
First, create a table with tab-delimited text file format:
CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
Then, download and extract the data files:
wget http://www.grouplens.org/system/files/ml-data.tar+0.gz
tar xvzf ml-data.tar+0.gz
And load it into the table that was just created:
LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data;
Count the number of rows in table u_data:
SELECT COUNT(*) FROM u_data;
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
Now we can do some complex data analysis on the table u_data:
Create weekday_mapper.py:
import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([userid, movieid, rating, str(weekday)])
Use the mapper script:
CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; add FILE weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data; SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday;
Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).
Apache Weblog Data
The format of Apache weblog is customizable, while most webmasters uses the default.
For default Apache weblog, we can create a table with the following command.
More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662
add jar ../build/contrib/hive_contrib.jar; CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE;