hive②-- GettingStarted

最新推荐文章于 2023-05-30 10:45:56 发布

srzyhead

最新推荐文章于 2023-05-30 10:45:56 发布

阅读量703

点赞数

分类专栏： hive 文章标签： Apache Hive MapReduce

hive 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

仅翻译部分,看着简单的和后面的一些没翻译

GettingStarted

Table of Contents

DISCLAIMER: Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now -
although it may very well work on other similar platforms. It does not work on Cygwin.

Most of our testing has been on Hadoop 0.20 - so we advise running it against this version even though it may compile/work against other versions

免责声明:hive现在只在unix和linux和mac机器上使用java1.6进行测试

虽然它可能在其他平台也能良好运行。

Cygwin不能用hive。

我们大部分的测试基于Hadoop0.2 素以我们建议在这个版本上运行,即使它也许在其他版本可以编译

Installation and Configuration

Requirements

Java 1.6
Hadoop 0.20.x.
Installing Hive from a Stable Release

Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).

Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z:

  $ tar -xzvf hive-x.y.z.tar.gz

Set the environment variable HIVE_HOME to point to the installation directory:

  $ cd hive-x.y.z
  $ export HIVE_HOME={{pwd}}

Finally, add $HIVE_HOME/bin to your PATH:

  $ export PATH=$HIVE_HOME/bin:$PATH

Building Hive from Source

The Hive SVN repository is located here: http://svn.apache.org/repos/asf/hive/trunk

  $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
  $ cd hive
  $ ant clean package
  $ cd build/dist
  $ ls
  README.txt
  bin/ (all the shell scripts)
  lib/ (required jar files)
  conf/ (configuration files)
  examples/ (sample input and query files)

In the rest of the page, we use build/dist and <install-dir> interchangeably.

Compile hive on hadoop 23

  $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
  $ cd hive
  $ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -Dhadoop.mr.rev=23
  $ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23

Running Hive

Hive uses hadoop that means:

you must have hadoop in your path OR
export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse
(aka hive.metastore.warehouse.dir) and set them chmod g+w in
HDFS before a table can be created in Hive.

Commands to perform this setup

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

I also find it useful but not necessary to set HIVE_HOME

  $ export HIVE_HOME=<hive-install-dir>

To use hive command line interface (cli) from the shell:

  $ $HIVE_HOME/bin/hive

Configuration management overview

Hive default configuration is stored in <install-dir>/conf/hive-default.xml
Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml
The location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable.
Log4j configuration is stored in <install-dir>/conf/hive-log4j.properties

Hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default.

Hive configuration can be manipulated by:

Editing hive-site.xml and defining any desired variables (including hadoop variables) in it
From the cli using the set command (see below)
By invoking hive using the syntax:
- $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
  this sets the variables x1 and x2 to y1 and y2 respectively
By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above
Runtime configuration

Hive queries are executed using map-reduce queries and, therefore, the behavior
of such queries can be controlled by the hadoop configuration variables.

The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:

    hive> SET mapred.job.tracker=myhost.mycompany.com:50030;
    hive> SET -v;

The latter shows all the current settings. Without the -v option only the
variables that differ from the base hadoop configuration are displayed

Hive, Map-Reduce and Local-Mode

Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:

hive编译器为查询生成map-reduce任务。这些任务通过指定的变量提交到Map-Reduce集群

  mapred.job.tracker

While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.

就像这个参数指向一个包含多个节点的map-reduce集群,hadoop也提供一个方便的选项来运行用户工作站map-reduce任务。这对于运行小数据集的查询非常有用,在这种情况下,本地方式执行经常会比提交任务到集群速度更快。(PS:本地方式应该是类似于map-reduce过程中的combine这步)数据存储透明的在HDFS中进行存取(PS:意思应该是数据存取的HDFS可以直接查看,是透明的)。相反,本地方式只运行一个reducer,处理大数据集时可能非常慢。

Starting v-0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:

从v-0.7开始,hive完全支持本地方式执行。

  hive> SET mapred.job.tracker=local;

In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space).

另外,mapred.local.dir应该制定一个本机的有效路径。否则,用户将的到一个分配本地空间的错误。

Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are:

从0.7版本开始,hive也支持自动以本地模式运行map-reduce任务。相关选项:

  hive> SET hive.exec.mode.local.auto=false;

note that this feature is disabled by default. If enabled - Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:

注意:这个特性,默认是关闭的。如果开启,hive分析查询中每个map-reduce任务的大小,如果满足如下条件,则本地运行之:

The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
任务的总输入大小小于:hive.exec.mode.local.auto.inputbytes.max(默认128MB)
The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
map-task的总数小于:hive.exec.mode.local.auto.tasks.max(默认4)
The total number of reduce tasks required is 1 or 0
reduce任务的总数需要的是0或者1

So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.

因此,下面2个情况任务会在本地运行,小数据集查询, 多map-reduce任务的查询,后面任务的输入实际更小

Note that there may be differences in the runtime environment of hadoop server nodes and the machine running the hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option hive.mapred.local.mem. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm.

注意,hadoop服务节点和hive客户端的运行环境可能差异很大。(因为不同的jvm版本和不同的软件库)。这将在本地运行时,导致不可预见的行为或错误。也要注意,本地方式在一个单独的子jvm中执行。如果用户期望,这个zijvm的最大内存可以通过hive.mapred.local.mem来进行控制。默认这个值是0,在这种情况下,hive让hadoop决定默认子jvm默认的内存限制。

Error Logs

Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is WARN and the logs are stored in the folder:

hive使用log4j。默认日志不通过命令行输出到控制台。默认日志级别是warn,日志存在如下目录:

/tmp/<user.name>/hive.log

If the user wishes - the logs can be emitted to the console by adding the arguments shown below:

如果用户期望,通过添加如下参数日志将被打到命令行

bin/hive -hiveconf hive.root.logger=INFO,console

Alternatively, the user can change the logging level only by using:

用户可以通过如下方式改变日志等级

bin/hive -hiveconf hive.root.logger=INFO,DRFA

Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time.

注意通过set命令设置hive.root.logger,不能改变日志属性,因为他们在初始化的时候就已经确定。

Hive also stores query logs on a per hive session basis in /tmp/<user.name>/, but can be configured in hive-site.xml with the hive.querylog.location property.

hive也存储hive会话基础的查询日志,但是通过hive.querylog.location属性配置在hive-site.xml中。

Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI.

hive在hadoop集群执行的日志被hadoop控制。每个map和reduce任务都被存储在执行任务的机器上,hadoop将会对应每个任务生成一个日志文件。进入Hadoop JobTracker web用户界面,点task details界面,可以获得日志文件。

When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting v-0.6 - Hive uses the hive-exec-log4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.

当使用本地模式,hadoop/hive执行日志产生在本地机器上,自0.6版本依赖,hive使用hive-exec-log4j.properties来确定这些日志默认交付的位置。默认配置文件执行一次本地查询生成一个日志,并把它存在/tmp/<user.name>中。提供单独配置文件的目的是允许管理员把有需要的日志集中到一起。执行日志对于运行过程中的错误排查很重要。

Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org.

DDL Operations

Creating Hive tables and browsing through them

  hive> CREATE TABLE pokes (foo INT, bar STRING);

Creates a table called pokes with two columns, the first being an integer and the other a string

  hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.

建表有两列和一个分区列(姑且这么翻译)。分区列是一个虚拟列,不是数据本身的一部分,而是来自于一个特定数据集的分区。

By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).

  hive> SHOW TABLES;

lists all the tables

  hive> SHOW TABLES '.*s';

lists all the table that end with 's'. The pattern matching follows Java regular
expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html

hive> DESCRIBE invites;

shows the list of columns

As for altering tables, table names can be changed and additional columns can be dropped:

  hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
  hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
  hive> ALTER TABLE events RENAME TO 3koobecaf;

Dropping tables:

  hive> DROP TABLE pokes;

Metadata Store

Metadata is in an embedded Derby database whose disk storage location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hive-default.xml), this location is

metadata内嵌了Derby数据库,他的磁盘存储位置通过hive配置的变量javax.jdo.option.ConnectionURL来确定。

./metastore_db

Right now, in the default configuration, this metadata can only be seen by one user at a time.

现在,默认配置是这个元数据只能被一个用户在同一时间看到。

Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName.

Refer to JDO (or JPOX) documentation for more details on supported databases.

元数据可以存储在支持JPOX的任何数据库中。位置和RDBMS的类型可以在下面2个变量中设置。
The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model.

In the future, the metastore itself can be a standalone server.

If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode.

DML Operations

Loading data from flat files into Hive:

  hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

Loads a file that contains two columns separated by ctrl-a into pokes table.
'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS.

The keyword 'overwrite' signifies that existing data in the table is deleted.
If the 'overwrite' keyword is omitted, data files are appended to existing data sets.

NOTES:

NO verification of data against the schema is performed by the load command.
没有验证架构对数据进行负载命令。
If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
如果文件在hdfs中,它会被移动到hive控制的文件系统名空间中
The root of the Hive directory is specified by the option hive.metastore.warehouse.dir
in hive-default.xml. We advise users to create this directory before trying to create tables via Hive.
我们建议用户在在hive中建表之前创建这个目录

  hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
  hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed.

上面的这两个load语句加载数据到表上两个不同的区。表将会通过键ds区分创建为2个分区。

  hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

The above command will load data from an HDFS file/directory to the table.

上面的命令将会从一个HDFS文件或者目录加载到表。

Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.

SQL Operations

Example Queries

Some example queries are shown below. They are available in build/dist/examples/queries.
More are available in the hive sources at ql/src/test/queries/positive

SELECTS and FILTERS

  hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

selects column 'foo' from all rows of partition ds=2008-08-15 of the invites table. The results are not
stored anywhere, but are displayed on the console.

Note that in all the examples that follow, INSERT (into a hive table, local
directory or HDFS directory) is optional.

  hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

selects all rows from partition ds=2008-08-15 of the invites table into an HDFS directory. The result data
is in files (depending on the number of mappers) in that directory.
NOTE: partition columns if any are selected by the use of *. They can also
be specified in the projection clauses.

Partitioned tables must always have a partition selected in the WHERE clause of the statement.

  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

Selects all rows from pokes table into a local directory

  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15';
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

GROUP BY

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

JOIN

  hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT

  FROM src
  INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
  INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
  INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

STREAMING

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

This streams the data in the map phase through the script /bin/cat (like hadoop streaming).
Similarly - streaming can be used on the reduce side (please see the Hive Tutorial for examples)

Simple Example Use Cases

MovieLens User Ratings

First, create a table with tab-delimited text file format:

CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

Then, download and extract the data files:

wget http://www.grouplens.org/system/files/ml-data.tar+0.gz
tar xvzf ml-data.tar+0.gz

And load it into the table that was just created:

LOAD DATA LOCAL INPATH 'ml-data/u.data'
OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:

SELECT COUNT(*) FROM u_data;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

Now we can do some complex data analysis on the table u_data:

Create weekday_mapper.py:

import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([userid, movieid, rating, str(weekday)])

Use the mapper script:

CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).

Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters uses the default.
For default Apache weblog, we can create a table with the following command.

More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662

add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;

srzyhead

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive②-- GettingStarted

仅翻译部分,看着简单的和后面的一些没翻译GettingStartedTable of ContentsInstallation and ConfigurationRequirementsInstalling Hive from a Stable ReleaseBuilding Hive from Sourc
复制链接

扫一扫