Programming Hive Chapter1-6 读书笔记

Chapter1
there is no record-level insert, delete or update
high latency : so it's more close to OLAP other than OLTP , you know, OLAP is for analysis. (but it's also not satisifying the "online" part)

if want to use OLTP features for large-scale data, consider using a NoSQL database. for example, HBase.

Hive is best suited for data warehouse applications.

Hive Interface : CLI (命令行), Hive Web API (web接口), programmatic access through JDBC, ODBC, and a Thrift server.

Hive communicates with JobTracker to initiate the MapReduce Job.

Metastore is a seperate relational database (usually a MySQL instance) where Hive persists table schema and other system metadata.

Hive related tools:
     Pig : a data flow language other than a query language. if input some files and output some files,  Hive will generate nested queries. Pig will generate temporary tables
     in pig, declare relations from other relations. Pig looks at these declarations and then builds up a sequence of MapReduce jobs to perform the transformation until the final results are computed.
     often used as a ETL tool to ingest external data into Hadoop.

HBase
     scenario : for row-level update, rapid query response time and transaction.
     distributed and scalable data store
     column oriented storage
     一些相关的列存储在同一存储空间中,因此对于查询只访问需要的列,对于不需要的列直接忽略

Storm
     Hadoop is batch-oriented system
     Store is suited for event stream processing
     更适合实时处理     

以word count为例,
     class Map extends Mapper<LongWritable, Text, Text, IntWritable> // 实现mapper接口
     其中的map函数 map(LongWritable,Text, Context) // 其中的LongWritable和Text对应的是输入,结果部分的<key,value>被包装在Context中
     Reduce部分类似,但是它的reduce(Text, Iteraor Int, Context) 中输入部分是Iterator,即一系列Int的合集
上述过程也可以从各个部分功能上理解:
     map <key1,value1> ->  <key2,value2> 对于word count而言,其key1是每个字符相对于起始位置的offset,实际上是没有用的,value1是输入的字符, key2是输入字符,value2是1表示该字符出现了一次
     然后在sort/shuffle步骤中,相同key(key2)的context会被送到相同的reducer中,然后reducer会对结果计数进行汇总. sum(1) 得到最后的value.

对比Hive和直接用java map/reduce执行的word count,Hive实际上对代码底层封装,所有实现相同的功能Hive可以用更少的代码

Chapter2

Hive安装需要先安装Hadoop,另外如果不使用默认的derby数据库的话,应该给出metadata的存储位置,通过jdbc接口 hive-site.xml里面配置.

Install of Hadoop and Hive
     比如在Amazon AWS  http://aws.amazon.com/cn/elasticmapreduce/

hive -e : exit after command execution finished
hive -s : execute silently
hive -f : execute from file
     hive -f file like hive -f /path/to/file/withqueries.hql
     treat as a .q or .hql extension for Hive query files.

     if in Hive shell, use SOURCE command to execute a script file is ok too.

hive -e "LOAD DATA LOCAL INPATH '/tmp/myfile' INTO TABLE src"

all commands in .hiverc  of HOME directory could be run automatically.

.hivehistory stores info about history entry.

add ! in front if want to run bash in hive command line
     like ! pwd; or ! bin/echo "hello world"

also could run hadoop command in hive CLI.
     like dfs -ls /;

comment line starts with "--"
     -- copyright 2012 Megacorp. LLC.

set hive.cli.print.header = true;
     print head info 在显示数据的同时显示列名

Chapter3 Data Types and File Formats

primitive data and collection data.
支持的基本数据类型和SQL类似

each of these data types is implemented in Java
     String is implemented by the Java String
     FLOAT is implemented by the Java float

hive不支持有最大字符限制的字符串,比如 char(10),因为传统SQL支持它的主要目的是性能优化,而HIVE更为在意的是disk reading and writing performance

collection data types
     structs : define like struct {first STRING; last STRING}, and use like struct('John','Doe')
     maps : like map('first', 'John', 'last', 'Doe')
     arrays : array('John', 'Doe')
传统数据库不支持collection data因为它们破坏了规范的原则,structs might be captured in seperate tables, with foreign key relations between the tables.

create table employees{
     name STRING,
     salary FLOAT,
     subordinates ARRAY<STRING>,
     deductions MAP<STRING, FLOAT>,
     address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
}

Text file delimiters
     each line is one record
     ^A : separate all fields(columns)
     ^B : separate elements in ARRAY or STRUCT or key-value pairs in a MAP
     ^C : separate elements in MAP key-value pairs.
它可以被等价的翻译为JSON

也可以定义其它类型的分隔符,比如 fields terminated by '\001', collection items terminated by '\002' map keys terminated by '\003'

传统的database : schema on write
     比如update会更新数据库

HIVE : schema on read.

Chapter4 Data Definition

Hive offers no support for row-level inserts, updates and deletes.
Hive doesn't support transaction

In hive, database means a catalog or namespace of tables.
hive > create database xxx;

可以覆盖hive默认的database创建目录 (默认路径在hive.metastore.warehouse.dir中指定)
create database xxx location '/tmp/xx/xxx';

hive.metastore.warehouse.dir 决定了database的存储目录,是hive中数据库概念下的文件夹,不是mysql的路径

可以在create database的时候添加描述信息 :
     create database xxx with deproperties (key = value)
     create database xxx comment 'mocky micky'
相应的信息会在describe databse时显示出来

drop database 之前应该先 drop tables.
     或者调用 drop database if exists xxx cascade (cascade是删除其内部的目录)

create table的同时可以添加comment.
     包括对其中的每个column添加comment,比如 (name string comment 'name', ...

Hive atuomatically adds two table properties : last_modified_by holds the username of the last user to modify the table, and last_modified_time holds the epoch time in second so f that modification.   

默认情况下会在database的目录下进一步去创建table.

managed table = internal table : Hive controls the lifecycle of their data.
     when drop a managed table, Hive will delete the data in the table.
managed table的灵活性可能有问题,比如数据存在pig处理的系统中,那么Hive无法访问它
external table与之对应
     但其实二者的区别没有看起来那么大,两种table都可以通过其它工具访问 (比如最基本的hadoop dfs) hive has owned tables but doesn't control over them.

if data is shared between tools, then create external table for them

在创建partition的时候就是在目录下创建相应的文件
     partition revised data storage of files. partition info will be used for directory.

set hive.mapred.mode = strict ; 对于有partition的table强制使用partition filter,主要是全部查询会比较慢

SHOW PARITIONS tablename; 显示在table中的partition信息
     比如 country=a/state=b ... country=e/state=d
     也可以只显示部分partition的信息

DESCRIBE EXTENDED tablename : 除了显示表本身的信息外,还显示了table中partition的信息 ; 另外也会显示当前table的tabletype是internal or external.

table storage format : Hive default sets it to a text file format, the option clause "STORED AS TEXTFILE"  
     可以更改为SEQUENCEFILE或者RCFILE,出于optimize disk space usage and I/O bandwidth performance.

alter table : influence metadata other than the data itself.
     alter table modifies table metadata only, the data for the table is untouched.
     alter table xxx rename to yyy;

对于table添加partition
alter table xxx add if not exists
     partition (year = 2011, month = 1, day = 1) location '/logs/2011/01/01'
     partition (year = 2011, month = 1, day = 2) location '/logs/2011/01/02'
     partition (year = 2011, month = 1, day = 3) location '/logs/2011/01/03'
;

对于table更改partition的位置
alter table xxx partition (year = 2011, month = 1, day = 1)
set location 's3n://xxx/logs/2011/01/01'     

external table
managed table = internal table
     对于delete,external table只是删除metadata的定义,而managed table同时删除md定义和数据

alter table xxx
     change column oldname newname type
     comment 'clause'
     after columnname;
改变某个列的属性,并且可以移动列的位置

TOUCH and hooks
     当table存储信息发生改变的时候trigger相应的execution.

alter table ... archive partition 把分隔的文件放到HAR压缩文件中

Chapter5 HiveQL: Data Manupulation

used to put data into tables and to extract data from tables to the filesystem.

this chapter is not about how to do select ... where but load data into table.

the only way to put data into a table is to use one of the bulk load operation.

insert data into tables from queries
     if use insert into other than insert, it will do append data other than replace it.

dynamic partition :  通过查询的动态参数去填充partition部分的取值
static partition : 填入确定的值
     如果同时有dynamic 和 static的话,static部分应该写在dynamic之前

create tables and load them in one query
     create table xxx
     as select 
     from 
     where 
this feature couldn't be used for external table because only metadata part is controlled by Hive.

Exporting Data
     copy the directories or files
     insert ... directory :      
          insert overwrite local directory '/tmp/ca_employees'
          select xxx
          from tablename
          where columnname = ''

Chapter6 HiveQL : Queries

aggergation function : User Defined Aggregate.
     比如count和avg
hive.map.aggr = true;
     对于aggregation的结果保存到top-level.
     如果不是top-level则会在操作后变为top-level的

与aggeragation function对应的是table generating function
     比如explode(columname),会根据columnname产生很多行

nested select
     from (select xxx from tablename) 
     select yyy
     where zzz
case ... when ... then statements.

Group by statement.
Join statement.
     Hive only supports equi-join now.
如果有多个join操作,同时它们的key是一致的,那么它们可以在同一个mapreduce中执行,另外HIVE会默认最后一个join的table是最大的 (这点HIVE会自动优化table出现的顺序)

left semi join 用来替换 where in exists
传统SQL中的语法
     select s.ymd,   s.symbol from stock s 
     where s.ymd, s.symbol in 
     (select d.ymd, d.symbol from dividend d);
HIVE的等效替代
     select s.ymd, s.symbol from stock left semi join dividend d on s.ymd = d.ymd and s.symbol = d.symbol;

在HIVE中某些错误的写法可能会导致笛卡尔连接
     select * from stocks join dividends 
     where stock.symbol = dividends.symbol and stock.symbol = 'AAA';
实际上,在应用where条件前,HIVE会先对join做笛卡尔连接操作

order by : 所有的table会到一个reducer里面去排序,相对时间较慢
sort by : 在每个reducer里面做它自己的排序,得到的结果是局部有序的
     better performance is traded for total ordering.
distribute by 定义的是如何将outputs map到哪个reducer中,因此distribute by和sort by应该有共同的column name.

cast 数据类型转换
where cast (salary as float)  < 1000.0;

Queries that sample data
     block sampling
     create table numbers_bucketed (number int) clustered by (number) into 3 buckets;

UNION ALL combines all tables and should generate same data type for the same column.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值