Programming Hive Chapter1-6 读书笔记

最新推荐文章于 2024-07-15 19:53:14 发布

linpingta

最新推荐文章于 2024-07-15 19:53:14 发布

阅读量800

点赞数

分类专栏： Big Data 文章标签： Hive 数据仓库

本文链接：https://blog.csdn.net/linpingta/article/details/18927145

版权

Big Data 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Chapter1

there is no record-level insert, delete or update

high latency : so it's more close to OLAP other than OLTP , you know, OLAP is for analysis. (but it's also not satisifying the "online" part)

if want to use OLTP features for large-scale data, consider using a NoSQL database. for example, HBase.

Hive is best suited for data warehouse applications.

Hive Interface : CLI (命令行), Hive Web API (web接口), programmatic access through JDBC, ODBC, and a Thrift server.

Hive communicates with JobTracker to initiate the MapReduce Job.

Metastore is a seperate relational database (usually a MySQL instance) where Hive persists table schema and other system metadata.

Hive related tools:

Pig : a data flow language other than a query language. if input some files and output some files, Hive will generate nested queries. Pig will generate temporary tables

in pig, declare relations from other relations. Pig looks at these declarations and then builds up a sequence of MapReduce jobs to perform the transformation until the final results are computed.

often used as a ETL tool to ingest external data into Hadoop.

HBase

scenario : for row-level update, rapid query response time and transaction.

distributed and scalable data store

column oriented storage

一些相关的列存储在同一存储空间中，因此对于查询只访问需要的列，对于不需要的列直接忽略

Storm

Hadoop is batch-oriented system

Store is suited for event stream processing

更适合实时处理

以word count为例，

class Map extends Mapper<LongWritable, Text, Text, IntWritable> // 实现mapper接口

其中的map函数 map(LongWritable,Text, Context) // 其中的LongWritable和Text对应的是输入，结果部分的<key,value>被包装在Context中

Reduce部分类似，但是它的reduce(Text, Iteraor Int, Context) 中输入部分是Iterator，即一系列Int的合集

上述过程也可以从各个部分功能上理解：

map <key1,value1> -> <key2,value2> 对于word count而言，其key1是每个字符相对于起始位置的offset，实际上是没有用的，value1是输入的字符, key2是输入字符，value2是1表示该字符出现了一次

然后在sort/shuffle步骤中，相同key（key2）的context会被送到相同的reducer中，然后reducer会对结果计数进行汇总. sum(1) 得到最后的value.

对比Hive和直接用java map/reduce执行的word count，Hive实际上对代码底层封装，所有实现相同的功能Hive可以用更少的代码

Chapter2

Hive安装需要先安装Hadoop，另外如果不使用默认的derby数据库的话，应该给出metadata的存储位置，通过jdbc接口 hive-site.xml里面配置.

Install of Hadoop and Hive

比如在Amazon AWS http://aws.amazon.com/cn/elasticmapreduce/

hive -e : exit after command execution finished

hive -s : execute silently

hive -f : execute from file

hive -f file like hive -f /path/to/file/withqueries.hql

treat as a .q or .hql extension for Hive query files.

if in Hive shell, use SOURCE command to execute a script file is ok too.

hive -e "LOAD DATA LOCAL INPATH '/tmp/myfile' INTO TABLE src"

all commands in .hiverc of HOME directory could be run automatically.

.hivehistory stores info about history entry.

add ! in front if want to run bash in hive command line

like ! pwd; or ! bin/echo "hello world"

also could run hadoop command in hive CLI.

like dfs -ls /;

comment line starts with "--"

set hive.cli.print.header = true;

print head info 在显示数据的同时显示列名

Chapter3 Data Types and File Formats

primitive data and collection data.

支持的基本数据类型和SQL类似

each of these data types is implemented in Java

String is implemented by the Java String

FLOAT is implemented by the Java float

hive不支持有最大字符限制的字符串，比如 char(10)，因为传统SQL支持它的主要目的是性能优化，而HIVE更为在意的是disk reading and writing performance

collection data types

structs : define like struct {first STRING; last STRING}, and use like struct('John','Doe')

maps : like map('first', 'John', 'last', 'Doe')

arrays : array('John', 'Doe')

传统数据库不支持collection data因为它们破坏了规范的原则，structs might be captured in seperate tables, with foreign key relations between the tables.

create table employees{

name STRING,

salary FLOAT,

subordinates ARRAY<STRING>,

deductions MAP<STRING, FLOAT>,

address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>

}

Text file delimiters

each line is one record

^A : separate all fields(columns)

^B : separate elements in ARRAY or STRUCT or key-value pairs in a MAP

^C : separate elements in MAP key-value pairs.

它可以被等价的翻译为JSON

也可以定义其它类型的分隔符，比如 fields terminated by '\001', collection items terminated by '\002' map keys terminated by '\003'

传统的database : schema on write

比如update会更新数据库

HIVE : schema on read.

Chapter4 Data Definition

Hive offers no support for row-level inserts, updates and deletes.

Hive doesn't support transaction

In hive, database means a catalog or namespace of tables.

hive > create database xxx;

可以覆盖hive默认的database创建目录 (默认路径在hive.metastore.warehouse.dir中指定)

create database xxx location '/tmp/xx/xxx';

hive.metastore.warehouse.dir 决定了database的存储目录，是hive中数据库概念下的文件夹，不是mysql的路径

可以在create database的时候添加描述信息：

create database xxx with deproperties (key = value)

create database xxx comment 'mocky micky'

相应的信息会在describe databse时显示出来

drop database 之前应该先 drop tables.

或者调用 drop database if exists xxx cascade (cascade是删除其内部的目录)

create table的同时可以添加comment.

包括对其中的每个column添加comment，比如 (name string comment 'name', ...

Hive atuomatically adds two table properties : last_modified_by holds the username of the last user to modify the table, and last_modified_time holds the epoch time in second so f that modification.

默认情况下会在database的目录下进一步去创建table.

managed table = internal table : Hive controls the lifecycle of their data.

when drop a managed table, Hive will delete the data in the table.

managed table的灵活性可能有问题，比如数据存在pig处理的系统中，那么Hive无法访问它

external table与之对应

但其实二者的区别没有看起来那么大，两种table都可以通过其它工具访问 (比如最基本的hadoop dfs) hive has owned tables but doesn't control over them.

if data is shared between tools, then create external table for them

在创建partition的时候就是在目录下创建相应的文件

partition revised data storage of files. partition info will be used for directory.

set hive.mapred.mode = strict ; 对于有partition的table强制使用partition filter，主要是全部查询会比较慢

SHOW PARITIONS tablename; 显示在table中的partition信息

比如 country=a/state=b ... country=e/state=d

也可以只显示部分partition的信息

DESCRIBE EXTENDED tablename : 除了显示表本身的信息外，还显示了table中partition的信息 ; 另外也会显示当前table的tabletype是internal or external.

table storage format : Hive default sets it to a text file format, the option clause "STORED AS TEXTFILE"

可以更改为SEQUENCEFILE或者RCFILE，出于optimize disk space usage and I/O bandwidth performance.

alter table : influence metadata other than the data itself.

alter table modifies table metadata only, the data for the table is untouched.

alter table xxx rename to yyy;

对于table添加partition

alter table xxx add if not exists

partition (year = 2011, month = 1, day = 1) location '/logs/2011/01/01'

partition (year = 2011, month = 1, day = 2) location '/logs/2011/01/02'

partition (year = 2011, month = 1, day = 3) location '/logs/2011/01/03'

;

对于table更改partition的位置

alter table xxx partition (year = 2011, month = 1, day = 1)

set location 's3n://xxx/logs/2011/01/01'

external table

managed table = internal table

对于delete，external table只是删除metadata的定义，而managed table同时删除md定义和数据

alter table xxx

change column oldname newname type

comment 'clause'

after columnname;

改变某个列的属性，并且可以移动列的位置

TOUCH and hooks

当table存储信息发生改变的时候trigger相应的execution.

alter table ... archive partition 把分隔的文件放到HAR压缩文件中

Chapter5 HiveQL: Data Manupulation

used to put data into tables and to extract data from tables to the filesystem.

this chapter is not about how to do select ... where but load data into table.

the only way to put data into a table is to use one of the bulk load operation.

insert data into tables from queries

if use insert into other than insert, it will do append data other than replace it.

dynamic partition : 通过查询的动态参数去填充partition部分的取值

static partition : 填入确定的值

如果同时有dynamic 和 static的话，static部分应该写在dynamic之前

create tables and load them in one query

create table xxx

as select

from

where

this feature couldn't be used for external table because only metadata part is controlled by Hive.

Exporting Data

copy the directories or files

insert ... directory :

insert overwrite local directory '/tmp/ca_employees'

select xxx

from tablename

where columnname = ''

Chapter6 HiveQL : Queries

aggergation function : User Defined Aggregate.

比如count和avg

hive.map.aggr = true;

对于aggregation的结果保存到top-level.

如果不是top-level则会在操作后变为top-level的

与aggeragation function对应的是table generating function

比如explode(columname)，会根据columnname产生很多行

nested select

from (select xxx from tablename)

select yyy

where zzz

case ... when ... then statements.

Group by statement.

Join statement.

Hive only supports equi-join now.

如果有多个join操作，同时它们的key是一致的，那么它们可以在同一个mapreduce中执行，另外HIVE会默认最后一个join的table是最大的 (这点HIVE会自动优化table出现的顺序)

left semi join 用来替换 where in exists

传统SQL中的语法

select s.ymd, s.symbol from stock s

where s.ymd, s.symbol in

(select d.ymd, d.symbol from dividend d);

HIVE的等效替代

select s.ymd, s.symbol from stock left semi join dividend d on s.ymd = d.ymd and s.symbol = d.symbol;

在HIVE中某些错误的写法可能会导致笛卡尔连接

select * from stocks join dividends

where stock.symbol = dividends.symbol and stock.symbol = 'AAA';

实际上，在应用where条件前，HIVE会先对join做笛卡尔连接操作

order by : 所有的table会到一个reducer里面去排序，相对时间较慢

sort by : 在每个reducer里面做它自己的排序，得到的结果是局部有序的

better performance is traded for total ordering.

distribute by 定义的是如何将outputs map到哪个reducer中，因此distribute by和sort by应该有共同的column name.

cast 数据类型转换

where cast (salary as float) < 1000.0;

Queries that sample data

block sampling

create table numbers_bucketed (number int) clustered by (number) into 3 buckets;

UNION ALL combines all tables and should generate same data type for the same column.

linpingta

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Programming Hive Chapter1-6 读书笔记

Chapter1there is no record-level insert, delete or updatehigh latency : so it's more close to OLAP other than OLTP , you know, OLAP is for analysis. (but it's also not satisifying the "online" par
复制链接

扫一扫