2020.9.25课堂笔记(phoenix：We put the SQL back in NoSQL)-CSDN博客

本文链接：https://blog.csdn.net/m0_48758256/article/details/108807466

Phoenix概述:

We put the SQL back in NoSQL
在这里插入图片描述
OLTP and operational analytics for Apache Hadoop

1. Phoenix简介

Phoenix是由saleforce.com开源的一个项目，后又捐给了Apache基金会。它相当于一个Java中间件，提供jdbc连接，操作hbase数据表。Phoenix是一个HBase的开源SQL引擎。你可以使用标准的JDBC API代替HBase客户端API来创建表，插入数据，查询你的HBase数据。Phoenix的团队用了一句话概括Phoenix：”We put the SQL back in NoSQL” 意思是：我们把SQL又放回NoSQL去了！这边说的NoSQL专指HBase，意思是可以用SQL语句来查询Hbase，你可能会说：“Hive和Impala也可以啊！”。但是Hive和Impala还可以查询文本文件，Phoenix的特点就是，它只能查Hbase，别的类型都不支持！但是也因为这种专一的态度，让Phoenix在Hbase上查询的性能超过了Hive和Impala！

注：Hive和Impala也可以对HBase进行查询，但是Phoenix在HBase上的性能远超他们，对于简单查询来说，Phoenix的性能量级是毫秒，对于百万级别的行数来说，性能量级是秒

2. Phoenix性能

Phoenix是构建在HBase之上的SQL引擎。你也许会存在“Phoenix是否会降低HBase的效率？”或者“Phoenix效率是否很低？”这样的疑虑，事实上并不会，Phoenix通过以下方式实现了比你自己手写的方式相同或者可能是更好的性能（更不用说可以少写了很多代码）：

编译你的SQL查询为原生HBase的scan语句。
检测scan语句最佳的开始和结束的key。
精心编排你的scan语句让他们并行执行。
推送你的WHERE子句的谓词到服务端过滤器处理。
执行聚合查询通过服务端钩子（称为协同处理器）。

实现了二级索引来提升非主键字段查询的性能。
统计相关数据来提高并行化水平，并帮助选择最佳优化方案。
跳过扫描过滤器来优化IN，LIKE，OR查询。
优化主键的来均匀分布写压力。

3. Phoenix安装及与HBase的集成

一、Phoenix安装
1.1 前置环境
首先保证Zookeeper、Hadoop集群的正常部署，并启动。
1.2 Phoenix安装部署
使用的版本为phoenix-4.14.0-cdh5.14.2。下载地址http://archive.cloudera.com/cdh5/cdh/5/
1.解压Phoenix到指定目录：
[root@hadoop101 software]$ tar -zxf apache-phoenix-4.14.0-cdh5.14.2-bin.tar.gz -C /opt/install
2.创建Phoenix软连接
[root@hadoop101 software]$ ln -s /opt/install/apache-phoenix-4.14.0-cdh5.14.2-bin /opt/install/phoenix
3.	配置环境变量
vi /etc/profile
添加如下内容：
export PHOENIX_HOME=/opt/install/phoenix
export PATH=$PHOENIX_HOME/bin$PATH
使环境变量生效。
source /etc/profile
1.3 Phoenix集成HBase
1.将$PHOENIX_HOME目录下的phoenix-4.14.0-cdh5.14.2-server.jar拷贝到每一个RegionServer机器的$HBASE_HOME/lib目录下。
scp /opt/install/phoenix/ phoenix-4.14.0-cdh5.14.2-server.jar root@hadoop101:/opt/install/hbase/lib/
1.4 Phoenix服务的启动
1.启动HBase（如果HBase已经启动，需要重新启动）。
2.在Phoenix的bin文件夹下执行，指定zk的地址作为hbase的访问入口。
[root@hadoop101 bin]$ ./sqlline.py hadoop101:2181
启动完成后，使用命令!tables查看表。

4. Phoenix字段类型：

a	b	c	d
INTEGER	java.lang.Integer	-2147483648 to 2147483647	4 byte
UNSIGNED_INT	java.lang.Integer	0 to 2147483647	4 byte
BIGINT	java.lang.Long	-9223372036854775807 to 9223372036854775807	8 byte
UNSIGNED_LONG	java.lang.Long	0 to 9223372036854775807
TINYINT	java.lang.Byte	-128 to 127	1 byte
UNSIGNED_TINYINT	java.lang.Byte	0 to 127	1 byte
SMALLINT	java.lang.Short	-32768 to 32767	2 byte
UNSIGNED_SMALLINT	java.lang.Short	0 to 32767	2 byte
FLOAT	java.lang.Float	-3.402823466 E + 38 to 3.402823466E+38	4 byte
UNSIGNED_FLOAT	java.lang.Float	-3.402823466E+38 to 3.402823466E+38	4 byte
DOUBLE	java.lang.Double	-1.7976931348623158E+308 to 1.7976931348623158E+308	8 byte
UNSIGNED_DOUBLE	java.lang.Double	0 to 1.7976931348623158E+308	8 byte
DECIMAL	java.math.BigDecimal	38 digits
BOOLEAN	java.lang.Boolean
TIME	java.sql.Time
DATE	java.sql.Date
TIMESTAMP	java.sql.Timestamp
UNSIGNED_TIME	java.sql.Time
UNSIGNED_DATE	java.sql.Date
UNSIGNED_TIMESTAMP	java.sql.Timestamp
VARCHAR(precisi onInt)	java.lang.String
CHAR(precisionInt)	java.lang.String
BINARY(precisionInt)	byte[]
VARBINARY	byte[]

5. 操作

--Run sqlline.py hadoop101:2181
--创建表
CREATE TABLE company (COMPANY_ID INTEGER PRIMARY KEY, NAME
VARCHAR(225));
--查看表
!tables
--查看表所有的列
!columns company
--插入数据
UPSERT INTO company VALUES(1, 'Microsoft');
UPSERT INTO company VALUES(2, 'IBM');
UPSERT INTO company VALUES(3, 'Oracle');
UPSERT INTO company VALUES(4, 'Twitter');
UPSERT INTO company VALUES(5, 'Facebook');
--查询数据
SELECT * FROM Company;
--删除数据
DELETE FROM company WHERE COMPANY_ID=5;
SELECT * FROM Company;
--创建表
CREATE TABLE stock (COMPANY_ID INTEGER PRIMARY KEY, PRICE
DECIMAL(10,2));
UPSERT INTO stock VALUES(1, 124.9);
UPSERT INTO stock VALUES(2, 99);
UPSERT INTO stock VALUES(3, 150.9);
UPSERT INTO stock VALUES(4, 45.5);
UPSERT INTO stock VALUES(5, 120.5);
--表关联操作
select s.COMPANY_ID, c.name, s.price
from stock s left outer join company c on s.COMPANY_ID = c.COMPANY_ID;
--退出
!quit

官方quick start：
http://phoenix.apache.org/Phoenix-in-15-minutes-or-less.html

Phoenix in 15 minutes or less

What is this new Phoenix thing I’ve been hearing about?
Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.

Doesn’t putting an extra layer between my application and HBase just slow things down?
Actually, no. Phoenix achieves as good or likely better performance than if you hand-coded it yourself (not to mention with a heck of a lot less code) by:

compiling your SQL queries to native HBase scans
determining the optimal start and stop for your scan key
orchestrating the parallel execution of your scans
bringing the computation to the data by
pushing the predicates in your where clause to a server-side filter
executing aggregate queries through server-side hooks (called co-processors)

In addition to these items, we’ve got some interesting enhancements in the works to further optimize performance:

secondary indexes to improve performance for queries on non row key columns
stats gathering to improve parallelization and guide choices between optimizations
skip scan filter to optimize IN, LIKE, and OR queries
optional salting of row keys to evenly distribute write load

Ok, so it’s fast. But why SQL? It’s so 1970s
Well, that’s kind of the point: give folks something with which they’re already familiar. What better way to spur the adoption of HBase? On top of that, using JDBC and SQL:

Reduces the amount of code users need to write
Allows for performance optimizations transparent to the user
Opens the door for leveraging and integrating lots of existing tooling

But how can SQL support my favorite HBase technique of x,y,z
Didn’t make it to the last HBase Meetup did you? SQL is just a way of expressing what you want to get not how you want to get it. Check out my presentation for various existing and to-be-done Phoenix features to support your favorite HBase trick. Have ideas of your own? We’d love to hear about them: file an issue for us and/or join our mailing list.

Blah, blah, blah - I just want to get started!
Ok, great! Just follow our install instructions:

download and expand our installation tar 下载然后解压
copy the phoenix server jar that is compatible with your HBase installation into the lib directory of every region server 复制phoenix-4.9.0-HBase-1.2-server.jar 到region server所在节点的hbase/lib目录下,如果配置了backup-master name master节点也要复制一份.
restart the region servers 重启 region servers,或者三台都重启
add the phoenix client jar to the classpath of your HBase client
download and setup SQuirrel as your SQL client so you can issue adhoc SQL against your HBase cluster

I don’t want to download and setup anything else!
Ok, fair enough - you can create your own SQL scripts and execute them using our command line tool instead. Let’s walk through an example now. Begin by navigating to the bin/ directory of your Phoenix install location.

First, let’s create a us_population.sql file, containing a table definition:

CREATE TABLE IF NOT EXISTS us_population (
      state CHAR(2) NOT NULL,
      city VARCHAR NOT NULL,
      population BIGINT
      CONSTRAINT my_pk PRIMARY KEY (state, city));

Now let’s create a us_population.csv file containing some data to put in that table:

NY,New York,8143197
CA,Los Angeles,3844829
IL,Chicago,2842518
TX,Houston,2016582
PA,Philadelphia,1463281
AZ,Phoenix,1461575
TX,San Antonio,1256509
CA,San Diego,1255540
TX,Dallas,1213825
CA,San Jose,912332

And finally, let’s create a us_population_queries.sql file containing a query we’d like to run on that data.

SELECT state as "State",count(city) as "City Count",sum(population) as "Population Sum"
FROM us_population
GROUP BY state
ORDER BY sum(population) DESC;

Execute the following command from a command terminal

./psql.py <your_zookeeper_quorum> us_population.sql us_population.csv us_population_queries.sql

Congratulations! You’ve just created your first Phoenix table, inserted data into it, and executed an aggregate query with just a few lines of code in 15 minutes or less!

Big deal - 10 rows! What else you got?
Ok, ok - tough crowd. Check out our bin/performance.py script to create as many rows as you want, for any schema you come up with, and run timed queries against it.

Why is it called Phoenix anyway? Did some other project crash and burn and this is the next generation?
I’m sorry, but we’re out of time and space, so we’ll have to answer that next time!