Phoenix二级索引再探_phoenix数据不一致-CSDN博客

本文链接：https://blog.csdn.net/u013850277/article/details/80978858

目前分两种情况测试：
- 一种是该Hbase表已存在，通过该Hbase表映射一张对应的Phoenix表；
- 另一种是该Hbase表不存在，通过Phoenix创建对应的Hbase表。

Hbase表存在的情况：

1、通过hbase shell 创建Hbase表

hbase(main):009:0> create 't_hbase1','info'
0 row(s) in 1.5930 seconds

=> Hbase::Table - t_hbase1

hbase(main):010:0> desc 't_hbase1'
Table t_hbase1 is ENABLED
t_hbase1
COLUMN FAMILIES DESCRIPTION
{NAME => 'info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERS
IONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.1380 seconds

2.这里同时建立hive外部表用来关联hbase1表

CREATE EXTERNAL TABLE t_hbase1(
key string, 
id string,
salary string,
start_date string,
end_date string)     
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'     
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:id,info:salary,info:start_date,info:end_date")     
TBLPROPERTIES("hbase.table.name" = "t_hbase1");

测试发现：hive外部表关联hbase表并不受hbase表是否通过phoenix还是hbase api插入数据的影响
3.往t_hbase1中添加数据

      hbase(main):010:0> put 't_hbase1','1','info:id','1'
      hbase(main):010:0> put 't_hbase1','1','info:salary','1'
      hbase(main):011:0> put 't_hbase1','1','info:start_date','2017-09-18'
      hbase(main):012:0> put 't_hbase1','1','info:end_date','2017-09-18'

hbase(main):016:0> scan 't_hbase1'
ROW                          COLUMN+CELL
 1                           column=info:end_date, timestamp=1530671481354, value=2017-09-18
 1                           column=info:id, timestamp=1530671453795, value=1
 1                           column=info:salary, timestamp=1530671469153, value=1
 1                           column=info:start_date, timestamp=1530671475444, value=2017-09-18

4、创建phoenix表关联t_hbase1表
进入phoenix shell

[root@hdp3 ~]# /usr/hdp/2.5.3.0-37/phoenix/bin/sqlline.py hdp1,hdp2,hdp3:2181
 create table "t_hbase1"(
"ROW" varchar primary key,
 "info"."start_date" varchar ,
  "info"."end_date" varchar ,
  "info"."id" varchar ,
  "info"."salary" varchar);



通过phoenix查询数据
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
+------+-------------+-------------+-----+---------+

查询数据量
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select count(*) from "t_hbase1";
+-----------+
| COUNT(1)  |
+-----------+
| 1         |
+-----------+

5、通过hbase shell添加数据到t_hbase1

      hbase(main):010:0> put 't_hbase1','2','info:id','2'
      hbase(main):010:0> put 't_hbase1','2','info:salary','2'
      hbase(main):011:0> put 't_hbase1','2','info:start_date','2017-09-18'
      hbase(main):012:0> put 't_hbase1','2','info:end_date','2017-09-18'

6、这时通过phoenix查询如下

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select count(*) from "t_hbase1";
+-----------+
| COUNT(1)  |
+-----------+
| 2         |
+-----------+
1 row selected (0.029 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
| 2    | 2017-09-18  | 2017-09-18  | 2   | 2       |
+------+-------------+-------------+-----+---------+
2 rows selected (0.034 seconds)

7、通过phoenix shell插入数据到t_hbase1表

   UPSERT INTO "t_hbase1" VALUES('3','2017-09-18','2017-09-18','3','3');

8、查询结果并分析

phoenix shell:
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
| 2    | 2017-09-18  | 2017-09-18  | 2   | 2       |
| 3    | 2017-09-18  | 2017-09-18  | 3   | 3       |
+------+-------------+-------------+-----+---------+
3 rows selected (0.057 seconds)

hbase shell:
hbase(main):022:0> scan 't_hbase1'
ROW                       COLUMN+CELL
 1                        column=info:_0, timestamp=1530671481354, value=
 1                        column=info:end_date, timestamp=1530671481354, value=2017-09-18
 1                        column=info:id, timestamp=1530671453795, value=1
 1                        column=info:salary, timestamp=1530671469153, value=1
 1                        column=info:start_date, timestamp=1530671475444, value=2017-09-18
 2                        column=info:end_date, timestamp=1530672268599, value=2017-09-18
 2                        column=info:id, timestamp=1530672247623, value=2
 2                        column=info:salary, timestamp=1530672253810, value=2
 2                        column=info:start_date, timestamp=1530672262302, value=2017-09-18
 3                        column=info:_0, timestamp=1530672889061, value=x
 3                        column=info:end_date, timestamp=1530672889061, value=2017-09-18
 3                        column=info:id, timestamp=1530672889061, value=3
 3                        column=info:salary, timestamp=1530672889061, value=3
 3                        column=info:start_date, timestamp=1530672889061, value=2017-09-18

hive shell:
hive> select * from t_hbase1;
OK
1       1       1       2017-09-18      2017-09-18
2       2       2       2017-09-18      2017-09-18
3       3       3       2017-09-18      2017-09-18
Time taken: 0.902 seconds, Fetched: 3 row(s)

通过phoenix删除指定数据
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> delete from "t_hbase1" where "id"='3';
1 row affected (0.052 seconds)
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
| 2    | 2017-09-18  | 2017-09-18  | 2   | 2       |
+------+-------------+-------------+-----+---------+
2 rows selected (0.057 seconds)

测试结果分析：
- 通过上述结果表明：无论通过hbase还是phoenix操作数据，在没有二级索引的情况下hbase中的数据与phoenix的数据都是可以保持同步的。

创建phoenix对应表的二级索引

CREATE INDEX THBASE1_INDEX_ID ON "t_hbase1"("info"."id");

创建成功后，phoenix中会添加一张索引表（THBASE1_INDEX_ID ），
当然同时hbase中也会添加对应表（THBASE1_INDEX_ID ）
该索引表只存储Row与ID，即存储了hbase原rowkey与二级索引对应的字段，如下所示：
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from THBASE1_INDEX_ID ;
+----------+-------+
| info:id  | :ROW  |
+----------+-------+
| 1        | 1     |
| 2        | 2     |
+----------+-------+

因此笔者觉得如果是想通过二级索引快速查询到对应的数据，可先通过二级索引查询到其对应的rowkey,
再通过该rowkey查询实际数据，实测速度可提升几十陪至百陪，sql如下：
select * from "t_hbase1"  t1 
INNER JOIN  
(select "ROW" FROM "t_hbase1"  WHERE "id"='200002') t2
on t1."ROW" = t2."ROW";

或者
select * from "t_hbase1"  where "ROW" in 
(select "ROW" FROM "t_hbase1"  WHERE "id"='200002');
测试发现第二种方式会快一点。

1、同样进行上述操作

通过hbase shell添加数据到t_hbase1
      hbase(main):010:0> put 't_hbase1','3','info:id','3'
      hbase(main):010:0> put 't_hbase1','3','info:salary','3'
      hbase(main):011:0> put 't_hbase1','3','info:start_date','2017-09-18'
      hbase(main):012:0> put 't_hbase1','3','info:end_date','2017-09-18'

插入后查询结果如下所示：
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
| 2    | 2017-09-18  | 2017-09-18  | 2   | 2       |
| 3    | 2017-09-18  | 2017-09-18  | 3   | 3       |
+------+-------------+-------------+-----+---------+
3 rows selected (0.058 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select count(*) from "t_hbase1";
+-----------+
| COUNT(1)  |
+-----------+
| 2         |
+-----------+
1 row selected (0.083 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1" where "id"='3';
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 3    | 2017-09-18  | 2017-09-18  | 3   | 3       |
+------+-------------+-------------+-----+---------+
1 row selected (0.059 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from THBASE1_INDEX_ID;
+----------+-------+
| info:id  | :ROW  |
+----------+-------+
| 1        | 1     |
| 2        | 2     |
+----------+-------+
2 rows selected (0.034 seconds)

2.通过phoenix shell 插入数据

 UPSERT INTO "t_hbase1" VALUES('4','2017-09-18','2017-09-18','4','4');

插入后查询结果如下所示：
0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from THBASE1_INDEX_ID;
+----------+-------+
| info:id  | :ROW  |
+----------+-------+
| 1        | 1     |
| 2        | 2     |
| 4        | 4     |
+----------+-------+
3 rows selected (0.055 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select count(*) from "t_hbase1";
+-----------+
| COUNT(1)  |
+-----------+
| 3         |
+-----------+
1 row selected (0.027 seconds)

0: jdbc:phoenix:hdp04,hdp05,hdp06:2181> select * from "t_hbase1";
+------+-------------+-------------+-----+---------+
| ROW  | start_date  |  end_date   | id  | salary  |
+------+-------------+-------------+-----+---------+
| 1    | 2017-09-18  | 2017-09-18  | 1   | 1       |
| 2    | 2017-09-18  | 2017-09-18  | 2   | 2       |
| 3    | 2017-09-18  | 2017-09-18  | 3   | 3       |
| 4    | 2017-09-18  | 2017-09-18  | 4   | 4       |
+------+-------------+-------------+-----+---------+
4 rows selected (0.113 seconds)

批量导入确认是否会自动更新索引表

通过测试确认批量导入会自动更新phoenix二级索引（这个结果不受是否先有hbase表的影响），下面以直接创建phoenix表自动生成hbase表为例，具体测试如下

创建phoenix表

CREATE TABLE example (
    my_pk bigint not null,
    m.first_name varchar(50),
    m.last_name varchar(50) 
    CONSTRAINT pk PRIMARY KEY (my_pk));

创建二级索引

 create index example_first_name_index on example(m.first_name);

创建data.csv文件

12345,John,Doe
67890,Mary,Poppins

批量写入数据

[root@hdp18 Templates]# 
/usr/hdp/2.5.3.0-37/phoenix/bin/psql.py -t EXAMPLE hdp14:2181 /root/Templates/data.csv

验证索引表是否有进行同步更新

0: jdbc:phoenix:hdp14,hdp15> select * from example;
+--------+-------------+------------+
| MY_PK  | FIRST_NAME  | LAST_NAME  |
+--------+-------------+------------+
| 12345  | John        | Doe        |
| 67890  | Mary        | Poppins    |
+--------+-------------+------------+
2 rows selected (0.023 seconds)
0: jdbc:phoenix:hdp04,hdp05> select * from example_first_name_index;
+---------------+---------+
| M:FIRST_NAME  | :MY_PK  |
+---------------+---------+
| John          | 12345   |
| Mary          | 67890   |
+---------------+---------+
2 rows selected (0.018 seconds)

通过上述结果可知批量导入数据的情况是会自动更新索引表的。

Hbase表不存在的情况

Hbase表不存在的情况经过跟上述操作步骤一样的测试，结果与hbase表已存在的情况是一样的。

通过上述结果发现，当给t_hbase1表创建了二级索引后，如果通过hbase shell 进行插入数据时，该二级索引表数据是不会同步进行更新的。
当给t_hbase1表创建了二级索引后，如果通过phoenix shell 进行插入数据时，该二级索引表数据是会自动同步的，原理主要是通过协处理器进行更新。
总的来说如果需要维护phoenix表所创建的二级索引，源表数据的操作需要通过Phoenix客户端进行操作，当然如果不用维护对应的二级索引，数据的操作就无所谓通过什么方式进行了。
另外，Phoenix对应的表与其所对应的视图所拥有的功能与性质基本一致，只不过视图只提供查询功能，以及视图对于二级索引的支持跟表对于二有索引的支持也是基本一致的。
个人觉得这种机制有一个小缺点，就是我数据本来就是通过非phoenix客户端的方式写入的这时改动就会比较大了。