Phoenix Tuning Guide

Phoenix Tuning Guide

The most important factor in performance is the design of your schema, especially as it affects the underlying HBase row keys

Primary Keys

The Phoenix primary keys are concatenated to create the underlying row key in Apache HBase


choose the most frequently queried columns as primary keys


If your primary keys are monotonically increasing, use salting to help distribute writes across the cluster and improve parallelization


the number of salt buckets should approximately equal the number of region servers. Do not salt automatically


General Tips

Random-Access

SSDs can improve performance because of their faster random seek time.

read-heavy

 Create global indexes This will affect write speed depending on the number of columns included in an index because each index writes to its own separate table

  Use multiple indexes to provide fast access to common queries

 When specifying machines for HBase, do not skimp on cores; HBase needs them

write-heavy

 Pre-split the table

  Create local indexes

Which columns will be accessed often

Create additional indexes to support common query patterns, including heavily accessed fields that are not in the primary key

Can the data be append-only (immutable)?

Set the UPDATE_CACHE_FREQUENCY option to 15 minutes or so if your metadata doesn’t change very often. This property determines how often an RPC is done to ensure you’re seeing the latest schema

If the data is not sparse (over 50% of the cells have values), use the SINGLE_CELL_ARRAY_WITH_OFFSETS data encoding scheme introduced in Phoenix 4.10, which obtains faster performance by reducing the size of the data

Is the table very large

Use the ASYNC keyword with your CREATE INDEX call to create the index asynchronously via MapReduce job

Is transactionality required


Block Encoding

Using compression or encoding is a must. Both SNAPPY and FAST_DIFF are good all around options

Set encoding at table creation time. Example: CREATE TABLE … ( … ) DATA_BLOCK_ENCODING=‘FAST_DIFF’

Schema Design

Because the schema affects the way the data is written to the underlying HBase layer, Phoenix performance relies on the design of your tables, indexes, and primary keys

Phoenix and the HBase data model

Phoenix creates a relational data model on top of HBase, enforcing a PRIMARY KEY constraint whose columns are concatenated to form the row key for the underlying HBase table. it’s important to be cognizant of the size and number of the columns you include in the PK constraint, because a copy of the row key is included with every cell in the underlying HBase table

Column Families

create multiple column families to separate the frequently-accessed columns from rarely-accessed columns

Columns

Keep VARCHAR columns under 1MB

For structured objects, don’t use JSON, which is not very compact. Use a format such as protobuf, Avro, msgpack, or BSON.

Use the column mapping feature

Indexes

A Phoenix index is a physical table that stores a pivoted copy of some or all of the data in the main table

Secondary indexes

Depending on your needs, consider creating covered indexes or functional indexes

create index if not exists event_object_id_idx_b on trans.event (object_id) ASYNC UPDATE_CACHE_FREQUENCY=60000


  • Create local indexes for write-heavy use cases.
  • Create global indexes for read-heavy use cases. To save read-time overhead, consider creating covered indexes.
  • If the primary key is monotonically increasing, create salt buckets. The salt buckets can’t be changed later, so design them to handle future growth. Salt buckets help avoid write hotspots, but can decrease overall throughput due to the additional scans needed on read.
  • Set up a cron job to build indexes. Use ASYNC with CREATE INDEX to avoid blocking.
  • Only create the indexes you need.
  • Limit the number of indexes on frequently updated tables.
  • Use covered indexes to convert table scans into efficient point lookups or range queries over the index table instead of the primary table: CREATE INDEX index ON table( … )INCLUDE( … )

Queries

It’s important to know which queries are executed on the server side versus the client side, because this can impact performace

Reading

Avoid joins unless one side is small, especially on frequent queries. For larger joins, see “Hints,” below

  • Avoid joins unless one side is small, especially on frequent queries. For larger joins, see “Hints,” below.
  • In the WHERE clause, filter leading columns in the primary key constraint.
  • Filtering the first leading column with IN or OR in the WHERE clause enables skip scan optimizations.
  • Equality or comparisions (< or >) in the WHERE clause enables range scan optimizations.
  • Let Phoenix optimize query parallelism using statistics. This provides an automatic benefit if using Phoenix 4.2 or greater in production.

Range Queries

Apache Phoenix makes it easy to utilize many cores to increase scan performance

For range queries, the HBase block cache does not provide much advantage

Large Range Queries

For large range queries, consider setting Scan.setCacheBlocks(false) even if the whole scan could fit into the block cache  基于操作系统缓存

Point Lookups

For point lookups it is quite important to have your data set cached, and you should use the HBase block cache

Hints

Hints let you override default query processing behavior and specify such factors as which index to use, what type of scan to perform, and what type of join to use

  • During the query, Hint global index if you want to force it when query includes a column not in the index.
  • If necessary, you can do bigger joins with the /*+ USE_SORT_MERGE_JOIN */ hint, but a big join will be an expensive operation over huge numbers of rows.
  • If the overall size of all right-hand-side tables would exceed the memory size limit, use the /*+ NO_STAR_JOIN */hint.

Writing

Batching large numbers of records

When using UPSERT to write a large number of records, turn off autocommit and batch records. Note: Phoenix uses commit() instead of executeBatch() to control batch updates

try (Connection conn = DriverManager.getConnection(url)) {

  conn.setAutoCommit(false);

  int batchSize = 0;

  int commitSize = 1000; // number of rows you want to commit per batch.  

  try (Statement stmt = conn.prepareStatement(upsert)) {

    stmt.set ... while (there are records to upsert) {

      stmt.executeUpdate(); 

      batchSize++; 

      if (batchSize % commitSize == 0) { 

        conn.commit(); 

      } 

   } 

 conn.commit(); // commit the last batch of records 

 }

Note: Because the Phoenix client keeps uncommitted rows in memory, be careful not to set commitSize too high

Reducing RPC traffic

To reduce RPC traffic, set the UPDATE_CACHE_FREQUENCY (4.7 or above) on your table and indexes when you create them (or issue an ALTER TABLE/INDEX call. See https://phoenix.apache.org/#Altering.

Using local indexes

consider using local indexes to minimize the write time. the writes for the secondary index will be to the same region server as your base table. This approach does involve a performance hit on the read side, though, so make sure to quantify both write speed improvement and read speed reduction.

Deleting

When deleting a large data set, turn on autoCommit before issuing the DELETE query so that the client does not need to remember the row keys of all the keys as they are deleted. This prevents the client from buffering the rows affected by the DELETE so that Phoenix can delete them directly on the region servers without the expense of returning them to the client.


Explain Plans

An EXPLAIN plan tells you a lot about how a query will be run:

  • All the HBase range queries that will be executed
  • The number of bytes that will be scanned
  • The number of rows that will be traversed
  • Which HBase table will be used for each scan
  • Which operations (sort, merge, scan, limit) are executed on the client versus the server

Use an EXPLAIN plan to check how a query will run, and consider rewriting queries to meet the following goals:

  • Emphasize operations on the server rather than the client. Server operations are distributed across the cluster and operate in parallel, while client operations execute within the single client JDBC driver.
  • Use RANGE SCAN or SKIP SCAN whenever possible rather than TABLE SCAN.
  • Filter against leading columns in the primary key constraint. This assumes you have designed the primary key to lead with frequently-accessed or frequently-filtered columns as described in “Primary Keys,” above.
  • If necessary, introduce a local index or a global index that covers your query.
  • If you have an index that covers your query but the optimizer is not detecting it, try hinting the query: SELECT /*+ INDEX() */

Improving parallelization

You can improve parallelization with the UPDATE STATISTICS command

With Phoenix 4.9, the user can set guidepost width for each table

In Phoenix 4.12, we have added a new configuration phoenix.use.stats.parallelization that controls whether statistics should be used for driving parallelization

Further Tuning

For advice about tuning the underlying HBase and JVM layers, see Operational and Performance Configuration Options in the Apache HBase™ Reference Guide

Special Cases

The following sections provide Phoenix-specific additions to the tuning recommendations in the Apache HBase™ Reference Guide section referenced above.

For applications where failing quickly is better than waiting

In addition to the HBase tuning referenced above, set phoenix.query.timeoutMs in hbase-site.xml on the client side to the maximum tolerable wait time in milliseconds.

For applications that can tolerate slightly out of date information

In addition to the HBase tuning referenced above, set phoenix.connection.consistency = timeline in hbase-site.xml on the client side for all connections.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值