Hive-0.8.1 索引解析(CompactIndex)

本文详细介绍了Hive 0.8.1版本相较于0.7.1新增的索引功能,通过一个查询示例展示了其将查询分解为多个MapReduce任务的过程。内容包括任务的执行流程、日志输出以及MapReduce任务的数量分析。还探讨了中间文件 Compact 索引的结构,并分析了Driver的compile和execute过程,以及Task树的构造。文章最后提到了 ConditionalTask 的选择问题及索引文件的存储位置。
摘要由CSDN通过智能技术生成

上一节已经做了实验了,结论就是Hive0.8.1多了0.7.1没有的几个选项:

<property>
  <name>hive.optimize.index.filter</name>
  <value>true</value>
  <description>Whether to enable automatic use of indexes</description>
</property>


<property>
  <name>hive.optimize.index.groupby</name>
  <value>false</value>
  <description>Whether to enable optimization of group-by queries using Aggregate indexes.</description>
</property>


<property>
  <name>hive.index.compact.file.ignore.hdfs</name>
  <value>false</value>
  <description>True the hdfs location stored in the index file will be igbored at runtime.
  If the data got moved or the name of the cluster got changed, the index data should still be usable.</description>
</property>


<property>
  <name>hive.optimize.index.filter.compact.minsize</name>
  <value>5368</value>
  <description>5368709120Minimum size (in bytes) of the inputs on which a compact index is automatically used.</description>
</property>


<property>
  <name>hive.optimize.index.filter.compact.maxsize</name>
  <value>-1</value>
  <description>Maximum size (in bytes) of the inputs on which a compact index is automatically used.
  A negative number is equivalent to infinity.</description>
</property>


<property>
  <name>hive.index.compact.query.max.size</name>
  <value>10737418240</value>
  <description>The maximum number of bytes that a query using the compact index can read. Negative value is equivalent to infinity.</description>
</property>


<property>
  <name>hive.index.compact.query.max.entries</name>
  <value>10000000</value>
  <description>The maximum number of index entries to read during a query that uses the compact index. Negative value is equivalent to infinity.</description>
</property>


<property>
  <name>hive.index.compact.binary.search</name>
  <value>true</value>
  <description>Whether or not to use a binary search to find the entries in an index table that match the filter, where possible</description>
</property>

一。 CompactIndex的分析:

一个query,分成几个tasks,其中两个MapReduce任务:

任务一:

CmdLine:

hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore -jobconf hive.metastore.local=true -jobconf hive.optimize.bucketmapjoin=false -jobconf hive.optimize.ppd.storage=true -jobconf hive.input.format.sorted=true -jobconf hive.querylog.location=/tmp/allen -jobconf hive.limit.row.max.size=100000 -jobconf hive.rework.mapredwork=false -jobconf dfs.safemode.extension=30000 -jobconf hive.added.jars.path=file:///home/allen/Desktop/hive-0.8.1/lib/hive-builtins-0.8.1.jar -jobconf hive.index.compact.query.max.entries=10000000 -jobconf hive.metastore.authorization.storage.checks=false -jobconf hive.autogen.columnalias.prefix.includefuncname=false -jobconf hive.zookeeper.namespace=hive_zookeeper_namespace -jobconf hive.test.mode.prefix=test_ -jobconf hive.merge.rcfile.block.level=true -jobconf hive.test.mode=false -jobconf hive.exec.compress.intermediate=false -jobconf datanucleus.cache.level2.type=SOFT -jobconf dfs.https.server.keystore.resource=ssl-server.xml -jobconf hive.metastore.ds.retry.attempts=1 -jobconf hive.limit.optimize.enable=false -jobconf hive.zookeeper.client.port=2181 -jobconf hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLogger -jobconf javax.jdo.option.ConnectionUserName=root -jobconf dfs.name.edits.dir=%24%7Bdfs.name.dir%7D -jobconf hive.merge.mapfiles=true -jobconf hive.test.mode.samplefreq=32 -jobconf hive.optimize.skewjoin=false -jobconf hive.optimize.index.groupby=false -jobconf hive.metastore.server.min.threads=200 -jobconf hive.mapjoin.localtask.max.memory.usage=0.9 -jobconf dfs.block.size=67108864 -jobconf hive.map.aggr.hash.min.reduction=0.5 -jobconf hive.exec.compress.output=false -jobconf dfs.datanode.ipc.address=0.0.0.0:50020 -jobconf javax.jdo.option.Multithreaded=true -jobconf hive.script.recordreader=org.apache.hadoop.hive.ql.exec.TextRecordReader -jobconf dfs.permissions=true -jobconf hive.multigroupby.singlemr=false -jobconf hive.lock.numretries=100 -jobconf hive.optimize.metadataonly=true -jobconf hive.exec.parallel.thread.number=8 -jobconf hive.exec.default.partition.name=__HIVE_DEFAULT_PARTITION__ -jobconf hive.exec.max.created.files=100000 -jobconf hive.archive.har.parentdir.settable=false -jobconf hive.metastore.event.clean.freq=0 -jobconf dfs.datanode.https.address=0.0.0.0:50475 -jobconf hive.exec.mode.local.auto=false -jobconf dfs.secondary.http.address=0.0.0.0:50090 -jobconf hive.optimize.index.filter=true -jobconf datanucleus.storeManagerType=rdbms -jobconf dfs.replication.max=512 -jobconf hive.script.operator.id.env.var=HIVE_SCRIPT_OPERATOR_ID -jobconf hive.exec.mode.local.auto.inputbytes.max=134217728 -jobconf mapred.min.split.size=1 -jobconf hive.mapjoin.size.key=10000 -jobconf hive.metastore.ds.retry.interval=1000 -jobconf hive.skewjoin.mapjoin.min.split=33554432 -jobconf hive.metastore.client.connect.retry.delay=1 -jobconf hive.auto.convert.join=false -jobconf dfs.https.client.keystore.resource=ssl-client.xml -jobconf hive.metastore.warehouse.dir=/user/hive/warehouse -jobconf hive.mapjoin.bucket.cache.size=100 -jobconf hive.exec.job.debug.timeout=30000 -jobconf datanucleus.transactionIsolation=read-committed -jobconf hive.stats.jdbc.timeout=30 -jobconf hive.mergejob.maponly=true -jobconf dfs.https.address=0.0.0.0:50470 -jobconf dfs.balance.bandwidthPerSec=1048576 -jobconf hive.fetch.output.serde=org.apache.hadoop.hive.serde2.DelimitedJSONSerDe -jobconf hive.exec.script.trust=false -jobconf hive.mapjoin.followby.map.aggr.hash.percentmemory=0.3 -jobconf hive.exim.uri.scheme.whitelist=hdfs%2Cpfile -jobconf hive.stats.dbconnectionstring=jdbc:derby:%3BdatabaseName=TempStatsStore%3Bcreate=true -jobconf mapred.reduce.tasks=-1 -jobconf hive.optimize.index.filter.compact.minsize=5368 -jobconf hive.skewjoin.key=100000 -jobconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver -jobconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator -jobconf dfs.max.objects=0 -jobconf mapred.input.dir.recursive=false -jobconf hive.udtf.auto.progress=false -jobconf hive.session.id=allen_201203131425 -jobconf mapred.job.name=select+*+from+table02+where+id=5000%28Stage-3%29 -jobconf dfs.datanode.dns.nameserver=default -jobconf hive.exec.script.maxerrsize=100000 -jobconf dfs.blockreport.intervalMsec=3600000 -jobconf hive.optimize.groupby=true -jobconf datanucleus.plugin.pluginRegistryBundleCheck=LOG -jobconf hive.exec.rowoffset=false -jobconf hive.default.fileformat=TextFile -jobconf hive.hadoop.supports.splittable.combineinputformat=false -jobconf hive.metastore.archive.intermediate.original=_INTERMEDIATE_ORIGINAL -jobconf hive.mapjoin.smalltable.filesize=25000000 -jobconf hive.exec.scratchdir=/tmp/hive-allen -jobconf datanucleus.identifierFactory=datanucleus -jobconf hive.exec.max.dynamic.partitions.pernode=100 -jobconf hive.stats.retries.max=0 -jobconf dfs.client.block.write.retries=3 -jobconf hive.join.emit.interval=1000 -jobconf hive.script.recordwriter=org.apache.hadoop.hive.ql.exec.TextRecordWriter -jobconf datanucleus.validateConstraints=false -jobconf hive.exec.dynamic.partition=false -jobconf hive.hashtable.loadfactor=0.75 -jobconf dfs.https.enable=false -jobconf hive.sample.seednumber=0 -jobconf hive.optimize.index.filter.compact.maxsize=-1 -jobconf hive.metastore.client.socket.timeout=20 -jobconf hive.map.aggr.hash.force.flush.memory.threshold=0.9 -jobconf hive.exec.show.job.failure.debug.info=true -jobconf hive.join.cache.size=25000 -jobconf hive.mapper.canno
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值