SQL优化及 hive配置优化

最新推荐文章于 2024-07-25 15:08:38 发布

KUI、简单

最新推荐文章于 2024-07-25 15:08:38 发布

阅读量67

点赞数

文章标签： sql hive mysql

本文链接：https://blog.csdn.net/weixin_59950163/article/details/134294490

版权

文章目录

SQL优化
- SQL语句方面
- 建表方面
hive配置优化

SQL优化

SQL语句方面

避免使用select *
在编写SQL时只查询需要的列，减少查询列也会提升查询效率
多使用 limit
如果limit 1; 查到了一条数据就不会再查了
用union all 代替 union
union all 是将两个表的数据直接合并为一张表
union 将两张表的数据合并，并做了去重，相对union多了去重的操作，所以效率会低
避免count(列)，使用count(*)/count(1)
用 group by 代替 distinct去重
group by 在数据流中做去重处理
distinct 先扫描全表数据，再删除重复数据
小表驱动大表，in适用于左表大表、右表小表，exists适用于左表小表、右边大表

-- user 100 数据量   order 10000 数据量

select * from order where user_id in (select id from user)

select * from user where exists(select * from order where user.id=order.user_id)

批量操作数据
在插入数据时，每请求一次数据库就消耗一些性能，找一个批量插入的方法，批量操作
增量查询
在使用远程接口查询数据时，如果数据量非常大，考虑到查询性能问题，则分批次查询

-- 按id和时间升序 每次只同步一批数据，同步完保存 最大的id和时间，给下次查询用
select * from user
where id>#{lastId} and create_time >= #(lastCreateTime]
limit 500;

用连接查询代替子查询
子查询会先执行内层查询再执行外层查询，相当于执行了两个SQL
连接查询，能用inner join尽量不要用left，用left join 左边要用小表，右边用大表
inner join 只查询两个表交集的数据，left join 会查询左边所有数据

建表方面

简历合理的数据类型
如身份证号固定18位，给varchar(200)则浪费182字节，给varchar(18)就不会浪费
字符串占用空间较大，布尔值可以用0和1代替
控制索引的数量
索引可以大幅提升查询效率，但索引在增删改的时候要更新B+树，索引过多就会消耗更多性能
如果超过五个，并发量不高超几个也是ok的
索引优化

-- 查看是否使用 索引 （这里没有使用索引，因为使用了*）
explain select * from order;

索引建立的依据可以通过查询分析语句提供依据 explain select语句

我们只需要注意一个最重要的type 的信息很明显的提现是否用到索引：

type结果值从好到坏依次是：

system > const > eq_ref > ref > fulltext > ref_or_null > index_merge > unique_subquery > index_subquery > range > index > ALL

all:全表扫描

index:另一种形式的全表扫描，只不过他的扫描方式是按照索引的顺序

range：有范围的索引扫描，相对于index的全表扫描，他有范围限制，因此要优于index

ref: 查找条件列使用了索引而且不为主键和unique。其实，意思就是虽然使用了索引，但该索引列的值并不唯一，有重复。这样即使使用索引快速查找到了第一条数据，仍然不能停止，要进行目标值附近的小范围扫描。但它的好处是它并不需要扫全表，因为索引是有序的，即便有重复值，也是在一个非常小的范围内扫描。

const:通常情况下，如果将一个主键放置到where后面作为条件查询，mysql优化器就能把这次查询优化转化为一个常量。至于如何转化以及何时转化，这个取决于优化器

索引失效的原因：
a. 使用了select *
b. 索引列上有计算/函数
c. 字符串类型没加引号
d. like左边有%

force index可以强制指定索引，如下

-- explain 查询索引 
explain select * from order 
    force index(user_id)

hive配置优化

hive jvm重用

临时设置：set mapred.job.reuse.jvm.num.tasks=10;

永久设置：在mapred-site.xml中添加

<property>
  <name>mapreduce.job.jvm.numtasks</name>
  <value>10</value>
  <description>How many tasks to run per jvm. If set to -1, there is
  no limit.
  </description>
</property>

job 并行处理

临时设置：
– 通过以下设置开启并行执行
set hive.exec.parallel=true; – 默认的是 false

set hive.exec.parallel.thread.number=16; – 默认的 8,定义并行运行的最大数量(16是指并行线程数（CPU线程），根据电脑情况设置，太多只会适得其反)

永久设置：

<!--1.修改mapred-site.xml添加调度器配置-->
<property>
  <name>mapred.jobtracker.taskScheduler</name>
  <value>org.apache.hadoop.mapred.FairScheduler</value>
</property>

<!--2.添加jar文件地址配置：-->
<property>
  <name>hadoopTest.jar</name>
  <value>所生成jar的地址</value>
</property>

查询优化

自动启动Map端Join

<!--防止数据倾斜-->
set hive.optimize.skewjoin=true;

<!--启用CBO(Cost based Optimizer)-->
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

<!--启动Vectorization（矢量化）-->
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
<!--使用CTE、临时表、窗口函数等-->