hadoop15--MR调优, 虚拟列, mysql

最新推荐文章于 2021-04-20 19:47:39 发布

forever428

最新推荐文章于 2021-04-20 19:47:39 发布

阅读量215

点赞数

分类专栏： hadoop 文章标签：数据倾斜 JVM重用推测执行 mysql hive

本文链接：https://blog.csdn.net/forever428/article/details/83995178

版权

hadoop 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

表的优化

在表的优化中, 当数据量较大的时候常用的手段就是拆分表, 大表拆小表, 分区表, 临时表, 外部表
小表和大表的join, 要把数据量小的表放在join的左边, 先进行缓存, 这样会减少表join的时候内存的消耗量

数据倾斜

数据倾斜产生的原因为分区之后某一个reduce运算的数据量比较小, 而某一个reduce运行的数据量比较大, 造成两个reduce处理数据不平等

合理设置map数量

可以影响map的数量的因素

在input文件夹中, 每一个文件就是一个map. input文件的数量, input文件的大小都会影响map的数量, 在mapreduce任务中, 一个切片就是一个map任务, 在Driver中设置如下:

FileInputFormat.setMaxInputSplitSize(job, size);
FileInputFormat.setMinInputSplitSize(job, size);

合理设置reduce数量

设置reduce个数:

hive (default)> set mapreduce.job.reduces;
mapreduce.job.reduces=-1
//默认为-1, 就是不设置reduce的个数

根据业务自定分区规则

并行执行

并行执行与java多线程的异步和同步概念差不多, 在MR运行任务中, 存在很多的MR任务可以进行执行, 有些MR任务和下一个MR任务存在依赖关系, 但是有些MR任务没有依赖关系. 例如: 存在依赖关系的MR, 一个MR任务的输出就是下一个MR任务的输入. 对于没有依赖关系的MR任务可以使用并行执行, 在同一时间运行多个MR任务, 这样在运行的过程中效率就会得到提升.

可以通过以下参数来设置

开启并行任务

hive (default)> set hive.exec.parallel;
hive.exec.parallel=false
---------------------------------------
set hive.exec.parallel=true;

设置多少个任务可以同时运行

hive (default)> set hive.exec.parallel.thread.number;
hive.exec.parallel.thread.number=8
//默认值为8个任务可以同时执行

严格模式

hive中提供有严格模式, 为了防止一些查询出现不好的影响, 例如笛卡尔积, 在严格模式下是不能运行的.

默认的严格模式设置:

<property>
    <name>hive.mapred.mode</name>
    <value>nonstrict</value>
    <description>
      The mode in which the Hive operations are being performed. 
      In strict mode, some risky queries are not allowed to run. They include:
        Cartesian Product.
        No partition being picked up for a query.
        Comparing bigints and strings.
        Comparing bigints and doubles.
        Orderby without limit.
    </description>
  </property>
  //默认值为非严格模式 : nonstrict

开启严格模式 : strict

开启了严格模式会对查询语句进行一些限制:

对于分区表: 必须存在where语句对分区表中的分区字段进行条件过滤, 否则不允许执行该查询.
对于使用order by: 当使用orderby语句时, 必须使用limit进行限定, 由于orderby之后所有的数据都会被分到一个reduce中, 这样reduce操作的数据量太大, 可能时间过长, 导致卡死, 所以为了防止出现这种情况, 在orderby的时候必须给定limit限制, 减少reduce处理的数据量
笛卡尔积查询: 在多表join中会出现笛卡尔积, 笛卡尔积灰造成内存的加大消耗, 为了防止这种情况, 禁止使用笛卡尔积查询, 同时防止误操作

JVM重用

在hive执行计算任务的时候, 会把执行计划上传到YARN集群中进行提交, 运行MR任务, 每次进行任务的运行的时候都会开启一个JVM进程运行MR任务, 如果提交任务频繁过多, 就会造成JVM频繁的开启和关闭, 在JVM的开启和关闭的过程中会造成大量的资源浪费.

在处理小文件的时候, 由于map任务较多, 所以JVM或频繁的开启和关闭, 所以对于小文件的处理优化, 主要减少JVM开启的次数

在mapred-default.xml配置文件中有如下参数:

<property>
  <name>mapreduce.job.jvm.numtasks</name>
  <value>10</value>
  <description>How many tasks to run per jvm. If set to -1, there is
  no limit. 
  </description>
</property>

在hive中临时设置JVM重用任务的数量

hive (default)> set mapreduce.job.jvm.numtasks;
mapreduce.job.jvm.numtasks=1

推测执行

由于集群中的资源分配不均等, 或者说每个集群中节点硬件性能会导致某个任务运行的时间快, 或者某个任务运行时间慢, 或者某个任务运行时直接卡死.

为了防止某些任务在运行过程中拖慢了整个MR任务的进度, 在运行慢的任务节点上, 开启相同的任务, 如果时间比原来的任务运行的快, 则直接输出推测的任务.

注意 : 推测执行分为map端的推测执行以及reduce端的推测执行

map端

设置开启map端推测执行的参数:

<property>
  <name>mapreduce.map.speculative</name>
  <value>true</value>
  <description>If true, then multiple instances of some map tasks
               may be executed in parallel.</description>
</property>

在hadoop中默认开启推测执行, 推测执行不是说一卡死就开启推测任务, 而是必须要运行到5%以上才开启推测执行

在hive中通过set设置

hive (default)> set mapreduce.map.speculative;
mapreduce.map.speculative=true

reduce端

设置开启reduce端推测执行的参数:

<property>
  <name>mapreduce.reduce.speculative</name>
  <value>true</value>
  <description>If true, then multiple instances of some reduce tasks
               may be executed in parallel.</description>
</property>

在hive中通过set设置

hive (default)> set mapreduce.reduce.speculative;
mapreduce.reduce.speculative=true

执行计划—查看SQL语句的执行过程

hive中提供可以查看hql语句的执行计划 , 在执行计划中会生成抽象语法树, 在语法树中会显示hql语句之间的以来关系以及执行过程. 通过这些执行的过程和以来关系可以对hql语句进行优化

explain + 执行语句
------------------------------------------------
hive (default)> explain select * from emp;
OK
Explain
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: emp
          Statistics: Num rows: 2 Data size: 653 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), edate (type: string), sal (type: double), deptno (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
            Statistics: Num rows: 2 Data size: 653 Basic stats: COMPLETE Column stats: NONE
            ListSink

Time taken: 0.127 seconds, Fetched: 17 row(s)

一般来说都会把复杂语句简单化处理, 例如多表的join

虚拟列

虚拟列本身是一个不存在的列, 在数据查询的时候, 可以通过虚拟列查询数据的路径, 以及数据的偏移量, 这两个都是hive中为用户提供的虚拟列进行的查询

INPUT__FILE__NAME : 数据文件的保存路径

通过查询得到文件的保存路径

select ename, INPUT__FILE__NAME  from emp;

SMITH   hdfs://ns1/user/hive/warehouse/emp/emp.txt
ALLEN   hdfs://ns1/user/hive/warehouse/emp/emp.txt
WARD    hdfs://ns1/user/hive/warehouse/emp/emp.txt
JONES   hdfs://ns1/user/hive/warehouse/emp/emp.txt
MARTIN  hdfs://ns1/user/hive/warehouse/emp/emp.txt
BLAKE   hdfs://ns1/user/hive/warehouse/emp/emp.txt
CLARK   hdfs://ns1/user/hive/warehouse/emp/emp.txt
SCOTT   hdfs://ns1/user/hive/warehouse/emp/emp.txt
KING    hdfs://ns1/user/hive/warehouse/emp/emp.txt
TURNER  hdfs://ns1/user/hive/warehouse/emp/emp.txt
ADAMS   hdfs://ns1/user/hive/warehouse/emp/emp.txt
JAMES   hdfs://ns1/user/hive/warehouse/emp/emp.txt
FORD    hdfs://ns1/user/hive/warehouse/emp/emp.txt
MILLER  hdfs://ns1/user/hive/warehouse/emp/emp.txt

BLOCK__OFFSET__INSIDE__FILE : 得到数据文件的偏移量信息

通过虚拟里得到数据的偏移量

select ename ,BLOCK__OFFSET__INSIDE__FILE  from emp;

ename   block__offset__inside__file
SMITH   0
ALLEN   44
WARD    97
JONES   149
MARTIN  194
BLAKE   249
CLARK   294
SCOTT   339
KING    385
TURNER  429
ADAMS   480
JAMES   524
FORD    567
MILLER  612

安装配置mysql

由于hive中默认的元数据保存在derby中只能单用户访问hive , 则另一用户无法访问, 会出现以下错误信息:

Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /opt/app/apache-hive-1.2.1-bin/metastore_db.

为了解决以上的问题, 可以把hive的元数据保存在mysql中.

mysql的安装步骤

在Linux系统中,可能存在mysql的安装包, 所以第一步先检查是否安装过mysql

[hadoop@hadoop apache-hive-1.2.1-bin]$ rpm -qa | grep -i mysql
mysql-libs-5.1.73-5.el6_6.x86_64

执行该命令可以查看是否安装mysql

卸载已有的mysql安装包

[hadoop@hadoop apache-hive-1.2.1-bin]$ sudo rpm -e --nodeps mysql-libs-5.1.73-5.el6_6.x86_64
[sudo] password for hadoop:

查看是否卸载成功

[hadoop@hadoop apache-hive-1.2.1-bin]$ rpm -qa | grep -i mysql  
[hadoop@hadoop apache-hive-1.2.1-bin]$

mysql分为server端和client端

 MySQL-client-5.5.47-1.linux2.6.x86_64.rpm
 MySQL-server-5.5.47-1.linux2.6.x86_64.rpm

安装mysql软件

通过rpm安装server

sudo rpm -ivh MySQL-server-5.5.47-1.linux2.6.x86_64.rpm

通过rpm安装client

sudo rpm -ivh MySQL-client-5.5.47-1.linux2.6.x86_64.rpm

查看mysql的运行状态

sudo service mysql status

启动mysql服务

[root@hadoop mysql]# service mysql start
Starting MySQL.. SUCCESS!

再次查看mysql运行状态

 SUCCESS! MySQL running (5094)

设置密码,远程授权

设置密码

mysql安装好之后进入mysql

mysql -uroot

查询数据库

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| test               |
+--------------------+
4 rows in set (0.01 sec)

切换mysql数据库

mysql> use mysql;
Database changed

查看user, host, passWord信息

mysql> select user,host,password from user;
+------+-----------+----------+
| user | host      | password |
+------+-----------+----------+
| root | localhost |          |
| root | hadoop    |          |
| root | 127.0.0.1 |          |
| root | ::1       |          |
|      | localhost |          |
|      | hadoop    |          |
+------+-----------+----------+
6 rows in set (0.00 sec)

设置mysql密码

mysql> update user set password=PASSWORD('root') where user='root';
Query OK, 4 rows affected (0.00 sec)
Rows matched: 4  Changed: 4  Warnings: 0

修改密码之后, 查询user表内容如下,说明在本地已经成功设置好了密码

mysql> select user,host,password from user;
+------+-----------+-------------------------------------------+
| user | host      | password                                  |
+------+-----------+-------------------------------------------+
| root | localhost | *81F5E21E35407D884A6CD4A731AEBFB6AF209E1B |
| root | hadoop    | *81F5E21E35407D884A6CD4A731AEBFB6AF209E1B |
| root | 127.0.0.1 | *81F5E21E35407D884A6CD4A731AEBFB6AF209E1B |
| root | ::1       | *81F5E21E35407D884A6CD4A731AEBFB6AF209E1B |
|      | localhost |                                           |
|      | hadoop    |                                           |
+------+-----------+-------------------------------------------+
6 rows in set (0.00 sec)

设置远程授权

通过新设置的密码登录mysql, 发现遇到如下问题, 说明用户名或密码不正确

[root@hadoop mysql]# mysql -uroot -proot
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)

在user表中存在字段host , 该字段表示可以访问mysql的路径地址, 从哪个节点可以访问, 有这个字段来决定

所以要授权远程登录, 则需要修改host字段, 增加一条信息, 表示任意节点都可以访问mysql, 用%来表示任意

mysql> update user set host='%' where user='root' and host='127.0.0.1';

完成以上语句后, 需要对修改的user进行刷新来生效语句操作

mysql> flush privileges;

完成以上操作之后验证mysql用户登录,可以登录成功

[root@hadoop mysql]# mysql -uroot -proot
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6
Server version: 5.5.47 MySQL Community Server (GPL)

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

配置hive元数据保存在mysql

需要在hive-site.xml配置文件总进行配置

设置hive链接mysql

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://192.168.91.100:3306/metastore?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

metastore: 默认保存hive中的元数据, 是一个数据库的名字

设置jdbc的驱动类

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

设置mysql的用户名

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>Username to use against metastore database</description>
  </property>

设置mysql的密码

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
    <description>password to use against metastore database</description>
  </property>

完成以上的配置之后, 需要在hive/lib下边存放jdbc的驱动包, 上传好驱动包之后最好修改权限

 sudo chown -R hadoop:hadoop mysql-connector-java-5.1.31.jar

将驱动包拷贝到hive目录下的lib文件夹

cp mysql-connector-java-5.1.31.jar /opt/app/apache-hive-1.2.1-bin/lib/

到hive的lib下检查是否拷贝成功
配置完成, 退出hive重新进入, 检查mysql中是否创建了metastore数据库, 如果创建成功, 则说明配置成功

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| metastore          |
| mysql              |
| performance_schema |
| test               |
+--------------------+

hiveserver2

1. beeline方式的连接

相当于在hive中启动一个服务器端, 客户端可以远程连接该hive, hiveserver2不用安装, 直接在hive/bin目录下启动

bin/hiveserver2

hiveserver2的服务启动之后, 可以通过bin/beeline客户端进行连接

官方实例:

!connect jdbc:hive2://localhost:10000 scott tiger

按照官方提供实例, 连接hiveserver2 测试能否连接成功

!connect jdbc:hive2://hadoop:10000 hadoop 123456

2. jdbc的方式连接

package org.hive.server;

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveJdbcClient {

	 private static String driverName = ""org.apache.hive.jdbc.HiveDriver"";
	 
	  /**
	   * @param args
	   * @throws SQLException
	   */
	  public static void main(String[] args) throws SQLException {
	      try {
	      Class.forName(driverName);
	    } catch (ClassNotFoundException e) {
	      // TODO Auto-generated catch block
	      e.printStackTrace();
	      System.exit(1);
	    }
	    //replace ""hive"" here with the name of the user the queries should run as
	    Connection con = DriverManager.getConnection(""jdbc:hive2://10.0.152.235:10000/default"", ""hive"", """");
	    Statement stmt = con.createStatement();
	    
	    String sql = ""show tables"";
	    System.out.println(""Running: "" + sql);
	    ResultSet res = stmt.executeQuery(sql);
	    if (res.next()) {
	      System.out.println(res.getString(1));
	    }
	 }
}

forever428

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
hadoop15--MR调优, 虚拟列, mysql

文章目录表的优化数据倾斜合理设置map数量可以影响map的数量的因素合理设置reduce数量并行执行严格模式JVM重用推测执行map端reduce端执行计划---查看SQL语句的执行过程虚拟列安装配置mysqlmysql的安装步骤设置密码,远程授权设置密码设置远程授权配置hive元数据保存在mysqlhiveserver21. beeline方式的连接2. jdbc的方式连接表的优化在表的优...
复制链接

扫一扫