apache 调优_性能调优Apache Squoop

最新推荐文章于 2023-08-02 06:00:00 发布

weixin_26714857

最新推荐文章于 2023-08-02 06:00:00 发布

阅读量116

点赞数

文章标签： java

原文链接：https://medium.com/swlh/performance-tuning-apache-sqoop-512242a58df5

版权

apache 调优

Sqoop is a tool offered by the Apache foundation that is commonly used in the Big Data world to import export millions of records between heterogeneous relational databases (RDBMS) and Hadoop Distributed File System (HDFS). This can lead to varying load times ranging from a couple of minutes to multiple hours. This is when Data engineers worldwide look under the hood to fine tune settings. The goal of performance tuning is to get more data loaded in a shorter thus time increasing efficiency and lessening chance of data loss in case of network timeouts.

Sqoop是Apache基金会提供的工具，通常在大数据世界中使用，以在异构关系数据库(RDBMS)和Hadoop分布式文件系统(HDFS)之间导入导出数百万条记录。这可能导致加载时间从几分钟到几个小时不等。这是世界各地的数据工程师精挑细选设置的时候。性能调整的目标是在较短的时间内加载更多数据，从而提高效率，并减少网络超时情况下数据丢失的机会。

In General, performance tuning in Sqoop can be achieved by:

通常，可以通过以下方式实现Sqoop中的性能调整：

Controlling Parallelism
控制并行
Controlling Data Transfer Process
控制数据传输过程

控制并行 (Controlling Parallelism)

Image for post — Photo by Meta Studio 35 on Unsplash

Sqoop works on the MapReduce programming model implemented in Hadoop. Sqoop imports exports data from most relational databases in parallel. The number of map tasks per job determines it’s parallelism. By controlling the parallelism we can control the load on our databases and hence its performance. Here are a couple of ways in Sqoop jobs to control parallelism:

Sqoop使用Hadoop中实现的MapReduce编程模型。 Sqoop import并行地从大多数关系数据库中导出数据。每个作业的映射任务数决定了它的并行性。通过控制并行性，我们可以控制数据库上的负载，从而控制其性能。以下是Sqoop作业中控制并行性的几种方法：

更改映射器的数量 (Changing the number of mappers)

Normal Sqoop jobs launch four mappers by default. To optimize performance, increasing the map tasks (Parallel processes) to an integer value of 8 or 16 can show an increase in performance in some databases.

默认情况下，普通Sqoop作业会启动四个映射器。为了优化性能，将映射任务(并行进程)增加到8或16的整数值可以显示某些数据库的性能提高。

By using the -m or --num-mappers parameter we can set the degree of parallelism in Sqoop. Changing the number of mappers to 10 for example:

通过使用 -m或--num-mappers参数，我们可以在Sqoop中设置并行度。例如，将映射器的数量更改为10：

sqoop import  
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \--num-mappers 10

A few things to keep in mind is that the number of map tasks should be less than the maximum number of parallel databases connections possible. The increase in the degree of parallelism should be lesser than that which is available within your MapReduce cluster.

要记住的几件事是，映射任务的数量应少于可能的并行数据库连接的最大数量。并行度的增加应小于MapReduce集群中可用的程度。

按查询拆分 (Split By Query)

When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range.

在执行并行导入时，Sqoop需要一个可用来划分工作负载的标准。 Sqoop使用拆分列拆分工作负载。 默认情况下，Sqoop将识别表中的主键列(如果存在)并将其用作拆分列。 从数据库中检索拆分列的低值和高值，并且映射任务对总范围的大小均匀的分量进行操作。

The --split-by parameter splits the column data uniformly on the basis of the number of mappers specified. The syntax for --split-by is given by:

--split-by参数根据指定的映射器数均匀地拆分列数据。 --split-by的语法如下：

sqoop import  
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \--split-by city_id

控制数据传输过程 (Controlling Data Transfer Process)

A popular method of improving performance is by managing the way we import and export data. Here are a few ways:

一种提高性能的流行方法是管理我们导入和导出数据的方式。以下是几种方法：

批处理 (Batching)

Batching means that related SQL statements can be grouped into a batch when you export data.

批处理意味着在导出数据时可以将相关SQL语句分组为一批。

The JDBC interface exposes an API for doing batches in a prepared statement with multiple sets of values. With the --batch parameter, Sqoop can take advantage of this. This API is present in all JDBC drivers because it is required by the JDBC interface.

JDBC接口公开了一个API，用于在具有多组值的预准备语句中进行批处理。通过--batch参数，Sqoop可以利用这一点。该API在所有JDBC驱动程序中都存在，因为JDBC接口需要它。

Batching is disabled by default in Sqoop. Enable JDBC batching using the --batch parameter.

默认情况下，在Sqoop中禁用批处理 。使用--batch参数启用JDBC批处理。

sqoop export   
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir /data/cities \--batch

提取大小 (Fetch Size)

The default number of records that can be imported at once is 1000. This can be overridden by using the Fetch-size parameter which is used to specify the number of records that Sqoop can import at a time using the following syntax:

一次可以导入的默认记录数为1000。可以使用Fetch-size参数覆盖此参数，该参数用于使用以下语法指定Sqoop一次可以导入的记录数：

sqoop import 
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--fetch-size=n

Where n represents the number of entries that Sqoop must fetch at a time.

其中n表示Sqoop一次必须获取的条目数。

Based on the available memory and bandwidth, the value of the fetch-size parameter can be increased w.r.t the volume of data that needs to be read.

基于可用存储器和带宽，取指尺寸参数的值可以WRT需要被读出的数据的量来增加。

直接模式 (Direct Mode)

By default, the Sqoop import process uses JDBC which provides a reasonable cross-vendor import channel support. However, some databases can achieve higher performance by using database-specific utilities since they are optimized to provide the best possible transfer speed while putting less strain on the database server.

默认情况下，Sqoop导入过程使用JDBC，它提供了合理的跨供应商导入渠道支持。但是，某些数据库可以通过使用特定于数据库的实用程序来获得更高的性能，因为它们经过优化以提供最佳的传输速度，同时减轻数据库服务器的负担。

By supplying the --direct argument, Sqoop is forced to attempt using the direct import channel. This channel may be higher performance than using JDBC.

通过提供--direct参数，Sqoop被迫尝试使用直接导入通道。与使用JDBC相比，此通道的性能可能更高。

sqoop import \   
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \--direct

There are several limitations that come with this faster import. For one, not all databases have available native utilities. This mode is not available for every supported database.

更快的导入会带来一些限制。首先，并非所有数据库都有可用的本机实用程序。此模式不适用于每个受支持的数据库。

Sqoop has direct support only for MySQL and PostgreSQL.

Sqoop仅直接支持MySQL和PostgreSQL。

自定义边界查询 (Custom Boundary Queries)

As seen before split-by uniformly distributes the data for import. If the column has non-uniform values, boundary-query can be used if we do not get the desired results while using the split-by argument alone.

如前所述，分割方式将数据均匀分布以用于导入。如果该列的值不一致，并且仅在使用split-by参数的情况下无法获得所需的结果，则可以使用边界查询。

Ideally, we configure the boundary-query parameter with the min(id) and max(id) along with the table name.

理想情况下，我们使用min(id)和max(id)以及表名来配置boundary-query参数。

sqoop import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--query 
                countries.country, \
                normcities.city \
                JOIN countries USING(country_id) \
                WHERE $CONDITIONS'
--split-by id \
--target-dir cities \
--boundary-query

Here $CONDITIONS is used internally by Sqoop and will be expanded to pick the min and max id of the cities table to split the data. By using custom values, efficient partitions can be derived to split the data resulting in performance improvement in general.

$CONDITIONS在这里由Sqoop内部使用，并将展开以选择citys表的最小和最大id来拆分数据。通过使用自定义值，可以得出有效的分区来分割数据，从而总体上提高了性能。

结论 (Conclusion)

The goal of performance tuning is to achieve significant performance without compromising on resources. Hopefully using the above techniques we should be able to improve efficiency with significantly reduced time for data transfer.

性能调整的目标是在不损害资源的情况下实现显着的性能。希望使用上述技术，我们应该能够在大大减少数据传输时间的情况下提高效率。