sqoop架构_SQOOP架构的深入介绍

最新推荐文章于 2024-08-03 19:27:22 发布

cumichun6193

最新推荐文章于 2024-08-03 19:27:22 发布

阅读量256

点赞数

文章标签：数据库 java 大数据 mysql python

原文链接：https://www.freecodecamp.org/news/an-in-depth-introduction-to-sqoop-architecture-ad4ae0532583/

版权

sqoop架构

by Jayvardhan Reddy

通过杰伊瓦尔丹·雷迪(Jayvardhan Reddy)

SQOOP架构的深入介绍 (An in-depth introduction to SQOOP architecture)

Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and structured data-stores such as relational databases, and vice-versa.

Apache Sqoop是一种数据提取工具，旨在在Apache Hadoop和结构化数据存储(例如关系数据库)之间有效地传输批量数据，反之亦然。

As part of this blog, I will be explaining how the architecture works on executing a Sqoop command. I’ll cover details such as the jar generation via Codegen, execution of MapReduce job, and the various stages involved in running a Sqoop import/export command.

作为该博客的一部分，我将解释该架构如何执行Sqoop命令。我将介绍诸如通过Codegen生成jar，执行MapReduce作业以及运行Sqoop导入/导出命令所涉及的各个阶段等细节。

码元 (Codegen)

Understanding Codegen is essential, as internally this converts our Sqoop job into a jar which consists of several Java classes such as POJO, ORM, and a class that implements DBWritable, extending SqoopRecord to read and write the data from relational databases to Hadoop & vice-versa.

理解Codegen是必不可少的，因为在内部，这会将我们的Sqoop作业转换为一个jar，该jar由几个Java类(例如POJO，ORM)和一个实现DBWritable的类组成，将SqoopRecord扩展为从关系数据库到Hadoop读写数据，反之亦然。反之亦然。

You can create a Codegen explicitly as shown below to check the classes present as part of the jar.

您可以如下所示显式创建Codegen，以检查作为jar一部分存在的类。

sqoop codegen \   -- connect jdbc:mysql://ms.jayReddy.com:3306/retail_db \   -- username retail_user \   -- password ******* \   -- table products

The output jar will be written in your local file system. You will get a Jar file, Java file and java files which are compiled into .class files:

输出jar将被写入您的本地文件系统中。您将获得一个Jar文件，Java文件和编译为.class文件的Java文件：

Let us see a snippet of the code that will be generated.

让我们看一看将要生成的代码片段。

ORM class for table ‘products’ // Object-relational modal generated for mapping:

表“产品”的ORM类//为映射生成的对象关系模态：

Setter & Getter methods to get values:

用Setter和Getter方法获取值：

Internally it uses JDBC prepared statements to write to Hadoop and ResultSet to read data from Hadoop.

在内部，它使用JDBC准备的语句写入Hadoop，并使用ResultSet从Hadoop读取数据。

Sqoop导入 (Sqoop Import)

It is used to import data from traditional relational databases into Hadoop.

它用于将数据从传统的关系数据库导入Hadoop。

Let’s see a sample snippet for the same.

让我们看一下相同的示例代码。

sqoop import \   -- connect jdbc:mysql://ms.jayReddy.com:3306/retail_db \   -- username retail_user \   -- password ******* \   -- table products \   -- warehouse-dir /user/jvanchir/sqoop_prac/import_table_dir \   -- delete-target-dir

The following steps take place internally during the execution of sqoop.

在执行sqoop时，以下步骤在内部进行。

Step 1: Read data from MySQL in streaming fashion. It does various operations before writing the data into HDFS.

步骤1 ：以流方式从MySQL读取数据。在将数据写入HDFS之前，它会执行各种操作。

As part of this process, it will first generate code (typical Map reduce code) which is nothing but Java code. Using this Java code it will try to import.

作为此过程的一部分，它将首先生成仅Java代码的代码(典型的Map reduce代码)。使用此Java代码，它将尝试导入。

Generate the code. (Hadoop MR)
生成代码。 (Hadoop MR)
Compile the code and generate the Jar file.
编译代码并生成Jar文件。
Submit the Jar file and perform the import operations
提交Jar文件并执行导入操作

During the import, it has to make certain decisions as to how to divide the data into multiple threads so that Sqoop import can be scaled.

在导入期间，它必须对如何将数据划分为多个线程做出某些决策，以便可以扩展Sqoop导入。

Step 2: Understand the structure of the data and perform CodeGen

步骤2 ：了解数据的结构并执行CodeGen

Using the above SQL statement, it will fetch one record along with the column names. Using this information, it will extract the metadata information of the columns, datatype etc.

使用上面SQL语句，它将获取一条记录以及列名。使用此信息，它将提取列的元数据信息，数据类型等。

Step 3: Create the java file, compile it and generate a jar file

步骤3 ：创建Java文件，对其进行编译并生成一个jar文件

As part of code generation, it needs to understand the structure of the data and it has to apply that object on the incoming data internally to make sure the data is correctly copied onto the target database. Each unique table has one Java file talking about the structure of data.

作为代码生成的一部分，它需要了解数据的结构，并且必须在内部将该对象应用于传入数据，以确保将数据正确复制到目标数据库中。每个唯一表都有一个讨论数据结构的Java文件。

This jar file will be injected into Sqoop binaries to apply the structure to incoming data.

该jar文件将被注入Sqoop二进制文件中，以将结构应用于传入数据。

Step 4: Delete the target directory if it already exists.

步骤4 ：删除目标目录(如果已存在)。

Step 5: Import the data

步骤5 ：导入数据

Here, it connects to a resource manager, gets the resource, and starts the application master.

在这里，它连接到资源管理器，获取资源，然后启动应用程序主机。

To perform equal distribution of data among the map tasks, it internally executes a boundary query based on the primary key by default to find the minimum and maximum count of records in the table. Based on the max count, it will divide by the number of mappers and split it amongst each mapper.

为了在地图任务之间执行均等的数据分配，默认情况下，它会在内部基于主键执行边界查询，以查找表中记录的最小和最大计数。根据最大计数，它将除以映射器的数量并将其分配给每个映射器。

It uses 4 mappers by default:

默认情况下，它使用4个映射器：

It executes these jobs on different executors as shown below:

它在不同的执行器上执行这些作业，如下所示：

The default number of mappers can be changed by setting the following parameter:

可以通过设置以下参数来更改默认的映射器数：

So in our case, it uses 4 threads. Each thread processes mutually exclusive subsets, that is each thread processes different data from the others.

因此，在我们的例子中，它使用4个线程。每个线程处理互斥的子集，即每个线程处理彼此不同的数据。

To see the different values, check out the below:

要查看不同的值，请查看以下内容：

Operations that are being performed under each executor nodes:

在每个执行程序节点下正在执行的操作：

In case you perform a Sqooop hive import, one extra step as part of the execution takes place.

如果您执行Sqooop配置单元导入，则执行过程会多执行一个步骤。

Step 6: Copy data to hive table

步骤6 ：将数据复制到配置单元表

Sqoop导出 (Sqoop Export)

This is used to export data from Hadoop into traditional relational databases.

这用于将数据从Hadoop导出到传统的关系数据库中。

Let’s see a sample snippet for the same:

让我们看一下相同的示例代码：

sqoop export \  -- connect jdbc:mysql://ms.jayReddy.com:3306/retail_export \  -- username retail_user \  -- password ******* \  -- table product_sqoop_exp \  -- export-dir /user/jvanchir/sqoop_prac/import_table_dir/products

On executing the above command, the execution steps (1–4) similar to Sqoop import take place, but the source data is read from the file system (which is nothing but HDFS). Here it will use boundaries upon block size to divide the data and it is internally taken care by Sqoop.

在执行上述命令时，将执行类似于Sqoop导入的执行步骤(1-4)，但是从文件系统读取源数据(除了HDFS外什么也没有)。在这里，它将使用块大小上的边界来划分数据，并且Sqoop会在内部对其进行处理。

The processing splits are done as shown below:

处理拆分如下所示：

After connecting to the respective database to which the records are to be exported, it will issue a JDBC insert command to read data from HDFS and store it into the database as shown below.

连接到要将记录导出到的相应数据库后，它将发出JDBC插入命令以从HDFS读取数据并将其存储到数据库中，如下所示。

Now that we have seen how Sqoop works internally, you can determine the flow of execution from jar generation to execution of a MapReduce task on the submission of a Sqoop job.

现在我们已经了解了Sqoop的内部工作原理，您可以确定从jar生成到提交Sqoop作业时执行MapReduce任务的执行流程。

Note: The commands that were executed related to this post are added as part of my GIT account.

注意：与该帖子相关的已执行命令被添加为我的GIT帐户的一部分。

Similarly, you can also read more here:

同样，您也可以在此处阅读更多内容：

Hive Architecture in Depth with code.
具有代码深度的Hive架构。
HDFS Architecture in Depth with code.
具有代码深度的HDFS体系结构。

If you would like too, you can connect with me on LinkedIn - Jayvardhan Reddy.

如果您愿意，也可以通过LinkedIn- Jayvardhan Reddy与我联系。

If you enjoyed reading this article, you can click the clap and let others know about it. If you would like me to add anything else, please feel free to leave a response ?

如果您喜欢阅读本文，则可以单击拍手并告知其他人。如果您希望我添加其他任何内容，请随时回复。

翻译自: https://www.freecodecamp.org/news/an-in-depth-introduction-to-sqoop-architecture-ad4ae0532583/

sqoop架构

cumichun6193

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
sqoop架构_SQOOP架构的深入介绍

sqoop架构by Jayvardhan Reddy 通过杰伊瓦尔丹·雷迪(Jayvardhan Reddy) SQOOP架构的深入介绍 (An in-depth introduction to SQOOP architecture)Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk...
复制链接

扫一扫