sqoop导数据遇到的部分问题

最新推荐文章于 2021-08-23 15:35:04 发布

time在左在右

最新推荐文章于 2021-08-23 15:35:04 发布

阅读量2.8k

点赞数 2

分类专栏：实战面试题文章标签：实战

本文链接：https://blog.csdn.net/java1440657916/article/details/97248422

版权

实战面试题专栏收录该内容

2 篇文章 0 订阅

订阅专栏

在导入hive的时候，如果数据库中有blob或者text字段，会报错，解决方案：
clob：在将数据由Oracle数据库导入到Hive时，发现带有clob字段的表的数据会错乱，出现一些字段全为NULL的空行。
由于在项目中CLOB字段没有实际的分析用途，因此考虑将CLOB字段去掉。
同时，为了防止CLOB字段产生一些问题，因此将HIVE中CLOB字段禁用，禁用的方式如下：
[Hadoop@master sqoop-1.4.5]$ cd $SQOOP_HOME/conf [hadoop@master conf]$ vi oraoop-site.xml
将以下属性的注释去掉，并且将value改为true

oraoop.import.omit.lobs.and.long
true
If true, OraOop will omit BLOB, CLOB, NCLOB and LONG columns during an Import.

有些表中虽然有clob字段，但是不能排除掉，因为其他字段使我们所需要，因此在导入的时候采用指定–columns的方式来进行导入
sqoop import --hive-import --hive-database test --create-hive-table --connect jdbc --username user–password user–bindir //scratch --outdir /Java --table aaa --columns “ID,NAME” -m 1 --null-string ‘\N’ --null-non-string ‘\N’
Sqoop导入导出Null存储一致性问题
Hive中的Null在底层是以“\N”来存储，而MySQL中的Null在底层就是Null，为了保证数据两端的一致性。在导出数据时采用–input-null-string和–input-null-non-string两个参数。导入数据时采用–null-string和–null-non-string。
–input-null-non-string
在生成的java文件中，可以将null字符串设为想要设定的值（比如空字符串’’）
–input-null-string
同上，设定时，最好与上面的属性一起设置，且设置同样的值（比如空字符串等等）。
Sqoop数据导出一致性问题
1）场景1：如Sqoop在导出到Mysql时，使用4个Map任务，过程中有2个任务失败，那此时MySQL中存储了另外两个Map任务导入的数据，此时老板正好看到了这个报表数据。而开发工程师发现任务失败后，会调试问题并最终将全部数据正确的导入MySQL，那后面老板再次看报表数据，发现本次看到的数据与之前的不一致，这在生产环境是不允许的。
官网：http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.
由于Sqoop将导出过程分解为多个事务，因此失败的导出作业可能会导致将部分数据提交到数据库。这可能进一步导致后续作业由于某些情况下的插入冲突而失败，或导致其他作业中的重复数据。您可以通过–staging-table选项指定登台表来解决此问题，该选项充当用于暂存导出数据的辅助表。分阶段数据最终在单个事务中移动到目标表。
–staging-table方式
sqoop export --connect jdbc:mysql://192.168.137.10:3306/user_behavior --username root --password 123456 --table app_cource_study_report --columns watch_video_cnt,complete_video_cnt,dt --fields-terminated-by “\t” --export-dir “/user/hive/warehouse/tmp.db/app_cource_study_analysis_${day}” --staging-table app_cource_study_report_tmp --clear-staging-table --input-null-string ‘\N’
2）场景2：设置map数量为1个（不推荐，面试官想要的答案不只这个）
多个Map任务时，采用–staging-table方式，仍然可以解决数据一致性问题。
生成大于1.5G的数据，导入
数据量非常大，可能会导致一个错误
错误：
exception “GC Overhead limit exceeded
原因：
Why Sqoop Import throws this exception?
The answer is – During the process, RDBMS database (NOT SQOOP) fetches all the rows at one shot and tries to load everything into memory. This causes memory spill out and throws error. To overcome this you need to tell RDBMS database to return the data in batches. The following parameters “?dontTrackOpenResources=true&defaultFetchSize=10000&useCursorFetch=true” following the jdbc connection string tells database to fetch 10000 rows per batch.

解决方案：
1、指定mappers的数量（数量最好不要超过节点的个数）
sqoop job --exec gp1813_user – --num-mappers 8;

2、调整jvm的内存，缺点:
-Dmapreduce.map.memory.mb=6000
-Dmapreduce.map.java.opts=-Xmx1600m
-Dmapreduce.task.io.sort.mb=4800 \

3、设置mysql的读取数据的方式，不要一次性将所有数据都fetch到内存
?dontTrackOpenResources=true&defaultFetchSize=10000&useCursorFetch=true

time在左在右

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
sqoop导数据遇到的部分问题

在导入hive的时候，如果数据库中有blob或者text字段，会报错，解决方案：clob：在将数据由Oracle数据库导入到Hive时，发现带有clob字段的表的数据会错乱，出现一些字段全为NULL的空行。由于在项目中CLOB字段没有实际的分析用途，因此考虑将CLOB字段去掉。同时，为了防止CLOB字段产生一些问题，因此将HIVE中CLOB字段禁用，禁用的方式如下：[Hadoop@mast...
复制链接

扫一扫