Spark 升级(1.6.1 升级到2.0.2)幽灵分区

1、过程描述

升级spark 版本时,原先已经删除的数据的分区都存在,spark 查询时会检查分区,对没有数据的分区会报错。无论查询的是或者不是已删除的分区的数据;

2、错误排查过程

1)利用impala 登录,查询原来的数据,正常查询;

[dmp175:21000] > select eid,ecid,sld,count(sld) from base_dmp_v1 where day=20171221 group by eid,ecid,sld limit 10;
Query: select eid,ecid,sld,count(sld) from base_dmp_v1 where day=20171221 group by eid,ecid,sld limit 10
8+------+------+------------------------------------------------+------------+
| eid  | ecid | sld                                            | count(sld) |
+------+------+------------------------------------------------+------------+
| 7921 | 15   | only-250629-175-167-154-163.nstool.netease.com | 2          |
| 7921 | 15   | only-732014-175-167-146-171.nstool.netease.com | 1          |
| 143  | 21   | www.simcere.com                                | 2          |
| 49   | 15   | atf.flyert.com                                 | 18         |
| 9    | 1    | fw.hswzyj.com                                  | 2          |
| 143  | 21   | 1513857630728.081.sngdia.imtmp.net             | 1          |
| 7921 | 15   | only-822862-175-167-138-69.nstool.netease.com  | 2          |
| 7922 | 15   | 249989901-mmspdu.s3.lecloud.com                | 10         |
| 9    | 1    | wq22.82mo.cn                                   | 10         |
| 143  | 21   | 1513837643698.051.sngdia.imtmp.net             | 1          |
+------+------+------------------------------------------------+------------+
Fetched 10 row(s) in 342.76s


2)用spark beeline 登录查询报错;

0: jdbc:hive2://192.168.200.175:10000/default> insert into table statistic_sld_group_eidecid  select eid,ecid,sld,count(sld) from base_dmp_v1 where day=20171221 group by eid,ecid,sld ;
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 45 in stage 1.0 failed 4 times, most recent failure: Lost task 45.3 in stage 1.0 (TID 172, 192.168.200.174): java.io.FileNotFoundException: File does not exist: hdfs://logSave/home/data/etl/export/parquet/base_dmp_v1/day=20170721/hour=09
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
        at org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
        at org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
        at scala.collection.AbstractIterator.to(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace: (state=,code=0)

3) 输入命令查看表的分区:show partitions tablename ; 结果如下所示:

0: jdbc:hive2://192.168.200.175:10000/default> show partitions base_dmp_v1;
+-----------------------+--+
|        result         |
+-----------------------+--+
| day=20171220/hour=10  |
| day=20171220/hour=11  |
| day=20171220/hour=12  |
| day=20171220/hour=13  |
| day=20171220/hour=14  |
| day=20171220/hour=15  |
| day=20171220/hour=16  |
| day=20171220/hour=17  |
| day=20171220/hour=18  |
| day=20171220/hour=19  |
| day=20171220/hour=20  |
| day=20171220/hour=21  |
| day=20171220/hour=22  |
| day=20171220/hour=23  |
| day=20171221/hour=00  |
| day=20171221/hour=01  |
| day=20171221/hour=02  |
| day=20171221/hour=03  |
| day=20171221/hour=04  |
| day=20171221/hour=05  |
| day=20171221/hour=06  |
| day=20171221/hour=07  |
| day=20171221/hour=08  |
| day=20171221/hour=09  |
| day=20171221/hour=10  |
| day=20171221/hour=11  |
| day=20171221/hour=12  |
| day=20171221/hour=13  |
| day=20171221/hour=14  |
| day=20171221/hour=15  |
| day=20171221/hour=16  |
| day=20171221/hour=17  |
| day=20171221/hour=18  |
| day=20171221/hour=19  |
| day=20171221/hour=20  |
| day=20171221/hour=21  |
| day=20171221/hour=22  |
| day=20171221/hour=23  |
| day=20171222/hour=00  |
| day=20171222/hour=01  |
| day=20171222/hour=02  |
| day=20171222/hour=03  |
| day=20171222/hour=04  |
| day=20171222/hour=05  |
| day=20171222/hour=06  |
| day=20171222/hour=07  |
| day=20171222/hour=08  |
| day=20171222/hour=09  |
| day=20171222/hour=10  |
| day=20171222/hour=11  |
| day=20171222/hour=12  |
| day=20171222/hour=13  |
| day=20171222/hour=14  |
+-----------------------+-

分析结果; 发现有很多已经删除的数据的分区依旧存在;

3 解决方案

      很简单,删处已经删除的数据的分区(python 脚本);具体命令为:ALTER TABLE base_dmp_v1 DROP IF EXISTS PARTITION(day= 20171206);

以下为删处所有的分区的python 脚本;

import  os
#coding=utf-8
year="2017"
bigmonths=["01","03","05","07","08","10","12"]
middlemonths=["04","06","09","11"]
smallmonth=["02"]
daybig=["01", "02", "03" ,"04", "05", "06" ,"07" ,"08" ,"09" ,"10" ,"11" ,"12" ,"13" ,"14" ,"15", "16" ,"17" ,
     "18", "19" ,"20", "21" ,"22" ,"23","24","25","26","27","28","29","30","31"]
daymiddle=["01", "02", "03" ,"04", "05", "06" ,"07" ,"08" ,"09" ,"10" ,"11" ,"12" ,"13" ,"14" ,"15", "16" ,"17" ,
     "18", "19" ,"20", "21" ,"22" ,"23","24","25","26","27","28","29","30"]
daysmall=["01", "02", "03" ,"04", "05", "06" ,"07" ,"08" ,"09" ,"10" ,"11" ,"12" ,"13" ,"14" ,"15", "16" ,"17" ,
     "18", "19" ,"20", "21" ,"22" ,"23","24","25","26","27","28"]

partiontion=""

for month in bigmonths:
    for day in daybig:
        dayint=int(day)
        if (month =='12' and dayint>=20 and dayint<=31):
            continue
        partiontion=year+month+day
        sql = "beeline -u jdbc:hive2://192.168.200.175:10000/default -n root -e " \
                  "'ALTER TABLE base_dmp_v1 DROP IF EXISTS  PARTITION(day=" + partiontion + ");';"
        print(sql)
        os.system(sql)
for month in middlemonths:
    for day in daymiddle:
        partiontion=year+month+day
        sql = "beeline -u jdbc:hive2://192.168.200.175:10000/default -n root -e " \
                  "'ALTER TABLE base_dmp_v1 DROP IF EXISTS  partition(day=" + partiontion + ");';"
        print sql
        os.system(sql)
for month in smallmonth:
    for day in daysmall:
        partiontion=year+month+day
        sql = "beeline -u jdbc:hive2://192.168.200.175:10000/default -n root -e " \
                  "'ALTER TABLE base_dmp_v1 DROP IF EXISTS  partition(day=" + partiontion + ");';"
        print sql
        os.system(sql)


4 删处分区后

1)查看分区,结果只剩未删除的分区;

Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://192.168.200.175:10000/default> show partitions base_dmp_v1;
+-----------------------+--+
|        result         |
+-----------------------+--+
| day=20171222/hour=00  |
| day=20171222/hour=01  |
| day=20171222/hour=02  |
| day=20171222/hour=03  |
| day=20171222/hour=04  |
| day=20171222/hour=05  |
| day=20171222/hour=06  |
| day=20171222/hour=07  |
| day=20171222/hour=08  |
| day=20171222/hour=09  |
| day=20171222/hour=10  |
| day=20171222/hour=11  |
| day=20171222/hour=12  |
| day=20171222/hour=13  |
| day=20171222/hour=14  |
+-----------------------+--+
53 rows selected (0.292 seconds)

2)查询结果正常

0: jdbc:hive2://192.168.200.175:10000/default> select eid,ecid,sld,count(sld) from base_dmp_v1 where day=20171221 group by eid,ecid,sld  limit 10;
+-------+-------+--------------------------+-------------+--+
|  eid  | ecid  |           sld            | count(sld)  |
+-------+-------+--------------------------+-------------+--+
| 7342  | 15    | res.imtt.qq.com          | 11628217    |
| 83    | 5     | wup.huya.com             | 6543218     |
| 117   | 15    | screenshot.dwstatic.com  | 492138      |
| 21    | 6     | msg.mobile.kugou.com     | 1728706     |
| 304   | 15    | i8.mifile.cn             | 215481      |
| 7922  | 15    | gad.netease.com          | 31423       |
| 7115  | 21    | 185.38.13.130            | 114         |
| 143   | 21    | url.cn                   | 892582      |
| 6     | 5     | 42.56.78.17              | 491946      |
| 2     | 17    | mdws.openapi.360.cn      | 1148830     |
+-------+-------+--------------------------+-------------+--+
10 rows selected (212.995 seconds)

3)比较impala 和spark 查询速度,寻找spark 更快的原因;

前提:impala和spark 共用hive 和hdfs;
查看impala官方文档:
Factors Affecting Scalability:

A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with cluster size. However, for some workloads, the scalability might be bounded by the network, or even by memory.

If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce the network load; in fact, a larger cluster could increase network traffic because some queries involve "broadcast" operations to all DataNodes. Therefore, boosting the cluster size does not improve query throughput in a network-constrained environment.

Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor network is saturated yet. This can happen because currently Impala uses only a single core per node to process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system cannot run more than 2 such queries at the same time.

Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6 GB/sec. Consider increasing the memory per node.

As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the throughput estimate.


由此可见,Impala用1核处理join操作和聚合操作;因此,最多可以用两核处理数据;spark 可以设置集群中每个节点的核数和内存大小,本项目中选择的核数和内存为5核,
10G内存,肯定处理速度更快。


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值