table scan
- SQL
select count(1) from dmp.trait_zamplus_supply_v2;
Table message
type | value |
---|---|
input files | 600 |
input size | 296.7 G |
average file size | 500M |
rows num | 1795165725 |
* Test result :
dimentions | MapReduce | Spark Test1 | Spark Test2 |
---|---|---|---|
use cores | about 400 | 400 | 400 |
Time Spent (seconds) | 181.089 | 313.455 | 71.575 |
- MapReduce
Map Data type | map task nums |
---|---|
Data-local map | 704 |
Rack-local map | 419 |
Other local map | 64 |
ALL map | 1187 |
Average Map Time 25sec
Average Shuffle Time 56sec
- Spark Test1
Data type | task nums |
---|---|
NODE_LOCAL | 2374 |
RACK_LOCAL | 26 |
ALL TASKS | 2400 |
Total Time Across All Tasks: 5.9 h
Input Size / Records: 296.7 GB / 1795165725
Shuffle Write: 72.7 KB / 2400
- Spark Test2
Data type | task nums |
---|---|
NODE_LOCAL | 2381 |
RACK_LOCAL | 19 |
ALL TASKS | 2400 |
Total Time Across All Tasks: 4.2 h
Input Size / Records: 296.7 GB / 1795165725
Shuffle Write: 72.7 KB / 2400
- Note
Our hadoop block size is 64M.In hive I set mapreduce.input.fileinputformat.split.maxsize to 256000000. Spark test1 set mapred.max.split.size=64M and Spark test2 set mapred.max.split.size=256000000.