CUHK-IEMS5730-HW0

Environment

  • Google Cloud Platform
  • Ubuntu 14.04 LTS
  • Instance: 2 cores, 8GB ROM, 50GB storage
  • Openjdk-7-jdk/jre
  • Hadoop 2.9.2

Part a

i: Single Node Hadoop Setup

The 50070 page of vm-master under pseudo distributed mode is in folder 1155114915-HW0/a/i.

ii: Tera Example on Single Node Machine

This example in run on vm-master instance.
Below is the command and output of job teragen, it can also be found under 1155114915-HW0/a/ii/teragen.txt:

ierg5730-gcp-key@vm-master:~/hadoop$ sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/ierg5730-gcp-key/hadoop/logs/hadoop-ierg5730-gcp-key-namenode-vm-master.out
localhost: starting datanode, logging to /home/ierg5730-gcp-key/hadoop/logs/hadoop-ierg5730-gcp-key-datanode-vm-master.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/ierg5730-gcp-key/hadoop/logs/hadoop-ierg5730-gcp-key-secondarynamenode-vm-master.out
starting yarn daemons
starting resourcemanager, logging to /home/ierg5730-gcp-key/hadoop/logs/yarn-ierg5730-gcp-key-resourcemanager-vm-master.out
localhost: starting nodemanager, logging to /home/ierg5730-gcp-key/hadoop/logs/yarn-ierg5730-gcp-key-nodemanager-vm-master.out
ierg5730-gcp-key@vm-master:~/hadoop$ jps
2928 DataNode
2729 NameNode
3483 NodeManager
3721 Jps
3168 SecondaryNameNode
3321 ResourceManager
ierg5730-gcp-key@vm-master:~/hadoop$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar teragen 100000 terasort/input
19/01/18 04:38:50 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
19/01/18 04:38:50 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
19/01/18 04:38:50 INFO terasort.TeraGen: Generating 100000 using 1
19/01/18 04:38:50 INFO mapreduce.JobSubmitter: number of splits:1
19/01/18 04:38:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1291551154_0001
19/01/18 04:38:51 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
19/01/18 04:38:51 INFO mapreduce.Job: Running job: job_local1291551154_0001
19/01/18 04:38:51 INFO mapred.LocalJobRunner: OutputCommitter set in config null
19/01/18 04:38:51 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/01/18 04:38:51 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/01/18 04:38:51 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/01/18 04:38:51 INFO mapred.LocalJobRunner: Waiting for map tasks
19/01/18 04:38:51 INFO mapred.LocalJobRunner: Starting task: attempt_local1291551154_0001_m_000000_0
19/01/18 04:38:51 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/01/18 04:38:51 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/01/18 04:38:51 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
19/01/18 04:38:51 INFO mapred.MapTask: Processing split: org.apache.hadoop.examples.terasort.TeraGen$RangeInputFormat$RangeInputSplit@4352d1fc
19/01/18 04:38:52 INFO mapred.LocalJobRunner:
19/01/18 04:38:52 INFO mapreduce.Job: Job job_local1291551154_0001 running in uber mode : false
19/01/18 04:38:52 INFO mapreduce.Job:  map 0% reduce 0%
19/01/18 04:38:52 INFO mapred.Task: Task:attempt_local1291551154_0001_m_000000_0 is done. And is in the process of committing
19/01/18 04:38:52 INFO mapred.LocalJobRunner:
19/01/18 04:38:52 INFO mapred.Task: Task attempt_local1291551154_0001_m_000000_0 is allowed to commit now
19/01/18 04:38:52 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1291551154_0001_m_000000_0' to hdfs://localhost:9000/user/ierg5730-gcp-key/terasort/input/_temporary/0/task_local1291551154_0001_m_000000
19/01/18 04:38:52 INFO mapred.LocalJobRunner: map
19/01/18 04:38:52 INFO mapred.Task: Task 'attempt_local1291551154_0001_m_000000_0' done.
19/01/18 04:38:52 INFO mapred.LocalJobRunner: Finishing task: attempt_local1291551154_0001_m_000000_0
19/01/18 04:38:52 INFO mapred.LocalJobRunner: map task executor complete.
19/01/18 04:38:53 INFO mapreduce.Job:  map 100% reduce 0%
19/01/18 04:38:53 INFO mapreduce.Job: Job job_local1291551154_0001 completed successfully
19/01/18 04:38:53 INFO mapreduce.Job: Counters: 21
        File System Counters
                FILE: Number of bytes read=303449
                FILE: Number of bytes written=795355
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=0
                HDFS: Number of bytes written=10000000
                HDFS: Number of read operations=3
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Map-Reduce Framework
                Map input records=100000
                Map output records=100000
                Input split bytes=82
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=15
                Total committed heap usage (bytes)=226492416
        org.apache.hadoop.examples.terasort.TeraGen$Counters
                CHECKSUM=214574985129000
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=10000000

Below is the command and output of job terasort, it can also be found under 1155114915-HW0/a/ii/terasort.txt:

ierg5730-gcp-key@vm-master:~/hadoop$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar terasort terasort/input terasort/output
19/01/18 05:19:00 INFO terasort.TeraSort: starting
19/01/18 05:19:02 INFO input.FileInputFormat: Total input files to process : 1
Spent 129ms computing base-splits.
Spent 2ms computing TeraScheduler splits.
Computing input splits took 132ms
Sampling 1 splits of 1
Making 1 from 100000 sampled records
Computing parititions took 868ms
Spent 1004ms computing partitions.
19/01/18 05:19:03 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
19/01/18 05:19:03 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
19/01/18 05:19:03 INFO mapreduce.JobSubmitter: number of splits:1
19/01/18 05:19:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1534826650_0001
19/01/18 05:19:03 INFO mapred.LocalDistributedCacheManager: Creating symlink: /tmp/hadoop-ierg5730-gcp-key/mapred/local/1547788743731/_partition.lst <- /home/ierg5730-gcp-key/hadoop/_partition.lst
19/01/18 05:19:03 INFO mapred.LocalDistributedCacheManager: Localized hdfs://localhost:9000/user/ierg5730-gcp-key/terasort/output/_partition.lst as file:/tmp/hadoop-ierg5730-gcp-key/mapred/local/1547788743731/_partition.lst
19/01/18 05:19:03 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
19/01/18 05:19:03 INFO mapreduce.Job: Running job: job_local1534826650_0001
19/01/18 05:19:03 INFO mapred.LocalJobRunner: OutputCommitter set in config null
19/01/18 05:19:03 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/01/18 05:19:03 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/01/18 05:19:03 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/01/18 05:19:04 INFO mapred.LocalJobRunner: Waiting for map tasks
19/01/18 05:19:04 INFO mapred.LocalJobRunner: Starting task: attempt_local1534826650_0001_m_000000_0
19/01/18 05:19:04 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/01/18 05:19:04 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/01/18 05:19:04 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
19/01/18 05:19:04 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/ierg5730-gcp-key/terasort/input/part-m-00000:0+10000000
19/01/18 05:19:04 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
19/01/18 05:19:04 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
19/01/18 05:19:04 INFO mapred.MapTask: soft limit at 83886080
19/01/18 05:19:04 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
19/01/18 05:19:04 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
19/01/18 05:19:04 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
19/01/18 05:19:04 INFO mapred.LocalJobRunner:
19/01/18 05:19:04 INFO mapred.MapTask: Starting flush of map output
19/01/18 05:19:04 INFO mapred.MapTask: Spilling map output
19/01/18 05:19:04 INFO mapred.MapTask: bufstart = 0; bufend = 10200000; bufvoid = 104857600
19/01/18 05:19:04 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25814400(103257600); length = 399997/6553600
19/01/18 05:19:04 INFO mapreduce.Job: Job job_local1534826650_0001 running in uber mode : false
19/01/18 05:19:04 INFO mapreduce.Job:  map 0% reduce 0%
19/01/18 05:19:05 INFO mapred.MapTask: Finished spill 0
19/01/18 05:19:05 INFO mapred.Task: Task:attempt_local1534826650_0001_m_000000_0 is done. And is in the process of committing
19/01/18 05:19:05 INFO mapred.LocalJobRunner: map
19/01/18 05:19:05 INFO mapred.Task: Task 'attempt_local1534826650_0001_m_000000_0' done.
19/01/18 05:19:05 INFO mapred.LocalJobRunner: Finishing task: attempt_local1534826650_0001_m_000000_0
19/01/18 05:19:05 INFO mapred.LocalJobRunner: map task executor complete.
19/01/18 05:19:05 INFO mapred.LocalJobRunner: Waiting for reduce tasks
19/01/18 05:19:05 INFO mapred.LocalJobRunner: Starting task: attempt_local1534826650_0001_r_000000_0
19/01/18 05:19:05 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/01/18 05:19:05 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/01/18 05:19:05 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
19/01/18 05:19:05 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@2a2a9206
19/01/18 05:19:05 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=333971456, maxSingleShuffleLimit=83492864, mergeThreshold=220421168, ioSortFactor=10, memToMemMergeOutputsThreshold=10
19/01/18 05:19:05 INFO reduce.EventFetcher: attempt_local1534826650_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
19/01/18 05:19:05 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1534826650_0001_m_000000_0 decomp: 10400002 len: 10400006 to MEMORY
19/01/18 05:19:05 INFO reduce.InMemoryMapOutput: Read 10400002 bytes from map-output for attempt_local1534826650_0001_m_000000_0
19/01/18 05:19:05 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 10400002, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->10400002
19/01/18 05:19:05 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
19/01/18 05:19:05 INFO mapred.LocalJobRunner: 1 / 1 copied.
19/01/18 05:19:05 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
19/01/18 05:19:05 INFO mapred.Merger: Merging 1 sorted segments
19/01/18 05:19:05 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 10399989 bytes
19/01/18 05:19:05 INFO reduce.MergeManagerImpl: Merged 1 segments, 10400002 bytes to disk to satisfy reduce memory limit
19/01/18 05:19:05 INFO reduce.MergeManagerImpl: Merging 1 files, 10400006 bytes from disk
19/01/18 05:19:05 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
19/01/18 05:19:05 INFO mapred.Merger: Merging 1 sorted segments
19/01/18 05:19:05 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 10399989 bytes
19/01/18 05:19:05 INFO mapred.LocalJobRunner: 1 / 1 copied.
19/01/18 05:19:05 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
19/01/18 05:19:05 INFO mapreduce.Job:  map 100% reduce 0%
19/01/18 05:19:06 INFO mapred.Task: Task:attempt_local1534826650_0001_r_000000_0 is done. And is in the process of committing
19/01/18 05:19:06 INFO mapred.LocalJobRunner: 1 / 1 copied.
19/01/18 05:19:06 INFO mapred.Task: Task attempt_local1534826650_0001_r_000000_0 is allowed to commit now
19/01/18 05:19:06 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1534826650_0001_r_000000_0' to hdfs://localhost:9000/user/ierg5730-gcp-key/terasort/output/_temporary/0/task_local1534826650_0001_r_000000
19/01/18 05:19:06 INFO mapred.LocalJobRunner: reduce > reduce
19/01/18 05:19:06 INFO mapred.Task: Task 'attempt_local1534826650_0001_r_000000_0' done.
19/01/18 05:19:06 INFO mapred.LocalJobRunner: Finishing task: attempt_local1534826650_0001_r_000000_0
19/01/18 05:19:06 INFO mapred.LocalJobRunner: reduce task executor complete.
19/01/18 05:19:06 INFO mapreduce.Job:  map 100% reduce 100%
19/01/18 05:19:06 INFO mapreduce.Job: Job job_local1534826650_0001 completed successfully
19/01/18 05:19:07 INFO mapreduce.Job: Counters: 35
        File System Counters
                FILE: Number of bytes read=21407160
                FILE: Number of bytes written=32796946
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=40000000
                HDFS: Number of bytes written=10000000
                HDFS: Number of read operations=45
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=6
        Map-Reduce Framework
                Map input records=100000
                Map output records=100000
                Map output bytes=10200000
                Map output materialized bytes=10400006
                Input split bytes=136
                Combine input records=0
                Combine output records=0
                Reduce input groups=100000
                Reduce shuffle bytes=10400006
                Reduce input records=100000
                Reduce output records=100000
                Spilled Records=200000
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=20
                Total committed heap usage (bytes)=663748608
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=10000000
        File Output Format Counters
                Bytes Written=10000000
19/01/18 05:19:07 INFO terasort.TeraSort: done

Below is the command and output of job teravalidate, it can also be found under 1155114915-HW0/a/ii/teravalidate.txt:

ierg5730-gcp-key@vm-master:~/hadoop$ ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar teravalidate terasort/output terasort/check
19/01/18 05:22:11 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
19/01/18 05:22:11 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
19/01/18 05:22:11 INFO input.FileInputFormat: Total input files to process : 1
Spent 50ms computing base-splits.
Spent 4ms computing TeraScheduler splits.
19/01/18 05:22:11 INFO mapreduce.JobSubmitter: number of splits:1
19/01/18 05:22:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local13918116_0001
19/01/18 05:22:12 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
19/01/18 05:22:12 INFO mapreduce.Job: Running job: job_local13918116_0001
19/01/18 05:22:12 INFO mapred.LocalJobRunner: OutputCommitter set in config null
19/01/18 05:22:12 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/01/18 05:22:12 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/01/18 05:22:12 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/01/18 05:22:12 INFO mapred.LocalJobRunner: Waiting for map tasks
19/01/18 05:22:12 INFO mapred.LocalJobRunner: Starting task: attempt_local13918116_0001_m_000000_0
19/01/18 05:22:12 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/01/18 05:22:12 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/01/18 05:22:12 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
19/01/18 05:22:12 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/ierg5730-gcp-key/terasort/output/part-r-00000:0+10000000
19/01/18 05:22:12 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
19/01/18 05:22:12 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
19/01/18 05:22:12 INFO mapred.MapTask: soft limit at 83886080
19/01/18 05:22:12 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
19/01/18 05:22:12 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
19/01/18 05:22:12 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
19/01/18 05:22:13 INFO mapreduce.Job: Job job_local13918116_0001 running in uber mode : false
19/01/18 05:22:13 INFO mapreduce.Job:  map 0% reduce 0%
19/01/18 05:22:13 INFO mapred.LocalJobRunner:
19/01/18 05:22:13 INFO mapred.MapTask: Starting flush of map output
19/01/18 05:22:13 INFO mapred.MapTask: Spilling map output
19/01/18 05:22:13 INFO mapred.MapTask: bufstart = 0; bufend = 80; bufvoid = 104857600
19/01/18 05:22:13 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214388(104857552); length = 9/6553600
19/01/18 05:22:13 INFO mapred.MapTask: Finished spill 0
19/01/18 05:22:13 INFO mapred.Task: Task:attempt_local13918116_0001_m_000000_0 is done. And is in the process of committing
19/01/18 05:22:13 INFO mapred.LocalJobRunner: map
19/01/18 05:22:13 INFO mapred.Task: Task 'attempt_local13918116_0001_m_000000_0' done.
19/01/18 05:22:13 INFO mapred.LocalJobRunner: Finishing task: attempt_local13918116_0001_m_000000_0
19/01/18 05:22:13 INFO mapred.LocalJobRunner: map task executor complete.
19/01/18 05:22:13 INFO mapred.LocalJobRunner: Waiting for reduce tasks
19/01/18 05:22:13 INFO mapred.LocalJobRunner: Starting task: attempt_local13918116_0001_r_000000_0
19/01/18 05:22:13 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/01/18 05:22:13 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/01/18 05:22:13 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
19/01/18 05:22:13 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@2f3c02d4
19/01/18 05:22:13 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=333971456, maxSingleShuffleLimit=83492864, mergeThreshold=220421168, ioSortFactor=10, memToMemMergeOutputsThreshold=10
19/01/18 05:22:13 INFO reduce.EventFetcher: attempt_local13918116_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
19/01/18 05:22:13 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local13918116_0001_m_000000_0 decomp: 88 len: 92 to MEMORY
19/01/18 05:22:13 INFO reduce.InMemoryMapOutput: Read 88 bytes from map-output for attempt_local13918116_0001_m_000000_0
19/01/18 05:22:13 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 88, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->88
19/01/18 05:22:13 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
19/01/18 05:22:13 INFO mapred.LocalJobRunner: 1 / 1 copied.
19/01/18 05:22:13 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
19/01/18 05:22:13 INFO mapred.Merger: Merging 1 sorted segments
19/01/18 05:22:13 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 77 bytes
19/01/18 05:22:13 INFO reduce.MergeManagerImpl: Merged 1 segments, 88 bytes to disk to satisfy reduce memory limit
19/01/18 05:22:13 INFO reduce.MergeManagerImpl: Merging 1 files, 92 bytes from disk
19/01/18 05:22:13 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
19/01/18 05:22:13 INFO mapred.Merger: Merging 1 sorted segments
19/01/18 05:22:13 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 77 bytes
19/01/18 05:22:13 INFO mapred.LocalJobRunner: 1 / 1 copied.
19/01/18 05:22:13 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
19/01/18 05:22:13 INFO mapred.Task: Task:attempt_local13918116_0001_r_000000_0 is done. And is in the process of committing
19/01/18 05:22:13 INFO mapred.LocalJobRunner: 1 / 1 copied.
19/01/18 05:22:13 INFO mapred.Task: Task attempt_local13918116_0001_r_000000_0 is allowed to commit now
19/01/18 05:22:13 INFO output.FileOutputCommitter: Saved output of task 'attempt_local13918116_0001_r_000000_0' to hdfs://localhost:9000/user/ierg5730-gcp-key/terasort/check/_temporary/0/task_local13918116_0001_r_000000
19/01/18 05:22:13 INFO mapred.LocalJobRunner: reduce > reduce
19/01/18 05:22:13 INFO mapred.Task: Task 'attempt_local13918116_0001_r_000000_0' done.
19/01/18 05:22:13 INFO mapred.LocalJobRunner: Finishing task: attempt_local13918116_0001_r_000000_0
19/01/18 05:22:13 INFO mapred.LocalJobRunner: reduce task executor complete.
19/01/18 05:22:14 INFO mapreduce.Job:  map 100% reduce 100%
19/01/18 05:22:14 INFO mapreduce.Job: Job job_local13918116_0001 completed successfully
19/01/18 05:22:14 INFO mapreduce.Job: Counters: 35
        File System Counters
                FILE: Number of bytes read=607334
                FILE: Number of bytes written=1584306
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=20000000
                HDFS: Number of bytes written=22
                HDFS: Number of read operations=13
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=4
        Map-Reduce Framework
                Map input records=100000
                Map output records=3
                Map output bytes=80
                Map output materialized bytes=92
                Input split bytes=137
                Combine input records=0
                Combine output records=0
                Reduce input groups=3
                Reduce shuffle bytes=92
                Reduce input records=3
                Reduce output records=1
                Spilled Records=6
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=230
                Total committed heap usage (bytes)=894435328
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=10000000
        File Output Format Counters
                Bytes Written=22

This is the end of part a.

Part b

i: Four Node Cluster Setup

The 50070 page of the name-node vm-master can be found under 1155114915-HW0/b/i.

ii: 2G & 20G Tera Data Gen & Sort

Since 20G tera dataset is really large, I tried to run the sort process 3 times(crashed in the middle) and Google Cloud Platform spend 1000 HKD of free trivial money on this single process!!! (I only have 1000 HKD in my account now).
So, I decied not to run the sort process of 20G tera data.
In stead, I did the Bonus 20 marks: JAVA Word Count job.

Since one line of tera data is 100B, in order to generate 2G and 20G data, we need to generate

2G: 2*1024*1024*1024/100 = 21,474,837 lines
20G: 20*1024*1024*1024/100 = 214,748,365 lines

Below is the command and output of teragen job for 2G data, it can also be found under 1155114915-HW0/b/ii

ierg5730-gcp-key@vm-master:~/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar teragen 21474837 terasort/2G-input
19/01/20 01:09:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/20 01:09:51 INFO terasort.TeraGen: Generating 21474837 using 2
19/01/20 01:09:51 INFO mapreduce.JobSubmitter: number of splits:2
19/01/20 01:09:51 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/01/20 01:09:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547946384889_0001
19/01/20 01:09:52 INFO impl.YarnClientImpl: Submitted application application_1547946384889_0001
19/01/20 01:09:52 INFO mapreduce.Job: The url to track the job: http://vm-master:8088/proxy/application_1547946384889_0001/
19/01/20 01:09:52 INFO mapreduce.Job: Running job: job_1547946384889_0001
19/01/20 01:10:01 INFO mapreduce.Job: Job job_1547946384889_0001 running in uber mode : false
19/01/20 01:10:01 INFO mapreduce.Job:  map 0% reduce 0%
19/01/20 01:10:21 INFO mapreduce.Job:  map 17% reduce 0%
19/01/20 01:10:22 INFO mapreduce.Job:  map 34% reduce 0%
19/01/20 01:10:28 INFO mapreduce.Job:  map 39% reduce 0%
19/01/20 01:10:29 INFO mapreduce.Job:  map 44% reduce 0%
19/01/20 01:10:34 INFO mapreduce.Job:  map 52% reduce 0%
19/01/20 01:10:35 INFO mapreduce.Job:  map 65% reduce 0%
19/01/20 01:10:40 INFO mapreduce.Job:  map 73% reduce 0%
19/01/20 01:10:41 INFO mapreduce.Job:  map 83% reduce 0%
19/01/20 01:10:43 INFO mapreduce.Job:  map 88% reduce 0%
19/01/20 01:10:45 INFO mapreduce.Job:  map 100% reduce 0%
19/01/20 01:10:45 INFO mapreduce.Job: Job job_1547946384889_0001 completed successfully
19/01/20 01:10:45 INFO mapreduce.Job: Counters: 31
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=396850
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=167
                HDFS: Number of bytes written=2147483700
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=4
        Job Counters
                Launched map tasks=2
                Other local map tasks=2
                Total time spent by all maps in occupied slots (ms)=79581
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=79581
                Total vcore-milliseconds taken by all map tasks=79581
                Total megabyte-milliseconds taken by all map tasks=81490944
        Map-Reduce Framework
                Map input records=21474837
                Map output records=21474837
                Input split bytes=167
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=1015
                CPU time spent (ms)=42000
                Physical memory (bytes) snapshot=400113664
                Virtual memory (bytes) snapshot=1678053376
                Total committed heap usage (bytes)=304611328
        org.apache.hadoop.examples.terasort.TeraGen$Counters
                CHECKSUM=46124755272701764
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=2147483700

Below is the command and output of terasort job for 2G data, it can also be found under 1155114915-HW0/b/ii

Total time spent by all map tasks (ms)=1520230
Total time spent by all reduce tasks (ms)=415210

ierg5730-gcp-key@vm-master:~/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar terasort terasort/2G-input terasort/2G-output
19/01/20 01:11:53 INFO terasort.TeraSort: starting
19/01/20 01:11:54 INFO input.FileInputFormat: Total input files to process : 2
Spent 134ms computing base-splits.
Spent 3ms computing TeraScheduler splits.
Computing input splits took 138ms
Sampling 10 splits of 16
Making 1 from 100000 sampled records
Computing parititions took 1119ms
Spent 1261ms computing partitions.
19/01/20 01:11:55 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/20 01:11:56 INFO mapreduce.JobSubmitter: number of splits:16
19/01/20 01:11:56 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/01/20 01:11:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547946384889_0004
19/01/20 01:11:56 INFO impl.YarnClientImpl: Submitted application application_1547946384889_0004
19/01/20 01:11:56 INFO mapreduce.Job: The url to track the job: http://vm-master:8088/proxy/application_1547946384889_0004/
19/01/20 01:11:57 INFO mapreduce.Job: Running job: job_1547946384889_0004
19/01/20 01:12:19 INFO mapreduce.Job: Job job_1547946384889_0004 running in uber mode : false
19/01/20 01:12:19 INFO mapreduce.Job:  map 0% reduce 0%
19/01/20 01:12:50 INFO mapreduce.Job:  map 8% reduce 0%
19/01/20 01:12:51 INFO mapreduce.Job:  map 11% reduce 0%
19/01/20 01:12:53 INFO mapreduce.Job:  map 14% reduce 0%
19/01/20 01:12:54 INFO mapreduce.Job:  map 17% reduce 0%
19/01/20 01:13:02 INFO mapreduce.Job:  map 18% reduce 0%
19/01/20 01:13:03 INFO mapreduce.Job:  map 19% reduce 0%
19/01/20 01:13:04 INFO mapreduce.Job:  map 21% reduce 0%
19/01/20 01:13:05 INFO mapreduce.Job:  map 22% reduce 0%
19/01/20 01:13:07 INFO mapreduce.Job:  map 24% reduce 0%
19/01/20 01:13:08 INFO mapreduce.Job:  map 25% reduce 0%
19/01/20 01:13:14 INFO mapreduce.Job:  map 26% reduce 0%
19/01/20 01:13:15 INFO mapreduce.Job:  map 27% reduce 0%
19/01/20 01:13:17 INFO mapreduce.Job:  map 28% reduce 0%
19/01/20 01:13:19 INFO mapreduce.Job:  map 29% reduce 0%
19/01/20 01:13:20 INFO mapreduce.Job:  map 30% reduce 0%
19/01/20 01:13:21 INFO mapreduce.Job:  map 32% reduce 0%
19/01/20 01:13:22 INFO mapreduce.Job:  map 33% reduce 0%
19/01/20 01:13:23 INFO mapreduce.Job:  map 34% reduce 0%
19/01/20 01:13:27 INFO mapreduce.Job:  map 36% reduce 0%
19/01/20 01:13:28 INFO mapreduce.Job:  map 37% reduce 0%
19/01/20 01:13:31 INFO mapreduce.Job:  map 38% reduce 0%
19/01/20 01:14:17 INFO mapreduce.Job:  map 39% reduce 6%
19/01/20 01:14:19 INFO mapreduce.Job:  map 40% reduce 6%
19/01/20 01:14:21 INFO mapreduce.Job:  map 42% reduce 6%
19/01/20 01:14:23 INFO mapreduce.Job:  map 42% reduce 10%
19/01/20 01:14:24 INFO mapreduce.Job:  map 43% reduce 10%
19/01/20 01:14:26 INFO mapreduce.Job:  map 46% reduce 10%
19/01/20 01:14:28 INFO mapreduce.Job:  map 49% reduce 10%
19/01/20 01:14:29 INFO mapreduce.Job:  map 49% reduce 13%
19/01/20 01:14:32 INFO mapreduce.Job:  map 50% reduce 13%
19/01/20 01:14:34 INFO mapreduce.Job:  map 51% reduce 13%
19/01/20 01:14:49 INFO mapreduce.Job:  map 52% reduce 13%
19/01/20 01:14:51 INFO mapreduce.Job:  map 55% reduce 13%
19/01/20 01:14:53 INFO mapreduce.Job:  map 58% reduce 13%
19/01/20 01:15:10 INFO mapreduce.Job:  map 59% reduce 13%
19/01/20 01:15:14 INFO mapreduce.Job:  map 60% reduce 13%
19/01/20 01:15:16 INFO mapreduce.Job:  map 62% reduce 13%
19/01/20 01:15:17 INFO mapreduce.Job:  map 63% reduce 13%
19/01/20 01:15:20 INFO mapreduce.Job:  map 64% reduce 13%
19/01/20 01:15:22 INFO mapreduce.Job:  map 66% reduce 13%
19/01/20 01:15:24 INFO mapreduce.Job:  map 68% reduce 13%
19/01/20 01:15:27 INFO mapreduce.Job:  map 69% reduce 13%
19/01/20 01:15:32 INFO mapreduce.Job:  map 69% reduce 17%
19/01/20 01:15:38 INFO mapreduce.Job:  map 69% reduce 23%
19/01/20 01:16:13 INFO mapreduce.Job:  map 71% reduce 23%
19/01/20 01:16:14 INFO mapreduce.Job:  map 75% reduce 23%
19/01/20 01:16:15 INFO mapreduce.Job:  map 76% reduce 23%
19/01/20 01:16:20 INFO mapreduce.Job:  map 78% reduce 23%
19/01/20 01:16:21 INFO mapreduce.Job:  map 81% reduce 23%
19/01/20 01:16:22 INFO mapreduce.Job:  map 82% reduce 23%
19/01/20 01:16:28 INFO mapreduce.Job:  map 83% reduce 23%
19/01/20 01:16:39 INFO mapreduce.Job:  map 84% reduce 23%
19/01/20 01:16:45 INFO mapreduce.Job:  map 88% reduce 23%
19/01/20 01:16:48 INFO mapreduce.Job:  map 89% reduce 23%
19/01/20 01:16:53 INFO mapreduce.Job:  map 90% reduce 23%
19/01/20 01:17:04 INFO mapreduce.Job:  map 92% reduce 23%
19/01/20 01:17:06 INFO mapreduce.Job:  map 93% reduce 23%
19/01/20 01:17:10 INFO mapreduce.Job:  map 94% reduce 23%
19/01/20 01:17:11 INFO mapreduce.Job:  map 96% reduce 23%
19/01/20 01:17:12 INFO mapreduce.Job:  map 97% reduce 23%
19/01/20 01:17:14 INFO mapreduce.Job:  map 98% reduce 23%
19/01/20 01:17:16 INFO mapreduce.Job:  map 99% reduce 23%
19/01/20 01:17:19 INFO mapreduce.Job:  map 100% reduce 23%
19/01/20 01:17:21 INFO mapreduce.Job:  map 100% reduce 31%
19/01/20 01:17:28 INFO mapreduce.Job:  map 100% reduce 36%
19/01/20 01:17:34 INFO mapreduce.Job:  map 100% reduce 44%
19/01/20 01:17:40 INFO mapreduce.Job:  map 100% reduce 49%
19/01/20 01:17:46 INFO mapreduce.Job:  map 100% reduce 52%
19/01/20 01:17:52 INFO mapreduce.Job:  map 100% reduce 56%
19/01/20 01:17:58 INFO mapreduce.Job:  map 100% reduce 59%
19/01/20 01:18:04 INFO mapreduce.Job:  map 100% reduce 64%
19/01/20 01:18:10 INFO mapreduce.Job:  map 100% reduce 67%
19/01/20 01:18:22 INFO mapreduce.Job:  map 100% reduce 68%
19/01/20 01:18:28 INFO mapreduce.Job:  map 100% reduce 69%
19/01/20 01:18:34 INFO mapreduce.Job:  map 100% reduce 71%
19/01/20 01:18:40 INFO mapreduce.Job:  map 100% reduce 72%
19/01/20 01:18:46 INFO mapreduce.Job:  map 100% reduce 73%
19/01/20 01:18:53 INFO mapreduce.Job:  map 100% reduce 75%
19/01/20 01:18:59 INFO mapreduce.Job:  map 100% reduce 76%
19/01/20 01:19:05 INFO mapreduce.Job:  map 100% reduce 78%
19/01/20 01:19:11 INFO mapreduce.Job:  map 100% reduce 80%
19/01/20 01:19:17 INFO mapreduce.Job:  map 100% reduce 81%
19/01/20 01:19:23 INFO mapreduce.Job:  map 100% reduce 83%
19/01/20 01:19:29 INFO mapreduce.Job:  map 100% reduce 85%
19/01/20 01:19:35 INFO mapreduce.Job:  map 100% reduce 87%
19/01/20 01:19:41 INFO mapreduce.Job:  map 100% reduce 89%
19/01/20 01:19:47 INFO mapreduce.Job:  map 100% reduce 91%
19/01/20 01:19:53 INFO mapreduce.Job:  map 100% reduce 92%
19/01/20 01:19:59 INFO mapreduce.Job:  map 100% reduce 93%
19/01/20 01:20:05 INFO mapreduce.Job:  map 100% reduce 94%
19/01/20 01:20:11 INFO mapreduce.Job:  map 100% reduce 96%
19/01/20 01:20:17 INFO mapreduce.Job:  map 100% reduce 98%
19/01/20 01:20:23 INFO mapreduce.Job:  map 100% reduce 100%
19/01/20 01:20:24 INFO mapreduce.Job: Job job_1547946384889_0004 completed successfully
19/01/20 01:20:25 INFO mapreduce.Job: Counters: 51
        File System Counters
                FILE: Number of bytes read=5443871246
                FILE: Number of bytes written=7680652284
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2147485924
                HDFS: Number of bytes written=2147483700
                HDFS: Number of read operations=51
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Killed map tasks=1
                Launched map tasks=17
                Launched reduce tasks=1
                Data-local map tasks=4
                Rack-local map tasks=13
                Total time spent by all maps in occupied slots (ms)=1520230
                Total time spent by all reduces in occupied slots (ms)=415210
                Total time spent by all map tasks (ms)=1520230
                Total time spent by all reduce tasks (ms)=415210
                Total vcore-milliseconds taken by all map tasks=1520230
                Total vcore-milliseconds taken by all reduce tasks=415210
                Total megabyte-milliseconds taken by all map tasks=1556715520
                Total megabyte-milliseconds taken by all reduce tasks=425175040
        Map-Reduce Framework
                Map input records=21474837
                Map output records=21474837
                Map output bytes=2190433374
                Map output materialized bytes=2233383144
                Input split bytes=2224
                Combine input records=0
                Combine output records=0
                Reduce input groups=21474837
                Reduce shuffle bytes=2233383144
                Reduce input records=21474837
                Reduce output records=21474837
                Spilled Records=73819750
                Shuffled Maps =16
                Failed Shuffles=0
                Merged Map outputs=16
                GC time elapsed (ms)=36834
                CPU time spent (ms)=291070
                Physical memory (bytes) snapshot=5058732032
                Virtual memory (bytes) snapshot=14205886464
                Total committed heap usage (bytes)=3330801664
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=2147483700
        File Output Format Counters
                Bytes Written=2147483700
19/01/20 01:20:25 INFO terasort.TeraSort: done

Below is the command and output of teragen job for 20G data, it can also be found under 1155114915-HW0/b/ii

ierg5730-gcp-key@vm-master:~/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar teragen 214748365 terasort/20G-input
19/01/19 14:14:47 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/19 14:14:48 INFO terasort.TeraGen: Generating 214748365 using 2
19/01/19 14:14:49 INFO mapreduce.JobSubmitter: number of splits:2
19/01/19 14:14:49 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/01/19 14:14:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547907254538_0001
19/01/19 14:14:50 INFO impl.YarnClientImpl: Submitted application application_1547907254538_0001
19/01/19 14:14:50 INFO mapreduce.Job: The url to track the job: http://vm-master:8088/proxy/application_1547907254538_0001/
19/01/19 14:14:50 INFO mapreduce.Job: Running job: job_1547907254538_0001
19/01/19 14:14:59 INFO mapreduce.Job: Job job_1547907254538_0001 running in uber mode : false
19/01/19 14:14:59 INFO mapreduce.Job:  map 0% reduce 0%
19/01/19 14:15:18 INFO mapreduce.Job:  map 2% reduce 0%
19/01/19 14:15:21 INFO mapreduce.Job:  map 3% reduce 0%
19/01/19 14:15:24 INFO mapreduce.Job:  map 4% reduce 0%
19/01/19 14:15:27 INFO mapreduce.Job:  map 5% reduce 0%
19/01/19 14:15:33 INFO mapreduce.Job:  map 6% reduce 0%
19/01/19 14:15:37 INFO mapreduce.Job:  map 7% reduce 0%
19/01/19 14:15:39 INFO mapreduce.Job:  map 8% reduce 0%
19/01/19 14:15:43 INFO mapreduce.Job:  map 9% reduce 0%
19/01/19 14:15:45 INFO mapreduce.Job:  map 10% reduce 0%
19/01/19 14:15:51 INFO mapreduce.Job:  map 11% reduce 0%
19/01/19 14:15:55 INFO mapreduce.Job:  map 13% reduce 0%
19/01/19 14:15:57 INFO mapreduce.Job:  map 14% reduce 0%
19/01/19 14:16:01 INFO mapreduce.Job:  map 15% reduce 0%
19/01/19 14:16:07 INFO mapreduce.Job:  map 16% reduce 0%
19/01/19 14:16:10 INFO mapreduce.Job:  map 17% reduce 0%
19/01/19 14:16:13 INFO mapreduce.Job:  map 19% reduce 0%
19/01/19 14:16:19 INFO mapreduce.Job:  map 20% reduce 0%
19/01/19 14:16:22 INFO mapreduce.Job:  map 21% reduce 0%
19/01/19 14:16:25 INFO mapreduce.Job:  map 22% reduce 0%
19/01/19 14:16:31 INFO mapreduce.Job:  map 23% reduce 0%
19/01/19 14:16:34 INFO mapreduce.Job:  map 24% reduce 0%
19/01/19 14:16:37 INFO mapreduce.Job:  map 25% reduce 0%
19/01/19 14:16:40 INFO mapreduce.Job:  map 26% reduce 0%
19/01/19 14:16:44 INFO mapreduce.Job:  map 27% reduce 0%
19/01/19 14:16:47 INFO mapreduce.Job:  map 28% reduce 0%
19/01/19 14:16:50 INFO mapreduce.Job:  map 29% reduce 0%
19/01/19 14:16:53 INFO mapreduce.Job:  map 30% reduce 0%
19/01/19 14:16:59 INFO mapreduce.Job:  map 31% reduce 0%
19/01/19 14:17:02 INFO mapreduce.Job:  map 32% reduce 0%
19/01/19 14:17:05 INFO mapreduce.Job:  map 33% reduce 0%
19/01/19 14:17:08 INFO mapreduce.Job:  map 34% reduce 0%
19/01/19 14:17:11 INFO mapreduce.Job:  map 35% reduce 0%
19/01/19 14:17:17 INFO mapreduce.Job:  map 36% reduce 0%
19/01/19 14:17:20 INFO mapreduce.Job:  map 37% reduce 0%
19/01/19 14:17:23 INFO mapreduce.Job:  map 38% reduce 0%
19/01/19 14:17:26 INFO mapreduce.Job:  map 39% reduce 0%
19/01/19 14:17:29 INFO mapreduce.Job:  map 40% reduce 0%
19/01/19 14:17:32 INFO mapreduce.Job:  map 41% reduce 0%
19/01/19 14:17:35 INFO mapreduce.Job:  map 42% reduce 0%
19/01/19 14:17:41 INFO mapreduce.Job:  map 43% reduce 0%
19/01/19 14:17:46 INFO mapreduce.Job:  map 44% reduce 0%
19/01/19 14:17:47 INFO mapreduce.Job:  map 45% reduce 0%
19/01/19 14:17:52 INFO mapreduce.Job:  map 47% reduce 0%
19/01/19 14:17:53 INFO mapreduce.Job:  map 48% reduce 0%
19/01/19 14:17:58 INFO mapreduce.Job:  map 49% reduce 0%
19/01/19 14:17:59 INFO mapreduce.Job:  map 50% reduce 0%
19/01/19 14:18:04 INFO mapreduce.Job:  map 51% reduce 0%
19/01/19 14:18:10 INFO mapreduce.Job:  map 52% reduce 0%
19/01/19 14:18:11 INFO mapreduce.Job:  map 53% reduce 0%
19/01/19 14:18:16 INFO mapreduce.Job:  map 54% reduce 0%
19/01/19 14:18:17 INFO mapreduce.Job:  map 55% reduce 0%
19/01/19 14:18:23 INFO mapreduce.Job:  map 57% reduce 0%
19/01/19 14:18:28 INFO mapreduce.Job:  map 58% reduce 0%
19/01/19 14:18:29 INFO mapreduce.Job:  map 59% reduce 0%
19/01/19 14:18:34 INFO mapreduce.Job:  map 60% reduce 0%
19/01/19 14:18:36 INFO mapreduce.Job:  map 61% reduce 0%
19/01/19 14:18:40 INFO mapreduce.Job:  map 62% reduce 0%
19/01/19 14:18:41 INFO mapreduce.Job:  map 63% reduce 0%
19/01/19 14:18:46 INFO mapreduce.Job:  map 64% reduce 0%
19/01/19 14:18:47 INFO mapreduce.Job:  map 65% reduce 0%
19/01/19 14:18:52 INFO mapreduce.Job:  map 66% reduce 0%
19/01/19 14:18:55 INFO mapreduce.Job:  map 68% reduce 0%
19/01/19 14:18:58 INFO mapreduce.Job:  map 69% reduce 0%
19/01/19 14:19:01 INFO mapreduce.Job:  map 70% reduce 0%
19/01/19 14:19:05 INFO mapreduce.Job:  map 71% reduce 0%
19/01/19 14:19:11 INFO mapreduce.Job:  map 72% reduce 0%
19/01/19 14:19:13 INFO mapreduce.Job:  map 73% reduce 0%
19/01/19 14:19:17 INFO mapreduce.Job:  map 74% reduce 0%
19/01/19 14:19:20 INFO mapreduce.Job:  map 75% reduce 0%
19/01/19 14:19:23 INFO mapreduce.Job:  map 76% reduce 0%
19/01/19 14:19:26 INFO mapreduce.Job:  map 77% reduce 0%
19/01/19 14:19:29 INFO mapreduce.Job:  map 78% reduce 0%
19/01/19 14:19:32 INFO mapreduce.Job:  map 79% reduce 0%
19/01/19 14:19:36 INFO mapreduce.Job:  map 80% reduce 0%
19/01/19 14:19:38 INFO mapreduce.Job:  map 81% reduce 0%
19/01/19 14:19:42 INFO mapreduce.Job:  map 82% reduce 0%
19/01/19 14:19:44 INFO mapreduce.Job:  map 83% reduce 0%
19/01/19 14:19:48 INFO mapreduce.Job:  map 84% reduce 0%
19/01/19 14:19:50 INFO mapreduce.Job:  map 86% reduce 0%
19/01/19 14:19:56 INFO mapreduce.Job:  map 87% reduce 0%
19/01/19 14:20:00 INFO mapreduce.Job:  map 88% reduce 0%
19/01/19 14:20:03 INFO mapreduce.Job:  map 89% reduce 0%
19/01/19 14:20:06 INFO mapreduce.Job:  map 90% reduce 0%
19/01/19 14:20:09 INFO mapreduce.Job:  map 91% reduce 0%
19/01/19 14:20:12 INFO mapreduce.Job:  map 92% reduce 0%
19/01/19 14:20:15 INFO mapreduce.Job:  map 93% reduce 0%
19/01/19 14:20:18 INFO mapreduce.Job:  map 94% reduce 0%
19/01/19 14:20:22 INFO mapreduce.Job:  map 95% reduce 0%
19/01/19 14:20:24 INFO mapreduce.Job:  map 96% reduce 0%
19/01/19 14:20:28 INFO mapreduce.Job:  map 97% reduce 0%
19/01/19 14:20:32 INFO mapreduce.Job:  map 98% reduce 0%
19/01/19 14:20:34 INFO mapreduce.Job:  map 99% reduce 0%
19/01/19 14:20:37 INFO mapreduce.Job:  map 100% reduce 0%
19/01/19 14:20:37 INFO mapreduce.Job: Job job_1547907254538_0001 completed successfully
19/01/19 14:20:37 INFO mapreduce.Job: Counters: 31
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=396854
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=170
                HDFS: Number of bytes written=21474836500
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=4
        Job Counters
                Launched map tasks=2
                Other local map tasks=2
                Total time spent by all maps in occupied slots (ms)=664402
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=664402
                Total vcore-milliseconds taken by all map tasks=664402
                Total megabyte-milliseconds taken by all map tasks=680347648
        Map-Reduce Framework
                Map input records=214748365
                Map output records=214748365
                Input split bytes=170
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=5884
                CPU time spent (ms)=356520
                Physical memory (bytes) snapshot=450916352
                Virtual memory (bytes) snapshot=1659351040
                Total committed heap usage (bytes)=212336640
        org.apache.hadoop.examples.terasort.TeraGen$Counters
                CHECKSUM=461200258163748239
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=21474836500

This is the end of Part b.

Part c

i: Python 2 Mapper & Reducer

Mapper.py
#!/usr/bin/env python

import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove 'num:' from line
    line = line.split(':')[1]
    # split the line into words
    # remove leading and trailing whitespace
    line = line.strip()
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)
Reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        # (1)when current_word is not None
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
# (2) since in condition (1), we output all the words that are not the last word, so we need to add the last word.
# do not forget to output the last word if needed!
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

ii: MapReduce using Hadoop Streaming

Below is the command and output of hadoop streaming mapreduce, it can also be found under 1155114915-HW0/c/i.

ierg5730-gcp-key@vm-master:~/hadoop$ hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar -file py2-word-count/mapper.py -mapper mapper.py -file py2-word-count/reducer.py -reducer reducer.py -input /hw0C/input/Large-Dataset.txt -output /hw0C/output
19/01/20 07:57:04 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [py2-word-count/mapper.py, py2-word-count/reducer.py, /tmp/hadoop-unjar722251595545502703/] [] /tmp/streamjob8862609580622424134.jar tmpDir=null
19/01/20 07:57:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/20 07:57:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/20 07:57:08 INFO mapred.FileInputFormat: Total input files to process : 1
19/01/20 07:57:08 INFO mapreduce.JobSubmitter: number of splits:4
19/01/20 07:57:09 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/01/20 07:57:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547969907570_0017
19/01/20 07:57:10 INFO impl.YarnClientImpl: Submitted application application_1547969907570_0017
19/01/20 07:57:10 INFO mapreduce.Job: The url to track the job: http://vm-master:8088/proxy/application_1547969907570_0017/
19/01/20 07:57:10 INFO mapreduce.Job: Running job: job_1547969907570_0017
19/01/20 08:02:00 INFO mapreduce.Job: Job job_1547969907570_0017 running in uber mode : false
19/01/20 08:02:00 INFO mapreduce.Job:  map 0% reduce 0%
19/01/20 08:02:30 INFO mapreduce.Job:  map 3% reduce 0%
19/01/20 08:02:31 INFO mapreduce.Job:  map 5% reduce 0%
19/01/20 08:02:33 INFO mapreduce.Job:  map 24% reduce 0%
19/01/20 08:02:37 INFO mapreduce.Job:  map 26% reduce 0%
19/01/20 08:02:40 INFO mapreduce.Job:  map 28% reduce 0%
19/01/20 08:02:47 INFO mapreduce.Job:  map 36% reduce 0%
19/01/20 08:03:08 INFO mapreduce.Job:  map 37% reduce 0%
19/01/20 08:03:15 INFO mapreduce.Job:  map 39% reduce 0%
19/01/20 08:03:20 INFO mapreduce.Job:  map 39% reduce 8%
19/01/20 08:03:21 INFO mapreduce.Job:  map 41% reduce 8%
19/01/20 08:03:22 INFO mapreduce.Job:  map 42% reduce 8%
19/01/20 08:03:24 INFO mapreduce.Job:  map 44% reduce 8%
19/01/20 08:03:27 INFO mapreduce.Job:  map 45% reduce 8%
19/01/20 08:03:30 INFO mapreduce.Job:  map 46% reduce 8%
19/01/20 08:03:58 INFO mapreduce.Job:  map 47% reduce 8%
19/01/20 08:04:05 INFO mapreduce.Job:  map 49% reduce 8%
19/01/20 08:04:08 INFO mapreduce.Job:  map 50% reduce 8%
19/01/20 08:04:11 INFO mapreduce.Job:  map 51% reduce 8%
19/01/20 08:04:12 INFO mapreduce.Job:  map 52% reduce 8%
19/01/20 08:04:15 INFO mapreduce.Job:  map 54% reduce 8%
19/01/20 08:04:17 INFO mapreduce.Job:  map 55% reduce 8%
19/01/20 08:04:21 INFO mapreduce.Job:  map 56% reduce 8%
19/01/20 08:04:55 INFO mapreduce.Job:  map 59% reduce 8%
19/01/20 08:04:58 INFO mapreduce.Job:  map 60% reduce 8%
19/01/20 08:05:01 INFO mapreduce.Job:  map 61% reduce 8%
19/01/20 08:05:02 INFO mapreduce.Job:  map 62% reduce 8%
19/01/20 08:05:04 INFO mapreduce.Job:  map 63% reduce 8%
19/01/20 08:05:08 INFO mapreduce.Job:  map 65% reduce 8%
19/01/20 08:05:14 INFO mapreduce.Job:  map 66% reduce 8%
19/01/20 08:05:38 INFO mapreduce.Job:  map 68% reduce 8%
19/01/20 08:05:42 INFO mapreduce.Job:  map 69% reduce 8%
19/01/20 08:05:45 INFO mapreduce.Job:  map 70% reduce 8%
19/01/20 08:05:48 INFO mapreduce.Job:  map 71% reduce 8%
19/01/20 08:05:51 INFO mapreduce.Job:  map 72% reduce 8%
19/01/20 08:05:58 INFO mapreduce.Job:  map 73% reduce 8%
19/01/20 08:06:04 INFO mapreduce.Job:  map 74% reduce 8%
19/01/20 08:06:13 INFO mapreduce.Job:  map 75% reduce 8%
19/01/20 08:06:25 INFO mapreduce.Job:  map 76% reduce 8%
19/01/20 08:06:31 INFO mapreduce.Job:  map 77% reduce 8%
19/01/20 08:06:33 INFO mapreduce.Job:  map 78% reduce 8%
19/01/20 08:06:40 INFO mapreduce.Job:  map 79% reduce 8%
19/01/20 08:06:43 INFO mapreduce.Job:  map 80% reduce 8%
19/01/20 08:06:46 INFO mapreduce.Job:  map 81% reduce 8%
19/01/20 08:06:49 INFO mapreduce.Job:  map 82% reduce 8%
19/01/20 08:06:55 INFO mapreduce.Job:  map 83% reduce 8%
19/01/20 08:06:58 INFO mapreduce.Job:  map 84% reduce 8%
19/01/20 08:07:04 INFO mapreduce.Job:  map 85% reduce 8%
19/01/20 08:07:05 INFO mapreduce.Job:  map 86% reduce 8%
19/01/20 08:07:10 INFO mapreduce.Job:  map 87% reduce 8%
19/01/20 08:07:14 INFO mapreduce.Job:  map 88% reduce 8%
19/01/20 08:07:16 INFO mapreduce.Job:  map 89% reduce 8%
19/01/20 08:07:20 INFO mapreduce.Job:  map 90% reduce 8%
19/01/20 08:07:22 INFO mapreduce.Job:  map 91% reduce 8%
19/01/20 08:07:24 INFO mapreduce.Job:  map 91% reduce 17%
19/01/20 08:07:28 INFO mapreduce.Job:  map 92% reduce 17%
19/01/20 08:07:35 INFO mapreduce.Job:  map 94% reduce 17%
19/01/20 08:07:41 INFO mapreduce.Job:  map 95% reduce 17%
19/01/20 08:07:47 INFO mapreduce.Job:  map 96% reduce 17%
19/01/20 08:07:51 INFO mapreduce.Job:  map 97% reduce 17%
19/01/20 08:07:54 INFO mapreduce.Job:  map 98% reduce 17%
19/01/20 08:07:55 INFO mapreduce.Job:  map 98% reduce 25%
19/01/20 08:08:06 INFO mapreduce.Job:  map 100% reduce 25%
19/01/20 08:08:14 INFO mapreduce.Job:  map 100% reduce 67%
19/01/20 08:08:20 INFO mapreduce.Job:  map 100% reduce 68%
19/01/20 08:08:26 INFO mapreduce.Job:  map 100% reduce 69%
19/01/20 08:08:32 INFO mapreduce.Job:  map 100% reduce 71%
19/01/20 08:08:38 INFO mapreduce.Job:  map 100% reduce 73%
19/01/20 08:08:44 INFO mapreduce.Job:  map 100% reduce 74%
19/01/20 08:08:50 INFO mapreduce.Job:  map 100% reduce 76%
19/01/20 08:08:56 INFO mapreduce.Job:  map 100% reduce 77%
19/01/20 08:09:02 INFO mapreduce.Job:  map 100% reduce 78%
19/01/20 08:09:08 INFO mapreduce.Job:  map 100% reduce 80%
19/01/20 08:09:14 INFO mapreduce.Job:  map 100% reduce 81%
19/01/20 08:09:20 INFO mapreduce.Job:  map 100% reduce 82%
19/01/20 08:09:26 INFO mapreduce.Job:  map 100% reduce 84%
19/01/20 08:09:32 INFO mapreduce.Job:  map 100% reduce 85%
19/01/20 08:09:38 INFO mapreduce.Job:  map 100% reduce 87%
19/01/20 08:09:44 INFO mapreduce.Job:  map 100% reduce 88%
19/01/20 08:09:50 INFO mapreduce.Job:  map 100% reduce 90%
19/01/20 08:09:56 INFO mapreduce.Job:  map 100% reduce 91%
19/01/20 08:10:02 INFO mapreduce.Job:  map 100% reduce 92%
19/01/20 08:10:08 INFO mapreduce.Job:  map 100% reduce 94%
19/01/20 08:10:14 INFO mapreduce.Job:  map 100% reduce 95%
19/01/20 08:10:20 INFO mapreduce.Job:  map 100% reduce 96%
19/01/20 08:10:26 INFO mapreduce.Job:  map 100% reduce 98%
19/01/20 08:10:32 INFO mapreduce.Job:  map 100% reduce 99%
19/01/20 08:10:36 INFO mapreduce.Job:  map 100% reduce 100%
19/01/20 08:10:36 INFO mapreduce.Job: Job job_1547969907570_0017 completed successfully
19/01/20 08:10:37 INFO mapreduce.Job: Counters: 50
        File System Counters
                FILE: Number of bytes read=1183675238
                FILE: Number of bytes written=1786909669
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=420032880
                HDFS: Number of bytes written=40736746
                HDFS: Number of read operations=15
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Killed map tasks=3
                Launched map tasks=6
                Launched reduce tasks=1
                Rack-local map tasks=6
                Total time spent by all maps in occupied slots (ms)=1672971
                Total time spent by all reduces in occupied slots (ms)=466287
                Total time spent by all map tasks (ms)=1672971
                Total time spent by all reduce tasks (ms)=466287
                Total vcore-milliseconds taken by all map tasks=1672971
                Total vcore-milliseconds taken by all reduce tasks=466287
                Total megabyte-milliseconds taken by all map tasks=1713122304
                Total megabyte-milliseconds taken by all reduce tasks=477477888
        Map-Reduce Framework
                Map input records=3474848
                Map output records=53108124
                Map output bytes=496006084
                Map output materialized bytes=602222356
                Input split bytes=408
                Combine input records=0
                Combine output records=0
                Reduce input groups=4041135
                Reduce shuffle bytes=602222356
                Reduce input records=53108124
                Reduce output records=4041135
                Spilled Records=157593576
                Shuffled Maps =4
                Failed Shuffles=0
                Merged Map outputs=4
                GC time elapsed (ms)=8200
                CPU time spent (ms)=476560
                Physical memory (bytes) snapshot=1396105216
                Virtual memory (bytes) snapshot=4178202624
                Total committed heap usage (bytes)=936378368
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=420032472
        File Output Format Counters
                Bytes Written=40736746
19/01/20 08:10:37 INFO streaming.StreamJob: Output directory: /hw0C/output

The Result of Mapreduce can be found under 1155114915-HW0/c/ii.

Part d

Java Map Reduce

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      // remove "num:" from value.
      String line = value.toString();
      line = line.split(":")[1];
      // change value.toString() to line.
      StringTokenizer itr = new StringTokenizer(line);
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

ii: Map Reduce using complied java program(.jar)

Below is the command and output, it can also be found under 1155114915-HW0/d/ii:

ierg5730-gcp-key@vm-master:~/hadoop$ hadoop jar java-word-count/WordCount.jar WordCount /WordCount /WordCount/output
19/01/20 09:46:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/01/20 09:46:08 INFO input.FileInputFormat: Total input files to process : 1
19/01/20 09:46:08 INFO mapreduce.JobSubmitter: number of splits:4
19/01/20 09:46:09 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
19/01/20 09:46:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547976927340_0006
19/01/20 09:46:10 INFO impl.YarnClientImpl: Submitted application application_1547976927340_0006
19/01/20 09:46:10 INFO mapreduce.Job: The url to track the job: http://vm-master:8088/proxy/application_1547976927340_0006/
19/01/20 09:46:10 INFO mapreduce.Job: Running job: job_1547976927340_0006
19/01/20 09:46:24 INFO mapreduce.Job: Job job_1547976927340_0006 running in uber mode : false
19/01/20 09:46:24 INFO mapreduce.Job:  map 0% reduce 0%
19/01/20 09:46:54 INFO mapreduce.Job:  map 27% reduce 0%
19/01/20 09:47:05 INFO mapreduce.Job:  map 36% reduce 0%
19/01/20 09:47:25 INFO mapreduce.Job:  map 40% reduce 0%
19/01/20 09:47:31 INFO mapreduce.Job:  map 44% reduce 0%
19/01/20 09:47:39 INFO mapreduce.Job:  map 44% reduce 8%
19/01/20 09:48:01 INFO mapreduce.Job:  map 49% reduce 8%
19/01/20 09:48:07 INFO mapreduce.Job:  map 53% reduce 8%
19/01/20 09:48:57 INFO mapreduce.Job:  map 55% reduce 8%
19/01/20 09:49:03 INFO mapreduce.Job:  map 59% reduce 8%
19/01/20 09:49:09 INFO mapreduce.Job:  map 62% reduce 8%
19/01/20 09:50:00 INFO mapreduce.Job:  map 65% reduce 8%
19/01/20 09:50:05 INFO mapreduce.Job:  map 68% reduce 8%
19/01/20 09:50:06 INFO mapreduce.Job:  map 71% reduce 8%
19/01/20 09:50:36 INFO mapreduce.Job:  map 75% reduce 8%
19/01/20 09:50:42 INFO mapreduce.Job:  map 78% reduce 8%
19/01/20 09:50:48 INFO mapreduce.Job:  map 80% reduce 8%
19/01/20 09:50:54 INFO mapreduce.Job:  map 82% reduce 8%
19/01/20 09:50:58 INFO mapreduce.Job:  map 83% reduce 8%
19/01/20 09:51:00 INFO mapreduce.Job:  map 83% reduce 17%
19/01/20 09:51:01 INFO mapreduce.Job:  map 85% reduce 17%
19/01/20 09:51:06 INFO mapreduce.Job:  map 86% reduce 17%
19/01/20 09:51:07 INFO mapreduce.Job:  map 89% reduce 17%
19/01/20 09:51:12 INFO mapreduce.Job:  map 91% reduce 17%
19/01/20 09:51:13 INFO mapreduce.Job:  map 93% reduce 17%
19/01/20 09:51:17 INFO mapreduce.Job:  map 95% reduce 17%
19/01/20 09:51:18 INFO mapreduce.Job:  map 97% reduce 17%
19/01/20 09:51:23 INFO mapreduce.Job:  map 100% reduce 17%
19/01/20 09:51:24 INFO mapreduce.Job:  map 100% reduce 67%
19/01/20 09:51:30 INFO mapreduce.Job:  map 100% reduce 79%
19/01/20 09:51:36 INFO mapreduce.Job:  map 100% reduce 96%
19/01/20 09:51:38 INFO mapreduce.Job:  map 100% reduce 100%
19/01/20 09:51:38 INFO mapreduce.Job: Job job_1547976927340_0006 completed successfully
19/01/20 09:51:39 INFO mapreduce.Job: Counters: 50
        File System Counters
                FILE: Number of bytes read=354682329
                FILE: Number of bytes written=472468510
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=420032928
                HDFS: Number of bytes written=40736746
                HDFS: Number of read operations=15
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Killed map tasks=3
                Launched map tasks=6
                Launched reduce tasks=1
                Data-local map tasks=6
                Total time spent by all maps in occupied slots (ms)=1388495
                Total time spent by all reduces in occupied slots (ms)=270804
                Total time spent by all map tasks (ms)=1388495
                Total time spent by all reduce tasks (ms)=270804
                Total vcore-milliseconds taken by all map tasks=1388495
                Total vcore-milliseconds taken by all reduce tasks=270804
                Total megabyte-milliseconds taken by all map tasks=1421818880
                Total megabyte-milliseconds taken by all reduce tasks=277303296
        Map-Reduce Framework
                Map input records=3474848
                Map output records=53108124
                Map output bytes=602222332
                Map output materialized bytes=116792741
                Input split bytes=456
                Combine input records=70677631
                Combine output records=26100472
                Reduce input groups=4041135
                Reduce shuffle bytes=116792741
                Reduce input records=8530965
                Reduce output records=4041135
                Spilled Records=34631437
                Shuffled Maps =4
                Failed Shuffles=0
                Merged Map outputs=4
                GC time elapsed (ms)=9245
                CPU time spent (ms)=247070
                Physical memory (bytes) snapshot=1396428800
                Virtual memory (bytes) snapshot=4191748096
                Total committed heap usage (bytes)=977272832
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=420032472
        File Output Format Counters
                Bytes Written=40736746

The result of this map reduce program can be found under 1155114915-HW0/d/ii.

Compare Word Count time between Hadoop Streaming Python and Hadoop Java:

The Python result:

Total time spent by all map tasks (ms)=1672971
Total time spent by all reduce tasks (ms)=466287

The JAVA result:

Total time spent by all map tasks (ms)=1388495
Total time spent by all reduce tasks (ms)=270804

The map tasks time for Java is 20%(3/17) less than time spent by python.

However, the reduce tasks time for Java is 50% less than time spent by python.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值