hadoop wordcount

最新推荐文章于 2025-03-16 22:40:22 发布

yandao

最新推荐文章于 2025-03-16 22:40:22 发布

阅读量2.3k

点赞数 2

分类专栏： hadoop 文章标签： hadoop mapreduce 大数据

本文链接：https://blog.csdn.net/yandao/article/details/125270971

版权

本文介绍了Hadoop自带的WordCount程序，详细阐述了MapReduce的核心思想、核心函数及其工作过程，通过一个三明治制作的类比帮助理解。MapReduce通过map和reduce函数实现数据的并行处理，将大规模数据拆分、处理并汇总。WordCount程序作为MapReduce的典型应用，用于统计文本中每个单词的出现频率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. hadoop自带的wordcount

1.1位置

[root@hadoop001 ~]# ls $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar
/export/servers/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar

1.2 使用

[root@hadoop001 ~]# hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

[root@hadoop001 ~]# hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount
Usage: wordcount <in> [<in>...] <out>