linux新建文件
建立wordcount.java文件:
- cd /home/hadoop/Documents
- touch wordcount.java
启动Hadoop
start-all.sh
stop-all.sh
1.WordCount
- cd ~/Documents/wordcount/
- javac -classpath /usr/local/hadoop/share/hadoop/common/hadoop-common-2.9.2.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.9.2.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar -d ./ ./WordCount.java
- jar -cvf WordCount.jar ./*.class
- hdfs dfs -rm -r output
- hadoop jar ./WordCount.jar WordCount input output
- hdfs dfs -ls ./output
- hdfs dfs -cat output/*
- hdfs dfs -rm -r output
2.MatrixMultiply
- cd ~/Documents/Matrix1/
- javac -classpath /usr/local/hadoop/share/hadoop/common/hadoop-common-2.9.2.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.9.2.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar -d ./ ./MatrixMultiply.java
- jar -cvf MatrixMultiply.jar ./*.class
- 查看脚本:sudo gedit ./genMatrix.sh
生成两个矩阵(30x50,50x100):./genMatrix.sh 30 50 100
将两个矩阵文件从本地发送至hdfs文件系统:(发送一次即可)
hdfs dfs -put ./M_30_50 input
hdfs dfs -put ./N_50_100 input - hdfs dfs -rm -r output
- hadoop jar ./MatrixMultiply.jar MatrixMultiply input/M_30_50 input/N_50_100 output
- hdfs dfs -ls ./output
- hdfs dfs -cat output/*
- hdfs dfs -rm -r output
3.InvertedIndex
- cd ~/Documents/index/
- javac -classpath /usr/local/hadoop/share/hadoop/common/hadoop-common-2.9.2.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.9.2.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar -d ./ ./InvertedIndex.java
- jar -cvf InvertedIndex.jar ./*.class
- hdfs dfs -rm -r output
- 导入停词表到input文件夹:(导入一次即可)
hdfs dfs -put ./stopwords.txt input - hadoop jar ./InvertedIndex.jar InvertedIndex indexinput output
- hdfs dfs -ls ./output
- hdfs dfs -cat output/*
- hdfs dfs -rm -r output
- 创建indexinput文件夹:(创建一次即可)
cd /usr/local/hadoop/bin
hdfs dfs -mkdir indexinput - 将文件导入:(导入一次即可)
cd ~/Documents/index/
hdfs dfs -put ./input/file1.txt indexinput
hdfs dfs -put ./input/file2.txt indexinput
hdfs dfs -put ./input/file3.txt indexinput
hdfs dfs -put ./input/file4.txt indexinput
hdfs dfs -put ./input/file5.txt indexinput - hdfs dfs -rm -r indexinput
- 详细文件:
file1:Apache Spark Scala Hadoop Java C Python Do And Will KNN
file2:SVM Scala News Play Akka Yes GBDT
file3:LDA SVM RF GBDT Adaboost Kmeans KNN
file4:QQ BAT I Great All LDA
file5:Apache Hadoop MapReduce Git SVN SVM