Spark 单词计数
先启动Hadoop,确保9000端口能被访问,关于hadoop启动,详见我的第一篇博客,spark-3.1.1下载地址download
1.进入/usr/local/src目录解压,重命名为spark
tar -xvf spark-3.1.1-bin-hadoop3.2
ln -sv ./src/spark-3.1.1-bin-hadoop3.2 ./spark
2.进入/spark/conf目录
cd /usr/local/spark/conf
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
export SPARK_MASTER_HOST=192.168.43.100 #虚拟机
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=2g
export SPARK_MASTER_WEBUI_PORT=8888
export JAVA_HOME=/usr/local/java/jdk1.8.0_161
export SPARK_DIST_CLASSPATH=$(/usr/local/src/hadoop-3.0.3/bin/hadoop classpath)
3.cp workers.template workers
vi workers(输入虚拟主机ip)
192.168.43.100
4.vi /etc/profile
JAVA_HOME=/usr/local/java/jdk1.8.0_161
HADOOP_HOME=/usr/local/src/hadoop-3.0.3
SPARK_HOME=/usr/local/src/spark-3.1.1-bin-hadoop3.2
CLASSPATH=.:$JAVA_HOME/lib.tools.jar
PATH=$JAVA_HOME/bin:$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/conf:$SPARK_HOME/bin
export JAVA_HOME CLASSPATH SPARK_HOME PATH
source /etc/profile
5.启动spark
cd /usr/local/spark/sbin
./start-all.sh
6.验证
cd /usr/local/spark
./bin/run-example SparkPi ##如果输出信息中存在Pi is roughly 说明测试成功。
7 在/root目录下创建task2,task1文件,vi编辑输入单词
cd /root
vi task2 用于hadoop模式
vi task1 用于本地模式
输入完内容后保存退出
创建hdfs目录 hdfs dfs -mkdir /sunhao
上传本地文件 hdfs dfs -put /root/task2 /sunhao
8
单词计数(本地模式)
cd /usr/local/spark/bin
./spark-shell
var textfile=sc.textFile("file:///root/task1");
var count = textfile.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
count.collect()
退出spark shell(ctrl+z)
单词计数(hadoop模式)
cd /usr/local/spark/bin
./spark-shell
var textfile=sc.textFile("hdfs://192.168.43.100:9000/sunhao/task2");
var count = textfile.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
count.collect()
退出spark shell
关闭spark
cd /usr/local/spark/sbin
./stop-all.sh
查看hdfs信息
hdfs dfsadmin -report