软件包和代码的位置:
链接:https://pan.baidu.com/s/1hmm2f-NkkHuHbhlnmo0iew 密码:vjwy
wordcount:
1:关于hadoop里面的参数
BooleanWritable:标准布尔型数值
ByteWritable:单字节数值
DoubleWritable:双字节数
FloatWritable:浮点数
IntWritable:整型数,类int
LongWritable:长整型数
Text:使用UTF8格式存储的文本,类String
都要以get()set(s)的形式使用,像对象里的变量一样去使用
3、写代码:reduce
hadoop的代码:
1.本地运行:
代码:
package com.itstar;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class wc_hadoop {
public static class mymap extends Mapper<LongWritable,Text,Text,IntWritable>{
private static IntWritable one=new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String [] words=value.toString().split(" ");
for(String word:words){
context.write(new Text(word),one);
}
}
}
public static class myreduce extends Reducer<Text,IntWritable,Text,IntWritable>{
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum=0;
for(IntWritable val:values){
sum+=val.get();
}
context.write(key,new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception{
//由于hdfs的用户验证,也可以修改运行的配置文件hdfs-site.xml里面的dfs.permissions为false,不设置会报错
System.setProperty("HADOOP_USER_NAME","hadoop");
Configuration conf=new Configuration();
Job job=Job.getInstance(conf,"wc_hadoop");
Path inp=new Path(args[1]);
//输出的文件地址如果存在就删除,但是本地删除可以使用这种办法
FileSystem hf=inp.getFileSystem(conf);
//远程要使用下面的办法
//FileSystem hf=FileSystem.get(conf);
if(hf.exists(inp)){
hf.delete(inp,true);
}
job.setJarByClass(wc_hadoop.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
job.setMapperClass(mymap.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setReducerClass(myreduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileOutputFormat.setOutputPath(job,inp);
job.waitForCompletion(true);
}
}
本地运行
1.需要下载hadoop包解压到本地,并配置到环境变量,不起作用就重启电脑再试试
解压包里要添加两个文件;hadoop.dll和winutils.exe
2.添加相应的jar包到依赖里,依赖在本地的解压包,对照图中位置
3:设置参数:(你需要几个参数写几个)Program arguments:设置输入输出文件位置
成功:
2.本地的测试map:
这个需要一个依赖:
<dependency>
<groupId>org.apache.mrunit</groupId>
<artifactId>mrunit</artifactId>
<version>1.0.0</version>
<classifier>hadoop2</classifier>
<scope>test</scope>
</dependency>
测试方法
2.hadoop上以jar的方式运行,其实就是一样的代码红字部分修改后打包就可以了
要注意如果你创建的scala的项目在pom文件中要注意打包目录的指定,注释就好
打包命令:mvn clean package
jar 包的运行方式
1.打包
执行包
结果
spark的本地和服务器运行:
1在spark-shell 测试
提交jar包的测试方法:
spark编写Java的代码
package com.itstar;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
public class wordCount_java {
public static void main(String[] args) {
if(args.length<1)
return;
SparkConf sc=new SparkConf();
sc.setAppName("JavaWordCount");
// sc.setMaster("local");
JavaSparkContext jsc=new JavaSparkContext(sc);
JavaRDD file= jsc.textFile(args[0]);
JavaRDD words= file.flatMap(new FlatMapFunction<String,String>() {
public Iterator call(String o) throws Exception {
return Arrays.asList(o.split(" ")).iterator();
}
});
JavaPairRDD maps= words.mapToPair(new PairFunction<String,String,Integer>() {
public Tuple2 call(String o) throws Exception {
return new Tuple2(o,1);
}
});
JavaPairRDD reduce= maps.reduceByKey(new Function2<Integer,Integer, Integer>() {
public Integer call(Integer o, Integer o2) throws Exception {
return o+o2;
}
});
List<Tuple2<String,Integer>> map= reduce.collect();
for (Tuple2<String,Integer> t:map){
System.out.println( t._1+" "+t._2);
}
jsc.stop();
}
}
submit的测试方法:
结果:
执行scala的代码执行:
代码:scala的写法
package com.itstar
import org.apache.spark.{SparkConf, SparkContext}
object wordCount_scala {
def main(args: Array[String]): Unit = {
if(args.length<1)
return
var s=new SparkConf()
// .setMaster("local")
.setAppName("ScalaWordCount")
var sc=new SparkContext(s)
var reduce= sc.textFile(args(0)).flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_+_).collect();
reduce.foreach(println);
sc.stop()
}
}
结果:
spark本地的测试:
只要写local就可以了,不用任何的配置
注意:
如果报这个错
去掉这个隐藏部分
如果报winutils的错误,spark可以忽略假装看不见:因为不影响结果
不想看见错就按第一步讲
winutils.exe放到bin下就可以了