文章目录
1.部署idea
- 如果是cdh版本,需要在pom文件中添加
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/
</url>
</repository>
</repositories>
- pom文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.ruozedata.bigdata</groupId>
<artifactId>ruozedata-hadoop</artifactId>
<version>1.0</version>
<name>Maven Portlet Archetype
</name>
<!-- FIXME change it to the project's website -->
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<hadoop.version>2.7.2</hadoop.version>
</properties>
<dependencies>
<!--hadoop 依赖-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
2. idea创建hdfs目录
public class HDFSApp {
public static final String HDFS_PATH="hdfs://10.211.55.23:9000";
@Test
public void mkdir() throws Exception{
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.get(new URI(HDFS_PATH), conf);
boolean isSucess = fileSystem.mkdirs(new Path("/ruozedata/api"));
Assert.assertEquals(true,isSucess);
fileSystem.close();
}
}
遇到权限问题:
org.apache.hadoop.security.AccessControlException: Permission denied: user=h
http://www.huqiwen.com/2013/07/18/hdfs-permission-denied/
owner问题
- 在conf中添加user信息即可
FileSystem fileSystem = FileSystem.get(new URI(HDFS_PATH), conf,"hadoop");
3.copyFromLocalFile
public static final String HDFS_PATH="hdfs://10.211.55.23:9000";
Configuration conf;
FileSystem fileSystem;
@Before
public void setUp() throws Exception{
conf = new Configuration();
fileSystem = FileSystem.get(new URI(HDFS_PATH), conf,"hadoop");
}
@After
public void tearDown() throws Exception{
fileSystem.close();
}
//本地拷贝文件到hdfs上
@Test
public void copyFromLocal() throws Exception{
Path srcPath = new Path("/Users/hyc/Desktop/access.log");
Path dstPath = new Path("/ruozedata/api");
fileSystem.copyFromLocalFile(srcPath,dstPath);
}
副本数不一样(hdfs-site.xml)
- 在conf中设置hdfs-site.xml的副本属性
@Before
public void setUp() throws Exception{
conf = new Configuration();
conf.set("dfs.replication","1");
fileSystem = FileSystem.get(new URI(HDFS_PATH), conf,"hadoop");
}
4.通过流的方式(文件块的合并)
- spark一共有270.8M
- 第一个block
//TODO....通过流的方式
@Test
public void copyFromLocalIO() throws Exception {
//io 字符流 字节流
BufferedInputStream in = new BufferedInputStream(new FileInputStream("/Users/hyc/Desktop/access-pk.log"));
FSDataOutputStream out = fileSystem.create(new Path("/ruozedata/api/access-io.log"));
//把in的东西写到out中去
IOUtils.copyBytes(in, out, 4096);
IOUtils.closeStream(out);
IOUtils.closeStream(in);
}
- 第3个block
//spark文件
@Test
public void download01() throws Exception {
FSDataInputStream in = fileSystem.open(new Path("/ruozedata/api/spark-2.4.2-bin-2.6.0-cdh5.15.1_20190731_104546.tgz"));
FileOutputStream out = new FileOutputStream(new File("/Users/hyc/Desktop/spark-02.tgz"));
//读下一个block,期望从哪里开始
in.seek(1024*1024*128*2);
// byte[] buffer = new byte[1024];
// for (int i = 0; i < 1024 * 128; i++) {
// in.read(buffer);
// out.write(buffer);
// }
IOUtils.copyBytes(in,out,conf);
IOUtils.closeStream(out);
IOUtils.closeStream(in);
}
5.MR
- Mapper Reducer Driver => 分布式计算程序 => yarn上运行就行
- 存储节点要和计算节点在同一节点集合.本地化
- MapReduce进程
MRAppMaster
MapTaskTasks
ReduceTasks(不一定有)
wordcount
-
在hadoop中,数据类型Intwritable, longwritable, Text
-
需求:统计文件中单词出现的次数
1) 文件的内容读取进来
2) 每行数据是有分隔符的,这里的分隔符是逗号
3)每个单词赋值为1
ruoze,1
ruoze,1
ruoze,1
jepson,1
jepson,1
xingxing,1
4 ) Shuffle
相同的key分发到一个reduce上
ruoze,<1,1,1>
jepson,<1,1>
xingxing,<1>
5 ) 数据归并
ruoze,3
jepson,2
xingxing,1 -
wordcount code
public class WordCountApp {
public static void main(String[] args) throws Exception {
//1.获取job对象
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
//2.设置jar的相关信息
job.setJarByClass(WordCountApp.class);
//3.设置自定义的myMapper和myReducer
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
//4.设置map阶段输出的key和value类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//5.设置reduce阶段输出的key和value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//6.设置输入输出路径
String input = "/Users/hyc/ruozedata_workspace/ruozedata-hadoop/data/wc.data";
String output = "out/";
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
//7.提交job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
/*
* KEYIN, 输入数据key的数据类型 每行数据的偏移量
* VALUEIN, 输入数据value的数据类型
* KEYOUT, 输出数据key的数据类型
* VALUEOUT 输出数据value的数据类型
* */
public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
IntWritable ONE = new IntWritable(1);
/*
* value 每一行数据
* */
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] splits = value.toString().split(",");
for (String split : splits) {
context.write(new Text(split), ONE);
}
}
}
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text word, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
//累计求和
for (IntWritable value : values) {
count += value.get();
}
context.write(word, new IntWritable(count));
}
}
}
- 在idea打包jar包,上传jar包在linux,在linux上跑mr
[hadoop@ruozedata000 lib]$ hadoop jar ruozedata-hadoop-1.0.jar com.ruozedata.bigdata.hadoop.mapreduce.wc.WordCountYarnApp /ruozedata/wc/input/ /ruozedata/wc/output
[hadoop@ruozedata000 lib]$ hadoop fs -ls /ruozedata/wc/output
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2019-08-26 23:58 /ruozedata/wc/output/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 22 2019-08-26 23:58 /ruozedata/wc/output/part-r-00000
[hadoop@ruozedata000 lib]$
[hadoop@ruozedata000 lib]$ hadoop fs -text /ruozedata/wc/output/part-r-00000
jack 2
ruoze 3
star 1
longWritable
public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public class LongWritable implements WritableComparable<LongWritable> {
public interface WritableComparable<T> extends Writable, Comparable<T> {
}
- LongWritable实现了一个序列化的接口.序列化的好坏决定了程序跑的快和慢,java的序列化serializable效率低,里面会有一些额外的东西加进去
- 序列化
序列化:对象==> 字节数组
反序列化:字节数组==>对象 - serializable为空 标志性接口
无默认构造方法-报错
java.lang.Exception: java.lang.RuntimeException: java.lang.NoSuchMethodException: com.ruozedata.bigdata.hadoop.mapreduce.ser.Access.<init>()
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
- 自定义 bean 对象实现序列化接口,反序列化时,需要反射调用空参构造函数,所以必须有空参构造
自定义的类一定要实现writable接口
public class Access implements Writable {}
6.自定义序列化类的开发步骤
1)实现Writable接口 *****
2)重写write和readFields方法,注意字段顺序一定要一样 *****
3)必须要带有一个默认无参的构造方法 *****
4)toString optional
5)如果你的自定义类要是实现排序,那么需要实现Comparable
补充
单元测试
#pom文件
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
#方法前面@Test
@Test
public void mkdir(){
- 最后可以把三个block 用cat 命令合并成一个完整的spark.
core-site.xml里有配置端口
- 8020是namenode节点active状态下的端口号;
9000端口:是fileSystem默认的端口号:
<!-- 指定 HDFS 中 NameNode 的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://ruozedata000:9000</value>
</property>
</configuration>
- 50070是namenode主节点的端口号
50090端口号:namenode的secondarynamenode的端口号: