big data相关的技术文章

最新推荐文章于 2024-07-29 22:43:52 发布

StevenIsSnail

最新推荐文章于 2024-07-29 22:43:52 发布

阅读量945

点赞数

文章标签： big data

bigdata 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Spark分布式计算平台

来自于：大数据技术作者：hzguoding 2014-08-13 14:24

Spark简介

UC伯克利 AMP实验室(2011)

当前版本0.8.1，加入Apache孵化项目

Lighting-Fast Cluster Computing

http://spark.incubator.apache.org/

Spark部署

Cluster Mode Overview

SparkContext是用户执行任务的核心控制句柄

Cluster Manager是集群的控制载体

目前支持的三种集群载体模式：

1 Standalone

2 Apache Mesos

3 Hadoop YARN

Standalone部署(Master-Slave)

1 下载，编译

2 编辑配置文件

3 执行启动脚本

Spark编码

Spark基于Scala开发

编程接口支持Scala, Java, Python

Fast Programming

RDD(Resilient Distributed Datasets)

RDD是全局抽象的分布式存储句柄；

Map-Reduce的job执行过程中，输入输出需要用指定hdfs的路径并做记录；

RDD对象的创建三种方式：

1 基于内存存储的容器对象

2 基于文本文件

3 基于Hadoop输入格式

RDD(Resilient Distributed Datasets)

基于内存容器：

List<Double> list = new ArrayList<Double]();

JavaRDD<Double> rdds = sc.parallelize(list);

基于文本文件：

JavaRDD<String> rdds = sc.textFile(“hdfs://xxx/user/files/wordcount.txt”);

基于hadoop输入文件：

JavaPairRDD<LongWritable, Text> rdds = sc. hadoopFile(“hdfs://xxx/user/files/wordcount.txt”, TextInputFormat.class, LongWritable.class, Text.class);

RDD Operation

RDD可支持的操作包括: map, reduce, filter, flatMap, sample, union, distinct, groupByKey, reduceByKey, join, cogroup, cartesian, count, foreach, saveAsTextFile, saveAsSequenceFile等等。

RDD Persistence

不同的持久化级别：MEMORY_ONLY(cache), MEMORY_AND_DISK, DISK_ONLY。

rdds.cache(), rdds.persist(storage_level)

内存计算的优势

编码简单，数据句柄操作可视化。

系统稳定性远不如Hadoop。

回归类迭代计算，内存足够大，划算。

================================================================================

Storm基础

来自于：大数据技术作者：李刚锐 2014-08-13 11:06

本文分别介绍Storm和Storm Trident的一些基础知识，适合初学者快速理解掌握Storm。其中一些基本概念都简单提一下，主要介绍中间一些比较重要的东西。

一、 Storm

Storm的工作任务称为一个Topology，类似于MapReduce中的Job。

Storm集群中包含两类节点：主节点（Master Node）和工作节点（Work Node）。其分别对应的角色如下：

主节点（Master Node）上运行一个被称为Nimbus的后台程序，它负责在Storm集群内分发代码，分配任务给工作机器，并且负责监控集群运行状态。Nimbus的作用类似于Hadoop中JobTracker的角色。

工作节点（Work Node）上运行一个被称为Supervisor的后台程序。Supervisor负责监听从Nimbus分配给它执行的任务，据此启动或停止执行任务的工作进程。每一个工作进程执行一个Topology的子集；一个运行中的Topology由分布在不同工作节点上的多个工作进程组成。

二、 Topology

一个Topology由很多个功能节点组成，各个节点组成一个有向图，每两个节点之间可以传递数据。

节点分为2种：Spout和Bolt。Spout是数据源，即整个Topology执行的起始点；Bolt为中间的各个计算节点。

数据在各个节点之间是以tuple来传输的，tuple是最小的传输单元。

从拓扑结构上来看，每2个节点之间有一个连接，而实际上是有多个并发的。

对并发的理解：

一个topology内有若干个Worker Process；

一个Worker Process里边有多个线程，每个线程是一个executor，对应一个Bolt或者Spout；

每个executor内有多个task；

每个task执行一个实际的数据处理.

在代码中，以下内容是用来设置并发的：

ParallelismHint，指定某个bolt初始的executor数量，即线程数；

Bolt.setNumTasks：设置task；

Config.setNumWorkers：设置worker
而由于并发，上一个节点执行完以后实际上有很多个后续节点，那么它应该把tuple发送给哪个后续节点继续计算呢？

在Storm中，把这个过程称为Stream Grouping，而分发的方式有以下几种：
1.Shuffle Grouping：将Tuple随机分配到下游的Bolt
2.Fields Grouping：保证相同Fields值的tuple会被发送到同一个Bolt
3.All Grouping：广播，每个tuple所有的Bolt都会收到
4.Global Grouping：所有的Stream都流向task id最低的那个task。
5.Non Grouping：与Shuffle一样的效果，区别在于会把这个Bolt放到与订阅Bolt同一个线程中执行。。
6.Direct Grouping：这是一种比较复杂的分组方法。。。它规定了tuple的producer来管理由哪个consumer的task来接受这个tuple。。这个比较复杂。
7.Local or shuffle Grouping：这是一种为了提高效率的随机Grouping方法，当一个Bolt的多个Task都在同一个Worker process中的时候，tuple会随机分配到这些正在运行的task中，否则就是普通的Shuffle Grouping
数据在节点之间传输，代码是通过 collector.emit(new Values(tuple)) 来实现的。每个Tuple都是一个Value类型的变量，即一个Object列表，它可以包含很多数据，比如 new Values(123, "String", new Date()， 123L, 12.3F, null)等。
在接受的节点，可以强制转换，即通过 tuple.getValueByField(_sourceName); 获得的Value，可以直接cast为上一个节点传递的object
（TODO：只测试过基本类型，包括List，类不知道可否直接cast）。
而数据在传输的时候，一个一个tuple传输的效率有时比较低，Storm后来有了一种批传输的方式。即将多个tuple在一个batch中传输。
但这样有时效率也不高，后来又有了Batch Transaction的方式，即将一个Batch内的多个tuple先合并运算，这样传输的数量就会减少。这个过程分为2个阶段：
1.Processing Phase：该阶段将一个Batch内的数据进行计算。这个过程可以并行执行，提高效率。
2.Commit Phase：将batch的结果按照严格顺序提交，保证Transaction。
另外，在Storm中，Spout和Bolt都是可序列化(implements Serializable)的。
关于序列化的理解：

对于spout、bolt来讲，他们中的成员变量需要是serializable的，是因为worker挂掉的时候，supervisor会将这些worker的数据序列化以后保存起来。然后supervisor在重新启动新的worker的时候，会把这些数据加载进去。在重新加载的时候，不会调用构造函数，而是从之前supervisor保存的数据中加载，并调用open方法。
因此，在这些Spout、Bolt的构造函数中用到的所有类成员变量都必须是Serializable的。其他的成员变量，如果不在构造函数中使用，可以不是Serializable的，例如在open中初始化这些变量，会在worker启动的时候调用open来重新调用。

三、 Storm Trident

Storm Trident是对Storm的一层封装，并且封装的代码都是很高效的。这使得我们可以更快捷的进行开发。

Trident将功能封装成一个个的原语，有链接、聚合、分组、用户自定义功能和过滤等。以最简单的单词统计为例进行说明：

TridentTopology topology = new TridentTopology();

TridentState wordCounts =

topology.newStream("spout1", spout)

.each(new Fields("sentence"), new Split(), new Fields("word"))

.groupBy(new Fields("word"))

.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))

.parallelismHint(6);

topology.newDRPCStream("words")

.stateQuery(wordCounts, new Fields("args"), new MapGet(), new Fields("count"));

首先建立一个Spout源FixedBatchSpout，不断的发送数据出来------调用emitBatch发送数据！！发出来的是一个个的句子。

然后创建一个TridentTopology，并建立TridentState对Spout进行监听，并通过each、groupBy等进行处理进行统计。然后将统计结果保存在叫做TridentState的状态中，上述代码中该state变量叫做wordCounts。
然后创建一个DRPCStream，用于外部调用的去查询上文的TridentState的状态。
外部调用的时候，执行
new DRPCClient("server", port).execute("words", "cat the dog jumped") 调用的时候，就是一次远程调用，去统计之前统计的所有数量中cat the dog jumped这几个词的数量。

四、关于聚合操作：

做聚合操作，类似于SQL语句中的select count(*), sum(count) 之类的。按照严格标准的SQL语法，有聚合的时候，未聚合的列都要group by。
在Trident中，做类似操作通常是利用aggregate、partitionAggregate、aggregatePersist 结合 groupBy方法来做。

若想对一批数据做多种聚合：
利用chainedAgg和chainEnd配合起来用于对组同时进行多种聚合操作，如下所示：
.chainedAgg()
.partitionAggregate(new Fields("url"), new Count(), new Fields("url_cnt"))
.partitionAggregate(new Fields("byte"), new Sum(), new Fields("bytes_sum"))
.chainEnd()
注意：
chainEnd会对Fields进行过滤，输入的Fields将不再保留。而partitionAggregate不会对Fields过滤的。
如本例中，输出的Fields中只包含url_cnt和bytes_sum，不再包含url和byte。但是其他的列（未经partitionAggregate处理的列）不会影响。。。
通常，partitionAggregate是和groupBy一起用的，过滤后的列只剩下groupBy和partitionAgg生成的列。

<<<<<end 关于聚合操作

前边提过，计算的中间过程可以保存在state中。
state有3种：non-transactional，repeat-transactional，opaque-transactional

对State有两种操作------
QueryFunction：查询操作
StateUpdater：更新操作
QueryFunction
QueryFunction的执行过程：将输入传递给batchRetrieve函数，进行相应的处理，返回一个List。
例如：stateQuery(locations, new Fields("userid"), new QueryLocation(), new Fields("location"))
作用是根据用户id查询位置信息，输入的是用户id的list(new Fields("userid"))，输出就是用户位置信息的List(new Fields("location"))。

public class QueryLocation extends BaseQueryFunction<LocationDB, String> {

public List<String> batchRetrieve(LocationDB state, List<TridentTuple> inputs) {

List<String> ret = new ArrayList();

for(TridentTuple input: inputs)

{

ret.add(state.getLocation(input.getLong(0)));

}

return ret;

}

public void execute(TridentTuple tuple, String location, TridentCollector collector)

{

collector.emit(new Values(location));

}

在QueryFunction里有2个函数------
List<T> batchRetrieve(S state, List<T> input)：根据输入的List，从State中查询或者其他操作，返回一个List
void execute：提交
StateUpdater
在updateState函数中进行更新操作
例如：.partitionPersist(new LocationDBFactory(), new Fields("userid", "location"), new LocationUpdater())

public class LocationUpdater extends BaseStateUpdater<LocationDB> {

public void updateState(LocationDB state, List<TridentTuple> tuples, TridentCollector collector) {

List<Long> ids = new ArrayList<Long>();

List<String> locations = new ArrayList<String>();

for(TridentTuple t: tuples)

{

ids.add(t.getLong(0)); locations.add(t.getString(1));

}

state.setLocationsBulk(ids, locations);

}

上边的partitionPersist函数是执行更新操作的

五、其他注意事项：

partitionPersist前必须要用partitionBy。

可以调用TridentUtils.fieldsUnion对各个fields求交集。(fieldsUnion与fieldsConcat的区别是，前者去除掉相同的fields)。

================================================================================

Storm中访问HDFS

来自于：大数据技术作者：李刚锐 2014-08-13 11:32

一、 Hadoop客户端配置

hadoop jar打入storm的package或加入storm的lib目录

把core-site.xml, mapred-site.xml, hdfs-site.xml, 从而在storm可以初始化hadoop的configuration

二、 Security验证

把keytab文件传入，转为可以序列化的字节数组，使得可以在spout，bolt之间传递。

BufferedInputStream in = new BufferedInputStream(new FileInputStream(keytabFile));

ByteArrayOutputStream out = new ByteArrayOutputStream(1024);

byte[] temp = new byte[1024];

int size = 0;

while ((size = in.read(temp)) != -1) {

out.write(temp, 0, size);

}

in.close();

this.priniciple = principle;

this.keytabContent = out.toByteArray();

在验证时使用byte数组创建临时文件，验证kerberos

Configuration hadoopConf = new Configuration();

//hadoopConf.set(FS_DEFAULT_NAME_KEY, this.fsName);

hadoopConf.set("hadoop.security.authentication", "kerberos");

UserGroupInformation.setConfiguration(hadoopConf);

//UserGroupInformation.loginUserFromKeytab(principle, keytab);

InputStream keytabFile = new ByteArrayInputStream(this.keytabContent);

File temp = File.createTempFile("stream_sql", "keytab");

temp.deleteOnExit();

IOUtils.copyBytes(keytabFile, new FileOutputStream(temp), 1024, true);

UserGroupInformation.loginUserFromKeytab(this.priniciple, temp.getAbsolutePath());

//remove the temp file

temp.delete();

三、 LZO编码问题

在storm的package中加入hadoop-lzo或者把hadoop-lzo加入storm的lib目录。

设置LD_LIBRARY_PATH（加入HADOOP_HOME/lib/native/Linux-amd64-64)使得可以加载native gpl library。

在storm配置中设置java.library.path,把lzo的路径加入到java.library.path

四、多节点同时读取一个文件多个block

使用和map/reduce相同的方法（InputSplit)

使得InputSplit可以被序列化，使用Wrapper重载readObject和writeObect。

private void writeObject(ObjectOutputStream s) throws IOException {

s.defaultWriteObject();

new ObjectWritable(this.writable).write(s);

}

private void readObject(ObjectInputStream ois) throws Exception {

ois.defaultReadObject();

ObjectWritable obj = new ObjectWritable();

obj.setConf(new JobConf());

obj.readFields(ois);

this.writable = (T) obj.get();

}

public T get() {

return this.writable;

}

创建InputSplit数组

String path = tuple.getString(0);

Configuration hConf = new Configuration();

JobConf jobConf = new JobConf(hConf);

//read the file path

FileInputFormat.addInputPath(jobConf, new Path(path));

jobConf.setInputFormat(TextInputFormat.class);

TextInputFormat input = new TextInputFormat();

input.configure(jobConf);

InputSplit[] splits = input.getSplits(jobConf, 2);

if (splits != null) {

for (InputSplit split: splits) {

collector.emit(new Values(new SerializeWritable<InputSplit>(split)));

}

并发处理Split消息

SerializeWritable<InputSplit> split = (SerializeWritable<InputSplit>)tuple.get(0);

if (split == null) {

return;

}

TextInputFormat input = new TextInputFormat();

JobConf jobConf = new JobConf();

input.configure(jobConf);

try {

RecordReader<LongWritable, Text> r = input.getRecordReader(split.get(), jobConf, Reporter.NULL);

LongWritable key = new LongWritable();

Text val = new Text();

while(r.next(key, val)) {

collector.emit(new Values(val.toString()));

}

r.close();

} catch (IOException e) {

e.printStackTrace();

}

================================================================================

StevenIsSnail

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

big data相关的技术文章

Spark简介

Spark部署

Spark编码

内存计算的优势

一、 Storm

二、 Topology

三、 Storm Trident

四、 关于聚合操作：

五、 其他注意事项：

一、 Hadoop客户端配置

二、 Security验证

三、 LZO编码问题

四、 多节点同时读取一个文件多个block

四、关于聚合操作：

五、其他注意事项：

四、多节点同时读取一个文件多个block