业务场景:交通流中车辆的行驶轨迹可以用来描述城市交通网中某两点间的可达性.该例以城市治安卡口过车数据为依据,进行数据清洗和处理,形成整个城市交通网治安卡点间的可达性矩阵,基于此可进行进一步的城市交通状态分析.
处理流程:
Hadoop:处理原始过车数据,形成以单个车辆时间序列顺序生成的数据流示例[(kkid1,1)(kkid2)…],该过程中需要对数据的合法性进行检查主要体现在排除A-A的情况,排除A-B 间隔很大的情况,此处需要提供一份完整的卡口ID集合,如没有可使用独立的MR程序生成一份.
Spark:加载Hadoop生成的数据,基于定点集合,边集借助Spark的Graph框架构建图模型.进而生成城市治安卡口点位间的可达性矩阵.
Hadoop流程:
CarGraphEdge.java ,MR驱动器,此处借用Hadoop提供的Tool,ToolRunner工具类简化命令行方式运行作业.
public class CarGraphEdge extends Configuredimplements Tool {
publicstatic final String GRANULARITY = "GRANULARITY";
publicint run(String[] args) throws Exception {
Pathinput = new Path(args[0]);
Pathoutput = new Path(args[1]);
Configurationconf = new Configuration();
if(args.length >= 3) {
conf.set(CarGraphEdge.GRANULARITY,args[3]);
}
Jobjob = Job.getInstance(conf, "CAR_GRAPH_EDGE");
job.setJarByClass(cn.com.zjf.MR_04.CarGraphEdge.class);
job.setInputFormatClass(CombineTextInputFormat.class);
job.setMapperClass(GraphMapper.class);
job.setCombinerClass(GraphConbine.class);
job.setReducerClass(GraphReduce.class);
//数据分区
job.setPartitionerClass(GraphPartion.class);
//数据分组这里没有必要
//job.setCombinerKeyGroupingComparatorClass(GraphComparator.class);
job.setMapOutputKeyClass(GrapOrderMap.class);
job.setMapOutputValueClass(GraphValue.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(NullWritable.class);
FileSystemfs = FileSystem.get(conf);
//预处理文件 .只读取写完毕的文件 .writed结尾 .只读取文件大小大于0的文件
{
FileStatuschilds[] = fs.globStatus(input, new PathFilter() {
publicboolean accept(Path path) {
if(path.toString().endsWith(".writed")) {
returntrue;
}
returnfalse;
}
});
Pathtemp = null;
for(FileStatus file : childs) {
temp= new Path(file.getPath().toString().replaceAll(".writed",""));
if(fs.listStatus(temp)[0].getLen() > 0) {
CombineTextInputFormat.addInputPath(job,temp);
}
}
}
CombineTextInputFormat.setMaxInputSplitSize(job,67108864);
if(fs.exists(output)) {
fs.delete(output,true);
}
FileOutputFormat.setOutputPath(job,output);
if(!job.waitForCompletion(true))
return0;
return-1;
}
publicstatic void main(String[] args) throws Exception {
ToolRunner.run(newCarGraphEdge(), args);
}
}
GrapOrderMap.java 组合键,用于对以车辆数据分区的分区内数据按时间序列排序
// 组合键定义
class GrapOrderMap implements Writable,WritableComparable<GrapOrderMap> {
privateText carPlate;
privateLong day;
publicLong getDay() {
returnday;
}
publicvoid setDay(Long day) {
this.day= day;
}
publicGrapOrderMap() {
carPlate= new Text();
day= 0L;
}
publicGrapOrderMap(Text carPlate, Long day) {
super();
this.carPlate= carPlate;
this.day= day;
}
publicint compareTo(GrapOrderMap co) {
intcompareValue = this.carPlate.compareTo(co.carPlate);
//相等
if(compareValue == 0) {
compareValue= this.day.compareTo(co.day);
}
returncompareValue;
}
publicvoid write(DataOutput out) throws IOException {
this.carPlate.write(out);
out.writeLong(day);
}
publicvoid readFields(DataInput in) throws IOException {
this.carPlate.readFields(in);
day= in.readLong();
}
publicText getCarPlate() {
returncarPlate;
}
publicvoid setCarPlate(Text carPlate) {
this.carPlate= carPlate;
}
@Override
publicString toString() {
return"CarOrder [carPlate=" + carPlate + ", day=" + day +"]";
}
}
GraphValue.java 组合Value,由于在Reduce过程中需要对时间序列的数据进行时间粒度清洗,没条数据都应该携带原始的过车时间,组合键中的时间值仅仅用来排序.
class GraphValue implements Writable,Comparable<GraphValue> {
privateString kkid;
privateLong time;
publicGraphValue() {
kkid= "";
time= 0L;
}
publicGraphValue(String kkid, Long time) {
this.kkid= kkid;
this.time= time;
}
publicvoid write(DataOutput out) throws IOException {
out.writeUTF(kkid);
out.writeLong(time);
}
publicvoid readFields(DataInput in) throws IOException {
kkid= in.readUTF();
time= in.readLong();
}
publicString getKkid() {
returnkkid;
}
publicvoid setKkid(String kkid) {
this.kkid= kkid;
}
publicLong getTime() {
returntime;
}
publicvoid setTime(Long time) {
this.time= time;
}
publicint compareTo(GraphValue o) {
returnthis.kkid.compareTo(o.getKkid());
}
@Override
publicString toString() {
return"GraphValue [kkid=" + kkid + ", time=" + time +"]";
}
}
GraphPartion.java ,分区函数,数据按车牌进行分区.这里有一个疑问,城市车辆多则近百万意味着数据分区数会很多,暂时未考虑分区数过多可能会带来的负面影响.
class GraphPartion extendsPartitioner<GrapOrderMap, GraphValue> {
@Override
publicint getPartition(GrapOrderMap key, GraphValue value, int numPartitions) {
returnkey.getCarPlate().hashCode() % numPartitions;
}
}
GraphConbine.java ,Combine 函数,在Mapper端进行数据预处理操作,这里主要处理掉时间序列上出现的点位,进行点位去重
// 数据去重
class GraphConbine extends Reducer<GrapOrderMap,GraphValue, GrapOrderMap, GraphValue> {
@Override
protectedvoid setup(Reducer<GrapOrderMap, GraphValue, GrapOrderMap,GraphValue>.Context context)
throwsIOException, InterruptedException {
}
@Override
protectedvoid reduce(GrapOrderMap key, Iterable<GraphValue> values,
Reducer<GrapOrderMap,GraphValue, GrapOrderMap, GraphValue>.Context context)
throwsIOException, InterruptedException {
//去除连续空间
List<GraphValue>graphValues = new ArrayList<GraphValue>();
for(GraphValue value : values) {
graphValues.add(newGraphValue(value.getKkid(), value.getTime()));
}
GraphValuepre = null;
for(GraphValue value : graphValues) {
if(pre == null) {
pre= value;
//纠正时间
key.setDay(value.getTime());
context.write(key,value);
continue;
}
//不相同输出
if(!pre.getKkid().equals(value.getKkid())) {
context.write(key,value);
}
//must
pre.setKkid(value.getKkid());
pre.setTime(value.getTime());
}
}
}
GraphReduce.java ,由于数据原始数据集可能跨度比较大,如造成上班时间出现的最后一个点位和下午下班的出现的第一个点位的可达性应该去掉,Reduce阶段主要进行时间粒度的清洗,粒度由MR框架传入,最后输出格式: (开始点位_结束点位 1) 此处1标识出现一次. 出现的次数在Spark构建图过程中用来标识该点位的车流量
// 时间粒度清洗
class GraphReduce extends Reducer<GrapOrderMap,GraphValue, NullWritable, Text> {
privateInteger granularity;
privateQueue<GraphValue> queue = null;
@Override
protectedvoid setup(Reducer<GrapOrderMap, GraphValue, NullWritable, Text>.Contextcontext)
throwsIOException, InterruptedException {
queue= new ArrayBlockingQueue<GraphValue>(2);
granularity= context.getConfiguration().getInt(CarGraphEdge.GRANULARITY, 30);
}
@Override
protectedvoid reduce(GrapOrderMap key, Iterable<GraphValue> values,
Reducer<GrapOrderMap,GraphValue, NullWritable, Text>.Context context)
throwsIOException, InterruptedException {
GraphValuetemp = null;
for(GraphValue gh : values) {
queue.add(newGraphValue(gh.getKkid(), gh.getTime()));
if(queue.size() >= 2) {
//重复节点
if(queue.peek().getKkid().equals(gh.getKkid())) {
queue.poll();
}else {
//粒度清洗超过N分钟无效
temp= queue.poll();
if(queue.peek().getTime() - temp.getTime() < 1000 * 60 * granularity) {
context.write(NullWritable.get(),
newText(temp.getKkid() + "_" + queue.peek().getKkid() +"\t1"));
}
}
while(queue.size() > 1) {
queue.poll();
}
}
}
}
}
SPARK阶段: 加载定点集合,加载边集合,对边集合预处理,使用Spark图处理框架构建图对象,此处主要体现点位间的可达性,其它算法后续章节将逐渐体现,学习中,具体见代码注释
App.scala
. object App extends App {
val conf =new SparkConf
conf.setAppName("TVC_GRAPH").setMaster("local[4]")
val sc = newSparkContext(conf)
//点集
val vertices= sc.textFile("hdfs://host218:8020/zjf/output6/part-r-00000",3).filter { item => !item.equals("") &&item.matches("[0-9]+?") }.map { item => (item.toLong, item) }
//边集
val edges =sc.textFile("hdfs://host218:8020/zjf/output7/part-r-00000", 3).filter{ item => !item.equals("") &&item.matches("[0-9]+?_[0-9]+?\t[0-9]+?") }.map { item =>
{
val args1= item.split("\t");
val args2= args1(0).split("_");
((args2(0).toLong, args2(1).toLong), 1)
}
}.reduceByKey((it1, it2) => it1 + it2).map(item =>Edge(item._1._1, item._1._2, item._2))
// 构建图
val graph:Graph[String, Int] = Graph(vertices, edges, "")
val maxCount= graph.edges.reduce((item1, item2) => {
if(item1.attr > item2.attr) { item1 } else { item2 }
})
//顶点数组
val ss =vertices.map(item => { item._1 }).collect();
//边元组
val zz =edges.map { item => (item.srcId.toLong, item.dstId.toLong) }.collect()
var arrs =Array.ofDim[Long](ss.length, ss.length);
for (i <-0 until ss.length) {
for (j <-0 until ss.length) {
if(zz.exists(item => item._1 == ss(i) && item._2 == ss(j))) {
arrs(i)(j) = 1;
}
}
}
/*交通流可达性矩阵*/
ss.foreach {item => print("\t" + item) }
println()
for (i <-0 until ss.length) {
print(ss(i)+ "\t")
for (j<- 0 until ss.length) {
print(arrs(i)(j) + "\t")
}
println()
}
}
可达性矩阵运算结果(部分):
点位id | 1001 | 1003 | 1004 | 1006 | 1007 | 1008 | 1009 | 1010 | … |
1001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|
1003 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|
1004 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
|
1006 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|
1007 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
|
1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|
… |
|
|
|
|
|
|
|
|
|