Spark IPmapping方案

最新推荐文章于 2023-04-13 11:42:35 发布

房石阳明i

最新推荐文章于 2023-04-13 11:42:35 发布

阅读量522

点赞数

分类专栏： spark 文章标签： spark 大数据分布式 hadoop 数据仓库

本文链接：https://blog.csdn.net/mogeko1/article/details/130012154

版权

spark 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

使用数据中的uid imei imsi mac androidid uuid 等标识字段，按优先级取一个标识，作为这条数据的用户唯一标识。有严重的漏洞。第一天登陆了，取uid，第二天没登录，取imei 是一个人吗。

在现实的日志数据中，用户可以使用各种各样的设备，有着各种各样的入口。甚至同一个人用户有多个设备以及使用多种前端入口。就会导致，日志数据中的同一个人，不同时段收集的数据中，取得的标识个数、种类各不相同。
手机、平板
        android ios winphone 手机
        android 系统的各种版本
        ios 也有各种版本

常见的用户设备标识：没办法轻易定制一个规则来取某个作为唯一标识
        mac
        imei 入网许可证序号
        imsi sim卡序号
        androidid 安卓系统id
        openuuid (app 自己生成的序号) 卸载重装app就会变更
        idfa 广告跟踪码
        deviceid app 日志收集埋点开发人员自己定义一种逻辑id
在复杂的各类id中，分辨出那些id属于同一设备，用普通的where很难实现

解决方案之一：redis
        从日志数据中抽取各种标识id
        将抽取出的标识id，去redis中查询是否存在
        如果不存在，新增一个guid
        存在，则使用以存在的标识
        存在问题：有些数据可能属于同一个人，但在每个时段，这些数据没有任何联系，此人的数据可能会标识上两个不同的标识。
        补救措施：定期对redis中id映射库进行整理合并。但是合并可能导致的问题，两个不同的人标识为同一个人

利用图计算实现ipmapping的过程：
        将当日的数据中的所有id标识，及标识字段之间的关联，生成点集合、边集合。
        将上一日的ids—>guid的映射关系，也生成点集合，边集合。
        将上面两类点集合、边集合合并到一起生成一个图。
        再对上述图执行最大连通子图的算法，得到一个连通子图结果
        再冲结果图中取得哪些属于同一组，并生成唯一标识。
        将上面步骤生成的唯一标识去比对前日的ids->映射表===》如果一个人已经存在guid，则沿用原来的guid

数据：

13866558899,zhangsan,wx_hq,2000
13877559988,sange,wx_hq,3000
,zhangsan,wx_lhq,5000
13912344678,lisi,wx_lxlt,12000
13912344678,lisi-1,wx_wk,3500
13912664678,lisi-2,wx_wk,5600

pom.xml文件添加

  <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-graphx_2.11</artifactId>
            <version>2.2.0</version>
  </dependency>

完整pom.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>SparkIPMappingDemo</artifactId>
    <version>1.0-SNAPSHOT</version>
    <name>scala-demo-project</name>
    <!-- FIXME change it to the project's website -->
    <url>http://www.example.com</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8</version>
        </dependency>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka_2.11</artifactId>
            <version>0.11.0.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-dbcp2</artifactId>
            <version>2.1.1</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.47</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-graphx_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>


        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.2.0</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>


        <!-- https://mvnrepository.com/artifact/org.scalikejdbc/scalikejdbc -->
        <dependency>
            <groupId>org.scalikejdbc</groupId>
            <artifactId>scalikejdbc_2.11</artifactId>
            <version>3.1.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.scalikejdbc/scalikejdbc-config -->
        <dependency>
            <groupId>org.scalikejdbc</groupId>
            <artifactId>scalikejdbc-config_2.11</artifactId>
            <version>3.1.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-graphx_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.69</version>
        </dependency>
        <dependency>
            <groupId>ch.hsr</groupId>
            <artifactId>geohash</artifactId>
            <version>1.3.0</version>
        </dependency>

        <dependency>
            <groupId>org.mongodb.spark</groupId>
            <artifactId>mongo-spark-connector_2.11</artifactId>
            <version>2.2.0</version>
        </dependency>



        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>druid</artifactId>
            <version>1.1.10</version>
        </dependency>


        <dependency>
            <groupId>redis.clients</groupId>
            <artifactId>jedis</artifactId>
            <version>2.9.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-api-scala_2.11</artifactId>
            <version>11.0</version>
        </dependency>

    </dependencies>


    <build>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>cn.kgc.kafak.demo.ThreadProducer</mainClass>
                                </transformer>
                            </transformers>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude><META-I></META-I>NF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>


</project>

Spark IPMappingDemo

import org.apache.commons.lang3.StringUtils
import org.apache.spark.graphx.{Edge, Graph, VertexId, VertexRDD}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Dataset, SparkSession}

object SparkDemo {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
    import spark.implicits._

    // 加载原始数据
    val ds: Dataset[String] = spark.read.textFile("file:///F:/JavaTest/SparkIPMappingDemo/data/graphx/input")

    // 构造一个点RDD
    val vertices: RDD[(Long, String)] = ds.rdd.flatMap(line => {
      val fields = line.split(",")

      // 在spark的图计算api中，点需要表示成一个tuple ==》  (点的唯一标识Long, 点的数据)
      for(ele <- fields if StringUtils.isNotBlank(ele)) yield (ele.hashCode.toLong,ele)

      /*Array((fields(0).hashCode.toLong, fields(0)),
        (fields(1).hashCode.toLong, fields(1)),
        (fields(2).hashCode.toLong, fields(2))
      )*/

    })

    // 构造一个边RDD
    // spark graphx中对边的描述结构：   Edge(起始点id,目标点id,边数据)
    val edges: RDD[Edge[String]] = ds.rdd.flatMap(line => {
      val fields = line.split(",")

      /*val lst = new ListBuffer[Edge[String]]()
      for (i <- 0 to fields.length - 2) {
        val edge1 = Edge(fields(i).hashCode.toLong, fields(i + 1).hashCode.toLong, "")
        lst += edge1
      }
      lst*/

      for (i <- 0 to fields.length - 2 if StringUtils.isNotBlank(fields(i))) yield Edge(fields(i).hashCode.toLong, fields(i + 1).hashCode.toLong, "")
    })
/*    edges.toDF().show()
//    +-----------+-----------+----+
//    |      srcId|      dstId|attr|
//    +-----------+-----------+----+
//    |  149291990|-1432604556|    |
//    |-1432604556|  113568551|    |
//    |  113568551|    1537214|    |
//    |-1515330570|  109203646|    |
//    |  109203646|  113568551|    |
//    |  113568551|    1567005|    |
//    |-1432604556| -774338537|    |
//    | -774338537|    1626587|    |
//    |-1095629956|    3322003|    |
//    |    3322003| 1765324466|    |
//    | 1765324466|   46789743|    |
//    |-1095629956|-1102520969|    |
//    |-1102520969|  113569010|    |
//    |  113569010|    1571810|    |
//    |-1007895461|-1102520968|    |
//    |-1102520968|  113569010|    |
//    |  113569010|    1632353|    |
//    +-----------+-----------+----+ */


    // 用  点集合  和  边集合  构造一张图
    val graph = Graph(vertices,edges)

    // 调用图的算法： 连通子图算法
    val graph2 = graph.connectedComponents()
    // 从结果图中，取出图的点集合，即可以得到我们想要的分组结果
    val vertices2: VertexRDD[VertexId] = graph2.vertices
    //(点id-0,点数据-0)
    //（点id-1,点数据-0）
    //(点id-4,点数据-4)
    //（点id-5,点数据-4）
    /**	取最小的值
     * (-1102520968,-1102520969)
     * (149291990,-1515330570)
     * (1567005,-1515330570)
     * (-1095629956,-1102520969)
     * (113569010,-1102520969)
     * (-1515330570,-1515330570)
     * (-774338537,-1515330570)
     * (113568551,-1515330570)
     * (1765324466,-1102520969)
     * (1571810,-1102520969)
     * (-1432604556,-1515330570)
     * (3322003,-1102520969)
     * (46789743,-1102520969)
     * (-1102520969,-1102520969)
     * (1632353,-1102520969)
     * (1626587,-1515330570)
     * (109203646,-1515330570)
     * (1537214,-1515330570)
     * (-1007895461,-1102520969)
     */
    //vertices2.take(30).foreach(println)


    // 将上面得到的的映射关系rdd，收集到Driver端
    val idmpMap = vertices2.collectAsMap()
    // 然后作为变量广播出去
    val bc = spark.sparkContext.broadcast(idmpMap)

    // 利用这个映射关系结果，来加工原始数据
    val res = ds.map(line=>{
      val bc_map = bc.value
      val name = line.split(",").filter(StringUtils.isNotBlank(_))(0)
      val gid = bc_map.get(name.hashCode.toLong).get

      gid+","+line
    })
    res.show(10,false)

    /**
     * +-------------------------------------------+
     * |value                                      |
     * +-------------------------------------------+
     * |-1515330570,13866558899,zhangsan,wx_hq,2000|
     * |-1515330570,13877559988,sange,wx_hq,3000   |
     * |-1515330570,,zhangsan,wx_lhq,5000          |
     * |-1102520969,13912344678,lisi,wx_lxlt,12000 |
     * |-1102520969,13912344678,lisi-1,wx_wk,3500  |
     * |-1102520969,13912664678,lisi-2,wx_wk,5600  |
     * +-------------------------------------------+
     **/

    spark.close()
  }
}

报错：ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

在hadoop服务器上把core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml四个文件放入idea里的src\main\resources下