spark streaming 实时统计mysql

1.sparkStreamingDemo

由于这个demo需要spark 和jdbc 的依赖包。在pom.xml文件中如下(关于新建maven 的spark工程请参考idea 构建maven 管理的spark项目

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.us.demo</groupId>
    <artifactId>mySpark</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <spark.version>2.0.2</spark.version>
        <scala.version>2.11</scala.version>
    </properties>


    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!-- JDBC-->

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.12</version>
        </dependency>

    </dependencies>

    <build>
        <plugins>

            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.19</version>
                <configuration>
                    <skip>true</skip>
                </configuration>
            </plugin>

        </plugins>
    </build>

</project>

SparkStreamingDemo demo 的代码如下,我会尽量逐行增加注释:

import java.sql.{DriverManager, ResultSet}

import scala.collection.mutable.Queue
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by yangyibo on 16/11/23.
  */
object SparkStreamingDemo {
  def main(args: Array[String]) {

    //创建spark实例
    val sparkConf = new SparkConf().setAppName("QueueStream")
    sparkConf.setMaster("local")
    // 创建sparkStreamingContext ,Seconds是多久去Rdd中取一次数据。
    val ssc = new StreamingContext(sparkConf, Seconds(3))

    // Create the queue through which RDDs can be pushed to a QueueInputDStream
    var rddQueue = new Queue[RDD[String]]()
    // 从rdd队列中读取输入流
    val inputStream = ssc.queueStream(rddQueue)
    //将输入流中的每个元素(每个元素都是一个String)后面添加一个“a“字符,并返回一个新的rdd。
    val mappedStream = inputStream.map(x => (x + "a", 1))
    //reduceByKey(_ + _)对每个元素统计次数。map(x => (x._2,x._1))是将map的key和value 交换位置。后边是过滤次数超过1次的且String 相等于“testa“
    val reducedStream = mappedStream.reduceByKey(_ + _)
        .map(x => (x._2,x._1)).filter((x)=>x._1>1).filter((x)=>x._2.equals("testa"))
    reducedStream.print()
    //将每次计算的结果存储在./out/resulted处。
    reducedStream.saveAsTextFiles("./out/resulted")
    ssc.start()

    //从数据库中查出每个用户的姓名,返回的是一个String有序队列seq,因为生成RDD的对象必须是seq。
    val seq = conn()
    println(Seq)
     //将seq生成RDD然后放入Spark的Streaming的RDD队列,作为输入流。
    for (i <- 1 to 3) {

      rddQueue.synchronized {
        rddQueue += ssc.sparkContext.makeRDD(seq,10)
      }
      Thread.sleep(3000)
    }
    ssc.stop()
  }


//从数据库中取出每个用户的名字,是个String有序队列
  def conn(): Seq[String] = {
    val user = "root"
    val password = "admin"
    val host = "localhost"
    val database = "msm"
    val conn_str = "jdbc:mysql://" + host + ":3306/" + database + "?user=" + user + "&password=" + password
    //classOf[com.mysql.jdbc.Driver]
    Class.forName("com.mysql.jdbc.Driver").newInstance();
    val conn = DriverManager.getConnection(conn_str)
    var setName = Seq("")
    try {
      // Configure to be Read Only
      val statement = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)

      // Execute Query,查询用户表 sec_user 是我的用户表,有name属性。
      val rs = statement.executeQuery("select * from sec_user")
      // Iterate Over ResultSet

      while (rs.next) {
        // 返回行号
        // println(rs.getRow)
        val name = rs.getString("name")
        setName = setName :+ name
      }
      return setName
    }
    finally {
      conn.close
    }
  }
}

2.scala 链接mysql

使用SparkSession链接数据库,请点击这里 <——

Scala链接数据库代码奉上

import java.sql.{Connection, DriverManager, ResultSet}

/**
  * Created by yangyibo on 16/11/23.
  */
object DB {

  def main(args: Array[String]) {
    val user = "root"
    val password = "admin"
    val host = "localhost"
    val database = "msm"
    val conn_str = "jdbc:mysql://" + host + ":3306/" + database + "?user=" + user + "&password=" + password
    println(conn_str)
    val conn = connect(conn_str)
    val statement =conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
    // Execute Query
    val rs = statement.executeQuery("select * from sec_user")
    // Iterate Over ResultSet
    while (rs.next) {
      // 返回行号
      // println(rs.getRow)
      val name = rs.getString("name")
      println(name)
    }
    closeConn(conn)
  }

  def connect(conn_str: String): Connection = {
    //classOf[com.mysql.jdbc.Driver]
    Class.forName("com.mysql.jdbc.Driver").newInstance();
    return  DriverManager.getConnection(conn_str)
  }

  def closeConn(conn:Connection): Unit ={
    conn.close()
  }

}

scala 链接MySQL所需依赖

        <!-- JDBC-->

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.12</version>
        </dependency>
您好!对于Spark直连MySQL生成Spark Streaming,您可以按照以下步骤操作: 1. 首先,确保您已经正确安装了SparkMySQL,并且它们都能正常运行。 2. 在Spark Streaming中使用MySQL连接器,您需要将MySQL连接器(JDBC驱动程序)添加到Spark的类路径中。您可以从MySQL官方网站上下载适用于您的MySQL版本的JDBC驱动程序,并将其放置在Spark的`lib`目录下。 3. 创建一个Spark Streaming应用程序,并导入所需的库和类。例如,您可以使用Scala编写以下代码: ```scala import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import java.sql.DriverManager ``` 4. 创建一个`SparkConf`对象,并设置您希望的Spark配置选项。例如: ```scala val conf = new SparkConf().setAppName("SparkStreamingMySQL").setMaster("local[*]") ``` 5. 创建一个`StreamingContext`对象,并指定批处理间隔。例如: ```scala val ssc = new StreamingContext(conf, Seconds(5)) ``` 6. 创建一个DStream,从MySQL数据库读取数据。您可以使用`foreachRDD`方法在每个RDD上执行自定义函数来实现此操作。例如: ```scala val jdbcUrl = "jdbc:mysql://localhost:3306/your_database" val jdbcUsername = "your_username" val jdbcPassword = "your_password" val dstream = ssc.receiverStream(new MySQLReceiver(jdbcUrl, jdbcUsername, jdbcPassword)) ``` 7. 在自定义的`MySQLReceiver`类中,您可以使用JDBC连接器来从MySQL数据库读取数据,并将其转换为RDD。例如: ```scala class MySQLReceiver(jdbcUrl: String, jdbcUsername: String, jdbcPassword: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) { def onStart() { new Thread("MySQL Receiver") { override def run() { receive() } }.start() } def onStop() {} private def receive() { val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword) val statement = connection.createStatement() val resultSet = statement.executeQuery("SELECT * FROM your_table") while (resultSet.next()) { val data = resultSet.getString("column_name") store(data) } resultSet.close() statement.close() connection.close() } } ``` 8. 处理从MySQL读取的数据。您可以使用`foreachRDD`方法将每个RDD中的数据发送到所需的处理逻辑中。例如: ```scala dstream.foreachRDD { rdd => rdd.foreach { data => // 处理数据逻辑 } } ``` 9. 启动Spark Streaming应用程序并等待其完成。例如: ```scala ssc.start() ssc.awaitTermination() ``` 请注意,上述代码只是一个示例,您需要根据您的具体需求进行适当的修改。此外,在生产环境中,您可能需要配置更多的参数(如检查点目录、资源分配等)来确保应用程序的稳定性和性能。 希望这些步骤对您有所帮助!如果您有任何问题,请随时提问。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值