1)首先idea中新建maven工程SparkWC
在pom.xml中导入配置
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.itstaredu.spark</groupId>
<artifactId>SparkWC</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
<hadoop.version>2.8.4</hadoop.version>
<mysql.version>5.1.38</mysql.version>
<encoding>UTF-8</encoding>
</properties>
<dependencies>
<!-- scala的依赖导入 -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- spark的依赖导入 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- hadoop-client API的导入 -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- mysql的连接驱动依赖 -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.version}</version>
</dependency>
</dependencies>
<build>
<pluginManagement>
<plugins>
<!-- scala的编译插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
</plugin>
<!-- ava的编译插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- 打jar包插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
这里的hadoop依赖可加可不加,没有用到
数据文件itstar.log格式
20180724101954 http://java.itstar.com/java/course/javaeeadvanced.shtml
20180724101954 http://java.itstar.com/java/course/javaee.shtml
20180724101954 http://java.itstar.com/java/course/android.shtml
20180724101954 http://java.itstar.com/java/video.shtml
20180724101954 http://java.itstar.com/java/teacher.shtml
20180724101954 http://java.itstar.com/java/course/android.shtml
20180724101954 http://bigdata.itstar.com/bigdata/teacher.shtml
20180724101954 http://net.itstar.com/net/teacher.shtml
20180724101954 http://java.itstar.com/java/course/hadoop.shtml
20180724101954 http://java.itstar.com/java/course/base.shtml
需求:将数据按照url中的学院也就是java,net,bigdata分组,学院 学院访问次数存储到mysql中。
SparkWC工程新建一个单例对象UrlGroupCount
import java.net.URL
import java.sql.{Connection, DriverManager}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object UrlGroupCount {
def main(args: Array[String]): Unit = {
//1.创建spark的程序入口
val conf:SparkConf = new SparkConf().setAppName("UrlGroupCont1").setMaster("local[2]")
val sc:SparkContext = new SparkContext(conf)
//2.将数据切分
val rdd1:RDD[String] = sc.textFile("e:/bigdata/itstar.log")
//3.将数据切分
val rdd2:RDD[(String,Int)] = rdd1.map(line => {
val s: Array[String] = line.split("\t")
//元组输出
(s(1), 1)
})
//4.累加求和
val rdd3:RDD[(String,Int)] = rdd2.reduceByKey(_+_)
//5.分组
val rdd4:RDD[(String,Int)] = rdd3.map(x => {
val url = x._1
val host = new URL(url).getHost.split("[.]")(0)
//元组输出
(host, x._2)
})
//6.根据学院分组
val rdd5: RDD[(String, List[(String, Int)])] = rdd4.groupBy(_._1).mapValues(it => {
//根据访问量排序,倒序
it.toList.sortBy(_._2).reverse.take(1)
})
//7.把计算结果保存到mysql中
rdd5.foreach(x => {
//把数据写道mysql
val conn: Connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/urlcount?charatorEncoding=utf-8","root","root")
//把spark结果插入到mysql中
val sql = "INSERT INTO url_data (xueyuan,number_one) VALUES (?,?)"
//执行sql
/**prepareStatement 预编译 无论多少次地使用同一个SQL命令,PreparedStatement
* 都只对它解析和编译一次。当使用Statement对象时,
* 每次执行一个SQL命令时,都会对它进行解析和编译。
* PreparedStatement对象将会大大降低运行时间,当然也加快了访问数据库的速度。
**/
val statement = conn.prepareStatement(sql)
statement.setString(1,x._1)
statement.setString(2,x._2.toString())
statement.executeUpdate()
statement.close()
conn.close()
})
sc.stop()
}
}
打开本地数据库
先启动mysql服务
C:\Program Files\MySQL\MySQL Server 5.5\bin\mysqld.exe
然后使用SQLyog软件连接mysql数据库
现在mysql中创建存储的数据库urlcount
CREATE DATABASE urlcount;
应用该数据库
USE urlcount;
创建表
CREATE TABLE url_data(
uid INT PRIMARY KEY AUTO_INCREMENT,
xueyuan VARCHAR(50),
number_one VARCHAR(200)
)
idea中右键执行UrlGroupCount对象,待执行成功后在mysql中查询urlcount表
SELECT * FROM url_data;