Spark[03]spark-shell、IDEA测试sparksql
准备环境
资源列表
软件/工具 | 版本 |
---|---|
VMware | VMware® Workstation 16 Pro |
Xshell | 6 |
filezilla | 3.7.3 |
Navicat | 11 |
IDEA | 2020.2 |
多台虚拟机部分数据如下
编号 | 主机名 | 主机域名 | ip地址 |
---|---|---|---|
① | Toozky | Toozky | 192.168.64.220 |
② | Toozky2 | Toozky2 | 192.168.64.221 |
③ | Toozky3 | Toozky3 | 192.168.64.222 |
准备多台虚拟机并配置hadoop2.0环境,安装hadoop2.0和zookeeper
详见链接:Hadoop[03]启动DFS和Zookeeper(Hadoop2.0)
设置时间同步
详见链接:Linux(CentOS6)设置同步网络时间
配置spark集群
详见链接:Spark[01]Spark集群安装以及配置
Xshell连接多台虚拟机,启动zkServer、hdfs、spark集群
安装mysql
虚拟机①、②、③
虚拟机安装mysql(5),如虚拟机已安装mysql此步跳过
yum install mysql-server
service mysqld start
chkconfig --level 345 mysqld on
为mysql赋予远程连接权限(本机可连Linux的mysql)
查询本机ip地址和域名
win+s→cmd
→ipconfig
由于虚拟机网络连接方式为NAT,则将VMnet8的IPv4复制
以太网适配器 VMware Network Adapter VMnet8:
IPv4 地址 . . . . . . . . . . . . : 192.168.64.1
192.168.64.1
就是虚拟机可通信的本机ip地址
win→设置→系统→关于
这里的设备名称就是本机的域名,将名称复制
192.168.64.1 Tuzki
需要写在每台虚拟机的/etc/hosts文件中
虚拟机①
为本机创建可连接到虚拟机①mysql的账号
mysql
use mysql;
CREATE USER 'root'@'Tuzki' IDENTIFIED BY '123456';
grant all privileges on *.* to root@'Tuzki';
flush privileges;
Navicat连接测试
Navicat创建连接后,创建test库,test_user表,并录入测试数据
编写JSON格式测试数据
虚拟机①
cd
vi testjsondb
按a或insert进入编辑
{"id":1, "name":"leo1", "age":18}
{"id":2, "name":"zhangsan", "age":19}
{"id":3, "name":"lisi", "age":17}
esc :wq
SparkSQL(spark-shell)
JSON数据测试
Xshell登录虚拟机①
spark-shell
var a=spark.read.json("file:///root/testjsondb")
查全表
a.show()
查name字段
a.select("name").show()
获取各条数据内的属性(list格式输出)
a.collectAsList
MySQL数据测试
Xshell登录虚拟机①
spark-shell
加载数据表
val jdbc_conf: Map[String, String] = Map(
"url" -> "jdbc:mysql://Toozky:3306/test",
"driver" -> "com.mysql.jdbc.Driver",
"dbtable" -> "test_user",
"user" -> "root",
"password" -> "123456"
)
val t_user=spark.read.format("jdbc").options(jdbc_conf).load()
查全表
t_user.show()
查name字段
t_user.select("uname").show()
spark.sql查询
为数据表创建临时视图
t_user.createOrReplaceTempView("test_user")
查全表
spark.sql("select * from test_user").show()
查平均年龄avg(uage)
spark.sql("select avg(uage) from test_user").show()
SparkSQL(IDEA)
IDEA创建普通Maven项目SparkSQLDemo
pom.xml
在pom.xml添加相关依赖
<properties>
<spark.version>2.1.1</spark.version>
<hadoop.version>2.6.5</hadoop.version>
<scala.version>2.11.8</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>compile</scope>
</dependency>
</dependencies>
log4j.properties
在resources目录创建log4j.properties
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to ERROR. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=ERROR
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=ERROR
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
TestWordCount
在java目录创建TestWordCount
spark://Toozky:7077
地址是spark alive主机的web首页中的地址
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SQLContext;
import org.junit.Before;
import org.junit.Test;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.Properties;
public class TestSparkSql {
SQLContext sqlContext;
Dataset test_user_set;
Statement statement;
@Before
public void before() throws SQLException {
SparkConf conf = new SparkConf().setAppName("test_sparksql_mysql001").setMaster("spark://Toozky:7077");
JavaSparkContext sc = new JavaSparkContext(conf);
sqlContext = new SQLContext(sc);
String url = "jdbc:mysql://Toozky:3306/test";
String table = "test_user";
String user="root";
String password="123456";
String driver="com.mysql.jdbc.Driver";
Properties conn = new Properties();
conn.put("user", user);
conn.put("password", password);
conn.put("driver", driver);
sqlContext.read().jdbc(url, table, conn).createOrReplaceTempView(table);
}
@Test
public void testSelectAll() throws SQLException {
System.out.println("====Test->SELECT * FROM test_user====");
test_user_set = sqlContext.sql("select * from test_user");
test_user_set.show();
}
@Test
public void testSelect() throws SQLException {
System.out.println("====Test->SELECT uname FROM test_user====");
sqlContext.sql("select uname from test_user").show();
System.out.println();
System.out.println("====Test->SELECT uname=\"张三\"->info====");
sqlContext.sql("select * from test_user where uname=\"张三\"").show();
}
}
控制台
testSelectAll运行结果
testSelect运行结果
spark alive web
Completed Applications出现计算记录
其中Name是代码:setAppName(“test_sparksql_mysql001”) 中的test_sparksql_mysql001
以上就是本期总结的全部内容,愿大家相互学习,共同进步!