Spark读取mysql数据

最新推荐文章于 2024-02-03 02:18:26 发布

ypp91zr

最新推荐文章于 2024-02-03 02:18:26 发布

阅读量1.2k

点赞数

分类专栏： java 大数据 Spark

本文链接：https://blog.csdn.net/ypp91zr/article/details/108324128

版权

java 同时被 3 个专栏收录

193 篇文章 3 订阅

订阅专栏

大数据

5 篇文章 0 订阅

订阅专栏

Spark

2 篇文章 0 订阅

订阅专栏

public void getStudentName() throws IOException {
        Properties properties = new Properties();
        properties.put("driver", ApplicationYmlUtils.getDataSourceDriverClassName());
        properties.put("user", ApplicationYmlUtils.getDataSourceUsername());
        properties.put("password", ApplicationYmlUtils.getDataSourcePassword());
        Dataset<Row> loadData = sqlContext.read().jdbc(ApplicationYmlUtils.getDataSourceUrl(), "t_student", properties);
        /*
        DataFrameReader reader = sqlContext.read().format("jdbc")
                .option("url", ApplicationYmlUtils.getDataSourceUrl())
                .option("dbtable", sql)
                .option("driver", ApplicationYmlUtils.getDataSourceDriverClassName())
                .option("user", ApplicationYmlUtils.getDataSourceUsername())
                .option("password", ApplicationYmlUtils.getDataSourcePassword());
        Dataset<Row> loadData = reader.load();
         */
        System.out.println(loadData.count());
        loadData.write().mode(SaveMode.Overwrite).json("E:/11.json");

如果是百万级甚至千万级乙级数据呢,这里以120W数据为例,数据库确实存在120W数据，输出成JSON文件为例

一次性读取

有OOM风险

public void getStudentName() throws IOException {
        long startTime = System.currentTimeMillis();
        Properties properties = new Properties();
        properties.put("driver", ApplicationYmlUtils.getDataSourceDriverClassName());
        properties.put("user", ApplicationYmlUtils.getDataSourceUsername());
        properties.put("password", ApplicationYmlUtils.getDataSourcePassword());
        Dataset<Row> loadData = sqlContext.read().jdbc(ApplicationYmlUtils.getDataSourceUrl(), "sys_operation_log", properties);
        System.out.println(loadData.count());
        loadData.write().mode(SaveMode.Overwrite).json("E:/11.json");
        long endTime = System.currentTimeMillis();
        System.err.println("120W数据大概用时"+(endTime-startTime));
    }

生成的JSON文件是只有一个JSON文件

分批次/分区

无OOM风险,但总时长比较长

public void getStudentName() throws IOException {
        long startTime = System.currentTimeMillis();
        Properties properties = new Properties();
        properties.put("driver", ApplicationYmlUtils.getDataSourceDriverClassName());
        properties.put("user", ApplicationYmlUtils.getDataSourceUsername());
        properties.put("password", ApplicationYmlUtils.getDataSourcePassword());
        String[] predicates=new String[120];
        Integer num=120;
        Integer index=0;
        for (int i = 0; i <num ; i++) {
            predicates[i]=" 1=1 LIMIT "+index+","+10000;
            index+=10000;
        }
        Dataset<Row> loadData = sqlContext.read().jdbc(ApplicationYmlUtils.getDataSourceUrl(), "sys_operation_log",predicates ,properties);
        System.out.println(loadData.count());
        loadData.write().mode(SaveMode.Overwrite).json("E:/11.json");
        long endTime = System.currentTimeMillis();
        System.err.println("120W数据大概用时"+(endTime-startTime));
    }

下面是任务部分日志截图

足足用了二十多分钟

JSON文件也是按任务分散的，并不是一个文件

数据量大还不如传统的mysql查询，毕竟是先查询数据放入内存再做相应的操作，可以用sqoop将mysql数据放进HDFS等再使用Spark来进行计算

所以海量数据Spark还是比较适用HDFS/Hive/Hbase这些来做数据源

ypp91zr

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark读取mysql数据

public void getStudentName() throws IOException { Properties properties = new Properties(); properties.put("driver", ApplicationYmlUtils.getDataSourceDriverClassName()); properties.put("user", ApplicationYmlUtils.getDataSourceUsern.
复制链接

扫一扫