面试集锦（二）之手写sql面试题目及map端join_20210206_大数据面试题

最新推荐文章于 2024-02-23 08:30:37 发布

菜鸟周星星

最新推荐文章于 2024-02-23 08:30:37 发布

阅读量261

点赞数

分类专栏： hive # 面试_SQL 文章标签： mapreduce hive java 面试 sql

本文链接：https://blog.csdn.net/weixin_42083008/article/details/113728049

版权

hive 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

面试_SQL

11 篇文章 0 订阅

订阅专栏

本篇是大数据面试题部分，今日内容

sql题目
map端join

sql题目

查询第二高的薪水

编写一个 SQL 查询，获取 Employee 表中第二高的薪水（Salary） 。

+----+--------+

| Id | Salary |

+----+--------+

| 1 | 100    |

| 2 | 200    |

| 3 | 300    |

+----+--------+

例如上述 Employee 表，SQL查询应该返回 200 作为第二高的薪水。如果不存在第二高的薪水，那么查询应返回 null。

+---------------------+

| SecondHighestSalary |

+---------------------+

| 200                 |

+---------------------+

答案：

select

(select  max(Salary)  as Salary from Employee

group by Salary  order by Salary desc  limit 1,1 )  as SecondHighestSalary

题解：此题较为简单，但需注意此题“如果不存在第二高的薪水，那么查询应返回 null”的要求

如仅直接查询，那么不会有null的记录，如下图所示：

并不能达到题目所要求，返回null字段（为一条记录），所以使用子查询包裹，实现题目要求

并且还需要考虑的一点是，有重复的薪水记录会干扰查询结果，所以需要进行去重

去重有两种方式 distinct (Salary ) 或者group by（Salary ）分组之后使用min或者max 取其组内一条记录

补充知识点：limit用法

limit n 取前n条数据

limit m,n 跳过m条数据，取n条

例：limit 1,1 跳过一条数据，取一条即取第二条数据

超过5名学生的课

有一个courses 表，有: student (学生) 和 class (课程)。

请列出所有超过或等于5名学生的课。

例如，表：

+---------+------------+

| student | class      |

+---------+------------+

| A       | Math       |

| B       | English    |

| C       | Math       |

| D       | Biology    |

| E       | Math       |

| F       | Computer   |

| G       | Math       |

| H       | Math       |

| I       | Math       |

+---------+------------+

应该输出:

+---------+

| class   |

+---------+

| Math    |

+---------+

提示：

学生在每个课中不应被重复计算。

答案1：

SELECT

    class

FROM

    (SELECT

        class, COUNT(DISTINCT student) AS num

    FROM

        courses

    GROUP BY class) AS temp_table

WHERE

    num >= 5

;

答案2：

select

class

from

courses

group by class  having  count(distinct student)>=5

解析：

求>=5名学生的课，依题意，即出现一次class，就有一个该class的学生（前提是不出现同一个学生的该class记录）

解法一：按照class分组，求出class和去重学生之后的count，套子查询过滤count>=5的class

解法二：谓词下推：将去重条件移至内层子查询，从而去掉子查询

上升的温度

表 Weather

+---------------+---------+

| Column Name   | Type    |

+---------------+---------+

| id            | int     |

| recordDate    | date    |

| temperature   | int     |

+---------------+---------+

id 是这个表的主键

该表包含特定日期的温度信息

编写一个 SQL 查询，来查找与之前（昨天的）日期相比温度更高的所有日期的 id 。

返回结果不要求顺序。

查询结果格式如下例：

Weather

+----+------------+-------------+

| id | recordDate | Temperature |

+----+------------+-------------+

| 1 | 2015-01-01 | 10          |

| 2 | 2015-01-02 | 25          |

| 3 | 2015-01-03 | 20          |

| 4 | 2015-01-04 | 30          |

+----+------------+-------------+

Result table:

+----+

| id |

+----+

| 2 |

| 4 |

+----+

2015-01-02 的温度比前一天高（10 -> 25）

2015-01-04 的温度比前一天高（20 -> 30）

答案：

select

b.id

from Weather a

join (select * from Weather  ) b on (datediff(b.recordDate,a.recordDate))=1 and b.Temperature >a.Temperature

解法：

依题意：需要自连接，条件即今天和昨天相差一天====>datediff(今天日期，昨天日期)=1 且今天温度比昨天高

map端join

面试官：你能给我说一下什么是map但join吗？

Map端join就是在map端实现join的过程，去掉了reduce阶段，从而在根本上减少程序shuffle的过程，杜绝了数据倾斜导致的整个程序运行时间过长的现象。

具体原理：分布式缓存文件

Mr程序中，在main主程序中，添加缓存文件，addCacheFile(“”)将文件分布式缓存到执行maptask的机器本地中。执行任务时，加载本地文件到内存当中，接着就可以进行流式join读取表数据做拼接。在setup方法中直接读取本地的文件（执行maptask的机器本地）

适用场景：一个小表join一个大表；可以将小表分布式缓存到执行maptask的机器上

类似原理的还有spark中的广播变量，在driver端将小表广播到执行task的executor机器上，并返回这个广播变量的一个引用。

二者区别：文件缓存是存到了执行任务机器上的本地磁盘目录中；广播变量则是使用网络传输将数据以二进制的形式广播到了执行任务的executor的机器上；另spark中也可以使用文件缓存到executor机器上。

面试官：那你能给我手写一个map端join的案例吗?

首先需要在main方法中添加分布式缓存文件

job.addCacheFile(new URI("/wordcount/cache/user.txt"))

接着重写setup方法

在setup方法当中读取本地文件（这里的本地是指执行maptask的机器本地）

@Override
protected void setup(Context context) throws IOException, InterruptedException {
BufferedReader br = new BufferedReader(new FileReader("user.txt"));

String line = null;
while((line=br.readLine())!=null){
String[] arr = line.split(",");
Order user = new Order();
user.set("",null,arr[0],null,arr[1],arr[2]);
userMap.put(user.getUid(),user);
}

细节代码如下：
package cn.doitedu.azkaban;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.HashMap;

public class MapSideJoin {

    public static class M extends Mapper<LongWritable, Text,Order, NullWritable>{
        Order order = new Order();
        HashMap<String, Order> userMap = new HashMap<String, Order>();


        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            BufferedReader br = new BufferedReader(new FileReader("user.txt"));


            String line = null;
            while((line=br.readLine())!=null){
                String[] arr = line.split(",");
                Order user = new Order();
                user.set("",null,arr[0],null,arr[1],arr[2]);
                userMap.put(user.getUid(),user);
            }

        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            String[] arr = value.toString().split(",");
            order.set("",arr[0],arr[1],Double.valueOf(arr[2]),null,null);

            // 查找user表
            Order user = userMap.get(this.order.getUid());
            order.setUname(user.getUname());
            order.setLevel(user.getLevel());

            context.write(order,NullWritable.get());
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        job.setJarByClass(MapSideJoin.class);

        job.setMapperClass(M.class);

        job.setOutputKeyClass(Order.class);
        job.setOutputValueClass(NullWritable.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        // 添加缓存文件
        job.addCacheFile(new URI("/wordcount/cache/user.txt"));

        // 关闭reducer
        job.setNumReduceTasks(0);

        FileInputFormat.setInputPaths(job,new Path("/wordcount/input"));
        FileOutputFormat.setOutputPath(job,new Path("/wordcount/output"));

        boolean res = job.waitForCompletion(true);

        System.exit(res?0:1);
    }
}

更多学习、面试资料尽在微信公众号：Hadoop大数据开发

菜鸟周星星

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
面试集锦（二）之手写sql面试题目及map端join_20210206_大数据面试题

本篇是大数据面试题部分，今日内容sql题目 map端joinsql题目查询第二高的薪水编写一个 SQL 查询，获取 Employee表中第二高的薪水（Salary）。+----+--------+| Id | Salary |+----+--------+| 1 | 100 || 2 | 200 || 3 | 300 |+----+--------+例如上述Employee表，SQL查询应该返回200 作为第二高的薪水...
复制链接

扫一扫