数据库Join算法

醒过来摸鱼

已于 2024-03-07 14:26:27 修改

阅读量326

点赞数

分类专栏： # 数据数据分析文章标签：数据库算法 java

于 2022-10-13 01:07:04 首次发布

本文链接：https://blog.csdn.net/m0_66201040/article/details/127292499

版权

数据同时被 2 个专栏收录

9 篇文章 1 订阅

订阅专栏

数据分析

7 篇文章 0 订阅

订阅专栏

文章目录

延迟实例化
嵌套循环连接
排序合并连接
哈希连接

延迟实例化

假如是一个鲁莽reckless的程序员，可能会这样写join，把两个表的数据取出来，然后组合在一起。但是现在目前主流的技术是把两张表的记录id取出来组合在一起，筛选出符合条件的数据，再去原数据表取数据，这样程序的开销就小很多，占用内存也小，这种技术呢，就叫做延迟实例化late materialization。

嵌套循环连接

最容易实现的算法是嵌套循环连接Nested Loop Join，这种算法是利用两层循环去遍历两张表，然后把结果join起来，可以用java模拟一下。为此，我先建两个类，为了节省篇幅，我就不去遵守Java Bean规范了，也不去使用延迟实例化技术：

package com.youngthing.springboot.demo.join;

/**
 * 10/12/2022 11:48 PM 创建
 *
 * @author 花书粉丝
 */
class Employee {
    public int id;
    public int deptId;
    public String name;

    public Employee(int id, int deptId, String name) {
        this.id = id;
        this.deptId = deptId;
        this.name = name;
    }
}

再写下一个部门类：

package com.youngthing.springboot.demo.join;

/**
 * 10/12/2022 11:49 PM 创建
 *
 * @author 花书粉丝
 */
public class Dept {
    public int id;
    public String name;

    public Dept(int id, String name) {
        this.id = id;
        this.name = name;
    }
}

最后写嵌套循环算法：

package com.youngthing.springboot.demo.join;

/**
 * 10/12/2022 11:12 PM 创建
 *
 * @author 花书粉丝
 */
public class SimpleNestedLoopJoin {

    static class Employee {
        int id;
        int deptId;
        String name;

        public Employee(int id, int deptId, String name) {
            this.id = id;
            this.deptId = deptId;
            this.name = name;
        }
    }

    static class Dept {
        int id;
        String name;

        public Dept(int id, String name) {
            this.id = id;
            this.name = name;
        }
    }

    public static void main(String[] args) {
        Employee[] tableA = {new Employee(1, 1, "David"), new Employee(2, 2, "Joe")};
        Dept[] tableB = {new Dept(1, "Dev"), new Dept(2, "Ops")};
        for (Employee employee : tableA) {
            for (Dept dept : tableB) {
                if (employee.deptId == dept.id) {
                    System.out.println(employee.name + "," + dept.name);
                }
            }
        }
    }
}

外层循环假使有M条数据，内层循环假使有N条数据，那么复杂度是M+M $\times$ N。为什们不是M $\times$ N呢？因为计算复杂度使用的是磁盘IO次数，先读M条出来，所以复杂度先是是M，然后进行M次大循环，每个大循环里进行N次内层循环，这就是 $M\times N$ ，所以复杂度总和就是 $M+M\times N$ 。
当然，嵌套循环连接还分为简单、缓冲和分块三种算法，每种算法复杂度不一样，但大体上都是 $M+M\times N$ 这种形式。再说下名词，外层循环遍历的表叫外表，内层循环遍历的表叫内表，M为外表数据块数量，N为内表数据块数量，以下是各种算法的复杂度：

算法	英文	复杂度	备注
简单	Simple Nested Loop Join	$\times N )$	m为外表总数据行数
分块	Block Nested Loop Join	$\times N )$
索引	Index Nested Loop Join	$\times C)$	C为常数，与内表索引树高度相关

排序合并连接

这个算法英文为Sort-Merge Join,这个算法就是对两个表，用连接字段进行排序，再合并结果。这个算法其实很重要的，就是在我们平常的java开发中也经常用到，对内存中的两组数据进行一一匹配。下面我用Java代码模拟下这个算法：

package com.youngthing.springboot.demo.join;

import java.util.Arrays;
import java.util.Comparator;

/**
 * 10/12/2022 11:51 PM 创建
 *
 * @author 花书粉丝
 */
public class SortMerge {
    private static Employee[] tableA = {
            new Employee(1, 1, "David"),
            new Employee(2, 3, "Joe"),
            new Employee(3, 2, "Jack"),
            new Employee(4, 3, "Wendy"),
            new Employee(5, 1, "Lucy"),
            new Employee(5, 2, "Lily"),
    };
    private static Dept[] tableB = {
            new Dept(1, "Dev"),
            new Dept(3, "HR"),
            new Dept(2, "Ops")};

    public static void main(String[] args) {
        Arrays.sort(tableA, Comparator.comparingInt(o -> o.deptId));
        Arrays.sort(tableB, Comparator.comparingInt(o -> o.id));


        for (int i = 0, j = 0; i < tableA.length && j < tableB.length; ) {
            final Employee employee = tableA[i];
            final Dept dept = tableB[j];
            if (employee.deptId < dept.id) {
                i++;
            } else if (employee.deptId > dept.id) {
                j++;
            } else {
                System.out.println(employee.name + "," + dept.name);
                i++;
                // 这里是我假设内表join键唯一以简化代码，让内表索引不递增，
                // 实际数据库遇到内表join键不唯一时会使用嵌套循环来解决
            }
        }
    }
}

实际中会使用下面这一种算法：

class TwoPointTest {

	private static void match(int[] left, int[] right) {
		int start = 0;
		int count = 0;
		for (int i = 0; i < left.length; i++) {
			for (int j = start; j < right.length; j++) {
				count++;
				int l = left[i];
				int r = right[j];
				if (l == r) {
					System.out.println("matched:" + l + "," + r);
					start = j + 1;
					// 找到之后要跳指针了
					break;
				} else if (l < r) {
					System.out.println("left 多余:" + l);
					break;
				} else {
					start = j + 1;
				}

			}
		}
		System.out.println("循环" + count + "次");
	}
}

从代码可以看出，这个算法有局限性，局限性在于join的key在内表必须是唯一约束。但是这个算法复杂度很低，排序复杂度取决于排序算法，而合并复杂度为 $M + N$ ，所以性能上是十分优越的。

哈希连接

这个可能是我们开发中经常用到的算法了吧，算法很简单，按join的key对外表建哈希索引，然后遍历内表就完事了。例子很简单，数据我还是用以上的数据，用java模拟下这个过程：

package com.youngthing.springboot.demo.join;

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

/**
 * 10/13/2022 12:31 AM 创建
 *
 * @author 花书粉丝
 */
public class SimpleHashJoin {
    private static Employee[] tableA = {
            new Employee(1, 1, "David"),
            new Employee(2, 3, "Joe"),
            new Employee(3, 2, "Jack"),
            new Employee(4, 3, "Wendy"),
            new Employee(5, 1, "Lucy"),
            new Employee(5, 2, "Lily"),
    };
    private static Dept[] tableB = {
            new Dept(1, "Dev"),
            new Dept(3, "HR"),
            new Dept(2, "Ops")};

    public static void main(String[] args) {

        final Map<Integer, List<Employee>> empMap =
                Arrays.stream(tableA).collect(Collectors.<Employee, Integer>groupingBy(x -> x.deptId));

        Arrays.stream(tableB).forEach(dept -> {
            final List<Employee> employees = empMap.get(dept.id);
            employees.forEach(emp -> System.out.println(emp.name + "," + dept.name));
        });
    }
}