https://clickhouse.com/docs/en/sql-reference/statements/select/join/#distributed-join
https://clickhouse.com/docs/en/sql-reference/operators/in/#select-distributed-subqueries
Distributed JOIN
There are two ways to execute join involving distributed tables:
- When using a normal JOIN, the query is sent to remote servers. Subqueries are run on each of them in order to make the right table, and the join is performed with this table. In other words, the right table is formed on each server separately.
- When using GLOBAL … JOIN, first the requestor server runs a subquery to calculate the right table. This temporary table is passed to each remote server, and queries are run on them using the temporary data that was transmitted.
Be careful when using GLOBAL. For more information, see the Distributed subqueries section.
This is more optimal than using the normal IN. However, keep the following points in mind:
- When creating a temporary table, data is not made unique. To reduce the volume of data transmitted over the network, specify DISTINCT in the subquery. (You do not need to do this for a normal IN.
- The temporary table will be sent to all the remote servers. Transmission does not account for network topology. For example, if 10 remote servers reside in a datacenter that is very remote in relation to the requestor server, the data will be sent 10 times over the channel to the remote datacenter. Try to avoid large data sets when using GLOBAL IN.
- When transmitting data to remote servers, restrictions on network bandwidth are not configurable. You might overload the network.
- Try to distribute data across servers so that you do not need to use GLOBAL IN on a regular basis.
- If you need to use GLOBAL IN often, plan the location of the ClickHouse cluster so that a single group of replicas resides in no more than one data center with a fast network between them, so that a query can be processed entirely within a single data center.
It also makes sense to specify a local table in the GLOBAL IN clause, in case this local table is only available on the requestor server and you want to use data from it on remote servers.