|
| 1 | +--- |
| 2 | +title: "MultiDisk (JBOD) Balancing" |
| 3 | +linkTitle: "MultiDisk (JBOD) Balancing" |
| 4 | +--- |
| 5 | + |
| 6 | +ClickHouse provides two options to balance an insert across disks in a volume with more than one disk: `round_robin` and `least_used` . |
| 7 | + |
| 8 | +## **Round Robin (Default):** |
| 9 | + |
| 10 | +ClickHouse selects the next disk in a round robin manner to write a part. |
| 11 | + |
| 12 | +This is the default setting and is most effective when parts created on insert are roughly the same size. |
| 13 | + |
| 14 | +Drawbacks: may lead to disk skew |
| 15 | + |
| 16 | +## **Least Used:** |
| 17 | + |
| 18 | +ClickHouse selects the disk with the most available space and writes to that disk. |
| 19 | + |
| 20 | +Changing to least_used when even disk space consumption is desirable or when you have a JBOD volume with differing disk sizes. To prevent hot-spots, it is best to set this policy on a fresh volume or on a volume that has already been (re)balanced. |
| 21 | + |
| 22 | +Drawbacks: may lead to hot-spots |
| 23 | + |
| 24 | +## Configurations |
| 25 | + |
| 26 | +Configurations that can affect disk selected: |
| 27 | + |
| 28 | +- storage policy volume configuration: `least_used_ttl_ms`. Only applies to `least_used` policy, 60s default. |
| 29 | +- disk setting: `keep_free_space_bytes` , `keep_free_space_ratio` |
| 30 | + |
| 31 | +Configuration to assist rebalancing: |
| 32 | + |
| 33 | +- The MergeTree setting `min_bytes_to_rebalance_partition_over_jbod` does not control where data is written during inserts. Instead, it governs how parts are redistributed across disks within the same volume during merge operations. |
| 34 | + |
| 35 | +> Note: setting `min_bytes_to_rebalance_partition_over_jbod` does not guarantee balanced partitions and balanced disk usage. |
| 36 | +> |
| 37 | +
|
| 38 | +Example of least_used policy: |
| 39 | + |
| 40 | +```xml |
| 41 | +<clickhouse> |
| 42 | + <storage_configuration> |
| 43 | + <disks> |
| 44 | + <default> |
| 45 | + <path>/var/lib/clickhouse/</path> |
| 46 | + <keep_free_space_bytes>10737418240</keep_free_space_bytes> |
| 47 | + </disk1> |
| 48 | + <disk1> |
| 49 | + <path>/mnt/disk1/</path> |
| 50 | + <keep_free_space_bytes>10737418240</keep_free_space_bytes> |
| 51 | + </disk1> |
| 52 | + <disk2> |
| 53 | + <path>/mnt/disk2/</path> |
| 54 | + <keep_free_space_bytes>10737418240</keep_free_space_bytes> |
| 55 | + </disk2> |
| 56 | + </disks> |
| 57 | + <policies> |
| 58 | + <hot> |
| 59 | + <volumes> |
| 60 | + <default> |
| 61 | + <disk>disk1</disk> |
| 62 | + <disk>disk2</disk> |
| 63 | + <load_balancing>least_used</load_balancing> |
| 64 | + <least_used_ttl_ms>60000</least_used_ttl_ms> <!-- 60s --> |
| 65 | + </default> |
| 66 | + </volumes> |
| 67 | + </hot> |
| 68 | + </policies> |
| 69 | + </storage_configuration> |
| 70 | +</clickhouse> |
| 71 | +``` |
| 72 | + |
| 73 | +## Manual Rebalancing Parts over JBOD Disks |
| 74 | + |
| 75 | +Following query will select large parts in target_tables and target_databases that can be candidates to move to another disk. Disk chosen should comply with the following requirements: |
| 76 | +- Should only select valid moves for the same storage_policy used by that table |
| 77 | +- storage_policy must be JBODs type |
| 78 | +- moves to other disks in the same volume |
| 79 | +- select a different disk, i.e not the same disk as the one that part is in |
| 80 | +- select the disk to move the part to by order of largest free_space on that disk |
| 81 | + |
| 82 | +Set `target_tables` and `target_databases` based on requirements. |
| 83 | + |
| 84 | +```sql |
| 85 | +WITH |
| 86 | + '%' AS target_tables, |
| 87 | + '%' AS target_databases |
| 88 | +SELECT sub.q FROM |
| 89 | +( |
| 90 | + SELECT |
| 91 | + 'ALTER TABLE ' || parts.database || '.' || parts.`table` || ' MOVE PART \'' || parts.name ||'\' TO DISK \'' || other_disk_candidate || '\';' as q, |
| 92 | + parts.database as db, |
| 93 | + parts.`table` as t, |
| 94 | + parts.name as part_name, |
| 95 | + parts.disk_name as part_disk_name, |
| 96 | + parts.bytes_on_disk AS part_bytes_on_disk, |
| 97 | + sp.storage_policy as part_storage_policy, |
| 98 | + arrayJoin(arrayRemove(v.disks, parts.disk_name)) AS other_disk_candidate, |
| 99 | + candidate_disks.free_space AS candidate_disk_free_space |
| 100 | + FROM system.parts AS parts |
| 101 | + INNER JOIN ( SELECT database, `table`, storage_policy FROM system.tables where (name LIKE target_tables) AND (database LIKE target_databases) group by 1, 2, 3 ) AS sp ON sp.`table` = parts.`table` AND sp.database = parts.database |
| 102 | + INNER JOIN ( SELECT policy_name, volume_name, disks AS disks FROM system.storage_policies WHERE volume_type = 0 ) AS v ON sp.storage_policy = v.policy_name |
| 103 | + INNER JOIN ( SELECT name, free_space FROM system.disks ORDER BY free_space DESC ) AS candidate_disks ON candidate_disks.name = other_disk_candidate |
| 104 | + WHERE parts.active = 1 |
| 105 | + AND (parts.bytes_on_disk >= 10737418240) --10GB prioritize larger parts |
| 106 | + AND (parts.`table` LIKE target_tables) |
| 107 | + AND (parts.database LIKE target_databases) |
| 108 | + AND candidate_disks.free_space > parts.bytes_on_disk*2 -- 2x buffer |
| 109 | + ORDER BY parts.bytes_on_disk DESC, candidate_disk_free_space DESC |
| 110 | + LIMIT 1 BY db, t, part_name |
| 111 | +) as sub |
| 112 | +FORMAT TSVRaw |
| 113 | +``` |
0 commit comments