Skip to content

Commit c29af8c

Browse files
committed
make it better
1 parent 6854cfe commit c29af8c

1 file changed

Lines changed: 148 additions & 42 deletions

File tree

content/en/altinity-kb-schema-design/row-level-deduplication.md

Lines changed: 148 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -8,67 +8,173 @@ description: >-
88

99
## ClickHouse® row-level deduplication.
1010

11-
(Block level deduplication exists in Replicated tables, and is not the subject of that article).
11+
(This article is about row-level deduplication of already ingested data. For insert/block-level deduplication and insert idempotency, see [Insert Deduplication / Insert Idempotency](https://kb.altinity.com/altinity-kb-schema-design/insert_deduplication/). For materialized-view retry semantics, see [Idempotent inserts into a materialized view](https://kb.altinity.com/altinity-kb-schema-design/materialized-views/idempotent_inserts_mv/).)
1212

1313
There is quite common requirement to do deduplication on a record level in ClickHouse.
1414
* Sometimes duplicates are appear naturally on collector side.
1515
* Sometime they appear due the the fact that message queue system (Kafka/Rabbit/etc) offers at-least-once guarantees.
1616
* Sometimes you just expect insert idempotency on row level.
1717

18-
For now that problem has no good solution in general case using ClickHouse only.
18+
For the general case, ClickHouse does not provide a cheap built-in way to enforce arbitrary row-level uniqueness across an already large table.
19+
That is a different problem from retry-safe insert deduplication, which ClickHouse supports separately for `MergeTree` family inserts.
1920

20-
The reason in simple: to check if the row already exists you need to do some lookup (key-value) alike (ClickHouse is bad for key-value lookups),
21+
The reason is simple: to check if the row already exists you need a lookup that is closer to a key-value access pattern (which is not what ClickHouse is optimized for),
2122
in general case - across the whole huge table (which can be terabyte/petabyte size).
2223

23-
But there many usecase when you can archive something like row-level deduplication in ClickHouse:
24+
But there many usecases when you can archive something like row-level deduplication in ClickHouse:
2425

25-
Approach 0. Make deduplication before ingesting data to ClickHouse
26-
+ you have full control
26+
### Approach 0. Make deduplication before ingesting data to ClickHouse
27+
28+
Pros:
29+
- you have full control
30+
- clean and simple schema and selects in ClickHouse
31+
32+
Cons:
2733
- extra coding and 'moving parts', storing some ids somewhere
28-
+ clean and simple schema and selects in ClickHouse
29-
! check if row exists in ClickHouse before insert can give non-satisfying results if you use ClickHouse cluster (i.e. Replicated / Distributed tables) - due to eventual consistency.
34+
- check if row exists in ClickHouse before insert can give non-satisfying results if you use ClickHouse cluster (i.e. Replicated / Distributed tables) - due to eventual consistency.
35+
36+
### Approach 1. Allow duplicates during ingestion.
37+
38+
Remove them on SELECT level (by things like GROUP BY)
39+
40+
Pros:
41+
- simple inserts
3042

31-
Approach 1. Allow duplicates during ingestion. Remove them on SELECT level (by things like GROUP BY)
32-
+ simple inserts
33-
- complicate selects
43+
Cons:
44+
- complicates selects
3445
- all selects will be significantly slower
3546

36-
Approach 2. Eventual deduplication using Replacing
37-
+ simple
38-
- can force you to use suboptimal primary key (which will guarantee record uniqueness)
39-
- deduplication is eventual - you never know when it will happen, and you will get some duplicates if you don't use FINAL clause
40-
- selects with FINAL clause (`select * from table_name FINAL`) are much slower
41-
- and may require tricky manual optimization https://github.com/ClickHouse/ClickHouse/issues/31411
42-
- can work with acceptable speed in some special conditions: https://kb.altinity.com/altinity-kb-queries-and-syntax/altinity-kb-final-clause-speed/
43-
44-
Approach 3. Eventual deduplication using Collapsing
45-
- complicated
46-
- can force you to use suboptimal primary key (which will guarantee record uniqueness)
47-
- you need to store previous state of the record somewhere, or extract it before ingestion from ClickHouse
48-
- deduplication is eventual (same as with Replacing)
49-
+ you can make the proper aggregations of last state w/o FINAL (bookkeeping-alike sums, counts etc)
50-
51-
Approach 4. Eventual deduplication using Summing with SimpleAggregateFunction( anyLast, ...), Aggregating with argMax etc.
52-
- quite complicated
53-
- can force you to use suboptimal primary key (which will guarantee record uniqueness)
54-
- deduplication is eventual (same as with Replacing)
55-
+ but you can finish deduplication with GROUP BY instead if FINAL (it's faster)
47+
### Approach 2. Eventual deduplication using Replacing
48+
49+
Pros:
50+
- simple
51+
52+
Cons:
53+
- can force you to use suboptimal primary key (which will guarantee record uniqueness)
54+
- deduplication is eventual - you never know when it will happen, and you will get some duplicates if you don't use `FINAL`
55+
- selects with `FINAL` (`select * from table_name FINAL`) add overhead and should be benchmarked
56+
- older versions often needed manual optimization https://github.com/ClickHouse/ClickHouse/issues/31411
57+
- performance has improved significantly in recent releases, so `FINAL` is often acceptable in production workloads https://clickhouse.com/blog/common-getting-started-issues-with-clickhouse
58+
- additional tuning notes: https://kb.altinity.com/altinity-kb-queries-and-syntax/altinity-kb-final-clause-speed/
59+
60+
### Approach 3. Eventual deduplication using Collapsing
61+
62+
Pros:
63+
- you can make the proper aggregations of last state w/o `FINAL` (bookkeeping-alike sums, counts etc)
64+
65+
Cons:
66+
- complicated
67+
- can force you to use suboptimal primary key (which will guarantee record uniqueness)
68+
- you need to store previous state of the record somewhere, or extract it before ingestion from ClickHouse
69+
- deduplication is eventual (same as with Replacing)
70+
71+
### Approach 4. Eventual deduplication using Summing/Aggregating/CoalescingMergeTree
72+
73+
use SimpleAggregateFunction( anyLast, ...) or Aggregating with argMax for Summing/AggregatingMT.
74+
CoalescingMergeTree implies anyLast by default
75+
76+
Pros:
77+
- you can finish deduplication with `GROUP BY` instead of `FINAL` (it's faster)
78+
79+
Cons:
80+
- quite complicated
81+
- can force you to use suboptimal primary key (which will guarantee record uniqueness)
82+
- deduplication is eventual (same as with Replacing)
83+
84+
Example: keep the latest version of each row in an `AggregatingMergeTree` table and read the finalized state with `GROUP BY`:
85+
86+
```sql
87+
create table Example4Raw
88+
(
89+
id UInt64,
90+
version UInt64,
91+
metric UInt64
92+
)
93+
engine = MergeTree
94+
order by (id, version);
95+
96+
create table Example4Agg
97+
(
98+
id UInt64,
99+
metric_state AggregateFunction(argMax, UInt64, UInt64)
100+
)
101+
engine = AggregatingMergeTree
102+
order by id;
103+
104+
create materialized view Example4AggMV to Example4Agg as
105+
select id, argMaxState(metric, version) as metric_state
106+
from Example4Raw
107+
group by id;
108+
```
109+
In that example the result contains `id = 1, metric = 20` and `id = 2, metric = 30`.
110+
56111

57-
Approach 5. Keep data fragment where duplicates are possible isolated. Usually you can expect the duplicates only in some time window (like 5 minutes, or one hour, or something like that).
112+
```sql
113+
create table Example4Raw
114+
(
115+
id UInt64,
116+
version UInt64,
117+
metric UInt64
118+
)
119+
engine = MergeTree
120+
order by (id, version);
121+
122+
create table Example4Agg
123+
(
124+
id UInt64,
125+
metric_state Nullable(UInt64)
126+
)
127+
engine = CoalescingTree
128+
order by id;
129+
130+
create materialized view Example4AggMV to Example4Agg as
131+
select id, metric as metric_state
132+
from Example4Raw;
133+
```
134+
135+
### Approach 5. Keep data fragment where duplicates are possible isolated.
136+
137+
Usually you can expect the duplicates only in some time window (like 5 minutes, or one hour, or something like that).
138+
58139
You can put that 'dirty' data in separate place, and put it to final MergeTree table after deduplication window timeout.
59140
For example - you insert data in some tiny tables (Engine=StripeLog) with minute suffix, and move data from tinytable older that X minutes to target MergeTree (with some external queries).
60141
In the meanwhile you can see realtime data using Engine=Merge / VIEWs etc.
61-
- quite complicated
62-
+ good control
63-
+ no duplicated in target table
64-
+ perfect ingestion speed
65142

66-
Approach 6. Deduplication using MV pipeline. You insert into some temporary table (even with Engine=Null) and MV do join or subselect
143+
Pros:
144+
- good control
145+
- no duplicated in target table
146+
- perfect ingestion speed
147+
148+
Cons:
149+
- quite complicated
150+
151+
### Approach 6. Deduplication using MV pipeline.
152+
153+
You insert into some temporary table (even with Engine=Null) and MV do join or subselect
67154
(which will check the existence of arrived rows in some time frame of target table) and copy new only rows to destination table.
68-
+ don't impact the select speed
69-
- complicated
70-
- for clusters can be inaccurate due to eventual consistency
71-
- slow down inserts significantly (every insert will need to do lookup in target table first)
72-
155+
156+
Pros:
157+
- don't impact the select speed
158+
159+
Cons:
160+
- complicated
161+
- for clusters can be inaccurate due to eventual consistency
162+
- slows down inserts significantly (every insert will need to do lookup in target table first)
163+
164+
```sql
165+
create table Example1 (id Int64, metric UInt64)
166+
engine = MergeTree order by id;
167+
168+
create table Example1Null engine = Null as Example1;
169+
170+
create materialized view __Example1 to Example1 as
171+
select * from Example1Null
172+
where id not in (
173+
select id from Example1 where id in (
174+
select id from Example1Null
175+
)
176+
)
177+
```
178+
73179

74180
In all case: due to eventual consistency of ClickHouse replication you can still get duplicates if you insert into different replicas/shards.

0 commit comments

Comments
 (0)