You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Block level deduplication exists in Replicated tables, and is not the subject of that article).
11
+
(This article is about row-level deduplication of already ingested data. For insert/block-level deduplication and insert idempotency, see [Insert Deduplication / Insert Idempotency](https://kb.altinity.com/altinity-kb-schema-design/insert_deduplication/). For materialized-view retry semantics, see [Idempotent inserts into a materialized view](https://kb.altinity.com/altinity-kb-schema-design/materialized-views/idempotent_inserts_mv/).)
12
12
13
13
There is quite common requirement to do deduplication on a record level in ClickHouse.
14
14
* Sometimes duplicates are appear naturally on collector side.
15
15
* Sometime they appear due the the fact that message queue system (Kafka/Rabbit/etc) offers at-least-once guarantees.
16
16
* Sometimes you just expect insert idempotency on row level.
17
17
18
-
For now that problem has no good solution in general case using ClickHouse only.
18
+
For the general case, ClickHouse does not provide a cheap built-in way to enforce arbitrary row-level uniqueness across an already large table.
19
+
That is a different problem from retry-safe insert deduplication, which ClickHouse supports separately for `MergeTree` family inserts.
19
20
20
-
The reason in simple: to check if the row already exists you need to do some lookup (key-value) alike (ClickHouse is bad for key-value lookups),
21
+
The reason is simple: to check if the row already exists you need a lookup that is closer to a key-value access pattern (which is not what ClickHouse is optimized for),
21
22
in general case - across the whole huge table (which can be terabyte/petabyte size).
22
23
23
-
But there many usecase when you can archive something like row-level deduplication in ClickHouse:
24
+
But there many usecases when you can archive something like row-level deduplication in ClickHouse:
24
25
25
-
Approach 0. Make deduplication before ingesting data to ClickHouse
26
-
+ you have full control
26
+
### Approach 0. Make deduplication before ingesting data to ClickHouse
27
+
28
+
Pros:
29
+
- you have full control
30
+
- clean and simple schema and selects in ClickHouse
31
+
32
+
Cons:
27
33
- extra coding and 'moving parts', storing some ids somewhere
28
-
+ clean and simple schema and selects in ClickHouse
29
-
! check if row exists in ClickHouse before insert can give non-satisfying results if you use ClickHouse cluster (i.e. Replicated / Distributed tables) - due to eventual consistency.
34
+
- check if row exists in ClickHouse before insert can give non-satisfying results if you use ClickHouse cluster (i.e. Replicated / Distributed tables) - due to eventual consistency.
35
+
36
+
### Approach 1. Allow duplicates during ingestion.
37
+
38
+
Remove them on SELECT level (by things like GROUP BY)
39
+
40
+
Pros:
41
+
- simple inserts
30
42
31
-
Approach 1. Allow duplicates during ingestion. Remove them on SELECT level (by things like GROUP BY)
32
-
+ simple inserts
33
-
- complicate selects
43
+
Cons:
44
+
- complicates selects
34
45
- all selects will be significantly slower
35
46
36
-
Approach 2. Eventual deduplication using Replacing
37
-
+ simple
38
-
- can force you to use suboptimal primary key (which will guarantee record uniqueness)
39
-
- deduplication is eventual - you never know when it will happen, and you will get some duplicates if you don't use FINAL clause
40
-
- selects with FINAL clause (`select * from table_name FINAL`) are much slower
41
-
- and may require tricky manual optimization https://github.com/ClickHouse/ClickHouse/issues/31411
42
-
- can work with acceptable speed in some special conditions: https://kb.altinity.com/altinity-kb-queries-and-syntax/altinity-kb-final-clause-speed/
43
-
44
-
Approach 3. Eventual deduplication using Collapsing
45
-
- complicated
46
-
- can force you to use suboptimal primary key (which will guarantee record uniqueness)
47
-
- you need to store previous state of the record somewhere, or extract it before ingestion from ClickHouse
48
-
- deduplication is eventual (same as with Replacing)
49
-
+ you can make the proper aggregations of last state w/o FINAL (bookkeeping-alike sums, counts etc)
50
-
51
-
Approach 4. Eventual deduplication using Summing with SimpleAggregateFunction( anyLast, ...), Aggregating with argMax etc.
52
-
- quite complicated
53
-
- can force you to use suboptimal primary key (which will guarantee record uniqueness)
54
-
- deduplication is eventual (same as with Replacing)
55
-
+ but you can finish deduplication with GROUP BY instead if FINAL (it's faster)
47
+
### Approach 2. Eventual deduplication using Replacing
48
+
49
+
Pros:
50
+
- simple
51
+
52
+
Cons:
53
+
- can force you to use suboptimal primary key (which will guarantee record uniqueness)
54
+
- deduplication is eventual - you never know when it will happen, and you will get some duplicates if you don't use `FINAL`
55
+
- selects with `FINAL` (`select * from table_name FINAL`) add overhead and should be benchmarked
56
+
- older versions often needed manual optimization https://github.com/ClickHouse/ClickHouse/issues/31411
57
+
- performance has improved significantly in recent releases, so `FINAL` is often acceptable in production workloads https://clickhouse.com/blog/common-getting-started-issues-with-clickhouse
create materialized view Example4AggMV to Example4Agg as
105
+
select id, argMaxState(metric, version) as metric_state
106
+
from Example4Raw
107
+
group by id;
108
+
```
109
+
In that example the result contains `id = 1, metric = 20` and `id = 2, metric = 30`.
110
+
56
111
57
-
Approach 5. Keep data fragment where duplicates are possible isolated. Usually you can expect the duplicates only in some time window (like 5 minutes, or one hour, or something like that).
112
+
```sql
113
+
createtableExample4Raw
114
+
(
115
+
id UInt64,
116
+
version UInt64,
117
+
metric UInt64
118
+
)
119
+
engine = MergeTree
120
+
order by (id, version);
121
+
122
+
createtableExample4Agg
123
+
(
124
+
id UInt64,
125
+
metric_state Nullable(UInt64)
126
+
)
127
+
engine = CoalescingTree
128
+
order by id;
129
+
130
+
create materialized view Example4AggMV to Example4Agg as
131
+
select id, metric as metric_state
132
+
from Example4Raw;
133
+
```
134
+
135
+
### Approach 5. Keep data fragment where duplicates are possible isolated.
136
+
137
+
Usually you can expect the duplicates only in some time window (like 5 minutes, or one hour, or something like that).
138
+
58
139
You can put that 'dirty' data in separate place, and put it to final MergeTree table after deduplication window timeout.
59
140
For example - you insert data in some tiny tables (Engine=StripeLog) with minute suffix, and move data from tinytable older that X minutes to target MergeTree (with some external queries).
60
141
In the meanwhile you can see realtime data using Engine=Merge / VIEWs etc.
61
-
- quite complicated
62
-
+ good control
63
-
+ no duplicated in target table
64
-
+ perfect ingestion speed
65
142
66
-
Approach 6. Deduplication using MV pipeline. You insert into some temporary table (even with Engine=Null) and MV do join or subselect
143
+
Pros:
144
+
- good control
145
+
- no duplicated in target table
146
+
- perfect ingestion speed
147
+
148
+
Cons:
149
+
- quite complicated
150
+
151
+
### Approach 6. Deduplication using MV pipeline.
152
+
153
+
You insert into some temporary table (even with Engine=Null) and MV do join or subselect
67
154
(which will check the existence of arrived rows in some time frame of target table) and copy new only rows to destination table.
68
-
+ don't impact the select speed
69
-
- complicated
70
-
- for clusters can be inaccurate due to eventual consistency
71
-
- slow down inserts significantly (every insert will need to do lookup in target table first)
72
-
155
+
156
+
Pros:
157
+
- don't impact the select speed
158
+
159
+
Cons:
160
+
- complicated
161
+
- for clusters can be inaccurate due to eventual consistency
162
+
- slows down inserts significantly (every insert will need to do lookup in target table first)
163
+
164
+
```sql
165
+
createtableExample1 (id Int64, metric UInt64)
166
+
engine = MergeTree order by id;
167
+
168
+
createtableExample1Null engine =Nullas Example1;
169
+
170
+
create materialized view __Example1 to Example1 as
171
+
select*from Example1Null
172
+
where id not in (
173
+
select id from Example1 where id in (
174
+
select id from Example1Null
175
+
)
176
+
)
177
+
```
178
+
73
179
74
180
In all case: due to eventual consistency of ClickHouse replication you can still get duplicates if you insert into different replicas/shards.
0 commit comments