Hacker News Clone

Solving duplicate data with performant deduplication

by goodroot on 11/20/2023, 6:08:34 PM with 25 comments

by goodroot on 11/21/2023, 10:40:05 PM
Hey! Thanks for upvoting.
Happy to answer any questions about deduplication. One thing that's not included in the write-up is that we also address out-of-order indexing alongside deduplication.
by goenning on 11/22/2023, 8:47:36 AM
If your ClickHouse ReplacingMergeTree returns twice the expected row count is because your query is wrong. You don’t need to FINAL it, just use aggregation on your queries as per their docs
by adren123 on 11/22/2023, 3:12:15 PM
An initial import with DuckDB from all the 15 files takes only 36 seconds on a regular (6 years old) desktop computer with 32GB of RAM and 26 seconds (5 times quicker than QuestDB) on a Dell PowerEdge 450 with 20 cores Intel Xeon and 256GB of RAM.
Here is the command to input the files:
CREATE TABLE ecommerce_sample AS SELECT * from read_csv_auto('ecommerce_*.csv');
by whalesalad on 11/21/2023, 10:49:56 PM
Can anyone comment on QuestDB vs Clickhouse vs TimescaleDB? Real world experience around ergonomics, ops, etc.
Currently using BigQuery for a lot of this (ingesting ~5-10TB monthly) but would like to begin exploring in-house tooling.
On the flip side, we still use PSQL/RDS a lot and I enjoy it for the low operations burden - but we're doing some time series stuff with it now that is starting to fall over. TimescaleDB is nice because it is postgres, but afaik cannot work inside RDS. Clickhouse is next on my list for a test deployment, but QuestDB looks pretty neat too.
by jimsimmons on 11/22/2023, 5:35:54 AM
What is the best way to deduplicate a corpus of documents