Hacker News Clone

Show HN: Data Diff – compare tables of any size across databases

by hichkaker on 6/22/2022, 3:28:04 PM with 21 comments

Gleb, Alex, Erez and Simon here – we are building an open-source tool for comparing data within and across databases at any scale. The repo is at https://github.com/datafold/data-diff, and our home page is https://datafold.com/.

As a company, Datafold builds tools for data engineers to automate the most tedious and error-prone tasks falling through the cracks of the modern data stack, such as data testing and lineage. We launched two years ago with a tool for regression-testing changes to ETL code https://news.ycombinator.com/item?id=24071955. It compares the produced data before and after the code change and shows the impact on values, aggregate metrics, and downstream data applications.

While working with many customers on improving their data engineering experience, we kept hearing that they needed to diff their data across databases to validate data replication between systems.

There were 3 main use cases for such replication:

(1) To perform analytics on transactional data in an OLAP engine (e.g. PostgreSQL > Snowflake) (2) To migrate between transactional stores (e.g. MySQL > PostgreSQL) (3) To leverage data in a specialized engine (e.g. PostgreSQL > ElasticSearch).

Despite multiple vendors (e.g., Fivetran, Stitch) and open-source products (Airbyte, Debezium) solving data replication, there was no tooling for validating the correctness of such replication. When we researched how teams were going about this, we found that most have been either:

Running manual checks: e.g., starting with COUNT(*) and then digging into the discrepancies, which often took hours to pinpoint the inconsistencies. Using distributed MPP engines such as Spark or Trino to download the complete datasets from both databases and then comparing them in memory – an expensive process requiring complex infrastructure.

Our users wanted a tool that could:

(1) Compare datasets quickly (seconds/minutes) at a large (millions/billions of rows) scale across different databases (2) Have minimal network IO and database workload overhead. (3) Provide straightforward output: basic stats and what rows are different. (4) Be embedded into a data orchestrator such as Airflow to run right after the replication process.

So we built Data Diff as an open-source package available through pip. Data Diff can be run in a CLI or wrapped into any data orchestrator such as Airflow, Dagster, etc.

To solve for speed at scale with minimal overhead, Data Diff relies on checksumming the data in both databases and uses binary search to identify diverging records. That way, it can compare arbitrarily large datasets in logarithmic time and IO – only transferring a tiny fraction of the data over the network. For example, it can diff tables with 25M rows in ~10s and 1B+ rows in ~5m across two physically separate PostgreSQL databases while running on a typical laptop.

We've launched this tool under the MIT license so that any developer can use it, and to encourage contributions of other database connectors. We didn't want to charge engineers for such a fundamental use case. We make money by charging a license fee for advanced solutions such as column-level data lineage, CI workflow automation, and ML-powered alerts.

by alexkoay on 6/23/2022, 6:59:23 PM
This seems to be very useful!
I notice that it casts everything to string for MD5 to work. In that case, how does it handle two databases having different types for the same columns? I'm thinking about floats and numerics (decimal places), timestamps (some have timezone support, some don't) and bytes (hex, base64) in particular, but there are definitely others that I'm missing as well.
by oa335 on 6/23/2022, 2:56:31 AM
Hi, data engineer here. There are umpteen data engineering tools that have “Data” in the title. Have you considered a different name?
by higeorge13 on 6/22/2022, 8:22:51 PM
Awesome tool, we will definitely give it a try! 2 questions:
- how do you handle the data replication lag in the comparison?
- i assume that this works in identical tables between 2 databases, right? Any support for “similar” tables based on a column set? Imagine that we have a use case where we have a table X in one db, and another table Y in another db, with some columns from X and enhanced attributes.
by pbnjay on 6/23/2022, 1:37:53 AM
Pretty cool! I did a similar project for flatfiles, but used bloom filters to generate an “index” of row contents to test against later. I feel like a similar idea could work for identifying divergent rows within your segments more quickly/with less repeated work.
Making that work across databases could be a huge pain though, I had some success in Postgre but bitfields in the other DBs were painful.
by snidane on 6/23/2022, 3:58:22 AM
What benefit does it give me over running table diff query in SQL?
```
  select * exclude (date_uploaded) from dev_table
  except
  select * exclude (date uploaded) from prod_table
```
or
```
  select *
  from (select * exclude (date_uploaded) from dev_table) dev
  natural full join
  (select * exclude (date_uploaded) from prod_table) prod
  where dev.date_uploaded is null
    or prod.date_uploaded is null
```
The only issue with the above is that EXCLUDE/EXCEPT is missing from standard SQL and even from market leaders like Snowflake, making this a massive pain in the ass. Second, natural joining in presence of null fields is going to produce a mess instead of something useful. Again - analytics db providers would rather boast about adhering to an ancient standard from the 1970s than listening to users and actually making SQL work after all those decades of pain.
Without the stupid default behavior of SQL, this wouldn't be a problem. I'm curious if Data Diff solves this or some other use case.
by neural_thing on 6/23/2022, 1:57:27 PM
I've been a Datafold customer for a year at two different companies. Great experience. Very useful tool in your CI flow, the team is very responsive.
by jnsie on 6/22/2022, 11:24:09 PM
Just curious - is there a reason that SQL Server doesn't make the list of supported platforms (i.e. it appears that there is no plan to support in future)?
by cryptonector on 6/22/2022, 9:38:36 PM
Does FDW let you do performant `FULL OUTER JOIN`s and/or `NATURAL FULL OUTER JOIN`s? If so then I would think that would be a decent place to start for remote DB diffs for PG. If might not be enough, of course, if the tables are huge, in which case taking a page from rsync and using some sort of per-row checksum as TFA does is clearly a good idea.
by michalg on 6/22/2022, 8:22:48 PM
Can you diff a table with a view? Or only tables are supported?