• by alexkoay on 6/23/2022, 6:59:23 PM

    This seems to be very useful!

    I notice that it casts everything to string for MD5 to work. In that case, how does it handle two databases having different types for the same columns? I'm thinking about floats and numerics (decimal places), timestamps (some have timezone support, some don't) and bytes (hex, base64) in particular, but there are definitely others that I'm missing as well.

  • by oa335 on 6/23/2022, 2:56:31 AM

    Hi, data engineer here. There are umpteen data engineering tools that have “Data” in the title. Have you considered a different name?

  • by higeorge13 on 6/22/2022, 8:22:51 PM

    Awesome tool, we will definitely give it a try! 2 questions:

    - how do you handle the data replication lag in the comparison?

    - i assume that this works in identical tables between 2 databases, right? Any support for “similar” tables based on a column set? Imagine that we have a use case where we have a table X in one db, and another table Y in another db, with some columns from X and enhanced attributes.

  • by pbnjay on 6/23/2022, 1:37:53 AM

    Pretty cool! I did a similar project for flatfiles, but used bloom filters to generate an “index” of row contents to test against later. I feel like a similar idea could work for identifying divergent rows within your segments more quickly/with less repeated work.

    Making that work across databases could be a huge pain though, I had some success in Postgre but bitfields in the other DBs were painful.

  • by snidane on 6/23/2022, 3:58:22 AM

    What benefit does it give me over running table diff query in SQL?

      select * exclude (date_uploaded) from dev_table
      except
      select * exclude (date uploaded) from prod_table
    
    or

      select *
      from (select * exclude (date_uploaded) from dev_table) dev
      natural full join
      (select * exclude (date_uploaded) from prod_table) prod
      where dev.date_uploaded is null
        or prod.date_uploaded is null
    
    
    The only issue with the above is that EXCLUDE/EXCEPT is missing from standard SQL and even from market leaders like Snowflake, making this a massive pain in the ass. Second, natural joining in presence of null fields is going to produce a mess instead of something useful. Again - analytics db providers would rather boast about adhering to an ancient standard from the 1970s than listening to users and actually making SQL work after all those decades of pain.

    Without the stupid default behavior of SQL, this wouldn't be a problem. I'm curious if Data Diff solves this or some other use case.

  • by neural_thing on 6/23/2022, 1:57:27 PM

    I've been a Datafold customer for a year at two different companies. Great experience. Very useful tool in your CI flow, the team is very responsive.

  • by jnsie on 6/22/2022, 11:24:09 PM

    Just curious - is there a reason that SQL Server doesn't make the list of supported platforms (i.e. it appears that there is no plan to support in future)?

  • by cryptonector on 6/22/2022, 9:38:36 PM

    Does FDW let you do performant `FULL OUTER JOIN`s and/or `NATURAL FULL OUTER JOIN`s? If so then I would think that would be a decent place to start for remote DB diffs for PG. If might not be enough, of course, if the tables are huge, in which case taking a page from rsync and using some sort of per-row checksum as TFA does is clearly a good idea.

  • by michalg on 6/22/2022, 8:22:48 PM

    Can you diff a table with a view? Or only tables are supported?