Hacker News Clone

Data Version Control

by HerrMonnezza on 10/1/2022, 4:09:57 PM with 59 comments

by lizen_one on 10/2/2022, 2:41:12 PM
DVC has had the following problems, when I tested it (half a year ago):
I gets super slow (waiting minutes) when there are a few thousand files tracked. Thousands files have to be tracked, if you have e.g. a 10GB file per day and region and artifacts generated from it.
You are encouraged (it only can track artifacts) if you model your pipeline in DVC (think like make). However, it cannot run tasks it parallel. So it takes a lot of time to run a pipeline while you are on a beefy machine and only one core is used. Obviously, you cannot run other tools (e.g. snakemake) to distribute/parallelize on multiple machines. Running one (part of a) stage has also some overhead, because it does commit/checks after/before running the executable of the task.
Sometimes you get merge conflicts, if you run a (partial parmaretized) stage on one machine and the other part on the other machine manually. These are cumbersome to fix.
Currently, I think they are more focused on ML features like experiment tracking (I prefer other mature tools here) instead of performance and data safety.
There is an alternative implementation from a single developer (I cannot find it right now) that fixes some problems. However, I do not use this because it propably will not have the same development progress and testing as DVC.
This sounds negative but I think it is currently the one of the best tools in this space.
by throwawaybutwhy on 10/2/2022, 2:14:36 PM
The package phones home. One has to set an env var or fix several lines of code to prevent that.
by adhocmobility on 10/2/2022, 5:46:32 PM
If you just want a git for large data files, and your files don't get updated too often (e.g. an ML model deployed in production which gets updated every month) then git-lfs is a nice solution. Bitbucket and Github both have support for it.
by tomthe on 10/2/2022, 2:50:21 PM
Can anyone compare this to DataLad [1], which someone introduced to me as "git for data"?
[https://www.datalad.org/]
by polemic on 10/2/2022, 11:10:43 PM
If you're looking for something that actually tracks tabular data there's https://kartproject.org. Geo focused but also works with standard database tables. Built with git (kart repos are git repos), can track PostgreSQL, MSSQL, MySQL etc.
by LaserToy on 10/2/2022, 2:15:58 PM
Can it be used for large and fast changing datasets?
Example: 100 TB, write us every 10 mins.
Or, 1tb, parquet, 40% is rewritten daily.
by smeagull on 10/2/2022, 10:08:11 PM
I don't think this tool can encompass everything you need in managing ML models and data sets, even if you limit it to versioning data.
I'd need such a tool to manage features, checkpoints and labels. This doesn't do any of that. Nor does it really handle merging multiple versions of data.
And I'd really like the code to be handled separately from the data. Git is not the place to do this. Because the choice of picking pairs of code and data should happen at a higher level, and be tracked along with the results - that's not going in a repo - MLFlow or Tensorboard handles it better.
by bs7280 on 10/2/2022, 6:25:41 PM
What value does this provide that I can't get by versioning my data in partitioned parquet files on s3?