Hey HN, would appreciate feedback on a version control for data toolset I am building, creatively called the Data Manager. When working with large repositories with data, full checkouts are problematic. Many git-for-data solutions will create a new copy of the entire datasets for each commit and none of them allow contributing to a data repo without full checkouts, to my knowledge. In the video, a workflow that does not require full checkouts of the datasets and still allows to commit changes in Git is presented. Specifically, it becomes possible to check out kilobytes to commit changes to a 130 gigabyte repository, including versions. Note that only diffs are committed, at row, column, and cell level, so the diffing that appears in the GUI will seem weird, since it will interpret the old diff as the file to be compared with the new one, when in fact they are both just diffs. The goal of the Data Manager is to version datasets and structured data in general, in a storage-efficient way, and easily identify and deploy to S3 datasets snapshots, identified by repository and commit sha (and optionally a tag) that need to be pulled for processing. S3 is also used to upload heavy files that are then pointed by reference, not URL, in Git commits. The no-full-checkout workflow shown applies naturally to adding data and can be extended to edits or deletions provided the old data is known. That is to ensure the creation of bidirectional diffs that enable navigating Git history both forward and backward, useful when caching snapshots. The burden of checking out and building snapshots from diff history is now borne by localhost, but that may change, as mentioned in the video. Smart navigation of git history from the nearest available snapshots, building snapshots with Spark, and other ways to save on data transfer and compute are being evaluated. This paradigm enables hibernating or cleaning up history on S3 for datasets no longer necessary to create snapshots, like those that are deleted, if snapshots of earlier commits are not needed. Individual data entries could also be removed for GDPR compliance using versioning on S3 objects, orthogonal to git. The prototype already cures the pain point I built it for: it was impossible to (1) uniquely identify and (2) make available behind an API multiple versions of a collection of datasets and config parameters, (3) without overburdening HDDs due to small, but frequent changes to any of the datasets in the repo and (4) while being able to see the diffs in git for each commit in order to enable collaborative discussions and reverting or further editing if necessary. Some background: I am building natural language AI algorithms (a) easily retrainable on editable training datasets, meaning changes or deletions in the training data are reflected fast, without traces of past training and without retraining the entire language model (sounds impossible), and (b) that explain decisions back to individual training data. LLMs have fixed training datasets, whereas editable datasets call for a system to manage data efficiently, plus I wanted to have something that integrates naturally with common, tried and tested tools such as Git, S3, and MySQL, hence the Data Manager. I am considering open-source: is that the best way to go? Which license to choose?
Story Published at: May 13, 2023 at 06:59PM
Story Published at: May 13, 2023 at 06:59PM