Hacker News new | past | comments | ask | show | jobs | submit login
Oxen.ai: Fast Unstructured Data Version Control (github.com/oxen-ai)
177 points by sbt567 on Feb 20, 2023 | hide | past | favorite | 63 comments



Oh man, if this could plug into git and be a LFS replacement, that would be awesome. I work in a field where folks run into situations where they think they need LFS, and rarely does it work out well. If someone can figure out an ergonomic and durable LFS-like blob versioning system that can align with git histories, that would be incredible.


Creator of the project here, would love to look at what an LFS style integration with git hooks could look like. Let me know if you have any ideas there!


Honestly, I'd just trawl the interwebs for all the reasons folks have had grief with LFS, and then see if Oxen can even do a fraction better. If it can, you'll have many billion-dollar companies that would be highly interested in your solution :)

A few starting points from personal experience would be

- Ensuring git-aware code editors don't try and mangle blob histories when they encounter them - Modern, high-performance blob downloads/version-resolution - Human-readable/debuggable meta-files, where it's easy to recover from bad states (instead of an opaque empty file with a GUID...?)


If you want blob support, you'd probably need to introduce format specific understanding of file structures.

You'd also need for the blob encoder to be exceptionally stable - if minor changes to the input lead to significant changes to output, you're going to have a lot of trouble.

Something like Google's Courgette (binary patch generation) could help to simplify things, but an old thread mentioned a patent suit.

https://news.ycombinator.com/item?id=2576878


Shameless plug. I open-sourced https://snapdir.org/ to explore how to solve these exact issues.

It's an early prototype to gather feedback around the ergonomics of the interface and the manifest format, and I would love to learn if people might find it useful enough for the use cases where LFS falls short.


Since launching in December we’ve had a few teams use XetHub specifically as a drop-in replacement for Git LFS and their main reasons for switching are: easier to use (no .gitattributes file management), faster performance, and less storage.

Love to have you try XetHub and give us your feedback!


What’s the challenge with LFS?


LFS is slow (copies files during checkout, downloads/uploads files independently), inefficient (no de-duplication or delta encoding, contrary to base Git), error-prone (needs manual setup, and rewriting history if anyone gets it wrong), and otherwise not well integrated (it relies on smudge filter and other hooks, so some Git commands are aware of it and others not. You will often see the `version ... oid ...` content of the actual blob rather than the large file).


> and otherwise not well integrated (it relies on smudge filter and other hooks, so some Git commands are aware of it and others not. You will often see the `version ... oid ...` content of the actual blob rather than the large file).

Curious, do you know the solutions here? I've looked into integrating Git LFS-like things into Git and it felt like Git doesn't give you many tools. Smudging, as hacky as it is, feels one of a very small pool of options you have. I want to say Git Annex does something slightly different.

Regardless Smudging felt fine due to how difficult Git made it. Are there better ways in your mind?


I don't think you can do better if you don't want to change Git, no.


Seems like an odd critique of LFS then, no? Your other points are super valid, i've wanted to replace LFS with my own impl for the same reasons _(also it was originally designed for HTTP servers, ugh)_. However the integration is roughly as good as it can be, no? Or rather, Oxen would suffer the same limitations no?


I'm not trying to target this critique at LFS specifically and absolve Git. It might be that LFS is a bad product because of Git's limited extension points, it remains a bad product.


In addition to what was said above:

In later versions, LFS doesn't play nice with misconfigured internal networks based on Microsoft's Azure / Active Directory. The LFS project dropped support for NTLM (for good reasons) but it's difficult to convince IT to switch to Kerberos. Our team have been trying to do that for several weeks now. So the result is that if we want to use LFS we are stuck with old versions of LFS which still supported NTLM, which necessitate using old versions of GIT which supported this old version of LFS, else we run into all sorts of weird error messages. Also the same combination of old GIT + old LFS works perfectly on some machines, and fails with some weird error messages on others. Possibly due to another IT misconfiguration or something.

I've been wanting for a while now to check if I can fork LFS and fix this, unfortunately my knowledge of golang is very sketchy. Hopefully we'll manage to convince IT to drop NTLM support from the on-premise Azure.


> Hopefully we'll manage to convince IT to drop NTLM support from the on-premise Azure.

What year is this? ;)

On a more serious note - ntlm, really? And with azure ad, not a regular (old) on-premise domain?

I know it can be hard to drag windows setups into the future (hello WINS!) - But why keep ntlm auth in a post windows 7 world? Legacy systems without ldap/radius support? Genuinely curious (and a little horrified).


Tell me about it ;)

Apparently, it's related to the few win 7 machines that are still present somewhere in the organization for various reasons. All of them are airgapped (I hope) so there's no real reason to keep NTLM, but nobody has yet made the decision.

Working on getting to the guy in charge to authorize it. Sigh.


That it’s not baked right into git? Half of my challenges revolve around getting either everything out of LFS, or putting everything in LFS.


I seem to always be running into git-lfs related issues. The latest include..

If one file doesn't resolve, for any reason, git-lfs will block the entire pull. No way to say "get whatever you could, and the fact that aws is not serving that file is not a blocker"

Git clone does not pull the entire lfs history of an image. Lacking the entire history, lfs will fail consistently on future actions.


non-devs (designers) groan when they need to worry about LFS in my exp. Though i never understood the pain either, just sharing what I've observed


What version control systems are there for non-code? I never looked into it, I can imagine there's a whole world out there for things like written articles, books, but also video game assets like artwork, sprites, 3D models, textures, animations, etc.

edit: Looks like e.g. the Unreal engine has support for Perforce and SVN (https://docs.unrealengine.com/5.0/en-US/collaboration-and-ve...), where they make use of locking a file when editing it to avoid merge issues with binary files. I can imagine that frequently goes wrong.


I am of the opinion the mind of the designer or creative does not think in terms of "versions".

There are two states, in woodworking for example, the raw material and the final product. The mental model of "cutting longer" or rolling back does not exist.

The idea of a filesystem level understanding of "undo' just has not been communicated or explored.

While not Oxen related - TimeMachine seems to be the closest conceptual understanding. Snapshotting the drive, deltas, etc are generally not considered in my experience. I would be curious how larger or mature groups manage versioning.


Non-devs complain about any version control tho. Perforce? Plastic SCM? Does anyone really like them?


How does this compare with other systems, like DVC (https://dvc.org/) for example?


Raw speed on large datasets of images, video, audio, etc is one factor, some performance numbers can be found here:

https://github.com/Oxen-AI/oxen-release/blob/main/Performanc...


Looks like in those benchmarks Oxen.AI makes a misguided assumption that benchmarking DVC is (roughly...?) the same as benchmarking DVC<>DAGShub (server side made by a different company). To my understanding DAGShub is a bottleneck there. They didn't care to benchmark DVC against an S3 bucket or a similar cloud storage that is more widely used. I wonder if it's because DAGShub makes this whole setup wayyy slower


Oxen dev here - let me add some benchmarks for DVC backed by an S3 bucket. I did it awhile back and we were still faster, but agree it's a good benchmark to have.

Fundamentally even adding and committing data locally is slower, even before the push. But I agree the remote matters too.


But what on earth is this measuring?

  oxen push origin main # ~308.98 secs
Where does that push to? Does this benchmark really just measure how well-provisioned various different VC-funded websites currently are?

I think a proper benchmark here would be install the server parts of Oxen, Git-LFS, etc on the same machine, and then time how long it takes to commit and push the same dataset from some other machine.

Although of course given that we live in an age where people expect to upload their immense datasets to the cloud for some reason, a "proper" benchmark might not be a relevant one. I'm not sure what a really good benchmark of that would be.


Will add a local network benchmark as well! Many reasons to upload your data to the cloud...but agree that there are use cases where you might just want to version on your local network.


Oxen seems more like git (with GitHub integration (Oxenhub)) for ML datasets, where DVC is a bit bit more like make (with S3, LFS, etc integration) for ML datasets. It seems like Oxen has finer granularity version control and diff capability, but as far as I can tell doesn’t have as many features to track and version derived data along with the code that produced it (like `dvc repro`)


We definitely have some of these features on our roadmap! Anything particularly helpful in DVC's workflow that you think we should prioritize?


One thing I love about DVC is that it doesn't need its own server. I can just push/pull files via SSH. I don't really want one more service that I need to keep running. I also happen to have a lot of space available to me on a server I can't install extra services on, so oxen requiring that is a deal breaker for me.


This is the real deal breaker for me. Dvc is super slow but it works with S3 (one of the greatest technologies built in last 15 years). At our company, we've written own (10x) faster version of dvc for commonly used features.


We have working with an S3 backend in the upcoming features, agree it's essential.


Good feedback, we're working on more streaming features as well as supporting different backends for the CLI.

Any other features you would find useful or a dealbreaker?


Perhaps this is outside the scope of what Oxen aims to do, but I like that DVC has a way for me to specify scripts and dependencies and then decide what needs to be regenerated (and what doesn't) when dependencies change.


Cool! To be honest I don’t really use dvc much, but the project version control features are what really interest me. I like how data pipelines help align versioned artifacts like model checkpoints and visualizations with the datasets and code that produced. I work as a computational science and that sort of reproducibility tool is really important, and a lot of us don’t have the best software engineering skills/discipline.

From your readme it seems like the oxen repo and software project repo are not as closely coupled as in dvc? It seemed like in the current state of oxen, you could do something similar with make files and oxen tracking?

Oxen seems really good for longer lived data and computational science projects, where dvc seems more oriented just at analysis projects. I have a project that I want to try it out on :)


On the topic of dvc, does anyone have any experiences with dagshub (https://dagshub.com/) that they are interested in sharing?


The comparison with DVC is biased https://github.com/Oxen-AI/oxen-release/blob/main/Performanc...

I'd nowhere near the same performance with oxen. The analysis is very biased to help Oxen. I wish people had more integrity before trying so hard to push a half-baked product into the market.


I'm reading it, and it's not apparent what that bias is. It's init, add, and push on the same data set.


It vastly depends on the remote. I'm testing Oxen (on their own hub), it has barely the same speed as dvc on S3. If I was trying to push the results to a direction being pro DVC, I would compare pushing to a local minio, I could claim 10x more speed.


Exactly, this is because of the coupling with DAGShub as remote. Full disclosure - I'm associated with iterative (DVC creators) - we have nothing to do with DAGShub I'm afraid. They don't consult or collaborate with us at all (to this date) about how to build or optimize their server or workflows for their "hub"


Link to the actual project source https://github.com/Oxen-AI/Oxen


Great to see more people in this space! We are the authors of XetHub (posted in Dec ‘22, ShowHN: https://news.ycombinator.com/item?id=33969908) and also think a git-like workflow is perfect for ML dataset management, except that we actually integrate with git (like LFS). <A quick benchmark suggests we are 2x your published performance!>


On your github org, twitter link is pointing to the wrong handle:

@oxen_ao -> @oxen_ai


The Privacy Policy and the Terms and Conditions also just link to https://www.oxen.ai/repositories


Oops, thanks for catching! Updated.


Good catch, thank you! Updated


Being realistic here, 3rd party provider for data handling will be a no-go for many firms, for infosec reasons. Whereas a hub with no ui might also be a no-go for convenience reasons. I understand that oxenhub is a way to monetise the project but is there a self-hosted 'enterprise' version of that anywhere in the plans?


Any plans for adding exclusive locking and option to delete old versions of a file? These are really important if working with unmergeable, large files.


Web hub (similar to GitHub): https://www.oxen.ai/


What are the differences between this and DVC?


Creators of the project here, the first difference is the raw speed if you have many images, video, audio, files etc.

We did some benchmarking here: https://github.com/Oxen-AI/oxen-release/blob/main/Performanc...

~TLDR~ 200k+ images from the CelebA dataset take ~6 minutes to add, commit, push into Oxen. Same dataset takes ~3 hours for DVC.


Maybe I missed it - but what is the approximate size of the dataset in GB?

Would be interesting to see rclone.org (and maybe s4cmd) (should be faster access to s3?) - and also mercurial and bitkeeper.org - both should behave a little better than git on large files?


Hey this is a minor thing but could you re-render that graph to a higher contrast color scheme than light grey on white? :)


Good call out, didn't realize how hard to see they were. Updated!


I see there are already a bunch of questions about how this compares to other tools like DVC, dolt, pachyderm.io and LFS? I would just like to add one to that list:

How does this compare to lakeFS?


How does it compare to dolt?

https://github.com/dolthub/dolt


Dolthub is more about versioning structured SQL data tables. Oxen handles more large unstructured sets of images, video, audio, text, and DataFrame (parquet, csv, arrow, etc) files


Please implement account deletion, you are violating people's privacy, GDPR and this is a dark pattern.


You're a dark pattern


How does this compare to Pachyderm.io?


why not just use git?


I guess you never tried using git with large repos (> 100 GB or so) which are predominantly populated with big (>100 MB) binary files? On good days it's just unusably slow. On bad days it completely falls apart.


Fast, huge file support included.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: