Hacker News new | past | comments | ask | show | jobs | submit login
Xltrail – An open-source Git command line extension for Excel workbook files (xltrail.com)
95 points by bjoerns on Feb 19, 2018 | hide | past | favorite | 23 comments



So this does NOT create diffs for the underlying excel workbooks, just for the VBA in the workbooks?


I have to handle quite a lot of Excel document at work, much to my despair. Unfortunately we are not using GIT, but SVN. I ended up creating a small tool that extract the vba, named ranges plus the sheet in this format: "cell_adress,type,format,value" I have to commit those files alongside the spreadsheet, which is not great but had helped me a lot. Nice to hear others are working on the problem!


Have you tried `textconv` instead? You can configure git to generate your CSVs on the fly and show the differences between CSVs when you do `git diff`. See: https://www.kernel.org/pub/software/scm/git/docs/gitattribut...


Speaking of text conversion and git, I usually don't commit non-textual data aside from necessary files like image and audio assets, but one time I commited some PDFs in a "samples" directory for a tool I made to extract some data from a set of PDF files, and later I removed one of them and observed that when I typed "git show" the diff showed the text contents of the PDF which I find rather mind blowing because of how much trouble I had experienced extracting text and how git was casually showing me an ascii rendering of the document more or less with good representation of the layout of the document.

This in fact prompted me to further investigate the open source text extraction tools on the market and I ended up finding one that was better than the one I had selected at first and which I had then been building upon. Happily my own tools were built in such a way that I could reuse most of the code I had written while using the previous tool, and in fact during the rewrite I also realized that I could write the new code in a much cleaner way and so there were basically only upsides to switching tool and rewriting some of my code :)


What tools did you end up settling on for PDF data/text extraction? I ask because I have a side project that I've been neglecting for far too long which depends in part on cleanly extracting text from PDF (other formats too but PDFs are by far the most headache inducing).


I use the pdf.load and pdf.tree.write of PDFQuery.

https://github.com/jcushman/pdfquery

> PDFQuery works by loading a PDF as a pdfminer layout, converting the layout to an etree with lxml.etree, and then applying a pyquery wrapper. All three underlying libraries are exposed, so you can use any of their interfaces to get at the data you want.

Here is a minimal Python script to dump the XML tree so you can load it in whatever other language you use and work with it from there.

    #!/usr/bin/env python3

    from pdfquery import PDFQuery

    pdf = PDFQuery("some_document.pdf")
    pdf.load()
    pdf.tree.write("some_document.xml", pretty_print=True, encoding="utf-8")
It doesn’t work perfectly with all documents but it works well with many. Give it a try.

    pip3 install pdfquery



Git really is awesome! Too bad we are only using SVN :-(

I like to have to text version of the workbook so I can query it using grep. We have multiple dozen of very similar workbooks (whoever set that up first was insane). Any change request involves updating all workbooks the exact same way. After I update them I can run something like this: grep "^A1" /*/sheet1.csv

Using this I get the value/formula and format of all workbooks for sheet1!A1. That way I am sure they have all been updated the same way. And this works with the VBA code as well.

Not the nicest system I have worked on, at least we don't have to touch it too often.


Excellent


I have to say, pretty neat. And I partly feel like I may need a shower after thinking that.

I have no doubt that a product like this has a real need ... I used to write software targetted at end-users for reducing first-level support at a huge company, and it never ceased to amaze me the kinds of unbelievably important things that Excel and Access were used to manage (on the order of processes that the failure of which would result in 7-figure problems). It would further amaze me to find out that many of the people working with these end-user created solutions to problems were had some really, durably, solid rocks between their ears, and yet were still able to navigate these tools with moderate success. Unfortunately, there's little chance that an average excel hacker is going to have much interest or ability to operate git, and the average git enthusiast is probably a lot like me -- a guy who goes out of his way to never launch Excel for any reason, ever.

The last time I was given a task in which the work product was demanded in Excel, I wrote an application to create the spreadsheet programmatically ... partly because it was the only way to do some of the "clean up" they were asking for[0] and partly because I am a smart-ass by nature.

I realize I'm a specific case and I probably have a much stronger allergy to Excel than most, and as I said, there's few large companies that don't have stupidly-complex processes that (also stupidly-)rely on Excel. I'm also a guy who bends git into just about every scenario I can find, so I love the idea of anything that makes it more useful. Kudos.

[0] I had to correlate output from several web services and the source was a spreadsheet had the values they responded with. I wrote an app to read the rows and apply edit-distance against them to correlate a few more than were possible using just excel tricks.


Getting the diff is nice, now if someone has a solution to merge file, I am all ears.


op here: feel free to log an issue on GH, it's definitely on our agenda: https://github.com/ZoomerAnalytics/git-xltrail


Nice attempt, I looked into this problem too a while ago. Quick question for the OP, I know a lot of enterprise users want to version control their spreadsheets rather than just the code. Have you looked into that? Seems to be a tricky problem to solve.


Our server-side (commercial) product actually does handle spreadsheets as well as VBA code. If you want to see it in action, have a look at our demo repositories: https://demo.xltrail.com


The VBA component seems to use https://github.com/decalage2/oletools . What are you using for handling spreadsheets?


I wonder if there wouldn't be a solution to diff a spreadsheet by analysing the grid and the formulas, recognising patterns (formulas dragged accross), then creating a code to generate that spreadsheet, converting patterns into for loops. If the spreadsheet is reasonably consistent and logical, the resulting code shouldn't be too large. Then you would check in the resulting code in a normal git server. That would allow you to do merge and diff, and to be able to regenerate the spreadsheet from code.

Easier said than done given it has to deal with array formulas, formatting, etc. But it should be doable.


SharePoint on-prem and online do this out of the box but they might be overkill. Works really nicely with Microsoft Teams.


Can anyone here answer whether it’s normal, when building git based interfaces, to only use the API for queries and contextual things but utilize regular git commands to perform interactions with the remote repositories?

I am a 3D artist building a system to version control the configuration/script file directory trees for all the applications I use on a day to day basis. I will be taking a look at this code for clues but maybe someone in this thread can help me with some questions.

Does anyone know something similar to this? I am modeling it on the VSCode Setting Sync extension, which is simple but a little rough itself.

I am using Python and just about have what I’m going for but it’s not as tight as I want it to be. I got a good system of using one repository with a branch per directory tree and folder symlinks to organize it locally. I can pull the branches into individual local repos per remote branch and it works nicely. This allows me to use regular git commands when I need to without fussing with other application’s configuration trees.

But things get pretty loose when it comes to building the backup() and restore() functions. For restore(), I can just doing an overwrite with fetch then reset hard —HEAD or various things like this but it seems like there must be a better approach considering the local repository, index etc. are a bit unnecessary. If anyone has any thoughts, links or pointers, I’d be super thankful! Sorry for the long post.


> Can anyone here answer whether it’s normal, when building git based interfaces, to only use the API for queries and contextual things but utilize regular git commands to perform interactions with the remote repositories?

Because there are a myriad ways to connect to remotes: local filesystem, git protocol (rare), git+ssh, http(s), whatever-git-clone-"user@host:myrepo.git" does, svn-bridge are the ones I know and it's well possible there are more.

So you'd need around 4-5 protocol reimplementations, plus some way to configure the weird stuff that people do with SSH (public keys with or without password, ssh-agent, custom ports, port knocking, bastion hosts, whatever else one may fit in .ssh/config)... best leave this to git, especially as there's a lot of crypto in there that you're bound to mess up.


Does this site mess with scrolling? On a macbook air, chrome, it scrolls wildly even with a tiny nudge.


Yes. I got so fed up with bullshit like this that I made a blocklist for my ad blocker just to keep sane while browsing.

https://gist.github.com/narrowtux/19220fc077ecbaeb274c56e849...

The latest entry blocks the smooth scroll on the site of this post. They've gotten smarter now and include this in bundles.


Hmm... I literally just created an AWS bucket that magically creates (i.e. lambda) corresponding .json files for any .xlsx file that gets put there.

I may have used Git if I knew this existed. For now our requirement was that the file be maintained in an Excel format, but this data needs to be consumed by different scripts written in different languages hence the automatic .json conversion.


I see this frequently but I question the choice of JSON which I assume is actually newline delimited JSON objects. What is the benefit of JSON? Familiarity?

If you need to ingest tabular data is JSON the ideal format? Why not something like escaped TSV? I realize storage is cheap but JSON has a lot of overhead and I don't see the benefit in using it to store data files, especially from spreadsheets.

Depending on access patterns something like Parquet may be a lot more useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: