However, had the author normalized the instances.json with something like "jq -S" then one would end up with a more reasonable 1736 textual changes, which github would have almost certainly rendered
It doesn't help fix GitHub UI views, but you can use the --tool option to git diff and configure alternative diff tools in your git config, including something like piping through a pretty printer or using a (generally much slower) character-based diff tool rather than a line-based one.
Hey, I have been doing the same thing as you for some types of online resources I am interested in for a long time, too. Really nice work! One small thing you might find interesting: in the beginning, I would push the automated commits via my GitHub identity so they would all be associated to my account as my activity. This annoyed me but I also couldn't accept the idea of the commits coming from a non-GitHub account (so the username wouldn't be clickable) nor the idea of creating and maintaining a separate set of credentials to push the change under. I though about what would be a good default identity I could use and after some experimentation I found that if you use these credentials, the commits appear as if they GitHub Actions native GitHub bot pushed them:
They have the right icon, clickable username and it is as simple as just using this email and name. You or someone else might like to do this, too, so here's me sharing this neat trick I found.
I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!
One thing I notice is that the diff still requires pretty deep analysis. You need to be able to compare xml or JSON over time.
I keep thinking the real power of git-based data-over-time storage is flatter data structures. Rather than one or a dozen or so files, scaped & stored, we could synthesize some kind of directory structure & simple value files - alike a plan9/9p system - that express the data, but where changes are more structurally apparent.
I don't know enough about plan9 to understand what you're getting at there.
There's definitely a LOT of scope for innovation around how the values are compared over time. So far my explorations have been around loading the deltas into SQLite in various ways, see https://simonwillison.net/2021/Dec/7/git-history/
Sysfs or procfs on Linux are similar. Rather than have deeply structured data files, let the file system be used to make hierarchy.
Rather than a JSON with a bunch of weather stations, make a directory with a bunch of stations as subdirectories, each with lat, long, temp, humidity properties. Let the fs express the structure.
Then when we watch in git, we can filter by changes to one of these subdirs, for example. Or see every time the humidity changes in one. I don't have a good name for the general practice, but trying to use the filesystem to express the structure is the essence.
Yeah, there's definitely a lot to be said for breaking up your scraped data into separate files rather than having it all in a single file.
I have a few projects where I do that kind of thing. My best example is probably this one, where I scrape the "--help" output for the AWS CLI commands and write that into a separate file for each command:
I have to agree. I've been playing around with this for a while at https://github.com/outages/ (which includes the bchydro-outages mentioned in the original comment).
While it's easy to gather the data, the friction in analyzing it has always pushed the priority of doing so below other datasets I've gathered.
Yeah this is great, I'm sure many including me have used a similar tool but half-arsed for internal use. For me it was tracking changes to a website where all the changes were done through a gui. I was asked to provide a backup / rollback which I did with git scraping.
I also started but never finished a terms-of-service, privacy-statement tracker. I stopped at the boring part where you'd have find the url for thousands of companies and/or engage others to do it.
I did this when I was a kid, decompiling a flash game client for an MMO (Tibia).
By itself a single decompile was hard to parse, but if you do it for each release, commit the decompiled sources, and diff them you can easily see code changes.
So you just run a script to poll for a new client version to drop and automatically download, decompile, commit, and tag.
I'd have a diff of the client changes immediately, allowing insight into the protocol changes to update the private game server code to support it.
This is cool but the name is confusing. First of all, git is not being scraped nor is git being used to do any scraping, git is only used as the storage format for the snapshots. Second, there is no scraping happening at all. Scraping is when you parse a file intended for human display in order to extract the embedded unstructured data. The examples given are about periodically downloading an already structured json file and uploading it to github. No parsing is happening unless you count when he manually searches for the json file in the browser dev tools.
Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.
I think it's fine to use the term "scraping" to refer to downloading a JSON file.
These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.
I do run Git scrapers that process HTML as well. A couple of examples:
I don't think that definition is universally agreed upon. I have had many conversations over the years where the term "scraping" referred to activities that didn't involve things like parsing HTML.
Scraping specifically refers to extracting information from source documents/data. Merely downloading them is just retrieval or, when following links, crawling.
“Git scraping” would intuitively refer to extracting specific information from Git repositories. The naming in the article is therefore confusing. “Snapshotting into Git” would be more accurate. (Git itself uses the term “snapshot” for a reason.)
Undocumented endpoints that return JSON are pretty fragile!
One of the benefits of catching them in a Git repo is that it helps you spot when their structure changes in ways that may break code that you write on top of them.
Sure, they are prone to being changed out from under you, but I think we can agree they're not fragile in the same way that parsing html for the 3rd div tag with the id w9j8f (thanks react!) and the 2nd a href tag under that is. It's very clear when the endpoint changes, or the outputted JSON changes, but assuming it's still JSON, it should still be fairly readable, and if the data's still in the JSON blob, finding it is quick work. Whereas if the HTML changes, you're in for a slog.
> the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program
Snapshotting JSON files can be incredibly useful, but I don't think you should call it "scraping".
It has sections covering things called "data scraping" and "web scraping" and "screen scraping" and "report mining", and links to articles about "data mining" and "data mining" and "search engine scraping" as well.
To me, that indicates that the terminology around this stuff is already extremely vague and poorly defined... and the suffix "scraping" is up for grabs for anyone who wants to further define it!
(If you don't like me calling this technique "Git scraping" you're going to /really/ hate the name I picked for my shot-scraper tool https://shot-scraper.datasette.io )
Scraping also has, in some contexts, negative associations. In a project for a non-profit that I'm involved with that coincidentally was originally a remix of some of Simon's code for one of these "Git scraping" projects + Datasette, I recently made the decision to refer to it strictly as what it is: a crawler.
I'm less warm at this point to the general idea behind the hack of dumping the resulting JSON crawl data to GitHub. It's a very roundabout way of approaching basically what something like TerminusDB was made for. It definitely feels like the main motivation was GitHub-specific stuff and not Git, really—namely, free jobs with GitHub Actions—and everything else flowed from that. It turns out that GitHub Actions proved to be too unreliable for executing on schedule, anyway, so we ported the crawler to JS with an eye towards using Cloudflare Workers and their cron triggers (which also come in a free flavor).
My first implementation of this pattern predated GitHub Actions and used CircleCI - though GitHub Actions made this massively more convenient to build.
I think the name is also a little ambiguous. I suggest maybe, commited-scraping, or time-scraping, or chronicle-scraping... and chatGPT could probably come up with something even better lol.
I think "time-lapse" might be the right metaphor. With time-lapse photography, we see snapshots of a continuous process taken at regular intervals.
I'm not sure what it has to do with git. It seems like any version control system would work. Or, really, the main use of git here is that GitHub is effectively being used as a free database. The snapshots and timestamps are enough to see the changes, regardless of storage format.
Please consider adding a user agent string with a link to the repo or some Google-able name to your curl call, it can help site operators get in touch with you if it starts to misbehave somehow.
It's tough when there's a cat and mouse game to spoof your UA so you don't get blocked. I wish webmasters had better relationships with scrapers and could accept the realities that your data will be scraped no matter how much you try and stop it.
If you're scraping any significant amount of data (>500K), and depending on the frequency, you might also want to add etag/cache-control headers as well as accept-encoding, to save server bandwidth.
Collecting 1 kB every minute might not be a big deal, but collecting 1 MB every minute would cost an AWS-hosted service >$40/year in additional data transfer costs
I use this approach for monitoring open ports in our infrastructure -- running masscan, commiting results to git repo. If there are changes, open the merge request for review. During the review, one would investigate the actual server, why there was change in open ports.
> The implementation of the scraper is entirely contained in a single GitHub Actions workflow.
It's interesting that you can run a scraper at fixed intervals on a free, hosted CI like that. If the scraped content is larger, more than a single JSON file, will GitHub have a problem with it?
I occasionally scrape results from brazilian lotteries. Their official web sites have internal APIs which simply return JSON data. I simply download the JSON and commit it to the repository. Right now I have 5504 files totalling 22 MB. GitHub hasn't complained yet.
It’s probably not a coincidence the other place I’ve seen this technique was also for archiving a feed of fires.
It that case the data was about 250gb when fully uncompressed, and IIRC under a gig when stored as a git repo.
It’s a really neat idea, though it can make analysis on the data harder to do, in particular quality control (the aforementioned dataset had a lot of duplicates and inconsistency).
Like everything it’s a process of trading off between compute or storage, in this case optimising storage.
In the past I had to hunt down when a particular product's public documentation web pages were updated by the product team to add disclaimers and limitations.
This would have helped so much. Bookmarking this tool. Maybe I will get around to setting this up for this docs site.
Maybe all larger documentation sites should have a public history like this -- if not volunteered by the maintainers themselves, then through git-scraping by community.
‘’’
It runs on a schedule at 6, 26 and 46 minutes past the hour—I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.
‘’’
Not sure how much of a difference it makes to the underlying service but I will also do this with my scraping.
I use this to aggregate some RSS feeds. Also to generate a feed out of the HTML from sites that don't have a feed. Then i just publish the result as GitHub Pages and add this link to my reader.
Thanks for this instruction it got me going on that idea.
Parsing the legal acts with the tools you mention looks very interesting! Currently, I simply collect the published XML files whose structure is optimized for laying out the text and not so much for representing a structure of sections and subsections.
I have a couple of similar scrapers as well. One is a private repo that I collect visa information off Wikipedia (for Visalogy.com), and GeoIP information from MaxMind database (used with their permission).
It downloads the repo, and dumps the data split by the first 8 bytes of the IP address, and saves to individual JSON files. For every new scraper run, it creates a new tag and pushes it as a package, so the dependents can simply update them with their dependency manager.
So it's basically using Git as an "append-only" (no update-in-place) database to then do time queries? It's not the first time I see people using Git that way.
EDIT: hmmm I realize in addition to that it's also a way to not have to do specific queries over time: the diff takes care of finding everything that changed (i.e. you don't have to say "I want to see how this and that values changed over time": the diff does it all). Nice.
The core idea I believe is tracking incremental changes and keeping past history of items. Git is good for text, though for large amounts of binary data I would recommend filesystem snapshots like with btrfs.
I use the same technique to maintain a json file mapping Slack channel names to channel IDs, as Slack for some reason doesn't have an API endpoint for getting a channel ID from its name.
some covid datasets were published as git repositories. the cool part was that it added a publish date dimension for historical data so that one could understand how long it took for historical counts to reach a steady state.
Yes! New Zealand published current covid locations of interest (where infected people had been) to a github repo - I created an animation of locations over time by crawling the repo history. It gave an idea for how it was spreading through the country, and Auckland in particular.
The trick here is to take some source of information online that's updated frequently and turn that into a historic record of every change made to that source, by setting up a GitHub repository and dropping a YAML file into it setting up a scheduled action.
Achieving the same thing with a time series database would require a whole lot more work I think - you'd need to run that database somewhere, then run code that scrapes and writes to it on a scheduled basis.
If you already have a time series database running and a machine that runs cron I guess it wouldn't be too much work to put that in place.
Git scraping also lets you easily track changes made to textual content, which I don't think would fit neatly in a time series database.
I mean you could use SQLite the wrong way and use it as a time series database, which would save you from having to have a machine to host it, and I'm sure you could cobble together some sort of hosting for it and glue it to a web cron system. this github seems quite a bit more straightforwards, but then you're on git instead of something else.
A fun way to track how people are using this is with the git-scraping topic on GitHub:
https://github.com/topics/git-scraping?o=desc&s=updated
That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.
As I write this, just in the last minute repos that updated include:
queensland-traffic-conditions: https://github.com/drzax/queensland-traffic-conditions
bbcrss: https://github.com/jasoncartwright/bbcrss
metrobus-timetrack-history: https://github.com/jackharrhy/metrobus-timetrack-history
bchydro-outages: https://github.com/outages/bchydro-outages