Hacker News new | past | comments | ask | show | jobs | submit login
Git scraping: track changes over time by scraping to a Git repository (2020) (simonwillison.net)
166 points by ekiauhce on Aug 10, 2023 | hide | past | favorite | 66 comments



I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.

A fun way to track how people are using this is with the git-scraping topic on GitHub:

https://github.com/topics/git-scraping?o=desc&s=updated

That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.

As I write this, just in the last minute repos that updated include:

queensland-traffic-conditions: https://github.com/drzax/queensland-traffic-conditions

bbcrss: https://github.com/jasoncartwright/bbcrss

metrobus-timetrack-history: https://github.com/jackharrhy/metrobus-timetrack-history

bchydro-outages: https://github.com/outages/bchydro-outages


Thanks for linking to the topic, that was interesting

As a heads up to anyone trying this stunt, please be mindful that git-diff is ultimately a line oriented action (yeah, yeah, "git stores snapshots")

For example https://github.com/pmc-ss/mastodon-scraping/commit/2a15ce1b2... is all :fu: because git sees basically the "first line" changed

However, had the author normalized the instances.json with something like "jq -S" then one would end up with a more reasonable 1736 textual changes, which github would have almost certainly rendered

  diff -u \
    <(git ls-tree HEAD^1 -- instances.json | cut -d' ' -f3 | xargs git show --pretty=raw | jq -S) \
    <(git ls-tree HEAD   -- instances.json | cut -d' ' -f3 | xargs git show --pretty=raw | jq -S)
  --- /dev/fd/63 2023-08-10 19:31:03.000000000 -0700
  +++ /dev/fd/62 2023-08-10 19:31:03.000000000 -0700
  @@ -1,6 +1,6 @@
   [
     {
  -    "connections": 5088,
  +    "connections": 5089,


It doesn't help fix GitHub UI views, but you can use the --tool option to git diff and configure alternative diff tools in your git config, including something like piping through a pretty printer or using a (generally much slower) character-based diff tool rather than a line-based one.


Hey, I have been doing the same thing as you for some types of online resources I am interested in for a long time, too. Really nice work! One small thing you might find interesting: in the beginning, I would push the automated commits via my GitHub identity so they would all be associated to my account as my activity. This annoyed me but I also couldn't accept the idea of the commits coming from a non-GitHub account (so the username wouldn't be clickable) nor the idea of creating and maintaining a separate set of credentials to push the change under. I though about what would be a good default identity I could use and after some experimentation I found that if you use these credentials, the commits appear as if they GitHub Actions native GitHub bot pushed them:

        git config --global user.email "41898282+github-actions[bot]@users.noreply.github.com"
        git config --global user.name "github-actions[bot]"
They have the right icon, clickable username and it is as simple as just using this email and name. You or someone else might like to do this, too, so here's me sharing this neat trick I found.

https://github.com/TomasHubelbauer/github-actions#write-work...


I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!

https://github.com/nmpowell/carbon-intensity-forecast-tracki...


One thing I notice is that the diff still requires pretty deep analysis. You need to be able to compare xml or JSON over time.

I keep thinking the real power of git-based data-over-time storage is flatter data structures. Rather than one or a dozen or so files, scaped & stored, we could synthesize some kind of directory structure & simple value files - alike a plan9/9p system - that express the data, but where changes are more structurally apparent.

Thoughts?


I don't know enough about plan9 to understand what you're getting at there.

There's definitely a LOT of scope for innovation around how the values are compared over time. So far my explorations have been around loading the deltas into SQLite in various ways, see https://simonwillison.net/2021/Dec/7/git-history/


Perhaps Jaunty is referring to https://en.wikipedia.org/wiki/Venti

Which also one of the inspirations for Git.


Sysfs or procfs on Linux are similar. Rather than have deeply structured data files, let the file system be used to make hierarchy.

Rather than a JSON with a bunch of weather stations, make a directory with a bunch of stations as subdirectories, each with lat, long, temp, humidity properties. Let the fs express the structure.

Then when we watch in git, we can filter by changes to one of these subdirs, for example. Or see every time the humidity changes in one. I don't have a good name for the general practice, but trying to use the filesystem to express the structure is the essence.


Oh, I see what you mean.

Yeah, there's definitely a lot to be said for breaking up your scraped data into separate files rather than having it all in a single file.

I have a few projects where I do that kind of thing. My best example is probably this one, where I scrape the "--help" output for the AWS CLI commands and write that into a separate file for each command:

help-scraper/tree/main/aws: https://github.com/simonw/help-scraper/tree/main/aws

This is fantastically useful for keeping track of which AWS features were added or changed at what point.


I have to agree. I've been playing around with this for a while at https://github.com/outages/ (which includes the bchydro-outages mentioned in the original comment).

While it's easy to gather the data, the friction in analyzing it has always pushed the priority of doing so below other datasets I've gathered.


i do this as a demo: https://github.com/swyxio/gh-action-data-scraping

but conveniently it also serves as a way to track the downtime of github actions, which used to be bad but seems to be fine the last couple months: https://github.com/swyxio/gh-action-data-scraping/assets/676...


Yeah this is great, I'm sure many including me have used a similar tool but half-arsed for internal use. For me it was tracking changes to a website where all the changes were done through a gui. I was asked to provide a backup / rollback which I did with git scraping.

I also started but never finished a terms-of-service, privacy-statement tracker. I stopped at the boring part where you'd have find the url for thousands of companies and/or engage others to do it.


wow, this is one of those things where I've thought of the problem many times and the solution makes me go, oh duh.


I did this when I was a kid, decompiling a flash game client for an MMO (Tibia).

By itself a single decompile was hard to parse, but if you do it for each release, commit the decompiled sources, and diff them you can easily see code changes.

So you just run a script to poll for a new client version to drop and automatically download, decompile, commit, and tag.

I'd have a diff of the client changes immediately, allowing insight into the protocol changes to update the private game server code to support it.


That's brilliant!


This is cool but the name is confusing. First of all, git is not being scraped nor is git being used to do any scraping, git is only used as the storage format for the snapshots. Second, there is no scraping happening at all. Scraping is when you parse a file intended for human display in order to extract the embedded unstructured data. The examples given are about periodically downloading an already structured json file and uploading it to github. No parsing is happening unless you count when he manually searches for the json file in the browser dev tools.


Git is a key technology in this approach, because the value you get out of this form of scraping is the commit history - it's a way of turning a static source of information into a record of how that information changed over time.

I think it's fine to use the term "scraping" to refer to downloading a JSON file.

These days an increasing number of websites work by serving up JSON which is then turned into HTML by a client-side JavaScript app. The JSON often isn't a formally documented API, but you can grab it directly to avoid the extra step of processing the HTML.

I do run Git scrapers that process HTML as well. A couple of examples:

scrape-san-mateo-fire-dispatch https://github.com/simonw/scrape-san-mateo-fire-dispatch scrapes the HTML from http://www.firedispatch.com/iPhoneActiveIncident.asp?Agency=... and records both the original HTML and converted JSON in the repository.

scrape-hacker-news-by-domain https://github.com/simonw/scrape-hacker-news-by-domain uses my https://shot-scraper.datasette.io/ browser automation tool to convert an HTML page on Hacker News into JSON and save that to the repo. I wrote more about how that works here: https://simonwillison.net/2022/Dec/2/datasette-write-api/

That one's a particularly fun demo because it's currently capturing changes to the points and comment count on this thread - a recent example commit: https://github.com/simonw/scrape-hacker-news-by-domain/commi...


It's not scraping, you're just consuming an API.

Scraping is when you're parsing human-readable content (HTML) and extracting data, as the parent comment correctly points out.


I don't think that definition is universally agreed upon. I have had many conversations over the years where the term "scraping" referred to activities that didn't involve things like parsing HTML.


Scraping specifically refers to extracting information from source documents/data. Merely downloading them is just retrieval or, when following links, crawling.

“Git scraping” would intuitively refer to extracting specific information from Git repositories. The naming in the article is therefore confusing. “Snapshotting into Git” would be more accurate. (Git itself uses the term “snapshot” for a reason.)


Agree to disagree. To me, scraping implies a level of fragility that hitting an endpoint that returns JSON does not have.


Undocumented endpoints that return JSON are pretty fragile!

One of the benefits of catching them in a Git repo is that it helps you spot when their structure changes in ways that may break code that you write on top of them.


Sure, they are prone to being changed out from under you, but I think we can agree they're not fragile in the same way that parsing html for the 3rd div tag with the id w9j8f (thanks react!) and the 2nd a href tag under that is. It's very clear when the endpoint changes, or the outputted JSON changes, but assuming it's still JSON, it should still be fairly readable, and if the data's still in the JSON blob, finding it is quick work. Whereas if the HTML changes, you're in for a slog.


From the Wikipedia entry for "data scraping":

> the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program

Snapshotting JSON files can be incredibly useful, but I don't think you should call it "scraping".


I think that article actually further supports my position here: https://en.wikipedia.org/wiki/Data_scraping

It has sections covering things called "data scraping" and "web scraping" and "screen scraping" and "report mining", and links to articles about "data mining" and "data mining" and "search engine scraping" as well.

To me, that indicates that the terminology around this stuff is already extremely vague and poorly defined... and the suffix "scraping" is up for grabs for anyone who wants to further define it!

(If you don't like me calling this technique "Git scraping" you're going to /really/ hate the name I picked for my shot-scraper tool https://shot-scraper.datasette.io )


Scraping also has, in some contexts, negative associations. In a project for a non-profit that I'm involved with that coincidentally was originally a remix of some of Simon's code for one of these "Git scraping" projects + Datasette, I recently made the decision to refer to it strictly as what it is: a crawler.

I'm less warm at this point to the general idea behind the hack of dumping the resulting JSON crawl data to GitHub. It's a very roundabout way of approaching basically what something like TerminusDB was made for. It definitely feels like the main motivation was GitHub-specific stuff and not Git, really—namely, free jobs with GitHub Actions—and everything else flowed from that. It turns out that GitHub Actions proved to be too unreliable for executing on schedule, anyway, so we ported the crawler to JS with an eye towards using Cloudflare Workers and their cron triggers (which also come in a free flavor).


My first implementation of this pattern predated GitHub Actions and used CircleCI - though GitHub Actions made this massively more convenient to build.


Exactly - "scraping" is the final resort when sites don't make data available via an API. It's almost exactly synonymous with "parsing HTML".


I think the name is also a little ambiguous. I suggest maybe, commited-scraping, or time-scraping, or chronicle-scraping... and chatGPT could probably come up with something even better lol.


“Periodic snapshotting” would be more accurate. That’s the usual term of the art.


I think "time-lapse" might be the right metaphor. With time-lapse photography, we see snapshots of a continuous process taken at regular intervals.

I'm not sure what it has to do with git. It seems like any version control system would work. Or, really, the main use of git here is that GitHub is effectively being used as a free database. The snapshots and timestamps are enough to see the changes, regardless of storage format.


This looks very cool!

Please consider adding a user agent string with a link to the repo or some Google-able name to your curl call, it can help site operators get in touch with you if it starts to misbehave somehow.


It's tough when there's a cat and mouse game to spoof your UA so you don't get blocked. I wish webmasters had better relationships with scrapers and could accept the realities that your data will be scraped no matter how much you try and stop it.


IMO, We should really just get rid of the user agent header altogether.


Yeah, that's good idea - I need to add that to my suggestions for how to implement this.


If you're scraping any significant amount of data (>500K), and depending on the frequency, you might also want to add etag/cache-control headers as well as accept-encoding, to save server bandwidth.

Collecting 1 kB every minute might not be a big deal, but collecting 1 MB every minute would cost an AWS-hosted service >$40/year in additional data transfer costs


It should definitely be optional. I can only imagine some busybody PM insisting they block harmless scrapes.


I use this approach for monitoring open ports in our infrastructure -- running masscan, commiting results to git repo. If there are changes, open the merge request for review. During the review, one would investigate the actual server, why there was change in open ports.

https://github.com/bobek/masscan_as_a_service


> The implementation of the scraper is entirely contained in a single GitHub Actions workflow.

It's interesting that you can run a scraper at fixed intervals on a free, hosted CI like that. If the scraped content is larger, more than a single JSON file, will GitHub have a problem with it?


GitHub repos appear to have a "soft" size limit of about 1GB - I feel completely comfortable with free repos with up to that size of content.

Once you get above 5GB I believe GitHub Support may send you a quiet polite email asking you to reconsider!

https://docs.github.com/en/repositories/working-with-files/m... has some more information on limits - they suggest keeping individual files below 50MB (and definitely below 100MB).


I occasionally scrape results from brazilian lotteries. Their official web sites have internal APIs which simply return JSON data. I simply download the JSON and commit it to the repository. Right now I have 5504 files totalling 22 MB. GitHub hasn't complained yet.


It’s probably not a coincidence the other place I’ve seen this technique was also for archiving a feed of fires.

It that case the data was about 250gb when fully uncompressed, and IIRC under a gig when stored as a git repo.

It’s a really neat idea, though it can make analysis on the data harder to do, in particular quality control (the aforementioned dataset had a lot of duplicates and inconsistency).

Like everything it’s a process of trading off between compute or storage, in this case optimising storage.


In the past I had to hunt down when a particular product's public documentation web pages were updated by the product team to add disclaimers and limitations.

This would have helped so much. Bookmarking this tool. Maybe I will get around to setting this up for this docs site.

Maybe all larger documentation sites should have a public history like this -- if not volunteered by the maintainers themselves, then through git-scraping by community.


Funnily enough I do something very close to this with the RFC database at rfc-editor.org, here's the script that I have put in my `cron`:

    pushd ~/data/rfc # this is a GIT repo
    rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::rfcs-text-only ~/data/rfc/text-only/
    rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::refs ~/data/rfc/refs
    rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::rfcs-pdf-only ~/data/rfc/pdf-only/
    git add .
    git commit -m "update $(date '+%Y-%m-%d')"
    git reflog expire --expire=now --all
    git gc --prune=now --aggressive
    git push origin master
    popd
Though I admit using GitHub's servers for this is more clever than me using one of my home servers. Still, I lean more to self-hosting.

@simonw Will take a look at `git-history`, looks intriguing!


I find -i or -ii give a nicer output than -v in rsync


‘’’ It runs on a schedule at 6, 26 and 46 minutes past the hour—I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite. ‘’’

Not sure how much of a difference it makes to the underlying service but I will also do this with my scraping.

Thank you for point that out


Further to this, I also put the following snippet in front of my web-scraping cron jobs so they start at a random time between the minute boundaries:

    perl -le 'sleep rand 60';
It's the most compact code I could find to do the job.


If you're using a somewhat modern shell there is $RANDOM which gives you a 15 bit random number. So e.g.

    sleep $((RANDOM / 546))
but I guess most cron jobs run with an extremely conservative shell that might not have it.


I use this to aggregate some RSS feeds. Also to generate a feed out of the HTML from sites that don't have a feed. Then i just publish the result as GitHub Pages and add this link to my reader. Thanks for this instruction it got me going on that idea.


One of my friends is doing this tracking Hungarian law modifications: https://github.com/badicsalex/torvenyek

He has tools for parsing them written in Rust: https://github.com/badicsalex/hun_law_rs

and Python: https://github.com/badicsalex/hun_law_py

I'm doing it myself tracking my GitHub star changes: https://github.com/kissgyorgy/my-stars


I built a tracker for German legal acts. I started in January 2022 and once a week, a GitHub Action downloads every published legal act: https://github.com/jandinter/gesetze-im-internet

Parsing the legal acts with the tools you mention looks very interesting! Currently, I simply collect the published XML files whose structure is optimized for laying out the text and not so much for representing a structure of sections and subsections.


I have a couple of similar scrapers as well. One is a private repo that I collect visa information off Wikipedia (for Visalogy.com), and GeoIP information from MaxMind database (used with their permission).

https://github.com/Ayesh/Geo-IP-Database/

It downloads the repo, and dumps the data split by the first 8 bytes of the IP address, and saves to individual JSON files. For every new scraper run, it creates a new tag and pushes it as a package, so the dependents can simply update them with their dependency manager.


So it's basically using Git as an "append-only" (no update-in-place) database to then do time queries? It's not the first time I see people using Git that way.

EDIT: hmmm I realize in addition to that it's also a way to not have to do specific queries over time: the diff takes care of finding everything that changed (i.e. you don't have to say "I want to see how this and that values changed over time": the diff does it all). Nice.


Love that the author provides a 5 minute video explaining the purpose and how he did it: https://www.youtube.com/watch?v=2CjA-03yK8I


The core idea I believe is tracking incremental changes and keeping past history of items. Git is good for text, though for large amounts of binary data I would recommend filesystem snapshots like with btrfs.


I use the same technique to maintain a json file mapping Slack channel names to channel IDs, as Slack for some reason doesn't have an API endpoint for getting a channel ID from its name.


My mind is so friggin blown, github will run arbitrary cron jobs for you?! Can't believe other services make you pay for that.


some covid datasets were published as git repositories. the cool part was that it added a publish date dimension for historical data so that one could understand how long it took for historical counts to reach a steady state.


Yes! New Zealand published current covid locations of interest (where infected people had been) to a github repo - I created an animation of locations over time by crawling the repo history. It gave an idea for how it was spreading through the country, and Auckland in particular.

Animation: https://raw.githubusercontent.com/tim-fan/media/main/2021092...

Code: https://gist.github.com/tim-fan/5f601c274a30505b1ae6b989a015...


What's the benefit of this versus a time series database?


I'm not sure how those are comparable.

The trick here is to take some source of information online that's updated frequently and turn that into a historic record of every change made to that source, by setting up a GitHub repository and dropping a YAML file into it setting up a scheduled action.

Achieving the same thing with a time series database would require a whole lot more work I think - you'd need to run that database somewhere, then run code that scrapes and writes to it on a scheduled basis.

If you already have a time series database running and a machine that runs cron I guess it wouldn't be too much work to put that in place.

Git scraping also lets you easily track changes made to textual content, which I don't think would fit neatly in a time series database.


I mean you could use SQLite the wrong way and use it as a time series database, which would save you from having to have a machine to host it, and I'm sure you could cobble together some sort of hosting for it and glue it to a web cron system. this github seems quite a bit more straightforwards, but then you're on git instead of something else.


Yes, maybe they are a bit 'apples to oranges'. You have some good points, especially when it comes to textual data. Thanks!


Tracking changes vs. tracking full versions of data is one immediate benefit (or difference, depending on your thoughts)


(2020)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: