If you haven't considered it, you can also use the direct wikitext markup, from ...

zulko · 2024-09-03T11:48:52 1725364132

> reduce it do only sections that are likely to be relevant (eg. "Life and career")

True but I also managed to do this from HTML. I tried getting pages wikitext through the API but couldn't find how to.

Just querying the HTML page was less friction and fast enough that I didn't need a dump (although when AI becomes cheap enough, there is probably a lot of things to do from a wikipedia dump!).

One advantage of using online wikipedia instead of a dump is that I have a pipeline on Github Actions where I just enter a composer name and it automagically scrapes the web and adds the composer to the database (takes exactly one minute from the click of the button!).

distances · 2024-09-03T13:15:39 1725369339

Wikipedia's api.php supports JSON output, which probably helps already quite a bit. For example https://en.wikipedia.org/w/api.php?action=query&prop=extract...

zulko · 2024-09-04T01:45:11 1725414311

Oooh I had missed that thanks!