Author of the referenced blog (and library) here. This is great!
The full text search engine in SQLite is sadly not really good for this - one reason is that it uses standard B-Trees, another is that it forces storing all token positions if you want BM25 sorting, which is a huge overhead for articles as long as Wikipedia's.
But that doesn't mean full text search isn't possible in a very efficient manner with statically hosted data! I wrote a proof of concept of making the Rust-based tantivy library work in the same way, which has a lot of internal things that can make the index much smaller and more efficient than SQLite's. It's also >10x faster in creating the search index.
I sadly haven't gotten around to working on it more and writing an article about it.
Other people are also working on using this stuff to make Sci-hub and LibGen more available by using this in combination with IPFS for distributed and "uncensorable" hosting which is pretty awesome.
Edit: Just realized that the OP demo only searches in article titles, while mine searches in full article contents by default. You can search in only the titles in my demo by querying `title:harry title:potter`
How does it do for performance if you throw the whole 43GB into RAM? Plenty of very affordable workstation systems out there today gently used with 128GB in them.
I'm going to upvote this comment because not everyone in the world can afford $300 for 64GB+ of RAM. The solution to not dump everything into RAM is a great one.
Fortunately, I've had a fancy Silicon Valley job for 15+ years which now lets me afford many GBs of RAM if I wanted to but it was a different story in my 20s. I relied on free or almost free hardware to learn. And that free hardware usually lagged in specs.
Yeah I agree with you not everyone can afford it, but lots of young people build $1800 gaming PCs. If you want a powerful Linux workstation desktop PC for fun projects like this you can get a Dell t5610 with two 8 core xeons and 128GB of ram on eBay for $650, then add your own $100 SSD.
I would not recommend running out and buying a new current gen Ryzen board+cpu or core i7-whatever and 128GB of new ram at ddr4-4200 prices.
To put it in perspective, 43GB of RAM to browse the entirety of Wikipedia doesn't sound that ridiculous when popular Electron-based IM programs use 1GB+.
It's around £300 for 64GB of Corsair RAM - you need a motherboard that supports it but you can go cheap and you don't need a great processor. You could probably put this together for £600? It's not exactly crazy if this was for an office or something.
I do. Many others too. Or do you see books written on phones, programing being done on phones? Being able to install any program you like? No? Not first class then.
There are many things that can't be done on a phone that are possible on a PC. The reverse case is much harder.
The fact that you can't even add memory (be it RAM or storage, the later one is of course doable on some devices) to your phone makes them not first class devices.
That's only a size of regular modern video game... the downloads of which are pretty mainstream these days. Saying that a 50 gig download is a barrier "for most" is definitely not true.
For English it states 90 GB uncompressed, doesn't say compressed size but that doesn't sound much larger than a large game. In the context I don't see it as a barrier.
As someone who was very recently on about as bad of a dsl connection you could get... What? Outside of metered cellular connections this is available to basically everyone. Even the remotest parts of Africa have at least a handful of unmetered connections within a days drive and a USB drive.
Gears of War 5 on the Xbox is 133GB with all its updates. I think we're at the point where 100GB is a high but reasonable request for people nowadays, at least in cities in the West.
If you understand btrees you understand the hardest part already :)
Basically, you need to design a search index that examines the fewest DB pages in order to find the result. The Lucene scoring method stores a mapping of term -> document[] sorted in relevance order. The main idea is that you can examine only the first n documents for each term in the search query in order to find the most relevant search results. Picking n is sort of tricky, but if your index is stored in this way it's possible to fulfill a large % of queries efficiently without downloading the whole index.
Yeah I was just thinking that. Why not just a static filesystem? Just mount a ext4 image full of .html files and browse away. It would be remarkably efficient, likely much more efficient than SQLite.
I have been thinking of using FTS5 with SQLite to search emails. Not close of 43 GB ofc, so I would probably not have any major performance problems, but still is FTS5 any good? Or should I look into other solutions?
The nice thing about FTS5 is you can have full text search up and running on a dataset in a couple of minutes. So give it try and see if it is sufficient for your use case. If it is, great! If not, you've not wasted much time.
Have you seen 'notmuch'? I was honestly shocked how good/fast it is. It uses a C (++?) lib called 'xapian' to provide the actual index/search capability.
If you're talking about a normal local SQlite DB and your dataset is less than maybe 100GB of plaintext then SQLite FTS will work fine regarding performance.
Shocked to find out that this is very much a static website, it's "merely" downloading a small portion of the 43GB SQLite file in question with HTTP Range requests, and then it uses a WASM-compiled copy of SQLite to query the requested data. Very impressive.
> [error: RuntimeError: abort(Error: server uses gzip or doesn't have length). Build with -s ASSERTIONS=1 for more info.]
On insert from Github Pages:
> [error: Error: SQLite: ReferenceError: SharedArrayBuffer is not defined]
Your browser might either be too old to support SharedArrayBuffer, or too new and have some Spectre protections enabled that don't work on GitHub Pages since they don't allow setting the necessary isolation headers. Try going to the Netlify mirror of this blog for the DOM demos.
Since the work all happens in your browser, the only victim of a long complicated query would be you own browse and the S3 bandwidth bill of the person hosting the database (since you'd end up sucking down a lot of data).
But if you want to jack up someone's S3 bandwidth bill there are already easier ways to do it!
You can chunk the file into e.g. 2MB chunks. The CDN can then cache all or the most commonly used ones. That's what I did in the original blog post to be able to host it on GitHub Pages.
It should be trivial to split it up into 1GB chunks or whatever. In fact, if you only request single pages, you could split the database up into pagesize-sized files. This is a lot of files, but at least it avoids the need to do range requests at all. You probably want to increase the pagesize a bit though (e.g. maybe 256 KiB instead of the default 4 KiB?)
Range requests are the star, so I would vote keep them? I mean, technically, the CDN could represent each 4kb block of the file as an individual URL, so that you do range requests by filename instead of by .. range request .. but at some point I definitely think RRs are more sane.
Probably both, compute file name and relative offset from the original offset. (A minor complication is requests crossing file boundaries, in which case you have to do two requests instead of one.)
What is said is this site isn’t any more susceptible to DOS than any other site. Of all the ways of architecting a site this is probably the least vulnerable.
My static file HTTP server called "filed" [0] will satisfy a request in as few as 1 system call (no memcpy involved -- the kernel reads the file and sends the buffer to the NIC), by using sendfile(2). Most other webservers do a bit more work, like open the file for every request.
I plowed the 40 GiB database file into sqlite_zstd_vfs [1] which reduced it 75% to 10 GiB. This plug-in also supports HTTP access [2] in the spirit of phiresky's, however, I don't have a WebAssembly build of mine yet, so it's a library for desktop/command-line apps for now. You can try it out on Linux or macOS x86-64:
pip3 install genomicsqlite
genomicsqlite https://f000.backblazeb2.com/file/mlin-public/static.wiki/en.zstd.db "select text from wiki_articles where title = 'SQLite'"
("genomicsqlite" is the CLI for my Genomics Extension [3], which is built around these Zstandard compression & web layers.)
If you want to download the whole en.zstd.db, then please kindly get it from zenodo (which doesn't support HTTP range requests, but is free): https://zenodo.org/record/5149677
Great work but why not compress with deflate if you are serving http requests since then you could directly copy the database content to the wire as gzip encoded responses.
With sqlite_zstd_vfs the data are compressed beforehand and stored that way "at rest", so web responses are directly copied to the wire controlled by HTTP range headers, similarly to the OP. They need to be decompressed by a client library sitting between SQLite and the wire.
I'm nervous on your behalf for your S3 bill, have you done napkin maths for how much this will cost to host, especially given the relative inefficiency of the autocomplete queries?
A trick that might help: create a separate SQLite table containing just the article titles to autocomplete against. That might give you queries that use smaller ranges of the file.
I built a wikipedia reader on the ipad years ago, I used two sqlite databases, one for the article titles using sqlite FTS then I compressed the articles themselves and merged the compressed chunks into a set of files < 2G, the title database had the offset into the compressed blobs. Only including the most popular pages, it was just under 10G.
I realize I wasn't clear. I used 2 sqlite databases just to store the article titles and offsets into binary files (<2GB) that had chunks of compressed data. Each chunk contained 10s-100s of stories, so that the decompressed around was around 20megs I believe. I had planned to store everything in sqlite, but there were too many issues. The FTS extension to sqlite is excellent, it made whole classes of problems literally melt away.
Recalling now, the article text itself, I stored in groups of articles concatenated together so that that data compression could take advantage of similar texts. This data was just Lua data structures, with the article title and article body in an array. So I knew which chunk it was in, but I had to decompress a whole chunk at a time and then iterate through the list of decompressed articles to find the specific one. I was trying to tune search time, decompression time and compression ratio. I had a handful of parameters in the job that would take the xml dumps and create the sqlite and compressed blobs. Memory consumption would spike as a chunk was decompressed, but after a call to the Lua garbage collector after pulling the article out of the decompression buffer, everything was fine again.
This was late 2010 or so. The job that processed the XML into database+bin data took over an hour to run on my laptop.
If I did it again, I'd use a compression library that was tuned for decompression and used an external dictionary. For the index, I'd use fst [1]
The project stalled out in parsing the wikitext, there are now libraries that can handle the parsing. I have considered resurrecting it, with jpeg2000 or other image codecs, one could include the images as well. The system did not have FTS over the article text itself. I had planned to use some sort tree of bloom filters, something similar to [2].
I was really trying to keep the total footprint of all the data under 10G or so it could fit on the majority of devices. Now with larger flash sizes, better codecs and better compression libraries, the functionality and performance could be greatly enhanced.
I learned a ton from this project, I really recommend getting deep into something, but also working within your limits and having fallback plans when something looks insurmountable. There were plenty of times where I thought I was a dead end, only to say, "what if ..." the next day, run a quick experiment and be on to the next insurmountable roadblock.
Great catch, simonw (and everyone else in this thread).
I've reviewed some other static hosts and settled on wasabi for now. When my S3 bill rolls in, I'll reply to this comment with the damage. A scary number will help others avoid my oversight.
AWS Billing reported that I was charged $28.43 over the course of 7 hours, for a total transfer of 306.916 GB. I believe this capture the ~25k page views the site received before I switched over to another static file host.
Lesson learned for me. Thanks to everyone in this thread for saving my wallet.
Seems to be hosted at Surge,sh which is hosted at Digital Ocean, so seems this is not OPs problem.
$ nslookup static.wiki
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
static.wiki canonical name = na-west1.surge.sh.
Name: na-west1.surge.sh
Address: 188.166.132.94
$ whois 188.166.132.94
% This is the RIPE Database query service.
% The objects are in RPSL format.
%
% The RIPE Database is subject to Terms and Conditions.
% See http://www.ripe.net/db/support/db-terms-conditions.pdf
% Note: this output has been filtered.
% To receive output for a database update, use the "-B" flag.
% Information related to '188.166.132.0 - 188.166.135.255'
% Abuse contact for '188.166.132.0 - 188.166.135.255' is 'abuse@digitalocean.com'
inetnum: 188.166.132.0 - 188.166.135.255
netname: DIGITALOCEAN
country: NL
admin-c: PT7353-RIPE
tech-c: PT7353-RIPE
status: ASSIGNED PA
mnt-by: digitalocean
created: 2019-04-17T14:02:52Z
last-modified: 2019-04-17T14:02:52Z
source: RIPE
person: Network Operations
address: 101 Ave of the Americas, 10th Floor
address: New York, NY, 10013
address: United States of America
phone: +13478756044
nic-hdl: PT7353-RIPE
mnt-by: digitalocean
created: 2015-03-11T16:37:07Z
last-modified: 2019-04-17T14:37:51Z
source: RIPE # Filtered
org: ORG-DOI2-RIPE
% This query was served by the RIPE Database Query Service version 1.101 (HEREFORD)
But yes, it is wise to run storage/transfer heavy websites on something that doesn't charge for traffic like AWS and most other infrastructure providers do.
Why do people even use S3 if that’s a risk? I really don’t get this, do most just not worry about accidentally wasting money or do they just not understand this risk?
It's one thing if you have a source of revenue so justify your S3 costs. My interpretation of the parent commentator's concern is that this person has opened up a multi gigabyte S3 file publicly and sent lots of traffic its way (via hacker news) for what appears to be a passion project.
This is a risk any time you host anything on the internet on any system that charges you for storage, bandwidth or CPU usage. It's fine to take that risk on as long as you're aware of it and keep an eye on your spend.
In this case I'm nervous because 43GB is a lot larger than most projects!
As pointed out elsewhere, there's some basic work being done to prevent everyone from grabbing the .db file, but it's easy enough to bypass.
Also, they're probably seeing non-zero API call charges. S3 only charges $0.0004 per 1000 calls, but when you start to make lots of calls at scale, it can really add up. Still, for this site, it's probably not that bad. If my simple tests are a fair judge, it works out to something like $20 a month for 100k lookups.
I've mentioned elsewhere on HN I used a similar technique with Python to query SQLite databases in S3. It works well, but there was an edge case on that project that resulted in millions of requests to S3 per invocation. It didn't break the bank, but it did push the project into the unprofitable range till we fixed that case.
I tried downloading the file directly from curl [0] but it seems like it blocks non-partial GET requests, so downloading it is not really straightaway.
Bittorrent can do random access just fine, especially with v2 which enables content verification in 16KB chunks. Clients tend to be throughput oriented which leads to sub-optimal latency, but that's not a limitation of the protocol.
Connection establishment in bittorrent might be marginally higher overhead than HTTP, but not by much and the cost is highly amortized if you're doing many range requests in the same torrent. There is indeed more overhead from HAVE messages and such, but that's just because bittorrent is doing a lot more than HTTP to enable P2P. It's still not that bad as a percentage of the payload traffic.
Can we really? For torrent to work you still need someone to host it in full at all times. Who will host a 40GB file that will be out of date in a week? The website provider will, so we’re back to square one.
No, you don't. You just need N people to host 1/N of the file some of the time. The only person who needs to upload it in full is the original creator.
I've been using static copies of Wikipedia for years; they're great to have when traveling. I mostly use the Kiwix app. Their ZIM format database for all English is 83 GB; 43 GB for one without pictures. Compares nicely to the 43 GB here.
I once spent an hour in the Louvre learning about Rubens astonishing paintings of Marie de' Medici thanks to Kiwix. Without it I was just like "huh, nice paintings, I wonder what they are?" They're incredible and I owe that moment of kismet to Wikipedia, Kiwix, and my mobile phone. https://en.wikipedia.org/wiki/Marie_de%27_Medici_cycle
Seriously? They couldn't make the content something standard like a .zip file? Or a zip file with a SQL database in it? Bloody monkeys always have to go and create a new standard for everything. They will argue it's for performance or something, but it's really an exercise in ego. "Oh, mine's better!" No, it's not. What is better is what works with everything else in the world.
And that index! We couldn't capitalize the proper name of Wikipedia? And do you not understand there are visually impaired people in the world who can't read light red text on a white background? What is wrong with people?
And the terminology! What in the fuckety is Aard2? What is a .slob file? Why is it that everyone uses obscure acronyms or whatever for everything and never explains anything? Ego again, probably. Let's create a language that no one else understands to make ourselves feel superior! No, you are not superior, you are just turning language into thick vegetable soup.
Didn't you people go to university, or did you not pay attention when they taught you that the first reference is always completely spelled out and explained and you can use the acronyms (or whatever they are) after that. It's only common sense that you don't spout gibberish and nonsense words without explaining then in the first instance, unless your goal is to confuse everyone for the purpose of making yourself feel superior.
And the Google sheet and the flat index, what a hodgepodge of nonsense. What the heck does it mean? What is the difference between the different versions? Who knows?
I have to go now. My eyes are bleeding from trying to read light red text on a bright white background. I'll check back again once I get a Braille laptop screen.
Does it actually work? I tried and it would always give up partway through the downloads (and most of the times when I ask in these threads it turns out the person recommending it doesn't actually use it, they just think it's a cool idea...)
I too have had rather disappointing results with Kiwix. Love the idea, but the implementation needs work. At this point, given the complexities of parsing Wikitext, I wish Wikipedia themselves would offer SQLite dumps of the latest text only content.
You should compress the individual article texts with zlib. Right now, looking up an article text for a decent size article ("Virus") takes 10 round-trips-- 5 appear to be btree traversals, but another 5 (!) are to read in the page content, as chunks of 8K 8K 16K 32K 64K. Compressing the article might cut one round trip off that
Increasing the sqlite page size to a much larger size like 64KB or 128KB may improve average latency too, since fewer btree pages would need to be fetched to find the offset of the desired row, and most page contents could be read with one fetch. This matters more when fetches over the network take ~100ms.
I installed a reader with a Wikipedia dump (without multimedia) on the original iPhone. At the time (at least in my region) WiFi was quite rare and it really did feel like the Hitchhiker's guide and incredibly powerful.
I only kept it for a few weeks though since it consumed almost all of the space. (And I imagine it was probably my language mutation, the english Wikipedia must have been too big already.)
I'm not trying to kill the fun here, but since this thing relies on a working internet connection, it would probably be pretty useless on a remote island
I remember having the iPhone 2G back in 2007, and that someone created a way to store a 5GB wikipedia dump (the complete wikipedia at that time, but without photos) on it, so you could have the whole Wikipedia in your pocket. At the time it felt pretty amazing.
Bonito. I want and yet have problems downloading even a 350mb file on my carrier, tethered via 2 different devices. Being able to DL something like this should be a requirement to be called an Internet carrier. Non-mobile is not an option for some, eg. low income unemployed and homeless. Maybe they’ve been a software engineer for 20 years and can reach unreachable markets. (No doubt it’s more profitable to spy and steal from the vulnerable, if God allows).
What if a customer needs their records and can not afford to get them and they lose their life or suffer significantly? Who pays the balance?
They didn’t need use the Internet? Okay, well they did use it and they do during pandemic times.
There's a bug when the target article is titled the same as the disambiguation page. Search for "Earthbound", and try clicking on the article for the video game. It'll bring you right back to the disambiguation page.
The autocomplete trickiness is fun. Looks (from links in GitHub) like it's using full-text search and currently needs to scan a lot to rank results. Doing just prefix search might be a cheap half-solution to autocomplete, and maybe layering on hacks around stopwords etc. can get you a bit further.
For another angle, enwiki's full compressed list of titles (enwiki-20210720-all-titles-in-ns0.gz) is 86MB. Grabbing the whole dang thing (in the background?) and doing something with it client-side is a different sort of approach from what this page is going for but not unthinkable.
Each time I post a link to Hacker News that becomes popular it ends up being at least a $100 in bandwidth bills from Google Cloud, and I've got a simple tiny unbloated website. This poor guy is going to go bankrupt. The Internet dark age is like Ma Bell all over again when we'd get those $500 long distance bills for dialing into BBSs.
Hi, I am just commenting to mention a bug. When I look at page for The Smith's self titled album ( http://static.wiki/en/The_Smiths_(album) ) , the track listing section doesn't populate. Other than that, I love the simplicity and cleanness of the design
That's awesome man. I didn't know you could host Wikipedia on your own client / own back-end. Seems like this means if someone could built a better interface and performance for Wikipedia they could begin to steal some of their traffic. Is it not?
You and dom96 (https://news.ycombinator.com/item?id=28013376) both noticed this issue, just two minutes apart. I guess there's some sort of inclination among HN readers testing out a static copy of Wikipedia to search for Nim? :-)
Maybe that would be a nice way to start Wikipedia alternatives, sijce it became a form of political activism the trustworthiness of Wikipedia fell down significantly.
Editing an existing page is usually pretty easy and conflict free. Creating a new page is close to impossible and strictly guarded.
It’s almost not worth trying for a beginner because you will almost certainly be rejected. The moderators require multiple approved sources of information for a page to be created. These pages must be reputable sources (IE not blog posts, forum posts, or other wikis), and must not have any kind of commercial link to the subject. So if you create a page about a product and you state that it has feature X, you can not use the companies own page that talks about having X as a reference.
So basically you can’t make a page about something until a major news website writes about it or you can reference it in some formal study.
I tried to create a few pages. One of them being about a file format and found that I could not use the companies website with the PDF describing the spec as a reference for the format because there is a commercial link between the file format and the source linked.
This is so painfully true. I've tried to submit two articles so far to AfC (articles for creation) and both were rejected. It's puzzling to me because I often see extremely low quality articles (with just one sentence, for example), published onto mainspace.
I have just come to accept this is how it is and unless you are a wiki power user who can just directly create a page, don’t bother and let someone else create it because you are wasting your time. I find it funny that articles like the one for GitLab had to fight off the wiki mods who wanted to remove it for not being notable.
The worst part is how Wikipedia does not have a logical user introduction flow. They have an insane amount of rules and process but they will very quickly drop a brand new user in to a text box and make it look like you simply type in some stuff and the page will be created.
They need some kind of reputation system to ramp up users to greater and greater feature sets like stack overflow has rather than pretending a brand new or unauthenticated user can just create a page.
I just discovered that we both may have been wrong.
Looks like you can skip the AfC review process entirely and this is given the go-ahead by Wikipedia (it's not frowned upon).
Basically, you have to create a Draft-namespace article (for example, Draft:Rust).
Then, once you have auto confirmed privileges (10 edits and 4 days account age), you can move the page namespace from the Draft namespace to the (Article) namespace...
Oh interesting. I've started editing a bit on Wikipedia recently, and seems like there are a lot of potential articles to improve upon.
The deletionism issue reminds me of this article https://www.gwern.net/In-Defense-Of-Inclusionism
Reminds me that at one point each page of the Apple Wiki Server that shipped with OS X Server back in the day used one SQLite file per page. Worked ok-ish.
I find that most wikipedia pages I'd want to look at are utterly unintelligible with communication conventions youd need to look up separately to understand the article.
I can't use an offline Wikipedia without that coursework
this is awesome, seems to be a good use case for using IPFS decentralized storage, either for copies of the big sqllite db dump file, or (as an alternative) for the individual wiki articles "snapshots"
i think it would be easier to keep it in sync with original wiki using the second option, where each article in static.wiki is synced with original wikipedia pages and updated individually, vs. updating the huge sqllite dump on every change in the originals
then the sqllite will be only holding metadata and links to static (and may be version controlled) versions of individual static pages/article snapshots, so the size will be much smaller and easier to distribute on ipfs
Nope, it is - unless you think "loading html pages from a remote web server" is also not static.
The point is there's no computation going on on the server, behind the scenes; your browser is doing all the work of knowing what to fetch, NOT the remote end, with the magic of range requests.
The full text search engine in SQLite is sadly not really good for this - one reason is that it uses standard B-Trees, another is that it forces storing all token positions if you want BM25 sorting, which is a huge overhead for articles as long as Wikipedia's.
But that doesn't mean full text search isn't possible in a very efficient manner with statically hosted data! I wrote a proof of concept of making the Rust-based tantivy library work in the same way, which has a lot of internal things that can make the index much smaller and more efficient than SQLite's. It's also >10x faster in creating the search index.
Here's the demo also for Wikipedia: https://demo.phiresky.xyz/tmp-ytccrzsovkcjoylr/dist/index.ht... I'm not sure if it's more efficient than the SQlite version in this form, but it definitely has more upward potential and is more fun to work with.
And the corresponding draft PR: https://github.com/tantivy-search/tantivy/pull/1067
I sadly haven't gotten around to working on it more and writing an article about it.
Other people are also working on using this stuff to make Sci-hub and LibGen more available by using this in combination with IPFS for distributed and "uncensorable" hosting which is pretty awesome.
Edit: Just realized that the OP demo only searches in article titles, while mine searches in full article contents by default. You can search in only the titles in my demo by querying `title:harry title:potter`