I'm nervous on your behalf for your S3 bill, have you done napkin maths for how ...

ignoramous · on July 30, 2021

> A trick that might help: create a separate SQLite table containing just the article titles to autocomplete against.

Another one: build a directed-acyclic-word-graph [0][1] of just the page-titles and download it to browser's localStorage for prefix-based searches.

In our usage, ~5M entries take ~18MB. Wikipedia has ~6.3M articles [2].

[0] https://github.com/smhanov/dawg

[1] http://stevehanov.ca/blog/?id=115

[2] https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

mkl · on July 31, 2021

A version of [1] with the pictures working: https://web.archive.org/web/20120211141906/http://stevehanov...

Bayart · on July 31, 2021

Thanks for making me aware of a new data structure !

sitkack · on July 30, 2021

I built a wikipedia reader on the ipad years ago, I used two sqlite databases, one for the article titles using sqlite FTS then I compressed the articles themselves and merged the compressed chunks into a set of files < 2G, the title database had the offset into the compressed blobs. Only including the most popular pages, it was just under 10G.

simonw · on July 30, 2021

That's really clever. Were you storing the article text as compressed data in BLOB columns in a SQLite table?

sitkack · on July 31, 2021

I realize I wasn't clear. I used 2 sqlite databases just to store the article titles and offsets into binary files (<2GB) that had chunks of compressed data. Each chunk contained 10s-100s of stories, so that the decompressed around was around 20megs I believe. I had planned to store everything in sqlite, but there were too many issues. The FTS extension to sqlite is excellent, it made whole classes of problems literally melt away.

Recalling now, the article text itself, I stored in groups of articles concatenated together so that that data compression could take advantage of similar texts. This data was just Lua data structures, with the article title and article body in an array. So I knew which chunk it was in, but I had to decompress a whole chunk at a time and then iterate through the list of decompressed articles to find the specific one. I was trying to tune search time, decompression time and compression ratio. I had a handful of parameters in the job that would take the xml dumps and create the sqlite and compressed blobs. Memory consumption would spike as a chunk was decompressed, but after a call to the Lua garbage collector after pulling the article out of the decompression buffer, everything was fine again.

This was late 2010 or so. The job that processed the XML into database+bin data took over an hour to run on my laptop.

If I did it again, I'd use a compression library that was tuned for decompression and used an external dictionary. For the index, I'd use fst [1]

The project stalled out in parsing the wikitext, there are now libraries that can handle the parsing. I have considered resurrecting it, with jpeg2000 or other image codecs, one could include the images as well. The system did not have FTS over the article text itself. I had planned to use some sort tree of bloom filters, something similar to [2].

I was really trying to keep the total footprint of all the data under 10G or so it could fit on the majority of devices. Now with larger flash sizes, better codecs and better compression libraries, the functionality and performance could be greatly enhanced.

I learned a ton from this project, I really recommend getting deep into something, but also working within your limits and having fallback plans when something looks insurmountable. There were plenty of times where I thought I was a dead end, only to say, "what if ..." the next day, run a quick experiment and be on to the next insurmountable roadblock.

[1] https://github.com/BurntSushi/fst

[2] https://arxiv.org/abs/2002.03057

segfall · on July 31, 2021

Great catch, simonw (and everyone else in this thread).

I've reviewed some other static hosts and settled on wasabi for now. When my S3 bill rolls in, I'll reply to this comment with the damage. A scary number will help others avoid my oversight.

segfall · on Aug 2, 2021

AWS Billing reported that I was charged $28.43 over the course of 7 hours, for a total transfer of 306.916 GB. I believe this capture the ~25k page views the site received before I switched over to another static file host.

Lesson learned for me. Thanks to everyone in this thread for saving my wallet.

apitman · on Aug 3, 2021

Thanks for reporting back!

jinseokim · on July 31, 2021

It seems to be 403 happens.

Quick test: https://s3.us-east-2.wasabisys.com/static.wiki/en.db

capableweb · on July 30, 2021

Seems to be hosted at Surge,sh which is hosted at Digital Ocean, so seems this is not OPs problem.

    $ nslookup static.wiki
    
    Server:  8.8.8.8
    Address: 8.8.8.8#53
    
    Non-authoritative answer:
    static.wiki canonical name = na-west1.surge.sh.
    Name: na-west1.surge.sh
    Address: 188.166.132.94
    
    $ whois 188.166.132.94
    
    % This is the RIPE Database query service.
    % The objects are in RPSL format.
    %
    % The RIPE Database is subject to Terms and Conditions.
    % See http://www.ripe.net/db/support/db-terms-conditions.pdf
    
    % Note: this output has been filtered.
    %       To receive output for a database update, use the "-B" flag.
    
    % Information related to '188.166.132.0 - 188.166.135.255'
    
    % Abuse contact for '188.166.132.0 - 188.166.135.255' is 'abuse@digitalocean.com'
    
    inetnum:        188.166.132.0 - 188.166.135.255
    netname:        DIGITALOCEAN
    country:        NL
    admin-c:        PT7353-RIPE
    tech-c:         PT7353-RIPE
    status:         ASSIGNED PA
    mnt-by:         digitalocean
    created:        2019-04-17T14:02:52Z
    last-modified:  2019-04-17T14:02:52Z
    source:         RIPE
    
    person:         Network Operations
    address:        101 Ave of the Americas, 10th Floor
    address:        New York, NY, 10013
    address:        United States of America
    phone:          +13478756044
    nic-hdl:        PT7353-RIPE
    mnt-by:         digitalocean
    created:        2015-03-11T16:37:07Z
    last-modified:  2019-04-17T14:37:51Z
    source:         RIPE # Filtered
    org:            ORG-DOI2-RIPE
    
    % This query was served by the RIPE Database Query Service version 1.101 (HEREFORD)

But yes, it is wise to run storage/transfer heavy websites on something that doesn't charge for traffic like AWS and most other infrastructure providers do.

simonw · on July 30, 2021

That's just the site itself (the HTML and JavaScript). The 43GB SQLite database is hosted on S3.

dom96 · on July 30, 2021

Why do people even use S3 if that’s a risk? I really don’t get this, do most just not worry about accidentally wasting money or do they just not understand this risk?

slunk · on July 30, 2021

It's one thing if you have a source of revenue so justify your S3 costs. My interpretation of the parent commentator's concern is that this person has opened up a multi gigabyte S3 file publicly and sent lots of traffic its way (via hacker news) for what appears to be a passion project.

simonw · on July 30, 2021

This is a risk any time you host anything on the internet on any system that charges you for storage, bandwidth or CPU usage. It's fine to take that risk on as long as you're aware of it and keep an eye on your spend.

In this case I'm nervous because 43GB is a lot larger than most projects!

remram · on July 31, 2021

I think the point is that not every host charges for bandwidth or CPU usage, making S3 particularly risky.

RhodesianHunter · on July 31, 2021

There's no CPU charge, only per-request charges.

Bandwidth is free into S3 or from S3 to another AWS service so it's particularly handy for that use case.

remram · on July 31, 2021

For S3, yes.

RhodesianHunter · on July 31, 2021

Every system has risks. You just need to be aware of them going in.

S3 is excellent at many things, serving a ton of public traffic isn't one of them. Usually people put a CDN in front of it for that use case.

mayli · on July 30, 2021

Correct, downloading this 40GB file could easily leads to huge s3 bills due to traffic. That's 1$ for a full download.

banana_giraffe · on July 31, 2021

$2, if my math is right.

As pointed out elsewhere, there's some basic work being done to prevent everyone from grabbing the .db file, but it's easy enough to bypass.

Also, they're probably seeing non-zero API call charges. S3 only charges $0.0004 per 1000 calls, but when you start to make lots of calls at scale, it can really add up. Still, for this site, it's probably not that bad. If my simple tests are a fair judge, it works out to something like $20 a month for 100k lookups.

I've mentioned elsewhere on HN I used a similar technique with Python to query SQLite databases in S3. It works well, but there was an edge case on that project that resulted in millions of requests to S3 per invocation. It didn't break the bank, but it did push the project into the unprofitable range till we fixed that case.

Asdrubalini · on July 30, 2021

I tried downloading the file directly from curl [0] but it seems like it blocks non-partial GET requests, so downloading it is not really straightaway.

[0] https://s3.amazonaws.com/static.wiki/db/en.db

llacb47 · on July 30, 2021

     curl -v "https://s3.amazonaws.com/static.wiki/db/en.db" -H "Referer: http://static.wiki/" -H "Range: bytes=0-"

Above works but for the sake of OP's wallet I would suggest that you do not download the entire database from his S3 bucket.

infogulch · on July 30, 2021

Instead download it from kaggle which is offering to host it for free: https://www.kaggle.com/segfall/markdownlike-wikipedia-dumps-...

Weryj · on July 30, 2021

Get a FusionIO drive off ebay and host the DB yourself, if you need, stick the computer behind a static ip VPN.

max_ · on July 30, 2021

He should seed a torrent file

wmf · on July 30, 2021

Torrents aren't great for random access which this app is doing.

magila · on July 30, 2021

Bittorrent can do random access just fine, especially with v2 which enables content verification in 16KB chunks. Clients tend to be throughput oriented which leads to sub-optimal latency, but that's not a limitation of the protocol.

wmf · on July 31, 2021

I guess just fine is a matter of opinion. Connection setup and piece bitmap tracking seems like more overhead than HTTP range requests.

magila · on July 31, 2021

Connection establishment in bittorrent might be marginally higher overhead than HTTP, but not by much and the cost is highly amortized if you're doing many range requests in the same torrent. There is indeed more overhead from HAVE messages and such, but that's just because bittorrent is doing a lot more than HTTP to enable P2P. It's still not that bad as a percentage of the payload traffic.

dTal · on July 30, 2021

But at least we could get a copy without incurring a massive s3 bill.

rudian · on July 31, 2021

Can we really? For torrent to work you still need someone to host it in full at all times. Who will host a 40GB file that will be out of date in a week? The website provider will, so we’re back to square one.

dTal · on July 31, 2021

No, you don't. You just need N people to host 1/N of the file some of the time. The only person who needs to upload it in full is the original creator.

thrower123 · on July 31, 2021

I would like a copy of that sqlite database to play with.

Having that all preprocessed would save a bunch of work compared to getting it from Wikipedia itself

Weryj · on July 30, 2021

IPFS?

warpech · on July 31, 2021

Second that. I have no hands down experience with it, but I've seen seemingly more complex apps deployed on IPFS https://docs.ipfs.io/concepts/usage-ideas-examples/