Building an Open Source Decentralized E-Book Search Engine

boredumb · 2024-03-11T13:17:20 1710163040

Many moons ago I wanted to do something similar for AI data sets and models over IPFS. I don't know the future for IPFS but I do hope the essence of a p2p data sharing infrastructure becomes more accessible to help individuals tackle some of the issues with large datasets with less hardware on hand.

https://github.com/JakeKalstad/IPFSPytorchDataset https://github.com/JakeKalstad/load_ipfs_pytorch_model

droopyEyelids · 2024-03-11T14:38:29 1710167909

The title got me really excited that they were doing full text search. Boy that would be an awesome project. Zlib and Google Books do it, but it would be great to have a open source version that everyone could contribute to, and provided access to full texts

raybb · 2024-03-11T14:45:36 1710168336

OpenLibrary does provide search access to full texts. For example: https://openlibrary.org/search/inside?q=%22institutional+thi...

It is open source and they're always looking for contributors. I think they'd especially welcome help improving search!

https://github.com/internetarchive/openlibrary/

mellutussa · 2024-03-11T23:00:50 1710198050

I think a distributed OCR project is needed. Problem is that a lot of books are PDF scans and missing raw text. OcrMyPdf does a pretty good job of is but it's cpu intensive.

greggsy · 2024-03-12T02:08:23 1710209303

I'd wager that there are several players in the AI market who have already scraped and OCR'd every book and magazine on zlib and libgen to feed into training models. Google are almost certainly piped everything they have in Google Books into their models, before some future legal case says they can't. Won't take long before the open community starts doing the same.

carlosjobim · 2024-03-11T16:26:04 1710174364

There's 13 search engines in a dozen if you only want book title or author. What's lacking is a search index of the content of e-books. Something that will soon be incredibly important in the face of generative AI. Somebody here on HN told me it only takes a laptop to index the content of millions of books, while other people say the scope is almost impossible.

Is there any project working on this?

dmotz · 2024-03-11T19:18:14 1710184694

I have a side project that aims to organize your ebook highlight collections with on-device semantic search. [1] Right now it only indexes your own content but I'd like to add a mode that allows you to share your collection and let others find relevant ideas via semantic search -- a discovery platform for ideas found in books. It's open source if you want a sense of how it works now. [2]

[1] https://emdash.ai/

[2] https://github.com/dmotz/emdash

dredmorbius · 2024-03-11T23:44:16 1710200656

The size of the index is far more dependent on the search text than the number of searched items.

I believe Google documented some of this in its early days, noting that a search index returns the relevant metadata matching a specific query. The query space itself is largely based on both raw keywords and tuples (2- or 3-word ngrams if memory serves, though I'm hazy on this), the latter meeting some minimum frequency requirement. Longer search terms can be constructed from shorter ngrams.

A typical advanced native-tongue English vocabulary is about 40,000 words. An expansive dictionary might contain fewer than 250,000 words, including obsolete ones.

Mapping a vocabulary to works citing those words is relatively straightforward. Ngrams experience combinatorial expansion, but are still a reasonably constrained space. And we now have well over a quarter-century's experience indexing written content at Web scale.

A laptop could probably make a decent cut at providing a useful index of many millions of books, though you'd probably want a somewhat larger system for a more comprehensive index, in particular to rank-index the search space, which is probably the more considerable challenge.

carlosjobim · 2024-03-12T03:08:03 1710212883

That was a great reply, and made me understand how these things work a lot better. Thank you!

dredmorbius · 2024-03-12T06:40:00 1710225600

The key concept here is the inverted index:

<https://en.wikipedia.org/wiki/Inverted_index>

myco_logic · 2024-03-11T17:57:11 1710179831

Depends on how beefy that laptop is...

I've been doing some local LLM stuff at work recently, and even with the amazing advances in quantization lately, doing that kind of stuff on a ThinkPad is feasible, but still strongly inferior to just renting out a VPS with a couple 4090/H100s for several hours.

The biggest thing with summarizing stuff is that most local LLM models often don't have very big context-windows, so they have trouble with larger texts like even a short Vonnegut novel (I was just testing em' with summarizing GitHub issues, and even with a 16k token context window they still sometimes struggle if there are a lot of comments).

There are probably smarter people than I who could get this working on a Raspberry Pi though... ;)

CWuestefeld · 2024-03-11T17:29:56 1710178196

I believe that Calibre, the popular and free ebook management tool, now supports indexing the content all books in your library.

bt1a · 2024-03-11T16:47:23 1710175643

Perhaps the initial creation of the index is indeed something that an average laptop could accomplish, but I'd imagine that frequently updating the index and serving requests against it would be compute-intensive. I have nothing to back this up but speculation. Would love to learn more!

Mortiffer · 2024-03-11T13:43:11 1710164591

Could you detail how you populate the search index and what you expect the memory limits to be?

devops000 · 2024-03-11T14:45:00 1710168300

Cool! Could be used for torrent searching? Like running web torrent with video streaming and a decentralized search engine.

j2qk3b · 2024-03-11T14:51:25 1710168685

Yes! Try this one: https://anybt.eth.limo/

I will build an open sourced version too!

hanniabu · 2024-03-11T15:20:23 1710170423

Nice to find eth.limo being used in the wild

j2qk3b · 2024-03-25T12:53:09 1711371189

https://news.ycombinator.com/item?id=39815170

There is an open sourced version for torrent searching here, using the same tech.

ValleZ · 2024-03-11T15:47:54 1710172074

Is this an actual search engine or just a front end which builds “select from” queries?

v010101 · 2024-03-11T14:28:38 1710167318

libstc.cc

MrThoughtful · 2024-03-11T14:04:55 1710165895

What on earth is this about?

"I was recommended ... Liber3 ..., which uses ENS domain names ... running on ENS and IPFS ... they appear to be using Glitter ... a ... service built with Tendermint."

This sounds like a signal from outer space to me. In a language used in a different galaxy.

I tried that Liber3 thing, but whatever I do, I get "Oops! Something went wrong. Please refresh or try again later".

What is this all about?

inhumantsar · 2024-03-12T16:00:03 1710259203

* ENS -> Ethernet Name Service. DNS but for blockchains. * IPFS -> Interplanetary File System. Distributed object store, think immutable P2P S3. * Glitter -> Sounds familiar but it's not coming to mind * Tendermint -> Consensus engine for blockchains, forms part of a toolchain meant to enable interop between blockchains alongside the Inter-Blockchain Communication (IBC) Protocol and the Cosmos SDK

The blockchain ecosystems really are their own little world unto themselves. It's all pretty cliquey, not in an exclusionary way but if you're not actively seeking it out then there's very little chance of you hearing about any of it.

Side note: IPFS is well worth checking out if you're interested in databases or decentralized zero-trust systems, and even if you're a blockchain skeptic. They're doing some really interesting work under the hood. The team hasn't latched on to the gold rush mentality the way so nearly all blockchain projects have.

WolfeReader · 2024-03-11T14:44:05 1710168245

The title is the de-jargonized version. It's a set of instructions to build an open-source ebook search engine. (Admittedly there is still some jargon in that description, but not to the level of naming specific libraries.)

The bulk of the article is implementation details, helpfully hyperlinked.

droopyEyelids · 2024-03-11T14:42:32 1710168152

[flagged]

pvg · 2024-03-11T15:18:09 1710170289

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

throwawayyyyyy2 · 2024-03-11T15:19:43 1710170383

And then realize it has existed for almost 15 years and it's called libgen.rs

spondylosaurus · 2024-03-11T16:24:54 1710174294

Anna's Archive is even better!

brevitea · 2024-03-11T19:36:12 1710185772

IMO, the more the merrier. That's the joy of decentralization and P2P.

tamimio · 2024-03-11T21:38:23 1710193103

It seems they are using flask in their code, just to show you don’t to go crazy with your stack to build useful software.

Dudhbbh3343 · 2024-03-11T18:03:03 1710180183

[flagged]

bastawhiz · 2024-03-11T18:31:11 1710181871

Sounds like they're on ipfs with a metadata database on some web3 system.

neilv · 2024-03-11T17:30:29 1710178229

This seems to be intended for IP piracy. Clarifying that in the title would help.

I'm trying to encourage publishers and authors to offer legitimate sales of DRM-free ebooks, so would prefer we try not to have the term "ebook" associated with piracy.

sureglymop · 2024-03-11T18:50:26 1710183026

It's a search engine... What about it makes it specific to IP piracy?

I actually understand your point well but I think it's even more important not to group in any legitimate use of technology with illegitimate use of it. Especially considering recent events (lawsuits over Yuzu and Dolphin emulators).

neilv · 2024-03-12T18:19:21 1710267561

Good point. In much the same way that you're trying to avoid emulators being considered only for piracy, I'm trying to avoid that with ebooks.

What I was commenting upon is that I think this particular thing will actually appear to many people to be for piracy of ebooks, and I don't want "ebook" to become synonymous with "pirated book" in the mind of the public (and especially not in the minds of publishers and authors who I want to encourage and support making non-DRM ebooks).

As a compromise, I'd like people whose efforts are intended for pirating books to distinguish that from legitimate ebooks, not call it simply "ebooks". Maybe then it'll be easier for our respective "freedom and access" goals to coexist.

(FWIW, I'm actually sympathetic to some of the uses of pirated books, such as sharing the wealth of the world's information with people who just can't afford it, or who have it officially denied to them. I'm less sympathetic to people who could pay for something but choose to take it instead, but I'm not trying to combat that here. I just want to mitigate some of the bad effects of piracy, such as legitimate buyers only being able to get DRM'd books, and only for consumer-hostile locked-down devices. Please don't inadvertently sabotage that orthogonal effort by appropriating language.)

citruscomputing · 2024-03-11T22:49:49 1710197389

It does seem to be! Isn't that cool?

jrm4 · 2024-03-12T01:48:51 1710208131

Sure, but by the same argument, libraries seem to be intended for IP piracy (or more precisely, the thing that drives most of this, which is "people reading books for free") as well.

RamblingCTO · 2024-03-11T17:58:42 1710179922

[flagged]

neilv · 2024-03-11T18:09:09 1710180549

Title is "Building an Open Source Decentralized E-Book Search Engine", and screenshot seems to suggest piracy.

t-3 · 2024-03-11T20:24:51 1710188691

Nothing about the title suggests piracy, and the screenshot doesn't show download links - hell, there aren't any actual Harry Potter books in a search for "Harry Potter". Even if it were searching for files, free and legal ebooks are ubiquitous, no copyright infringement necessary to make it a worthwhile endeavor.

WolfeReader · 2024-03-11T21:11:51 1710191511

Please be specific about how the screenshot advocates piracy.

(Also, a personal preference: never use the phrase "seems to suggest" again; if you're going to make an accusation, be honest enough to actually make it.)