Many moons ago I wanted to do something similar for AI data sets and models over IPFS. I don't know the future for IPFS but I do hope the essence of a p2p data sharing infrastructure becomes more accessible to help individuals tackle some of the issues with large datasets with less hardware on hand.
The title got me really excited that they were doing full text search. Boy that would be an awesome project. Zlib and Google Books do it, but it would be great to have a open source version that everyone could contribute to, and provided access to full texts
I think a distributed OCR project is needed. Problem is that a lot of books are PDF scans and missing raw text. OcrMyPdf does a pretty good job of is but it's cpu intensive.
I'd wager that there are several players in the AI market who have already scraped and OCR'd every book and magazine on zlib and libgen to feed into training models. Google are almost certainly piped everything they have in Google Books into their models, before some future legal case says they can't. Won't take long before the open community starts doing the same.
There's 13 search engines in a dozen if you only want book title or author. What's lacking is a search index of the content of e-books. Something that will soon be incredibly important in the face of generative AI. Somebody here on HN told me it only takes a laptop to index the content of millions of books, while other people say the scope is almost impossible.
I have a side project that aims to organize your ebook highlight collections with on-device semantic search. [1] Right now it only indexes your own content but I'd like to add a mode that allows you to share your collection and let others find relevant ideas via semantic search -- a discovery platform for ideas found in books. It's open source if you want a sense of how it works now. [2]
The size of the index is far more dependent on the search text than the number of searched items.
I believe Google documented some of this in its early days, noting that a search index returns the relevant metadata matching a specific query. The query space itself is largely based on both raw keywords and tuples (2- or 3-word ngrams if memory serves, though I'm hazy on this), the latter meeting some minimum frequency requirement. Longer search terms can be constructed from shorter ngrams.
A typical advanced native-tongue English vocabulary is about 40,000 words. An expansive dictionary might contain fewer than 250,000 words, including obsolete ones.
Mapping a vocabulary to works citing those words is relatively straightforward. Ngrams experience combinatorial expansion, but are still a reasonably constrained space. And we now have well over a quarter-century's experience indexing written content at Web scale.
A laptop could probably make a decent cut at providing a useful index of many millions of books, though you'd probably want a somewhat larger system for a more comprehensive index, in particular to rank-index the search space, which is probably the more considerable challenge.
I've been doing some local LLM stuff at work recently, and even with the amazing advances in quantization lately, doing that kind of stuff on a ThinkPad is feasible, but still strongly inferior to just renting out a VPS with a couple 4090/H100s for several hours.
The biggest thing with summarizing stuff is that most local LLM models often don't have very big context-windows, so they have trouble with larger texts like even a short Vonnegut novel (I was just testing em' with summarizing GitHub issues, and even with a 16k token context window they still sometimes struggle if there are a lot of comments).
There are probably smarter people than I who could get this working on a Raspberry Pi though... ;)
Perhaps the initial creation of the index is indeed something that an average laptop could accomplish, but I'd imagine that frequently updating the index and serving requests against it would be compute-intensive. I have nothing to back this up but speculation. Would love to learn more!
"I was recommended ... Liber3 ..., which uses ENS domain names ... running on ENS and IPFS ... they appear to be using Glitter ... a ... service built with Tendermint."
This sounds like a signal from outer space to me. In a language used in a different galaxy.
I tried that Liber3 thing, but whatever I do, I get "Oops! Something went wrong. Please refresh or try again later".
* ENS -> Ethernet Name Service. DNS but for blockchains.
* IPFS -> Interplanetary File System. Distributed object store, think immutable P2P S3.
* Glitter -> Sounds familiar but it's not coming to mind
* Tendermint -> Consensus engine for blockchains, forms part of a toolchain meant to enable interop between blockchains alongside the Inter-Blockchain Communication (IBC) Protocol and the Cosmos SDK
The blockchain ecosystems really are their own little world unto themselves. It's all pretty cliquey, not in an exclusionary way but if you're not actively seeking it out then there's very little chance of you hearing about any of it.
Side note: IPFS is well worth checking out if you're interested in databases or decentralized zero-trust systems, and even if you're a blockchain skeptic. They're doing some really interesting work under the hood. The team hasn't latched on to the gold rush mentality the way so nearly all blockchain projects have.
The title is the de-jargonized version. It's a set of instructions to build an open-source ebook search engine. (Admittedly there is still some jargon in that description, but not to the level of naming specific libraries.)
The bulk of the article is implementation details, helpfully hyperlinked.
This seems to be intended for IP piracy. Clarifying that in the title would help.
I'm trying to encourage publishers and authors to offer legitimate sales of DRM-free ebooks, so would prefer we try not to have the term "ebook" associated with piracy.
It's a search engine... What about it makes it specific to IP piracy?
I actually understand your point well but I think it's even more important not to group in any legitimate use of technology with illegitimate use of it. Especially considering recent events (lawsuits over Yuzu and Dolphin emulators).
Good point. In much the same way that you're trying to avoid emulators being considered only for piracy, I'm trying to avoid that with ebooks.
What I was commenting upon is that I think this particular thing will actually appear to many people to be for piracy of ebooks, and I don't want "ebook" to become synonymous with "pirated book" in the mind of the public (and especially not in the minds of publishers and authors who I want to encourage and support making non-DRM ebooks).
As a compromise, I'd like people whose efforts are intended for pirating books to distinguish that from legitimate ebooks, not call it simply "ebooks". Maybe then it'll be easier for our respective "freedom and access" goals to coexist.
(FWIW, I'm actually sympathetic to some of the uses of pirated books, such as sharing the wealth of the world's information with people who just can't afford it, or who have it officially denied to them. I'm less sympathetic to people who could pay for something but choose to take it instead, but I'm not trying to combat that here. I just want to mitigate some of the bad effects of piracy, such as legitimate buyers only being able to get DRM'd books, and only for consumer-hostile locked-down devices. Please don't inadvertently sabotage that orthogonal effort by appropriating language.)
Sure, but by the same argument, libraries seem to be intended for IP piracy (or more precisely, the thing that drives most of this, which is "people reading books for free") as well.
Nothing about the title suggests piracy, and the screenshot doesn't show download links - hell, there aren't any actual Harry Potter books in a search for "Harry Potter". Even if it were searching for files, free and legal ebooks are ubiquitous, no copyright infringement necessary to make it a worthwhile endeavor.
Please be specific about how the screenshot advocates piracy.
(Also, a personal preference: never use the phrase "seems to suggest" again; if you're going to make an accusation, be honest enough to actually make it.)
https://github.com/JakeKalstad/IPFSPytorchDataset https://github.com/JakeKalstad/load_ipfs_pytorch_model