More

b1zguy · on July 12, 2023

I've been meaning to attempt running a custom search engine for particular sites I've 'bookmarked'. Some sites contain gold that could be useful in the future and is not often discovered in Google results.

Should I go the Postgres/Elasticsearch route or are somewhat out-of-the-box solutions available?

bshipp · on July 12, 2023

For such a light demand and fixed site requirements, a single-file sqlite dB is probably best. Modern Sqlite has full-text capabilities that are quite powerful and relatively easy to implement.

https://www.sqlite.org/fts5.html

binarymax · on July 12, 2023

For something small with a minimal footprint, I'd recommend Typesense. https://github.com/typesense/typesense

Elasticsearch is heavy, and relational databases with search bolted on (like Postgres or SQLite) aren't great.

bshipp · on July 12, 2023

It depends on what the user requirements are. FTS works pretty well with both Postgres and SQLite, in my experience.

Here's a git repo someone can modify to do a cross comparison on a specific dataset, if they are interested. It doesn't seem to indicate the RMDBs are outclassed in a small-scale FTS implementation.

https://github.com/VADOSWARE/fts-benchmark

binarymax · on July 12, 2023

For personal use nobody cares about 100ms vs 10ms response. What they do care about is relevance. Consider the following from those repo outputs:

Typesense

    [timing] phrase [superman]: returned [28] results in 4.222797.ms
    [timing] phrase [suprman]: returned [28] results in 3.663458.ms

SQLite

    [timing] phrase [superman]: returned [47] results in 0.351138.ms
    [timing] phrase [suprman]: returned [0] results in 0.07513.ms

So SQLite is faster, but who cares? I want things like relevance and typo resilience without having to configure anything.

evdubs · on July 12, 2023

The article covers typo resilience in the section "Typo tolerance / fuzzy search".

This adds a step between query entry and text search where you find the similarity of query words to unique lexemes if the word is not a lexeme. Seems like a reasonable compromise to me?

bshipp · on July 12, 2023

I'm not trying to be argumentative. As long as people find a solution they're happy with, I think that's great. For me, I'm far less interested in handling typos, but I can see how it would be valuable in many applications. I'm usually less interested in tying in and learning another set of services if I can get 90% of the way there with one, but leaving the option of adding it later if additional requirements make it necessary.

hardwaresofton · on July 12, 2023

Also I’ve got a small project in which I try to compare meilisearch and postgres fts w/ pg_trm, it’s called podcastsaver:

Podcastsaver.com (click on the nerds tab in the top right)

Never got to it but there are a bunch of other search engines worth adding — Sonic, Typesense, etc. Maybe some day

sudobash1 · on July 12, 2023

I am wanting to do something similar. Archivebox seems to be the best solution for this sort of self-hosted, searchable web archive. It has multiple search back-ends and plugins to sync browser bookmarks (or even history).

I haven't finished getting it set up though, so take this recommendation with a hefty grain of salt.

SpriglyElixir12 · on July 12, 2023

How would something like this work in practice? Would you generate any tags or summaries per site when inserting it into the db?

sudobash1 · on July 12, 2023

ArchiveBox can extract text from HTML (and possibly PDFs too). I think it can be configured to extract subtitles from YouTube videos as well. So it can do full text searches. Basically you could have your own, offline & curated search-engine.

janalsncm · on July 12, 2023

You could run a full text search or search against an auto-generated summary. Or if you want to be fancy, use semantic search like in Retrieval Augmented Generation.

b1zguy · on July 12, 2023

Edit: I forgot to add how would I add the webpage to the databases already suggested here? Do I need to use a separate program to spider/index each site, and check for its updates?

bshipp · on July 12, 2023

If you're looking for a turn-key solution, I'd have to dig a little. I generally write a scraper in python that dumps into a database or flat file (depending on number of records I'm hunting).

Scraping is a separate subject, but once you write one you can generally reuse relevant portions for many others. If you can get adept at a scraping framework like Scrapy you can do it fairly quickly, but there aren't many tools that work out of the box for every site you'll encounter.

Once you've written the spider, it's generally able to be rerun for updates unless the site code is dramatically altered. It really comes down to how brittle the spider is coded (i.e. hunting for specific heading sizes or fonts or something) instead of grabbing the underlying JSON/XHR that doesn't usually change frequently.

1. https://scrapy.org

busymom0 · on July 13, 2023

Depending upon the type of content, one might want to look into using the Readability (Browder's reader view) to parse the webpage. It will give you all the useful info without the junk. Then you can put it in the DB as needed.

https://github.com/mozilla/readability

Btw, readability, is also available in few other languages like Kotlin:

https://github.com/dankito/Readability4J

tudorg · on July 12, 2023

Do you prefer it locally or in the cloud? If in the cloud, check out Xata (the domain of the blog post here).

Alifatisk · on July 12, 2023

Searchkick gem + Elasticsearch is a good combo

b1zguy · on July 3, 2023

Sorry, how does flossing impact?

b1zguy · on Feb 16, 2023

What's are the costs? Couldn't find it on the site.

b1zguy · on Feb 16, 2023

Um, what is it?

joaogui1 · on Feb 16, 2023

Total Recall

culi · on Feb 16, 2023

Having never seen it, I have no context for which details were left out or bent to fit the narrative, but this still feels amazing. Well-earned sunglasses

AngryData · on Feb 17, 2023

It is definitely a must-see movie, the original 1990 version with Arnold that is.

jeffrallen · on Feb 16, 2023

I Don't Recall

b1zguy · on Jan 8, 2023

I did the term 'fracking' in this context. I'll start using it.

b1zguy · on Dec 20, 2022

Can you please elaborate on why your regret returning home?

brnt · on Dec 20, 2022

Moving abroad and making some sort of life for yourself is two things: 1. a review of your preconceptions and a test for your ability to adapt, 2. if you stick it out, you may integrate new ways and viewpoints into who you are and what you think.

When you then go back to people who are similar (because that's always going to be the case to some extent, even if locally the community feels and/or thinks itself diverse) you will frequently feel that others have not 'progressed' or are 'stuck' in those culturally determined ways. It's hard to share that growth with people that haven't done the same, which is I think why expats stick together, even when they're back.

I moved between Western countries I should perhaps add.

antonvs · on Dec 20, 2022

I’m not the person you replied to, but for many of us “home” is not a first world or Western country. The opportunities there are much fewer, quality of life is lower in many tangible ways - do you like having reliable power, internet, mail/shipping, safe roads? Dangerous crime is a much more serious problem, and corruption is rife. Healthcare is subpar if you’re not wealthy. Education is generally poor, which has all sorts of consequences for society as a whole.

There’s also the fact that after living elsewhere for many years, the home you go back to is not the same as the home you left.

b1zguy · on Dec 9, 2022

What does this mean?

reducesuffering · on Dec 9, 2022

Hero is a big imagine banner across the page with a quick slogan.

Social proof are the testimonies where companies or people are listed with quips like "X was the best solution I've ever used."

The combination is very popular for landing pages and GP is tiring of them, like people tired of generic Bootstrap sites: http://www.dagusa.com/

barbarbar · on Dec 11, 2022

That site is hilarious!

b1zguy · on Dec 9, 2022

I've been wanting to run my own search engine sorta thingy that indexes websites I feed it. I sometimes find little nooks of the net that post resources I may need in the future. Like my own mini-Google that indexes a list of sites.

How can I go about creating this? Are their off-the-shelf solutions, will I need to say combine scrapy with elastic search? The links in this thread look promising.

b1zguy · on Dec 7, 2022

Thanks for the insightful primer!

b1zguy · on Dec 5, 2022

What's wrong with Substack? I don't know much about it!