Hacker News new | past | comments | ask | show | jobs | submit | b1zguy's comments login

I've been meaning to attempt running a custom search engine for particular sites I've 'bookmarked'. Some sites contain gold that could be useful in the future and is not often discovered in Google results.

Should I go the Postgres/Elasticsearch route or are somewhat out-of-the-box solutions available?


For such a light demand and fixed site requirements, a single-file sqlite dB is probably best. Modern Sqlite has full-text capabilities that are quite powerful and relatively easy to implement.

https://www.sqlite.org/fts5.html


For something small with a minimal footprint, I'd recommend Typesense. https://github.com/typesense/typesense

Elasticsearch is heavy, and relational databases with search bolted on (like Postgres or SQLite) aren't great.


It depends on what the user requirements are. FTS works pretty well with both Postgres and SQLite, in my experience.

Here's a git repo someone can modify to do a cross comparison on a specific dataset, if they are interested. It doesn't seem to indicate the RMDBs are outclassed in a small-scale FTS implementation.

https://github.com/VADOSWARE/fts-benchmark


For personal use nobody cares about 100ms vs 10ms response. What they do care about is relevance. Consider the following from those repo outputs:

Typesense

    [timing] phrase [superman]: returned [28] results in 4.222797.ms
    [timing] phrase [suprman]: returned [28] results in 3.663458.ms
SQLite

    [timing] phrase [superman]: returned [47] results in 0.351138.ms
    [timing] phrase [suprman]: returned [0] results in 0.07513.ms
So SQLite is faster, but who cares? I want things like relevance and typo resilience without having to configure anything.


The article covers typo resilience in the section "Typo tolerance / fuzzy search".

This adds a step between query entry and text search where you find the similarity of query words to unique lexemes if the word is not a lexeme. Seems like a reasonable compromise to me?


I'm not trying to be argumentative. As long as people find a solution they're happy with, I think that's great. For me, I'm far less interested in handling typos, but I can see how it would be valuable in many applications. I'm usually less interested in tying in and learning another set of services if I can get 90% of the way there with one, but leaving the option of adding it later if additional requirements make it necessary.


Also I’ve got a small project in which I try to compare meilisearch and postgres fts w/ pg_trm, it’s called podcastsaver:

Podcastsaver.com (click on the nerds tab in the top right)

Never got to it but there are a bunch of other search engines worth adding — Sonic, Typesense, etc. Maybe some day


I am wanting to do something similar. Archivebox seems to be the best solution for this sort of self-hosted, searchable web archive. It has multiple search back-ends and plugins to sync browser bookmarks (or even history).

I haven't finished getting it set up though, so take this recommendation with a hefty grain of salt.


How would something like this work in practice? Would you generate any tags or summaries per site when inserting it into the db?


ArchiveBox can extract text from HTML (and possibly PDFs too). I think it can be configured to extract subtitles from YouTube videos as well. So it can do full text searches. Basically you could have your own, offline & curated search-engine.


You could run a full text search or search against an auto-generated summary. Or if you want to be fancy, use semantic search like in Retrieval Augmented Generation.


Edit: I forgot to add how would I add the webpage to the databases already suggested here? Do I need to use a separate program to spider/index each site, and check for its updates?


If you're looking for a turn-key solution, I'd have to dig a little. I generally write a scraper in python that dumps into a database or flat file (depending on number of records I'm hunting).

Scraping is a separate subject, but once you write one you can generally reuse relevant portions for many others. If you can get adept at a scraping framework like Scrapy you can do it fairly quickly, but there aren't many tools that work out of the box for every site you'll encounter.

Once you've written the spider, it's generally able to be rerun for updates unless the site code is dramatically altered. It really comes down to how brittle the spider is coded (i.e. hunting for specific heading sizes or fonts or something) instead of grabbing the underlying JSON/XHR that doesn't usually change frequently.

1. https://scrapy.org


Depending upon the type of content, one might want to look into using the Readability (Browder's reader view) to parse the webpage. It will give you all the useful info without the junk. Then you can put it in the DB as needed.

https://github.com/mozilla/readability

Btw, readability, is also available in few other languages like Kotlin:

https://github.com/dankito/Readability4J


Do you prefer it locally or in the cloud? If in the cloud, check out Xata (the domain of the blog post here).


Searchkick gem + Elasticsearch is a good combo


Sorry, how does flossing impact?


What's are the costs? Couldn't find it on the site.


Um, what is it?


Total Recall


Having never seen it, I have no context for which details were left out or bent to fit the narrative, but this still feels amazing. Well-earned sunglasses


It is definitely a must-see movie, the original 1990 version with Arnold that is.


I Don't Recall


I did the term 'fracking' in this context. I'll start using it.


Can you please elaborate on why your regret returning home?


Moving abroad and making some sort of life for yourself is two things: 1. a review of your preconceptions and a test for your ability to adapt, 2. if you stick it out, you may integrate new ways and viewpoints into who you are and what you think.

When you then go back to people who are similar (because that's always going to be the case to some extent, even if locally the community feels and/or thinks itself diverse) you will frequently feel that others have not 'progressed' or are 'stuck' in those culturally determined ways. It's hard to share that growth with people that haven't done the same, which is I think why expats stick together, even when they're back.

I moved between Western countries I should perhaps add.


I’m not the person you replied to, but for many of us “home” is not a first world or Western country. The opportunities there are much fewer, quality of life is lower in many tangible ways - do you like having reliable power, internet, mail/shipping, safe roads? Dangerous crime is a much more serious problem, and corruption is rife. Healthcare is subpar if you’re not wealthy. Education is generally poor, which has all sorts of consequences for society as a whole.

There’s also the fact that after living elsewhere for many years, the home you go back to is not the same as the home you left.


What does this mean?


Hero is a big imagine banner across the page with a quick slogan.

Social proof are the testimonies where companies or people are listed with quips like "X was the best solution I've ever used."

The combination is very popular for landing pages and GP is tiring of them, like people tired of generic Bootstrap sites: http://www.dagusa.com/


That site is hilarious!


I've been wanting to run my own search engine sorta thingy that indexes websites I feed it. I sometimes find little nooks of the net that post resources I may need in the future. Like my own mini-Google that indexes a list of sites.

How can I go about creating this? Are their off-the-shelf solutions, will I need to say combine scrapy with elastic search? The links in this thread look promising.


Thanks for the insightful primer!


What's wrong with Substack? I don't know much about it!


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: