Postgres full text search is good enough

birken · on Sept 29, 2014

Well the article is right in one sense: Postgres full text search is probably better than you think it would be and is very usable at non-massive scale

However, I still wouldn't recommend using it. The queries and indexes are all unintuitive, complex, not very readable and require learning a bunch of new syntax anyways (as you can see in the examples in the blog post). So just take the effort you would spend doing that and instead learn something with more features that will continue running no matter how large your scale.

While the blog mentions elasticsearch/solr (which are awesome and powerful), it doesn't mention sphinx (http://sphinxsearch.com/). If I were trying to make a simple search engine for my postgres/mysql powered site, I'd use sphinx. The indexer can be fed directly with a postgres/mysql query, and it includes simple APIs for all languages to interface with the searcher daemon. And of course it is super fast, has way more features and can scale larger than you'd ever need (sphinx powers craiglist's search).

mrweasel · on Sept 29, 2014

One interesting thing we learn from build an online store: People uses search alot less that you would think.

We started to notice that customers will use the menu and filtering options, rather than searching for an item. Need something from the "electronics" category, click the electronics menu tab, and pick the subcategory. From there customers browse or filter until they find what they're looking for.

Of cause so do use the search, and most people would have better results if they did search directly.

My point is: Maybe we could have just used Postgresqls full text search in this case. The queries will be rather simply anyway. The new syntax won't matter much, you need to learn something new anyway if you use Solr or ElasticSearch. Search is fare less important than people think, it just needs to be there and be "good enough".

Yes, Sphinx is awesome, if you're doing plain searches, filtering can get complicated pretty quickly.

orbifold · on Sept 29, 2014

I think that is not so much because of the fact that people do not want to use search, just that by experience they've learned that it is usually next to useless on most sites. At least that is my experience.

aw3c2 · on Sept 29, 2014

As a user I never trust myself to know the "right" search terms nor the search box to show the "right" results. If I browse, I know that I see all the items and I can filter with exact, pre-defined (trusted) attributes.

wuliwong · on Sept 29, 2014

I'm making the same observation on my site. I thought search would be the clear way to get to where you want to go but it is relatively unused. I have a site which lists tutorials for a specific language. You would think people would be looking for something specific, but it seems 90% click whatever is displayed first and other just surf around using the basic filters and pagination.

chdir · on Sept 29, 2014

The menu & filtering option can benefit a lot from Facets (or aggregations in ElasticSearch for more advanced options).

mrweasel · on Sept 29, 2014

That's actually what we use. I should have been more clear, it's the fulltext search that doesn't matter to much.

narag · on Sept 29, 2014

I think orbifol is right: we've learned that site searches are mostly useless.

But also: when navigating to a page, people are usually using the mouse. Switching between mouse and keyboard is somehow costly. While there are options (choosing from a combo box), users will use it and only switch to keyboard if there's no other way.

michielvoo · on Sept 29, 2014

I was under the impression that Sphinx cannot return documents, only IDs. (The implication is that you need to query your data source if you want to show the results.) Is that correct?

birken · on Sept 29, 2014

Yes, which I think in simple cases is a benefit. For example if you are using an ORM, the massive queries in the OP are really a pain to deal with. With Sphinx, you perform the search using the search API and get back some ids, then just query those ids using your standard ORM constructs for getting the (small) set of whatever object/rows you are loading. Since you are loading small and fixed amounts of data by the primary keys, the performance shouldn't be an issue.

If you are looking for caching and other more complex features, I'd recommend elasticsearch (and I do highly recommend it). But sphinx is simpler, and I think it is a good alternative for the type of functionality talked about in the blog post. Granted, I really don't think elasticsearch is all that complicated either, and it is also really well documented. But sphinx is just painfully simple for the basic use cases (like anything postgres can do).

aidos · on Sept 29, 2014

Additionally it helps to alleviate issues with stale data in the search index (which is often updated periodically ). If you have a list of ids to query for, there's no harm if one of them is no longer in the db, you just won't show it.

As ever, it's a tradeoff. I used to keep everything required in for search results pages in the search index (solr, at the time). Eventually I decided that the additional db lookup was well worth the extra few milliseconds to make sure I was working with reliable data.

siegecraft · on Sept 29, 2014

You can actually store a bit more than that in sphinx. You can add attributes to your sphinx documents that can either be used for filtering or just as extra metadata when returning query results. The downside is they get added to the index which has to fit in memory (I think, it's been a while).

chdir · on Sept 29, 2014

What about the cost of running a separate service like ElasticSearch for a small startup. It should be easy to swap out PG text search with a more mature solution later on as you scale.

birken · on Sept 29, 2014

I think if you are concerned with swapping out to a different system as you scale you should use a "search as a service" company and then bring it in-house if the cost becomes too much.

The interface provided by a "search as a service" company is going to be a lot more similar to elasticsearch/solr than it is to postgres, so it would be a much easier switch.

If you want to switch from PG full-text search to something else, you'll essentially have to start from square one because all of the ways you are interacting with your search mechanism (defining, inserting, updating, querying, retrieving documents) are completely different between PG and all the other search programs.

A blog post like this makes it all seem so easy, but I assure you maintaining these long queries, dealing with the extra load on your database, preventing XSS and just overall making administering your database a little more difficult also has a real cost.

chdir · on Sept 29, 2014

Useful insights, thank you. Would love to read a more detailed analysis if you ever feel like writing one.

frik · on Sept 29, 2014

Especially with cloud hosting providers, were more memory gets expensive, I would recommend to go with native C/C++ applications like Postgres/MySQL/SphinxSearch/SQLite-FTS. (add Lucene based Solr and ElasticSearch to the list if memory is cheap)

rachbelaid · on Sept 29, 2014

As the author, it's maybe worthwhile that I explain the goals of this post.

The first goal is about using your current architecture to solve small/medium search needs and help your project to introduce search without requiring to add components to your current architecture. It's not a solution on the long run if your application revolve around search. I'm hoping that with the introduction of JSONB in 9.4, the idea of using Postgres as document store raise more attention on the search which can to be improved. But for an out-of-the-box feature, I think that the Postgres community did an amazing job. You can imagine that you get something working with Postgres in matter of hours if you are already using it (and you are comfortable with PG), but if you have never used SOLR/ElasticSearch it will take you much longer to introduce it in your project and get started (Ops, document sync, getting familiar with query, ...).

The second goal, is about introducing full-text search concepts. The post try to guide the user to build a search from nothing to a quite decent full text search in English / French (I cannot give feedbacks on other languages)

The third which is probably the less clear is that people are still using MySQL search sometimes which is IMHO an horrible search solution. I think this happen because some web framework like django provide easy access to match in the ORM. In this context, the post is aimed also to provide MySQL user some insights about what can be done with Postgres and being more aware about its features.

If you are interested by the topic then I suggest you to have a look to the amazings posts from Tim van der Linden who did an amazing job of going into more details about the subject.

http://shisaa.jp/postset/postgresql-full-text-search-part-1.... http://shisaa.jp/postset/postgresql-full-text-search-part-2.... http://shisaa.jp/postset/postgresql-full-text-search-part-3....

Postgres full-text search is not the silver bullet of search but matter of your needs, it's maybe good enough ;)

chdir · on Sept 29, 2014

FWIW, I submitted this link here because I found your blog post immensely useful. Its an excellent intro for someone wanting to integrate text search on their site in a few hours. I think the title has led a few posters here to believe that you are recommending it as a panacea for search.

Although you have listed a few points where full text search is not enough, it might be a good followup to explain some cases where ElasticSearch / Solr / Sphinx etc. are needed.

NDizzle · on Sept 29, 2014

Hey, thanks for the article. I'm still going through it now. Would it work for my scenario? I'm full text indexing large legal documents - 50 to even 300 pages. What I've been doing is just throwing the entire document (OCR) into a big mysql (Mariadb 10) LONGTEXT columns. Then I'm using a JDBC river to shoot those over to elasticsearch.

I don't fully understand elasticsearch, because like you said it's insane. The query that I build to perform a simple search (think a google-like search box with a go button) is about 100 lines.

Should I try out postgres full text search for my purposes? I guess I could change to indexing page by page, and then I could group on the document hash to stamp out duplicates. That would also get me a list of page numbers where the results were. I'm not sure if I could get the snippets though. (What I call a snippet is the word surrounding the match - I'm returning that to datatables so that the user can search within their search to further refine the search.)

Here's what my search (and results) looks like now: http://imgur.com/Fjckva6 As you can tell, I'm not too familiar with elasticsearch (A shameful offset input!) and I'm more comfortable in a RDBMS.

rachbelaid · on Sept 29, 2014

Which languages? English only? Anykind of fuzzy search required?

I don't have experience with big data set but you may to read about GIST/GIN to pick the right index in your case (probably GIST) But from what I saw when I prepared this post, some people are getting some decent performance with full-text search on dataset like the size wikipedia.

https://wiki.postgresql.org/images/2/25/Full-text_search_in_...

I think that doesn't take long to try Postgres FTS if you are already using it but it may require more investment if you need to move from MySQL to PG.

NDizzle · on Sept 29, 2014

Thanks for the reply. English only. The only fuzzyness I've told my users about is "phrase here"~20 which would search for the given phrase within 20 words of each other. That helps for the imperfect OCR'd documents. A lot of the stuff we're indexing is best-guess OCR so it's nowhere near perfect. They are boring agreement documents that people have scribbled all over (setting up the redactions) so fuzzy is somewhat interesting to have. But it's not fuzzy in the sense of a typo, it's fuzzy in the sense that it's OCR'd garbage.

frik · on Sept 29, 2014

> people are still using MySQL search sometimes which is IMHO an horrible search solution

Some probably use "text LIKE '%foo%'" SQL queries on any database.

MySQL supports several database engines, especially ISAM and InnoDB come with full text search support. You can also install Sphinx Search as MySQL DB engine. So your statement is not true in general. If you meant the old ISAM engine the newer InnoDB is default and usually the preferred solution.

sbilstein · on Sept 29, 2014

I wrote an extremely basic search using postgres full text search (documented here http://tech.pro/tutorial/1142/building-faceted-search-with-p...) that was very easy to get up and running. It got us pretty much 95% there for a usable product search for the ecommerce site I was on at the time. While it might be limited for full blown search; it does pretty damn well for handling english language search of product descriptions, labels, and any other categorization.

I would definitely recommend it if you need a fast solution that is easy to implement (and doesn't really incur much tech debt) so you can get back to working on things that might matter more to your particular project.

MichaelGG · on Sept 29, 2014

I understand the point if you're already proficient with PG, and want to keep it in the same system. But I will say Elasticsearch is surprisingly easy to get started with. Install, POST some docs, GET a search query. That's pretty much it. I implemented it for a project and it literally was a few hours to get rolling with it (and improve search speeds by orders of magnitude).

But sure, it is another system to have to deal with.

sanderjd · on Sept 29, 2014

While it's definitely easy to get started with ES, I've found the tedious part to be ongoing coordination between two systems. Unless there's no interaction at all between the documents stored in your search index and the data stored in the database, it is necessary to coordinate them at the application layer, which is more complex and can be less performant than doing everything in the database.

wuliwong · on Sept 29, 2014

Thanks a lot. I'm a guy that likes to spin up MVP ideas and am always looking for quick+dirty solutions which I can swap out later if needs be. All of my searching is performed by basic queries, but I will definitely start to use these methods. I here people griping about what is missing but I'm very excited by what is already included. Thanks again!

pilif · on Sept 29, 2014

While it's certainly good enough for english bodies, its half-assed support for compounds is a problem for bodies in languages that use compounds (German for example).

There is sorta-support for compounds, but only really for ispell dictionaries and the ispell format isn't very good at dealing with compounds (you have to declare all permutations and manually flag them as compoundable) plus the world overall has moved over to hunspell, so even just getting an ispell dictionary is tricky.

As a reminder: This is about finding the "wurst" in "weisswürste" for example.

Furthermore, another problem is that the decision whether the output of a dictionary module should be chained into the next module specified or not is up to the C code of the module itself, not part of the FTS config.

This bit me for example when I wanted to have a thesaurus match for common misspellings and colloquial terms which I then further wanted to feed into the dictionary, again for compound matching.

Unfortunately, the thesaurus module is configured as a non-chainable one, so once there's a thesaurus match, that's what's ending up in the index. No chance to ever also looking it up in the dictionary.

Changing this requires changing the C code and subsequently deploying your custom module, all of which is certainly doable but also additional work you might not want to have to do.

And finally: If you use a dictionary, keep in mind that it's by default not shared between connections. That means that you have to pay the price of loading that dictionary whenever you use the text search engine the first time over a connection.

For smaller dictionaries, this isn't significant, but due to the compound handling in ispell, you'll have huge-ass(tm) dictionaries. Case in point is ours which is about 25 Megs in size and costs 0.5 seconds of load time on an 8 drive 15K RAID10 array.

In practice (i.e. whenever you want to respond to a search in less than 0.5 secs, which is, I would argue, always), this forces you to either use some kind of connection pooling (which can have other side-effects for your application), or you use the shared_ispell extension (http://pgxn.org/dist/shared_ispell), though I've seen crashes in production when using that which led me to go back to pgbouncer.

Aside of these limitations (neither of which will apply to you if you have an english body, because searching these works without a dictionary to begin with), yes, it works great.

(edited: Added a note about chaining dictionary modules)

rachbelaid · on Sept 29, 2014

All your points are very interresting and I will probably reuse them if I attend to write a more extended post.

Maybe the goal of this post was not clear `enough`. I'm presenting it more like a solution for small/medium needs for search without the needs to add extra dependencies in your architecture.

In my opinion this can help younger project/startup to raise their project from the ground by minimizing the complexity of their system (less maintenance)

But it's not a solution viable on the long run until the Postgres address loads of the problem on the search. I'm hoping that now that we have JSONB coming 9.4 that the search get some serious attention so PG can become a serious candidate as document store.

pilif · on Sept 29, 2014

>Maybe the goal of this post was not clear `enough`. I'm presenting it more like a solution for small/medium needs for search without the needs to add extra dependencies in your architecture.

it was very clear and I agree 100% with you. I just wanted to point out that like so often in life, it's not the be-all-end-all solution, but there are some caveats.

For what it's worth, we're using tsearch with quite a big (~600 GB total) body of bilingual german/french documents. Yes. We've run into the issues I outlined, but a) we could work around them (custom dictionary, thesaurus in-application before indexing, pgbouncer) and b) it's very nice to not having to maintain yet another piece of software (ElasticSearch or similar) on a completely different architecture (Java vs. PHP and Node.js).

donw · on Sept 29, 2014

I've got a project that needed multilingual support over very, very small documents, and ended up just rolling that myself -- neither ElasticSearch nor Postgres cope very well when you start mixing multiple languages in the same document.

polyfractal · on Sept 29, 2014

As other's have mentioned, it's entirely possible to have multiple fields with different analysis chains. Elastisearch handles multi-languages very well. You just have to start changing the defaults, since the defaults assume a single language for all fields (namely English).

E.g. "title_german" and "title_english" can each have their own analyzer specific to the language. Or you could have a single field "title" which then uses multifields to index a "title.english" and "title.german" field.

The key is that at search time you need to use a query that understands analysis. So you should use something like the `match` query, or `multi_match` for multiple fields. These queries will analyze the input text according to the analyzer configured on the underlying field (e.g. english or german)

There is a ton more information in The Definitive Guide book, in the chapter on languages: http://www.elasticsearch.org/guide/en/elasticsearch/guide/cu...

Topics include: language analysis chains, pitfalls of multiple languages per doc (at index vs search time), one language per field schemas, one language per doc schemas, multi-language per field schemas, etc

jrochkind1 · on Sept 29, 2014

> neither ElasticSearch nor Postgres cope very well when you start mixing multiple languages in the same document.

I've done okay with that with Solr -- but it's a hard problem. If you really want to meet the needs of fluent speakers of all languages. You can do kind of sort okay in Solr trying a multi-lingual unicode-aware analysis, but to do really good you probably need separate indexed fields (with different analysis) for each language, and may need code to try to guess which passages are in which language.

It's a hard problem, but I suspect Solr gives you more flexibility than ElasticSearch to try and work out a solution (in some cases having to write some Java). Although it doesn't come with an out-of-the-box best-practice solution (I don't think there are enough solr devs who have this use case day to day; solr dev is definitely driven by it's core devs business cases).

radicalbyte · on Sept 29, 2014

I had the same experience and the same conclusion.

Were I to do it again, I would simply split the document into multiple parts and include a "composite-document-id" in the metadata to combine them all.

pilif · on Sept 29, 2014

If it's indeed the same document, you're screwed. Yes. But in Postgres, I could at least have different column (or join tables with different language codes) for different languages and then query them with either dictionary.

ElasticSearch doesn't support that: All fields in a document are assumed to be of the same language which leads to a huge duplication of documents in cases where most fields of a document are language-independent.

Argorak · on Sept 29, 2014

> All fields in a document are assumed to be of the same language

That is not true. You can configure analyzers on a field basis, both for search and index.

pilif · on Sept 29, 2014

Oh. Cool. Thanks. When I was looking into this, it wasn't yet possible. That's very interesting for me.

ElasticSearch has MUCH better support for language analysis to the point where it might be worth it to move from Postgres tsearch to ElasticSearch, but this limitation was a huge hurdle for me not being able to move.

Argorak · on Sept 29, 2014

I am surprised, that's possible since any version I've been working with (0.12x and up).

What you cannot do is control that via client (e.g. "this index request uses the chinese analyzer for that field only"), because the analyzer setting in the index request overwrites all other configs. Mappings supported that for a long time.

One of the arts of search optimization on Elasticsearch is finding fields that you want to index multiple times with different analyzers and combine in a clever fashion.

donw · on Sept 29, 2014

That's basically what I ended up doing -- writing an analyzer that made a reasonable guess as to what language each word was, and filing it into the appropriate bucket, with a generic bucket for "everything else".

Ranking is another story, though.

izacus · on Sept 29, 2014

I don't know about ElasticSearch, but the more popular alternative Solr, can do that easily by using multiple cores (each configured and tweaked for each language) and doing sharding queries, which merge results for you.

jrochkind1 · on Sept 29, 2014

you don't need to use multiple cores for that, although you could -- but in general, anything you can do in multiple cores you can do identically in one core, the only reason to use multiple cores is for performance when you have so much corpus that one machine can't handle it.

But yeah, Solr gives you a lot of flexibility, perhaps more than ElasticSearch (which is both it's pro and it's con, for 'on the path' use cases I get the impression ElasticSearch requires less thinking in configuration than solr. although i have not any direct experience with ElasticSearch).

sologoub · on Sept 29, 2014

This maybe an incredibly naive question, but couldn't you work around the compounding issue with wild carding?

pilif · on Sept 29, 2014

Nope, because that wouldn't work with stemming and pluralizing.

Think finding "weisswürste" when people search for wurst. Unless you correctly build a tvector of (weiss, wurst) from "weisswürste", even with wildcarding hacks, you'd have to search for "würste" to find it.

Also: Not all words are compoundable. Doing a blind wildcard search would lead to false-positives which is somewhere between annoying and very annoying for the user.

aidos · on Sept 29, 2014

Midly OT: I don't really know German at all but I've wondered in the past how easy it is to split the compound words in the language. Is it something that can be easily automated? How many words can be in a single compound?

ygra · on Sept 29, 2014

As evidenced by the (somewhat humorous) Donaudampfschiffahrtsgesellschaftskapitänsmütze there isn't really a limit. Usually, if compounds get longer hyphens are used in select places to aid readability (they have sort of a semantic meaning as well, depending on the compound, so simply saying that you could write the above one as Donaudampfschiffahrts-Gesellschaftskapitänsmütze would be misleading (and in this case really mutilates the compound)).

jeltz · on Sept 29, 2014

In Swedish there is no theoretical limit, but more than 4 is uncommon. To split words you will need a dictionary and stemming/compounding rules for Swedish and process the word from left to right. Note also that in rare cases there can be multiple legal ways to split a given compound word.

As far as I can tell it is the same for German other than that they have a slightly different set of compounding rules, e.g. hyphen can be used in more cases in German.

llllllllllll · on Sept 29, 2014

It can be done quickly and accurately. Check out Peter Norvig's implementation of Viterbi algorithm for text segmentation.

Segmentation is typically used for scripts of languages like Thai and Khmer that don't feature word boundaries. I don't know the ins and outs of German word compounding, but it should work for breaking apart compounds too given a fully conjugated/declined word list as input.

brianhempel · on Sept 29, 2014

For full text search, the only global frequency information Postgres uses is the stop-word list. It does not do TF-IDF ranking. [1]

For example, if you search for "Bob Peterson", Postgres will rank these two documents the same:

"I saw Bob."

"I saw Peterson."

In contrast, an IDF-aware search would notice that "Peterson" occurs in fewer documents than "Bob" and score "I saw Peterson" higher for that reason.

[1] http://en.wikipedia.org/wiki/Tf%E2%80%93idf

[2] http://stackoverflow.com/questions/18296444/does-postgresql-...

rorydh · on Sept 29, 2014

TF–IDF ranking doesn't seem to be too complex a thing to implement. Maybe this is an opportunity for someone here to contribute to the open source project.

jeltz · on Sept 29, 2014

The concurrency aspects of this seem a bit tricky. How do we ensure that a bloated index does not screw our results too much?

petergeoghegan · on Sept 29, 2014

I'm surprised that there is no mention of the recent big improvements in the GIN infrastructure that Postgres full text search uses. The 9.4 release notes list:

* Reduce GIN index size (Alexander Korotkov, Heikki Linnakangas)

* Improve speed of multi-key GIN lookups (Alexander Korotkov, Heikki Linnakangas)

Argorak · on Sept 29, 2014

The most important sentence: "Probably not if your core business needs revolve around search."

Postgres full text search is very good, but once you get into the realms were Elasticsearch and SOLR really shine (complex scoring based on combinations of fields, temporal conditions or in multiple passes, all that with additional faceting etc.), trying to rebuild all that on top of Postgres will be a pain.

While that doesn't break the article, it runs into a nasty problem: `unaccent` doesn't handle denormalized accents.

    # SELECT unaccent(U&'\0065\0301');
     unaccent 
    ----------
     é
    (1 row)

(That problem is also present in Elasticsearch if you forget to configure the analyzer to normalize properly before unaccenting)

rachbelaid · on Sept 29, 2014

Thanks for you comment, I was not aware of the unaccent limitation. This blog post only present a solution for small/medium search needs without adding extra dependency ... the postgres full text search is far of being a silver bullet

Argorak · on Sept 29, 2014

Yep, that's how I understood it and I like the rest a lot, too. I just wanted to make make the sentence stand out, as it is so far below. BTW, the problem seems to be that postgres doesn't actually handle decomposed UTF-8 and is non-compliant in that regard:

http://www.postgresql.org/message-id/53E1AB15.8050702@2ndqua...

So, probably making sure everything is in composed form before writing to the DB seems to be the best way to go.

rboling91 · on Sept 29, 2014

Shouldn't this query:

SELECT to_tsvector(post.title) || to_tsvector(post.content) || to_tsvector(author.name) || to_tsvector(coalesce((string_agg(tag.name, ' ')), '')) as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id, author.id;

be rewritten as:

SELECT to_tsvector(post.title) || to_tsvector(post.content) || to_tsvector(author.name) || to_tsvector(coalesce((string_agg(tag.name, ' ')), '')) as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = post.id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id, author.id;

yror10 · on Sept 29, 2014

If your working with a Ruby/Rails app, check out the pg_search gem https://github.com/Casecommons/pg_search

It allows you to do the things in this article with ActiveRecord models.

wuliwong · on Sept 29, 2014

Nice! In fact, this came up first when I searched "postgress full text search rails 4". I'll check it out. I was actually going to make a gem for my own needs. Hopefully you've done all the legwork for me! I'll also follow the repo and try to help out once I understand it. Thanks.

mihai_ionic · on Sept 29, 2014

One use case where Postgres FTS fails spectacularly is when your search has to return results ordered by some field (e.g., timestamp). This is critical for applications that have to return close-to-realtime results due to the queried content being time-sensitive.

Neither GIN nor GiST indices support such document ordering. For a top-N query of the most recent results, the entire result set has to be processed. With common domain-specific words, this can be more than 50% of the entire document set. As you can imagine, this is insanely expensive. When you are in this situation, it helps to set gin_fuzzy_search_limit to a conservative number (20000 works for me, less if you expect heavy traffic) in postgresql.conf, so that pathological queries eventually finish without processing every result. Result quality will take a hit, because many documents are skipped over.

If you need any type of ordered search on more than a hundred thousand documents, do yourself a favor and use something else than Postgres.

I'm not sure what search back-end Wikipedia is using, but it seems like they are not quite immune to this problem either: https://en.wikipedia.org/w/index.php?search=a+the

rhema · on Sept 29, 2014

I really wish I had some benchmarks here. Postgres can be crazy fast, but can it compete with Elasticsearch? How long would a typical query across 1,000,000 documents?

buro9 · on Sept 29, 2014

I haven't made benchmarks specifically against other products, but we use PostgreSQL across a database of 7.7m blog-post length documents and have found the search to be extremely fast.

The only times in which search slows down is when you search for common words that are not stop words and are also performing ordering (such as ranking) across that body of results before returning your window. That is to say, the search is still fast but sorting a large resultset can be slow.

The only gotchas I would call out to anyone thinking of using PostgreSQL full text search are:

1) Beware of row-level queries like ts_highlight() for extracting a fragment of the matched text, be sure that you are only doing this for the n lines that you are returning for a LIMIT and not the entire resultset (most of which you will discard).

2) Beware of accuracy and think about the text you are indexing. If you are receiving raw markdown and transforming to HTML, then you shouldn't index either (raw markdown may contain XSS attacks that can survive ts_highlight(), and the transformed HTML will contain markup that will be indexed). You should figure out a way to build a block of raw text that is safe to index and to show fragments for (ts_highlight()) which likely means running a striptags() over the HTML and also putting hyperlinks into the text (if you want those to be searchable) and indexing that.

We had no issues with performance, control over search ordering, or anything else. The only gotchas we encountered related to sorting and displaying results, and we resolved both easily enough, though too few of the high-level docs like the linked article pointed out the potential issues there.

donw · on Sept 29, 2014

That's going to depend on the size of the documents, number of terms, etc. The other big question is ranking -- from what I can tell, PostgreSQL's ranking functions (ts_rank and ts_rank_cd) can be performance problems, but I haven't seen any good head-to-head benchmarks on a given corpus.

hendry · on Sept 29, 2014

Postgres, elasticsearch and SOLR all seem pretty heavyweight to me.

What's the lightweight approach to create an inverted index nowadays with some basic stemming?

frik · on Sept 29, 2014

* Sphinx search (free, open source): http://sphinxsearch.com , http://en.wikipedia.org/wiki/Sphinx_(search_engine)

It's the lightweight & fast alternative, similar to what Nginx is to Apache and IIS.

* SQLite FTS 4 (free, open source): http://www.sqlite.org/fts3.html

Lightweight & embedded solution (suitable for small to medium sized database and single process).

mrweasel · on Sept 29, 2014

It might fall into "heavyweight" for you, but Sphinx Search is pretty easy to setup and get started with.

For Go there's also Bleve (http://www.blevesearch.com/), but I haven't tried it yet.

atombender · on Sept 29, 2014

Xapian is excellent. It's an embedded C++ library, and there are bindings for most languages.

It's well-designed but fairly basic. Since it's a library, you'll have to write the server part yourself. (It has a server called Omega, but I've never used it.)

Indexing is slower than Lucene, though, but it's not too bad. The only significant problem I found with it the last time I was using it (which was 5+ years ago) was that writers occasionally would cause queries to abort, and you had to restart them, a cycle which could lead to long query times if you had lots of write churn.

rjbwork · on Sept 29, 2014

You can always just roll your own with lucene or lucene.NET, which is what ES/SOLR use underneath.

collyw · on Sept 29, 2014

I never use Lucene, but I got the impression fro others that it was difficult to use, and that Elasticsearch gives it a relatively usable layer on top.

rpedela · on Sept 29, 2014

It isn't that hard to use in my opinion, but it is only a library. So you need to implement many things to turn it into a server which Solr and Elasticsearch have already done.

framos · on Sept 29, 2014

For Python it doesn't get easier than Whoosh (https://pypi.python.org/pypi/Whoosh). Both indexing and searching are done in Python, no need for external services or processes.

Features are limited and performance won't be the best, but sometimes ease of use and maintenance are all you need.

leo_santagada · on Sept 30, 2014

https://pythonhosted.org/Whoosh/intro.html

latch · on Sept 29, 2014

redis sets or sorted sets with a npl library (natural for node, treat for ruby...).

I'm sure there are packages that do the storage and retrieval for you, but it's the sort of lightweight thing you can do quickly and tweak from there.

mborch · on Sept 29, 2014

http://xapian.org/

elchief · on Sept 29, 2014

Text search is ok in postgres. Better than mysql. Not as good as solr, but you don't have to sync with solr if you use it.

My main complaint is that 'united states' @@ 'united states of america' is false, ie no token overlap similarity.

You can overcome this w the smlar extension. Have requested smlar integration in 9.5

aswanson · on Sept 29, 2014

Has the team been responsive with regard to integration requests?

elchief · on Oct 3, 2014

Craig K seemed to think it was a good idea. We'll see...

ipmb · on Sept 29, 2014

We're using it on our IRC logging service, https://botbot.me/ and agree with the author's sentiments. It's not perfect, but it keeps our infrastructure simple and is very fast over 20M+ rows.

resca79 · on Sept 29, 2014

I love this article especially when I'm thinking about the premature optimization. Having a good full text-search with adding zero components to the platform is great. When postgres search will be not enought, happy to add elasticsearch, sorl, and so on.

bshimmin · on Sept 29, 2014

It's awesome what you can do with Postgres, undeniably, but it seems to me that if you're building a Rails app, for instance, it's an absolute no-brainer to use the brilliant Sunspot gem with Solr. Yes, you have the moderate unpleasantness of setting up Solr and fiddling about with Java and XML files, but it's all pretty painless after that.

I don't know if similarly nice things exist for other web frameworks.

As others have said, what's described in the OP may well be "good enough" for many purposes but it seems like quite a lot of effort to get there.

mctx · on Sept 29, 2014

Django has haystack[0], which I've used with an elasticsearch backend on a few projects and is generally pretty easy to set up and get good results with. I haven't had to do anything more advanced than autocomplete/boost/spelling suggestions, but from what I've read it seems pretty powerful.

[0]: http://haystacksearch.org/

rorydh · on Sept 29, 2014

Check this out for your app: https://github.com/Casecommons/pg_search

It's basically all that stuff minus the complexity of configuring it up for a rails app.

hiharryhere · on Sept 29, 2014

Full text search is good when you have big passages of text to search, but it isn't good for typos on short fields. The tolerances just aren't configurable enough.

I was using it to search across user profiles, and while the fuzzy search was great for a slab of text in, say, an about me section, it wasn't good at all for short bits of text like name. Its just too fuzzy for short blobs.

rom16384 · on Sept 29, 2014

At least for our use case Postgres full text search wasn't good enough. Indexing and searching 1M records was troublesome, and its advanced queries weren't as flexible as we'd like, so after a while we started using Solr, and while it was better it still had its share of problems, so we are currently using ElasticSearch.

mctx · on Sept 29, 2014

I've gone down the same route, from PostgreSQL search to ElasticSearch with usually pretty good results. A friend has strongly recommended Algolia, so I'll explore that for the next project that needs it.

sanderjd · on Sept 29, 2014

Hmm, was it "troublesome" in terms of performance building the index or doing the search or was there some other issue?

rom16384 · on Sept 29, 2014

Part of the problem was due to the Django package we used to interface with the Postgres search, it wasn't designed to handle large tables. I don't recall how long the index took to build (hours? a day?), but the queries themselves were slow due to their complexity and ranking using ts_rank.

sanderjd · on Sept 29, 2014

Interesting, thanks for the info. What did you switch to? Were the queries significantly faster?

rorydh · on Sept 29, 2014

Here's a nice PDF on some of the great performance you can get out of it https://wiki.postgresql.org/images/2/25/Full-text_search_in_...

chdir · on Sept 29, 2014

If you've used both FTS in postgres and either ElasticSearch or Solr recently:

Objectively, what are some data points when you need to switch over? E.g. more than 'x' values to index, index update time, or perhaps features like prefix/substring/phrase search, lack of facets, etc.

tim333 · on Sept 29, 2014

For knocking up a site quickly with a search button I wonder how this compares with using Google custom search? I guess with Postgres you can cover stuff that Google has not indexed?

jcroll · on Sept 29, 2014

For the record: MySQL has fulltext search already for MYISAM tables and will have fulltext for InnoDB in 5.6.

feld · on Sept 30, 2014

did you read the article? He explained why MySQL's fulltext is not acceptable.

chatman · on Sept 29, 2014

Faceting, spell correction, sloppy search, etc. are tricky to implement in Postgres.

fiatjaf · on Sept 29, 2014

Swish-e is good enough.

seivan · on Sept 29, 2014

I wrote a half-assed full text search in Redis once. Want to go back and work on it some more just for kicks.

iamleppert · on Sept 29, 2014

If you're going to do something, do it right. Or leave search out of your product if you aren't willing to invest the effort and energy required to implement a proper solution.

Those SQL queries are hideous. Most people have learned that, because of many developers like this, to avoid search because it's next to useless. Please do us a favor and leave it out of your product if you aren't willing to invest in a proper solution. Not one that is "good enough".

sanderjd · on Sept 29, 2014

Honestly, what is wrong with the queries? They use pretty straightforward SQL functions, datatype conversions, operators, and relational querying capabilities. Is it just aesthetics or are there performance or other actual problems with them?