Interesting quote: "we decided to compute all statistics on demand. This was som...

qaq · on Sept 15, 2017

We do exactly what you describe on about 80TB dataset stored across a number of PG instances. The 100GB comments are extremely clever because running a query against a dataset that fully fits in RAM will be blazingly fast. "Try doing text search on a few million SQL rows" we are doing it on billions of rows.

gdulli · on Sept 15, 2017

Exactly. People overestimate the amount of effort and cleverness it takes to horizontally shard/scale your data yourself rather than trust a specialized database to do it as a black box. And overestimate the cleverness of the black box.

What you retain by staying with Postgres rather than going to a more exotic database is priceless. There is a threshold of data size or product need that makes a more specialized database the right choice. It's just well above 100 GB and your application should have some very specific needs to justify it.

probablybroken · on Sept 15, 2017

I think some of this can be attributed to expectations; Many systems such as elastic are expected to run on ram, whereas people tend to expect relational systems to perform reasonably well on disk. I think we all want a magic black box that makes such problems disappear, and we always seem to be disappointed with the system in question when at some point it doesn't.

qaq · on Sept 15, 2017

You are 100% right the perpetual search for "magic black box" is the cause of much pain on the bright side it's also a source of significant amount of high paying work to clean up the mess.

firasd · on Sept 15, 2017

Do you have FULLTEXT indexes on those rows? I find it hard to believe that searching something like e.g. a fragment of text in Youtube comments can be as fast in a SQL system (even in RAM) as in Elasticsearch.

As for the other stuff I mentioned (recommendations, etc.) I'm not just basing it on my personal experience--here's a write-up from Pinterest about having to dump all their MySQL data to Hadoop to drive various types of analysis. I doubt they would do it it if just putting the SQL DBs in RAM was adequate! https://medium.com/@Pinterest_Engineering/tracker-ingesting-...

brightball · on Sept 15, 2017

PG is something of a different beast too. With MySQL the standard mode of operation for most non-basic functionality is to dump the data in another tool.

Full text with Postgres is pretty fantastic and configurable. Not putting the data somewhere else keeps you from having to maintain a second system, keep things in sync, etc.

People jump to putting data in a search tool because it's a search tool waaaaay too quickly IMHO. If the use case justifies it, go for it...but don't add uneccessary complexity unless you have to.

ris · on Sept 15, 2017

> Not putting the data somewhere else keeps you from having to maintain a second system, keep things in sync, etc.

To be fair there is some amount of "keeping things in sync" one has to do with postgres, even if it's just setting up the right triggers to update the FTS index on updates.

brightball · on Sept 15, 2017

True, but it's transaction safe so should be impossible to get out of sync which is what I was really getting at.

qaq · on Sept 15, 2017

I can assure you that Pinterest's dataset is vastly bigger than 100GB :) at certain scale RDBMS will obviously experience issues and might no longer be the optimal solution. For PG (hard to generalize) but beyond 10-20TB things become painful. Now the thing is that "limit" is constantly shifting so if you are starting with 100GB datasets and it is growing at 200GB a year you can basically stay on single instance RDBMS forever.

firasd · on Sept 15, 2017

Sorry, I'm just not buying that it's about the file size of data.

Right now I'm working on an app that works with tweets. I want to find all tweets that link to iTunes.

When I was using:

> select * from `tweets` where `url` not like '%twitter.com%' and `url` like '%itunes.apple.com%'

I could scale my server up to 16 CPUs, it would still take several minutes to search a few million tweets.

Yesterday I added another field to the database, `is_audio_url` where I pre-compute whether the URL is an itunes link (by string matching in the app code) when I insert the record into the database. So I can do:

> select * from `tweets` where `is_audio_url` = 1

And now it's blazing fast. It is just my most recent of many experiences that MySQL really struggles with text matching.

Lazare · on Sept 15, 2017

Full text search in Postgres is fast if you have it configured correctly.

> It is just my most recent of many experiences that MySQL

1) You're using MySQL not Postgres; given that this is a discussion about whether Postgres can compete with Elasticsearch, that's not super relevant. :)

> select * from `tweets` where `url` not like '%twitter.com%' and `url` like '%itunes.apple.com%'

2) That's not how you query a full text index; that's going to be glacially slow.

You need a FULLTEXT index and to use a MATCH...AGAINST query. Check out the docs[1].

[1]: https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html

firasd · on Sept 15, 2017

Fair enough, I wasn't aware of MATCH... AGAINST will have to check it out. But although I'm not going to keep going back and forth on this in HN comments it really has been my experience that Elasticsearch can do some very cool things that I've struggled to do with MySQL. I've already named the top three examples I have in mind (full text search, recommendations/related items, and trending topics i.e. finding terms used in text fields in the last week that weren't as frequently used in the last month.)

wutwutwutwut · on Sept 15, 2017

But didn't you just say that you had not used full text search...you haven't struggled with FT - you haven't used it..

qaq · on Sept 15, 2017

I am not selling anything to you :) You can scale your server to even 100 vCpu or whatever blackbox name the provider is using and still have same speed mainly because MySQL does not have parallel query :). BTW you none ever mentioned MySQL but you keep brining it up.

vlplabs · on Sept 15, 2017

Queries with %text% syntax do not use indexes by default.

You can use trgm module to index this kind of queries.

https://www.postgresql.org/docs/9.1/static/pgtrgm.html

justinclift · on Sept 15, 2017

Prob better to point to the current (9.6) version of that doc, rather than the no-longer-supported ancient v9.1 version. There are some differences. :)

https://www.postgresql.org/docs/current/static/pgtrgm.html

sk5t · on Sept 15, 2017

You should really take the time to learn how storage and indexes work, and the difference between IO and CPU constraints on query performance. The database isn't magic... with the possible exception of elasticsearch, they will all disappoint you if you chuck data into them blindly and expect optimal performance on inefficient queries.

merb · on Sept 15, 2017

the problem is that you are probably not using the correct index. considering that when you use a '%' at the beginning of a string you need to use at least a GIN (or was it gist?) index with string ops as far as I remember but actually using % at the beginning is not recommended. I think using a tsquery might be more recommended for contain searches. at least on postgresql

qaq · on Sept 15, 2017

He is using MySQL not PG

srrr · on Sept 15, 2017

You can directly define is_audio_url as a partial index in postgres. No need for app code.. postgresql.org/docs/current/static/indexes-partial.html

wolco · on Sept 15, 2017

You could do a number of things to speed it up.

- Full text index - Extract the domain name to another column index it - Change mysql defaults - Change engine types. Maybe go in memory - Create lookup table of domains

tscherno · on Sept 15, 2017

Create a fulltext index and use MATCH AGAINST instead of LIKE

njs12345 · on Sept 15, 2017

If you wanted to, you could even use a trigger to compute that field, and then backfill it - no changes to app code required!

kuschku · on Sept 15, 2017

I'm doing fulltext search on 270 million rows in postgres (with textvector indices) in below 65ms on commodity hardware (sub 8€/mo).

And it works better than ElasticSearch, just due to the lower additional overhead.

ngrilly · on Sept 16, 2017

How do you sort the result? Do you use ts_rank?

What is the size of your database on disk?

kuschku · on Sept 17, 2017

I use ts_rank combined with many other factors (as I also rank by time, by message type, etc).

The size of the database is a few dozen gigabytes by now, but that isn’t relevant with tsvector, only the row count has an effect on search speed.

ngrilly · on Sept 17, 2017

Thanks for following up :-)

I was asking because ranking can be slow in PostgreSQL. PostgreSQL can use a GIN or GiST index for filtering, but not for ranking, because the index doesn't contain the positional information needed for ranking.

This is not an issue when your query is highly selective and returns a low number of matching rows. But when the query returns a large number of matching rows, PostgreSQL has to fetch the ts_vector from heap for each matching row, and this can be really slow.

People are working on this but it's not in PostgreSQL core yet: https://github.com/postgrespro/rum.

This is why I'm a bit surprised by the numbers you shared: fulltext search on 270 million rows in below 65ms on commodity hardware (sub 8€/mo).

A few questions, if I may:

- What is the average number of rows returned by your queries? Is there a LIMIT?

- Is the ts_vector stored in the table?

- Do you use a GIN or GiST index on the ts_vector?

Cheers.

qaq · on Sept 15, 2017

It will not be as fast but it will be fast enough for the wast majority of use cases.

bulldoa · on Sept 15, 2017

is there some resource one can figure out what category of problem need what storage solution?

What storage solutions would you use for product recommendations or trending topics?

firasd · on Sept 15, 2017

I used Elasticsearch. But I didn't use it as a primary datastore, more just a mirror of the MySQL data.

It comes out of the box with a way to find "More Like This" https://www.elastic.co/guide/en/elasticsearch/reference/curr...

For trending topics I believe I used something related to this: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

An interesting article on using ES: https://auth0.engineering/from-slow-queries-to-over-the-top-...