Hacker News new | past | comments | ask | show | jobs | submit login

Interesting quote: "we decided to compute all statistics on demand. This was something we previously tried in RethinkDB, but the results were not good... When we tried implementing statistics as SQL queries in Postgres, we were amazed by the performance. We could implement complex statistics involving data from many tables."

I think standard line "use right tool for the job" is still the ultimate answer. Data in most applications is relational, and you need to query it in different ways that weren't anticipated at the beginning, hence the longevity of SQL.

That said, I too often see HN commentators say something like "this data was only 100 GB? Why didn't they just put it in Postgres?" which is not as clever as the writer may think. Try doing text search on a few million SQL rows, or generating product recommendations, or finding trending topics... Elasticsearch and other 'big data' tools will do it much quicker than SQL because its a different category of problem. It's not about the data size, it's about the type of processing required. (Edited my last line here a bit based on replies below.)




We do exactly what you describe on about 80TB dataset stored across a number of PG instances. The 100GB comments are extremely clever because running a query against a dataset that fully fits in RAM will be blazingly fast. "Try doing text search on a few million SQL rows" we are doing it on billions of rows.


Exactly. People overestimate the amount of effort and cleverness it takes to horizontally shard/scale your data yourself rather than trust a specialized database to do it as a black box. And overestimate the cleverness of the black box.

What you retain by staying with Postgres rather than going to a more exotic database is priceless. There is a threshold of data size or product need that makes a more specialized database the right choice. It's just well above 100 GB and your application should have some very specific needs to justify it.


I think some of this can be attributed to expectations; Many systems such as elastic are expected to run on ram, whereas people tend to expect relational systems to perform reasonably well on disk. I think we all want a magic black box that makes such problems disappear, and we always seem to be disappointed with the system in question when at some point it doesn't.


You are 100% right the perpetual search for "magic black box" is the cause of much pain on the bright side it's also a source of significant amount of high paying work to clean up the mess.


Do you have FULLTEXT indexes on those rows? I find it hard to believe that searching something like e.g. a fragment of text in Youtube comments can be as fast in a SQL system (even in RAM) as in Elasticsearch.

As for the other stuff I mentioned (recommendations, etc.) I'm not just basing it on my personal experience--here's a write-up from Pinterest about having to dump all their MySQL data to Hadoop to drive various types of analysis. I doubt they would do it it if just putting the SQL DBs in RAM was adequate! https://medium.com/@Pinterest_Engineering/tracker-ingesting-...


PG is something of a different beast too. With MySQL the standard mode of operation for most non-basic functionality is to dump the data in another tool.

Full text with Postgres is pretty fantastic and configurable. Not putting the data somewhere else keeps you from having to maintain a second system, keep things in sync, etc.

People jump to putting data in a search tool because it's a search tool waaaaay too quickly IMHO. If the use case justifies it, go for it...but don't add uneccessary complexity unless you have to.


> Not putting the data somewhere else keeps you from having to maintain a second system, keep things in sync, etc.

To be fair there is some amount of "keeping things in sync" one has to do with postgres, even if it's just setting up the right triggers to update the FTS index on updates.


True, but it's transaction safe so should be impossible to get out of sync which is what I was really getting at.


I can assure you that Pinterest's dataset is vastly bigger than 100GB :) at certain scale RDBMS will obviously experience issues and might no longer be the optimal solution. For PG (hard to generalize) but beyond 10-20TB things become painful. Now the thing is that "limit" is constantly shifting so if you are starting with 100GB datasets and it is growing at 200GB a year you can basically stay on single instance RDBMS forever.


Sorry, I'm just not buying that it's about the file size of data.

Right now I'm working on an app that works with tweets. I want to find all tweets that link to iTunes.

When I was using:

> select * from `tweets` where `url` not like '%twitter.com%' and `url` like '%itunes.apple.com%'

I could scale my server up to 16 CPUs, it would still take several minutes to search a few million tweets.

Yesterday I added another field to the database, `is_audio_url` where I pre-compute whether the URL is an itunes link (by string matching in the app code) when I insert the record into the database. So I can do:

> select * from `tweets` where `is_audio_url` = 1

And now it's blazing fast. It is just my most recent of many experiences that MySQL really struggles with text matching.


Full text search in Postgres is fast if you have it configured correctly.

> It is just my most recent of many experiences that MySQL

1) You're using MySQL not Postgres; given that this is a discussion about whether Postgres can compete with Elasticsearch, that's not super relevant. :)

> select * from `tweets` where `url` not like '%twitter.com%' and `url` like '%itunes.apple.com%'

2) That's not how you query a full text index; that's going to be glacially slow.

You need a FULLTEXT index and to use a MATCH...AGAINST query. Check out the docs[1].

[1]: https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html


Fair enough, I wasn't aware of MATCH... AGAINST will have to check it out. But although I'm not going to keep going back and forth on this in HN comments it really has been my experience that Elasticsearch can do some very cool things that I've struggled to do with MySQL. I've already named the top three examples I have in mind (full text search, recommendations/related items, and trending topics i.e. finding terms used in text fields in the last week that weren't as frequently used in the last month.)


But didn't you just say that you had not used full text search...you haven't struggled with FT - you haven't used it..


I am not selling anything to you :) You can scale your server to even 100 vCpu or whatever blackbox name the provider is using and still have same speed mainly because MySQL does not have parallel query :). BTW you none ever mentioned MySQL but you keep brining it up.


Queries with %text% syntax do not use indexes by default.

You can use trgm module to index this kind of queries.

https://www.postgresql.org/docs/9.1/static/pgtrgm.html


Prob better to point to the current (9.6) version of that doc, rather than the no-longer-supported ancient v9.1 version. There are some differences. :)

https://www.postgresql.org/docs/current/static/pgtrgm.html


You should really take the time to learn how storage and indexes work, and the difference between IO and CPU constraints on query performance. The database isn't magic... with the possible exception of elasticsearch, they will all disappoint you if you chuck data into them blindly and expect optimal performance on inefficient queries.


the problem is that you are probably not using the correct index. considering that when you use a '%' at the beginning of a string you need to use at least a GIN (or was it gist?) index with string ops as far as I remember but actually using % at the beginning is not recommended. I think using a tsquery might be more recommended for contain searches. at least on postgresql


He is using MySQL not PG


You can directly define is_audio_url as a partial index in postgres. No need for app code.. postgresql.org/docs/current/static/indexes-partial.html


You could do a number of things to speed it up.

- Full text index - Extract the domain name to another column index it - Change mysql defaults - Change engine types. Maybe go in memory - Create lookup table of domains


Create a fulltext index and use MATCH AGAINST instead of LIKE


If you wanted to, you could even use a trigger to compute that field, and then backfill it - no changes to app code required!


I'm doing fulltext search on 270 million rows in postgres (with textvector indices) in below 65ms on commodity hardware (sub 8€/mo).

And it works better than ElasticSearch, just due to the lower additional overhead.


How do you sort the result? Do you use ts_rank?

What is the size of your database on disk?


I use ts_rank combined with many other factors (as I also rank by time, by message type, etc).

The size of the database is a few dozen gigabytes by now, but that isn’t relevant with tsvector, only the row count has an effect on search speed.


Thanks for following up :-)

I was asking because ranking can be slow in PostgreSQL. PostgreSQL can use a GIN or GiST index for filtering, but not for ranking, because the index doesn't contain the positional information needed for ranking.

This is not an issue when your query is highly selective and returns a low number of matching rows. But when the query returns a large number of matching rows, PostgreSQL has to fetch the ts_vector from heap for each matching row, and this can be really slow.

People are working on this but it's not in PostgreSQL core yet: https://github.com/postgrespro/rum.

This is why I'm a bit surprised by the numbers you shared: fulltext search on 270 million rows in below 65ms on commodity hardware (sub 8€/mo).

A few questions, if I may:

- What is the average number of rows returned by your queries? Is there a LIMIT?

- Is the ts_vector stored in the table?

- Do you use a GIN or GiST index on the ts_vector?

Cheers.


It will not be as fast but it will be fast enough for the wast majority of use cases.


is there some resource one can figure out what category of problem need what storage solution?

What storage solutions would you use for product recommendations or trending topics?


I used Elasticsearch. But I didn't use it as a primary datastore, more just a mirror of the MySQL data.

It comes out of the box with a way to find "More Like This" https://www.elastic.co/guide/en/elasticsearch/reference/curr...

For trending topics I believe I used something related to this: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

An interesting article on using ES: https://auth0.engineering/from-slow-queries-to-over-the-top-...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: