Major Changes from Solr 4 to Solr 5

krat0sprakhar · on Feb 19, 2015

At my company, we've been beating our heads to the wall[0] in getting multi-term synonyms to work correctly in SOLR. e.g.

   fruit extractor => fruit juicer, citrus juicer

Does anyone experienced enough have a clue if SOLR 5 can help with that?

[0] - http://opensourceconnections.com/blog/2013/10/27/why-is-mult...

elchief · on Feb 19, 2015

This guy fixed it for 3 and 4

https://github.com/healthonnet/hon-lucene-synonyms

The Solr guys don't give a flying F about this issue though

detnavillus · on Feb 20, 2015

As I said I'm one Solr guy that does 'effin care about this issue - I should have replied on this sub thread, sorry. Will have a submission ready soon for my AutophrasingTokenFilter - when I get the JIRA number, I'll let you know.

AznHisoka · on Feb 20, 2015

I've yet to find a NLP library that can give you multi-term synonyms as accurate that.. let alone a SOLR module/fix.

oomkiller · on Feb 20, 2015

You could possibly change the text before indexing to solve this problem.

thinkcomp · on Feb 19, 2015

Does Solr 5.0 support password-protecting the admin interface yet without spending hours trying to wrangle custom XML files? It seems like a pretty basic requirement for a web-based application.

I've tried things like this

http://community.zimbra.com/documentation/w/documentation/se...

repeatedly. They never seem to work right.

Tharkun · on Feb 19, 2015

Solr is not a web-based application. You shouldn't directly expose your Solr instance to anyone. Regardless of whether or not you secure your admin interface. That's not Solr's core business, and I don't see why they should was their efforts on it.

Have Solr listen on localhost and have your web app talk to Solr. If your Solr is visible to the world, you're doing it wrong.

Edit: by saying that it's not a web-based application I mean that it shouldn't be on teh interwebz -- it's obviously a webapp in the sense that it mostly speaks HTTP.

tomp · on Feb 19, 2015

Well, regardless of whether it's web-facing or not, it's not unreasonable to want to limit the access to its admin panel (e.g. in a big company with different teams).

I agree however that SOLR is best off doing one thing well, web page security can be implemented e.g. by Apache.

Tharkun · on Feb 19, 2015

You would be right if the Solr admin page were the only administrative interface. It's not. You could send delete queries, create additional indexes, etc all without using the admin page, simply by sending http requests to the relevant Solr components. Instead of adding overhead by securing each individual call, they leave it up to you.

Or to whichever friendly consultant you decide to hire to help out with that. Wink wink. Nudge nudge.

imaginenore · on Feb 19, 2015

So limit it to only whitelisted IPs. That's why you have sysadmins.

untog · on Feb 19, 2015

Or use Nginx as a proxy, or something. It's frustrating, but I can see the argument for Solr to just delegate this kind of task to other projects that do it better.

gregors · on Feb 19, 2015

The admin is a web application I can tell this because I'm looking at my browser. ;) Not wanting to deal with the boring parts of web apps -- that's understandable but not business/user savvy.

BonoboBoner · on Feb 19, 2015

While you are technically right, that kind of thinking gave people access to a ton of MongoDB databases last week.

gchanan · on Feb 19, 2015

Disclaimer: I work at Cloudera on Solr and related technologies.

I don't think there's anything out of the box in Solr 5.0 that changes that. SOLR-4470[0] should be able to do that, but it hasn't been committed. Apache Sentry[1] adds role based access control to Solr, but it's only been tested up to Solr 4.10 and with kerberos (not basic password protection). It comes nicely integrated out-of-the-box with Solr as part of Cloudera Search[2]; otherwise, you'll have to do some manual setup to get it to work.

[0] https://issues.apache.org/jira/browse/SOLR-4470 [1] https://sentry.incubator.apache.org/ [2] http://www.cloudera.com/content/cloudera/en/products-and-ser...

jonbaer · on Feb 19, 2015

If you need a solid easy-to-config security wrapped around Solr I would recommend taking a peek @ Fusion from Lucidworks, http://lucidworks.com/product/solr-enterprise/

ankit-singh · on Feb 20, 2015

Make your solr listen to localhost, and put a reverse-proxy system like nginx to access it via web and control authentication from there.

detnavillus · on Feb 20, 2015

krat0sprakhar: Check out my blogs on this https://lucidworks.com/blog/solution-for-multi-term-synonyms...

I also did a Meetup on this just this week http://www.slideshare.net/detnavillus/the-well-tempered-sear...

Check out slide 18 - autophrasing + synonyms: Precision 100% recall 100% Bag of words OOTB Solr/Lucene NOT so!

The code is on github and is a Lucene TokenFilter so it should work. I used 4.10.3 for the Meetup demo

AznHisoka · on Feb 20, 2015

Still no percolator/reverse search. ElasticSearch still is my go-to search technology.

johnx123-up · on Feb 20, 2015

I've checked http://www.elasticsearch.org/guide/en/elasticsearch/referenc... But, couldn't grasp the usecase. Can anyone share some thoughts, where will it be helpful?

AznHisoka · on Feb 20, 2015

Think Google Alerts or Think "Tell me when price reaches $X"

altcognito · on Feb 20, 2015

Your customers pay you to let them know when new data is available.

detnavillus · on Feb 20, 2015

elchief: I am a Solr guy and I do give a flying F about this. I've been remiss in submitting this to Solr though - my bad. Working on this now. Nolan's fix is also good. I referenced this work in my blog post

mikeblum · on Feb 19, 2015

looking at the solr downloads page, I can't seem to find the Solr 5 tarball... Clicking downloads: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html redirects to the 4. tarball, i added 5.0.0.to the url and got the latest tarbal: http://www.apache.org/dyn/closer.cgi/lucene/solr/5.0.0

sagivo · on Feb 19, 2015

Any reasons to prefer Solr over elastic-search?

arafalov · on Feb 19, 2015

Solr is a lot more configurable (Search Components, Update Request Processors, configurable Schemaless processing sequence, etc).

Solr includes Admin UI console which is free for production, ES has one that's only free for development.

Solr has contributors from a lot more different companies, so grows into multiple directions at once.

If you want to compare on a technical level, you can see my presentation from the Lucene/Solr Revolution back in November: http://www.slideshare.net/arafalov/solr-vs-elasticsearch-cas...

Solr is completely free. If that's not an issue for you and you are ready to pay, then you should compare Elasticsearch to LucidWorks Fusion, not directly to Solr.

diegolo · on Feb 20, 2015

https://www.youtube.com/watch?v=S1Md3LDJPLs

arafalov · on Feb 20, 2015

Yes, that was me :-)

apawloski · on Feb 19, 2015

Search-wise? You're ending up at Lucene either way.

Scaling-wise? Distributed Elasticsearch doesn't have a Zookeeper dependency, which is nice. But Solr has more sharding flexibility and partition tolerance.

Like everything else, depends on your needs.

craigching · on Feb 19, 2015

> partition tolerance

Yes, see [1] and [2] for a comparison between Solr and Elasticsearch. I should mention also that ES are working on the partition tolerance issue as described in [3]. I am currently using 1.4.3 and am wondering if anything was addressed in 1.4 for resiliency.

[1] http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky... [2] https://aphyr.com/posts/317-call-me-maybe-elasticsearch [3] http://www.elasticsearch.org/blog/resiliency-elasticsearch/

shironinja · on Feb 19, 2015

For your use case set up both and then do some testing.

Speaking for myself I did that and found that SOLR was a lot more performant. I needed a high-traffic solution without a bunch of servers.

I find that ES tries to do too much with all the dashboards and monitoring etc.

SOLR keeps it simple and thereby does not incur the performance penalty.

Also when an ES cluster goes south (cluster health "yellow" or "red") it seemed like a pain to troubleshoot and determine the real reason WHY. SOLR seems more durable and when something needs investigation you get a clear message in the log.

If you are starting something new and just don't know your traffic requirements and are scared of it "going viral" like a hot new mobile game ES may work for you though. The one thing it excels at is adding more ES servers to the cluster quickly. So if you need more servers ASAP and don't care about the cost ES has that covered.

atombender · on Feb 19, 2015

Does Solr have the same problem as ES where you can't modify an index mapping after creating it? With ES you have to reindex everything into a new temporary index and then swap the new and old indexes. It's a terrible design, especially considering that ES already has all the original data and should be perfectly capable of doing it itself, incrementally.

nemo44x · on Feb 20, 2015

Often when you want to reindex you want to reindex a lot of data - quickly. You don't want to use your same hardware, possibly. So what you do if you're in AWS, for instance, is take your data mounts and mount them to an extremely high powered set of instances. Then, reindex with the new mappings. Then, move your mounts back to your smaller instances.

You save money and can do a reindex very quickly.

A basic "reindex" command would be a cool feature though.

shironinja · on Feb 19, 2015

It is similar in SOLR (and other NoSQL data stores) currently .. at least as far as I am aware. It can be quite intensive especially since ES, SOLR, etc like to use as much memory as they can for fast access. Any non-trivial application should have duplicate servers / instances to soak up the traffic while a change like that is migrated.

arafalov · on Feb 19, 2015

The primary limitation comes from Lucene, which powers both. With Solr, you could change the definition and reload the core, but you will be getting some weird artifacts for the previously-indexed content.

Neither Solr nor Elasticsearch should be treated as a primary data store, so that's probably why reindex-in-place is not the highest priority.

atombender · on Feb 19, 2015

I think treating Solr/ES as a primary data store would be a horrible idea. But the problem is equally annoying if you're using it as an expendable search index.

There is no technical reason why Solr/ES could not do diff-based indexing. ES (I don't know about Solr) admittedly uses a single Lucene index per logical index, so changing a single field mapping involves reindexing the whole index, not just that one mapping.

But if the mappings were properly versioned ES could simply create a new version, index everything (from the original contents), and then swap. Locking the original index should be a non-issue.

gizmogwai · on Feb 19, 2015

No, they don't. Behind the scene, it's Lucene that is creating a set of inverted indexes. While extremely performant for text search, it is also destructive and partial comparing to the initial dataset. So, there is no way to migrate an index to another one.

atombender · on Feb 19, 2015

ES, by default, stores the original document in the nested "_source" document, which could be used to reindex data from scratch.

chatman · on Feb 19, 2015

Solr is backed by the Apache community while ES is backed by a private entity. Is that a good reason?

matthewmacleod · on Feb 19, 2015

Not really, no. If ElasticsearchCo were to go belly-up, the technology would almost certainly be backed by someone else.

capkutay · on Feb 19, 2015

I don't know what you mean when you say backed by the Apache Community, but Elasticsearch is also an open source, apache-licensed project that has a commercial, private counterpart. I think every major open source project has this now (Hadoop has Cloudera/Hortonworks/MapR, Spark has databricks).

TallGuyShort · on Feb 19, 2015

>> I don't know what you mean when you say backed by the Apache Community

It's a matter of who does the releasing (i.e. curating of patches, responsibility for building community, etc.): Solr is governed by the Apache Software Foundation and associated volunteers, ElasticSearch is governed by the private entity.

IMO (and this is an increasingly rare opinion) the license is the most important piece and they're both under the ASL 2.0.

nemothekid · on Feb 20, 2015

Governing alone - I don't that thats a good reason. There are plenty of Apache projects that are effectively governed by a private entity (Cassandra, Samza, Kafka, and these are all in the "big data" space), that would most likely be as useful as Elasticsearch if their supporting companies disappeared overnight.

ecaron · on Feb 19, 2015

No. It'd be like preferring memcache over redis. You'll find warriors in both camps, but most people have forgotten there was a debate.