As I said I'm one Solr guy that does 'effin care about this issue - I should have replied on this sub thread, sorry. Will have a submission ready soon for my AutophrasingTokenFilter - when I get the JIRA number, I'll let you know.
Does Solr 5.0 support password-protecting the admin interface yet without spending hours trying to wrangle custom XML files? It seems like a pretty basic requirement for a web-based application.
Solr is not a web-based application. You shouldn't directly expose your Solr instance to anyone. Regardless of whether or not you secure your admin interface. That's not Solr's core business, and I don't see why they should was their efforts on it.
Have Solr listen on localhost and have your web app talk to Solr. If your Solr is visible to the world, you're doing it wrong.
Edit: by saying that it's not a web-based application I mean that it shouldn't be on teh interwebz -- it's obviously a webapp in the sense that it mostly speaks HTTP.
Well, regardless of whether it's web-facing or not, it's not unreasonable to want to limit the access to its admin panel (e.g. in a big company with different teams).
I agree however that SOLR is best off doing one thing well, web page security can be implemented e.g. by Apache.
You would be right if the Solr admin page were the only administrative interface. It's not. You could send delete queries, create additional indexes, etc all without using the admin page, simply by sending http requests to the relevant Solr components. Instead of adding overhead by securing each individual call, they leave it up to you.
Or to whichever friendly consultant you decide to hire to help out with that. Wink wink. Nudge nudge.
Or use Nginx as a proxy, or something. It's frustrating, but I can see the argument for Solr to just delegate this kind of task to other projects that do it better.
The admin is a web application I can tell this because I'm looking at my browser. ;) Not wanting to deal with the boring parts of web apps -- that's understandable but not business/user savvy.
Disclaimer: I work at Cloudera on Solr and related technologies.
I don't think there's anything out of the box in Solr 5.0 that changes that. SOLR-4470[0] should be able to do that, but it hasn't been committed. Apache Sentry[1] adds role based access control to Solr, but it's only been tested up to Solr 4.10 and with kerberos (not basic password protection). It comes nicely integrated out-of-the-box with Solr as part of Cloudera Search[2]; otherwise, you'll have to do some manual setup to get it to work.
elchief: I am a Solr guy and I do give a flying F about this. I've been remiss in submitting this to Solr though - my bad. Working on this now. Nolan's fix is also good. I referenced this work in my blog post
Solr is completely free. If that's not an issue for you and you are ready to pay, then you should compare Elasticsearch to LucidWorks Fusion, not directly to Solr.
Search-wise? You're ending up at Lucene either way.
Scaling-wise? Distributed Elasticsearch doesn't have a Zookeeper dependency, which is nice. But Solr has more sharding flexibility and partition tolerance.
Yes, see [1] and [2] for a comparison between Solr and Elasticsearch. I should mention also that ES are working on the partition tolerance issue as described in [3]. I am currently using 1.4.3 and am wondering if anything was addressed in 1.4 for resiliency.
For your use case set up both and then do some testing.
Speaking for myself I did that and found that SOLR was a lot more performant. I needed a high-traffic solution without a bunch of servers.
I find that ES tries to do too much with all the dashboards and monitoring etc.
SOLR keeps it simple and thereby does not incur the performance penalty.
Also when an ES cluster goes south (cluster health "yellow" or "red") it seemed like a pain to troubleshoot and determine the real reason WHY. SOLR seems more durable and when something needs investigation you get a clear message in the log.
If you are starting something new and just don't know your traffic requirements and are scared of it "going viral" like a hot new mobile game ES may work for you though. The one thing it excels at is adding more ES servers to the cluster quickly. So if you need more servers ASAP and don't care about the cost ES has that covered.
Does Solr have the same problem as ES where you can't modify an index mapping after creating it? With ES you have to reindex everything into a new temporary index and then swap the new and old indexes. It's a terrible design, especially considering that ES already has all the original data and should be perfectly capable of doing it itself, incrementally.
Often when you want to reindex you want to reindex a lot of data - quickly. You don't want to use your same hardware, possibly. So what you do if you're in AWS, for instance, is take your data mounts and mount them to an extremely high powered set of instances. Then, reindex with the new mappings. Then, move your mounts back to your smaller instances.
You save money and can do a reindex very quickly.
A basic "reindex" command would be a cool feature though.
It is similar in SOLR (and other NoSQL data stores) currently .. at least as far as I am aware. It can be quite intensive especially since ES, SOLR, etc like to use as much memory as they can for fast access. Any non-trivial application should have duplicate servers / instances to soak up the traffic while a change like that is migrated.
The primary limitation comes from Lucene, which powers both. With Solr, you could change the definition and reload the core, but you will be getting some weird artifacts for the previously-indexed content.
Neither Solr nor Elasticsearch should be treated as a primary data store, so that's probably why reindex-in-place is not the highest priority.
I think treating Solr/ES as a primary data store would be a horrible idea. But the problem is equally annoying if you're using it as an expendable search index.
There is no technical reason why Solr/ES could not do diff-based indexing. ES (I don't know about Solr) admittedly uses a single Lucene index per logical index, so changing a single field mapping involves reindexing the whole index, not just that one mapping.
But if the mappings were properly versioned ES could simply create a new version, index everything (from the original contents), and then swap. Locking the original index should be a non-issue.
No, they don't. Behind the scene, it's Lucene that is creating a set of inverted indexes. While extremely performant for text search, it is also destructive and partial comparing to the initial dataset. So, there is no way to migrate an index to another one.
I don't know what you mean when you say backed by the Apache Community, but Elasticsearch is also an open source, apache-licensed project that has a commercial, private counterpart. I think every major open source project has this now (Hadoop has Cloudera/Hortonworks/MapR, Spark has databricks).
>> I don't know what you mean when you say backed by the Apache Community
It's a matter of who does the releasing (i.e. curating of patches, responsibility for building community, etc.): Solr is governed by the Apache Software Foundation and associated volunteers, ElasticSearch is governed by the private entity.
IMO (and this is an increasingly rare opinion) the license is the most important piece and they're both under the ASL 2.0.
Governing alone - I don't that thats a good reason. There are plenty of Apache projects that are effectively governed by a private entity (Cassandra, Samza, Kafka, and these are all in the "big data" space), that would most likely be as useful as Elasticsearch if their supporting companies disappeared overnight.
[0] - http://opensourceconnections.com/blog/2013/10/27/why-is-mult...