Hacker News new | past | comments | ask | show | jobs | submit login
Major Changes from Solr 4 to Solr 5 (apache.org)
103 points by chatman on Feb 19, 2015 | hide | past | favorite | 43 comments



At my company, we've been beating our heads to the wall[0] in getting multi-term synonyms to work correctly in SOLR. e.g.

   fruit extractor => fruit juicer, citrus juicer 
Does anyone experienced enough have a clue if SOLR 5 can help with that?

[0] - http://opensourceconnections.com/blog/2013/10/27/why-is-mult...


This guy fixed it for 3 and 4

https://github.com/healthonnet/hon-lucene-synonyms

The Solr guys don't give a flying F about this issue though


As I said I'm one Solr guy that does 'effin care about this issue - I should have replied on this sub thread, sorry. Will have a submission ready soon for my AutophrasingTokenFilter - when I get the JIRA number, I'll let you know.


I've yet to find a NLP library that can give you multi-term synonyms as accurate that.. let alone a SOLR module/fix.


You could possibly change the text before indexing to solve this problem.


Does Solr 5.0 support password-protecting the admin interface yet without spending hours trying to wrangle custom XML files? It seems like a pretty basic requirement for a web-based application.

I've tried things like this

http://community.zimbra.com/documentation/w/documentation/se...

repeatedly. They never seem to work right.


Solr is not a web-based application. You shouldn't directly expose your Solr instance to anyone. Regardless of whether or not you secure your admin interface. That's not Solr's core business, and I don't see why they should was their efforts on it.

Have Solr listen on localhost and have your web app talk to Solr. If your Solr is visible to the world, you're doing it wrong.

Edit: by saying that it's not a web-based application I mean that it shouldn't be on teh interwebz -- it's obviously a webapp in the sense that it mostly speaks HTTP.


Well, regardless of whether it's web-facing or not, it's not unreasonable to want to limit the access to its admin panel (e.g. in a big company with different teams).

I agree however that SOLR is best off doing one thing well, web page security can be implemented e.g. by Apache.


You would be right if the Solr admin page were the only administrative interface. It's not. You could send delete queries, create additional indexes, etc all without using the admin page, simply by sending http requests to the relevant Solr components. Instead of adding overhead by securing each individual call, they leave it up to you.

Or to whichever friendly consultant you decide to hire to help out with that. Wink wink. Nudge nudge.


So limit it to only whitelisted IPs. That's why you have sysadmins.


Or use Nginx as a proxy, or something. It's frustrating, but I can see the argument for Solr to just delegate this kind of task to other projects that do it better.


The admin is a web application I can tell this because I'm looking at my browser. ;) Not wanting to deal with the boring parts of web apps -- that's understandable but not business/user savvy.


While you are technically right, that kind of thinking gave people access to a ton of MongoDB databases last week.


Disclaimer: I work at Cloudera on Solr and related technologies.

I don't think there's anything out of the box in Solr 5.0 that changes that. SOLR-4470[0] should be able to do that, but it hasn't been committed. Apache Sentry[1] adds role based access control to Solr, but it's only been tested up to Solr 4.10 and with kerberos (not basic password protection). It comes nicely integrated out-of-the-box with Solr as part of Cloudera Search[2]; otherwise, you'll have to do some manual setup to get it to work.

[0] https://issues.apache.org/jira/browse/SOLR-4470 [1] https://sentry.incubator.apache.org/ [2] http://www.cloudera.com/content/cloudera/en/products-and-ser...


If you need a solid easy-to-config security wrapped around Solr I would recommend taking a peek @ Fusion from Lucidworks, http://lucidworks.com/product/solr-enterprise/


Make your solr listen to localhost, and put a reverse-proxy system like nginx to access it via web and control authentication from there.


krat0sprakhar: Check out my blogs on this https://lucidworks.com/blog/solution-for-multi-term-synonyms...

I also did a Meetup on this just this week http://www.slideshare.net/detnavillus/the-well-tempered-sear...

Check out slide 18 - autophrasing + synonyms: Precision 100% recall 100% Bag of words OOTB Solr/Lucene NOT so!

The code is on github and is a Lucene TokenFilter so it should work. I used 4.10.3 for the Meetup demo


Still no percolator/reverse search. ElasticSearch still is my go-to search technology.


I've checked http://www.elasticsearch.org/guide/en/elasticsearch/referenc... But, couldn't grasp the usecase. Can anyone share some thoughts, where will it be helpful?


Think Google Alerts or Think "Tell me when price reaches $X"


Your customers pay you to let them know when new data is available.


elchief: I am a Solr guy and I do give a flying F about this. I've been remiss in submitting this to Solr though - my bad. Working on this now. Nolan's fix is also good. I referenced this work in my blog post


looking at the solr downloads page, I can't seem to find the Solr 5 tarball... Clicking downloads: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html redirects to the 4. tarball, i added 5.0.0.to the url and got the latest tarbal: http://www.apache.org/dyn/closer.cgi/lucene/solr/5.0.0


Any reasons to prefer Solr over elastic-search?


Solr is a lot more configurable (Search Components, Update Request Processors, configurable Schemaless processing sequence, etc).

Solr includes Admin UI console which is free for production, ES has one that's only free for development.

Solr has contributors from a lot more different companies, so grows into multiple directions at once.

If you want to compare on a technical level, you can see my presentation from the Lucene/Solr Revolution back in November: http://www.slideshare.net/arafalov/solr-vs-elasticsearch-cas...

Solr is completely free. If that's not an issue for you and you are ready to pay, then you should compare Elasticsearch to LucidWorks Fusion, not directly to Solr.



Yes, that was me :-)


Search-wise? You're ending up at Lucene either way.

Scaling-wise? Distributed Elasticsearch doesn't have a Zookeeper dependency, which is nice. But Solr has more sharding flexibility and partition tolerance.

Like everything else, depends on your needs.


> partition tolerance

Yes, see [1] and [2] for a comparison between Solr and Elasticsearch. I should mention also that ES are working on the partition tolerance issue as described in [3]. I am currently using 1.4.3 and am wondering if anything was addressed in 1.4 for resiliency.

[1] http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky... [2] https://aphyr.com/posts/317-call-me-maybe-elasticsearch [3] http://www.elasticsearch.org/blog/resiliency-elasticsearch/


For your use case set up both and then do some testing.

Speaking for myself I did that and found that SOLR was a lot more performant. I needed a high-traffic solution without a bunch of servers.

I find that ES tries to do too much with all the dashboards and monitoring etc.

SOLR keeps it simple and thereby does not incur the performance penalty.

Also when an ES cluster goes south (cluster health "yellow" or "red") it seemed like a pain to troubleshoot and determine the real reason WHY. SOLR seems more durable and when something needs investigation you get a clear message in the log.

If you are starting something new and just don't know your traffic requirements and are scared of it "going viral" like a hot new mobile game ES may work for you though. The one thing it excels at is adding more ES servers to the cluster quickly. So if you need more servers ASAP and don't care about the cost ES has that covered.


Does Solr have the same problem as ES where you can't modify an index mapping after creating it? With ES you have to reindex everything into a new temporary index and then swap the new and old indexes. It's a terrible design, especially considering that ES already has all the original data and should be perfectly capable of doing it itself, incrementally.


Often when you want to reindex you want to reindex a lot of data - quickly. You don't want to use your same hardware, possibly. So what you do if you're in AWS, for instance, is take your data mounts and mount them to an extremely high powered set of instances. Then, reindex with the new mappings. Then, move your mounts back to your smaller instances.

You save money and can do a reindex very quickly.

A basic "reindex" command would be a cool feature though.


It is similar in SOLR (and other NoSQL data stores) currently .. at least as far as I am aware. It can be quite intensive especially since ES, SOLR, etc like to use as much memory as they can for fast access. Any non-trivial application should have duplicate servers / instances to soak up the traffic while a change like that is migrated.


The primary limitation comes from Lucene, which powers both. With Solr, you could change the definition and reload the core, but you will be getting some weird artifacts for the previously-indexed content.

Neither Solr nor Elasticsearch should be treated as a primary data store, so that's probably why reindex-in-place is not the highest priority.


I think treating Solr/ES as a primary data store would be a horrible idea. But the problem is equally annoying if you're using it as an expendable search index.

There is no technical reason why Solr/ES could not do diff-based indexing. ES (I don't know about Solr) admittedly uses a single Lucene index per logical index, so changing a single field mapping involves reindexing the whole index, not just that one mapping.

But if the mappings were properly versioned ES could simply create a new version, index everything (from the original contents), and then swap. Locking the original index should be a non-issue.


No, they don't. Behind the scene, it's Lucene that is creating a set of inverted indexes. While extremely performant for text search, it is also destructive and partial comparing to the initial dataset. So, there is no way to migrate an index to another one.


ES, by default, stores the original document in the nested "_source" document, which could be used to reindex data from scratch.


Solr is backed by the Apache community while ES is backed by a private entity. Is that a good reason?


Not really, no. If ElasticsearchCo were to go belly-up, the technology would almost certainly be backed by someone else.


I don't know what you mean when you say backed by the Apache Community, but Elasticsearch is also an open source, apache-licensed project that has a commercial, private counterpart. I think every major open source project has this now (Hadoop has Cloudera/Hortonworks/MapR, Spark has databricks).


>> I don't know what you mean when you say backed by the Apache Community

It's a matter of who does the releasing (i.e. curating of patches, responsibility for building community, etc.): Solr is governed by the Apache Software Foundation and associated volunteers, ElasticSearch is governed by the private entity.

IMO (and this is an increasingly rare opinion) the license is the most important piece and they're both under the ASL 2.0.


Governing alone - I don't that thats a good reason. There are plenty of Apache projects that are effectively governed by a private entity (Cassandra, Samza, Kafka, and these are all in the "big data" space), that would most likely be as useful as Elasticsearch if their supporting companies disappeared overnight.


No. It'd be like preferring memcache over redis. You'll find warriors in both camps, but most people have forgotten there was a debate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: