Hacker News new | past | comments | ask | show | jobs | submit login
Elasticsearch for Beginners: Indexing Your GMail Inbox (github.com/oliver006)
216 points by SuperKlaus on Jan 1, 2015 | hide | past | favorite | 32 comments



I've been doing a whole blog series on doing this also: http://bitquabit.com/post/having-fun-python-and-elasticsearc... . It's intereting to see a different take on it.


I recently posted something similar focused on NLP analysis of Gmail messages in Python/pandas, with some notes on storing in Elasticsearch as well; glad to see others are covering that side of it, as I'd love to come back to this project and take it a bit further! http://engineroom.trackmaven.com/blog/monthly-challenge-natu...


This is a totally shameless plug but if you'd like to learn Elasticsearch from scratch, I've got an introductory course up on Pluralsight. http://www.pluralsight.com/courses/elasticsearch-for-dotnet-...


It's the first time I see github's Readme's being used as a blogging tool. Is this common? I've started to link to a Vagrant/Ansible repo for my setup / code intensive posts, but having the code and the text encapsulated as a repo is quite novel.



I've been using github pages as a blogging tool since it makes linking between various posts easier but if you're primarily writing about code this may be even better since the code is easily accessible.


Thanks. Markdown provides most of the formatting needed so it seemed like a good choice to quickly publish a tutorial without setting up a blog.


It's a neat idea, but be aware that you're hosting your words on someone else's platform. Annoy the wrong person, and it could all go away.

At least it's a git repo so you should have local copies, but still, that caveat applies.


There are a couple of libraries listed below. Would using any of them make life easier with ElasticSearch + Python?

- https://github.com/elasticsearch/elasticsearch-py (low level lib, from ES)

- https://github.com/elasticsearch/elasticsearch-dsl-py (high level lib, from ES)

- https://github.com/mozilla/elasticutils (high level lib from Mozilla)

There are a few more, but they are either obsolete or don't have much traction. There's also django-haystack, but that's specific to django.


We (Infoxchange) use the official elasticsearch-py, I believe it's not without it's frustrations but we've integrated Elasticsearch with part of our large database of health services in Australia as part of a complete re-write (Django) of a very old application (Perl).

You can try it out here: https://www2.hsnet.nsw.gov.au

Try searching for something like: psychiatrists near sydney cbd

The site is self is mostly just a front-end (as designed by / for a client) running across several Docker containers for the database (PostgreSQL) which is indexed into Elasticsearch and queried via the Elasticsearch API / Python ES Libraries.

If you're interested in Elasticsearch with Python / Django check out our (pretty crappy at the moment) tech blog: https://ixa.io or our github: https://github.com/infoxchange


We use https://github.com/elasticsearch/elasticsearch-py. It will certainly make like easier.


I use elasticsearch-dsl-py for constructing and testing complex queries. It's great, much easier than debugging deeply nested JSON queries.


I've had a very positive experience with elasticutils. It provides filtering syntax very similar to Django ORM filters.


I've been thinking about making my own email searchable with elasticsearch. The main thing holding me back is security. With elasticsearch listening on localhost:9200, anyone with local access can read all your mail. Even if you would do this on a computer over which you have full control, even a tiny breach would leak all your mails.

I realize this tutorial is just meant to get started with elasticsearch and not meant as a tool to make your email searchable. Still would be interesting to take this to the next level.


Not sure if people are still here. I tried moving through this and it appears to be failing on the import... I am running a vagrant and get everything installed just fine.

I don't know how to invoke the script properly...

I've tried so many ways. This seems like it would give results... though it does nothing much.

python index_emails.py test.mbox

Any help or tips are appreciated! This has been a fun project so far. Stumbling at the end. Thanks!


error and check and throwing mad -vv after python2.7 resulted in some sort of standard out directions.

python2.7 index_emails.py --infile=test.mbox

above is working


Just a word of caution: elasticsearch allows everyone access to the indexed data, by default. If you're doing this on a world-reachable machine with sensitive data, you should probably lock it down or make sure it's locked down.

There are a number of authentication solutions, and they will require additional configuration -plugins like jetty and elasticsearch-http-basic.


The whole point of GMail was supposed to be that it was searchable. Did Google break that, or what?

If there's a demand for this, it might be worthwhile to build IMAP servers with more indexing. It's easy to request searches with IMAP, but the performance can be a problem for IMAP servers that aren't real databases.


If you want to get started with something like ElasticSearch it helps having large dataset to play with. Using your gmail/mail archive gives you plenty of weirdly shaped data to have fun with.

I don't think this has much to do with the search in gmail not being sufficient or broken.


Very interesting. This is a very useful and practical way of learning new things instead of reading an article about it. I don't know python programming but I was able to understand each and every bit of it and I will be coming back to this if I ever need to incorporate Elasticsearch.


The 'notmuch' mail indexing system uses Xapian. I can grep through my 200k messages in seconds.

http://notmuchmail.org/

Since it's implemented as a "library" of sorts, there are interfaces for emacs, command line, GTK, mutt, ...


Wow little tutorials like this with easy attainable data are so helpful. Thanks for posting.


Analysing the "Turn mbox into JSON" section

http://paste.lisp.org/display/145050


What was the performance like for those queries?


I have > 110k messages indexed and responses came back within ~300ms, less than 100ms with a warm cache.


Couldn't this be "Indexing your mbox files"? It seems applicable to any mailbox that is in or can be in that format. Except for the x-gmail-labels part, of course.

Anyway if you do feel like you want to accomplish the stated purpose of finding which emails are taking up space, you can search in gmail with the word "larger", as in "larger:20MB".


so when should you use elasticsearch? can't you get away with doing

    SELECT id FROM pages WHERE title LIKE "%elastic"


The search you mentioned there specifically is problematic on any large data-set, because in its current(ly unoptimized) state it wont be able to use index-seeking to arrive at an answer.

This means that at best it will need to do a index-scan (in practice a full table-scan on the index consisting of a subset of the table-data). The results of a query like that can't be obtained fast, at least not milliseconds fast, and its not well suited for caching or reusability.

There are (obviously) ways around this, but they all involve indirection and workarounds.

Anyway, as a rule of thumb: Any time you want to do general full-text search queries on relational data, the LIKE operator will very quickly meet its limits and depending on how much outside its bounds you want to go, you will find yourself having to do quite a bit of work.

And in those cases you mihgt want to use the full-text capabilities of your RDBMS if it has any. Or maybe something like Elasticsearch might be a better fit.

Disclaimer: I've never used Elasticsearch myself.


Those sorts of SQL queries aren't very fast compared to a dedicated index/search system like ElasticSearch. Especially when millions of rows are concerned.

Think of search engines like ElasticSearch and Solr as being purpose built for "search" rather than ad hoc querying.

They offer more advanced searching features like faceting and synonyms, if your example had been "SELECT id FROM pages WHERE title LIKE '%dog'" you could set things up so that matches for 'dog', 'dogs', 'doggie', 'puppy', 'pup', 'canine', and 'mans best friend' all returned the same results.


While you are absolutely correct that this won't scale, and doesn't have a lot of advanced features like faceting and synonyms, it's still a useful technique. I use the SQL LIKE operator all the time, especially when I just need a simple way of searching my data.

If you under a million rows, LIKE is reasonably performant.


Would LOVE to see this in Ruby rather than Python. My boss wants me to learn ElasticSearch.


me too...




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: