I recently posted something similar focused on NLP analysis of Gmail messages in Python/pandas, with some notes on storing in Elasticsearch as well; glad to see others are covering that side of it, as I'd love to come back to this project and take it a bit further! http://engineroom.trackmaven.com/blog/monthly-challenge-natu...
It's the first time I see github's Readme's being used as a blogging tool. Is this common? I've started to link to a Vagrant/Ansible repo for my setup / code intensive posts, but having the code and the text encapsulated as a repo is quite novel.
I've been using github pages as a blogging tool since it makes linking between various posts easier but if you're primarily writing about code this may be even better since the code is easily accessible.
We (Infoxchange) use the official elasticsearch-py, I believe it's not without it's frustrations but we've integrated Elasticsearch with part of our large database of health services in Australia as part of a complete re-write (Django) of a very old application (Perl).
Try searching for something like: psychiatrists near sydney cbd
The site is self is mostly just a front-end (as designed by / for a client) running across several Docker containers for the database (PostgreSQL) which is indexed into Elasticsearch and queried via the Elasticsearch API / Python ES Libraries.
I've been thinking about making my own email searchable with elasticsearch. The main thing holding me back is security. With elasticsearch listening on localhost:9200, anyone with local access can read all your mail. Even if you would do this on a computer over which you have full control, even a tiny breach would leak all your mails.
I realize this tutorial is just meant to get started with elasticsearch and not meant as a tool to make your email searchable. Still would be interesting to take this to the next level.
Not sure if people are still here. I tried moving through this and it appears to be failing on the import... I am running a vagrant and get everything installed just fine.
I don't know how to invoke the script properly...
I've tried so many ways. This seems like it would give results... though it does nothing much.
python index_emails.py test.mbox
Any help or tips are appreciated! This has been a fun project so far. Stumbling at the end. Thanks!
Just a word of caution: elasticsearch allows everyone access to the indexed data, by default. If you're doing this on a world-reachable machine with sensitive data, you should probably lock it down or make sure it's locked down.
There are a number of authentication solutions, and they will require additional configuration -plugins like jetty and elasticsearch-http-basic.
The whole point of GMail was supposed to be that it was searchable. Did Google break that, or what?
If there's a demand for this, it might be worthwhile to build IMAP servers with more indexing. It's easy to request searches with IMAP, but the performance can be a problem for IMAP servers that aren't real databases.
If you want to get started with something like ElasticSearch it helps having large dataset to play with. Using your gmail/mail archive gives you plenty of weirdly shaped data to have fun with.
I don't think this has much to do with the search in gmail not being sufficient or broken.
Very interesting. This is a very useful and practical way of learning new things instead of reading an article about it. I don't know python programming but I was able to understand each and every bit of it and I will be coming back to this if I ever need to incorporate Elasticsearch.
Couldn't this be "Indexing your mbox files"? It seems applicable to any mailbox that is in or can be in that format. Except for the x-gmail-labels part, of course.
Anyway if you do feel like you want to accomplish the stated purpose of finding which emails are taking up space, you can search in gmail with the word "larger", as in "larger:20MB".
The search you mentioned there specifically is problematic on any large data-set, because in its current(ly unoptimized) state it wont be able to use index-seeking to arrive at an answer.
This means that at best it will need to do a index-scan (in practice a full table-scan on the index consisting of a subset of the table-data). The results of a query like that can't be obtained fast, at least not milliseconds fast, and its not well suited for caching or reusability.
There are (obviously) ways around this, but they all involve indirection and workarounds.
Anyway, as a rule of thumb: Any time you want to do general full-text search queries on relational data, the LIKE operator will very quickly meet its limits and depending on how much outside its bounds you want to go, you will find yourself having to do quite a bit of work.
And in those cases you mihgt want to use the full-text capabilities of your RDBMS if it has any. Or maybe something like Elasticsearch might be a better fit.
Those sorts of SQL queries aren't very fast compared to a dedicated index/search system like ElasticSearch. Especially when millions of rows are concerned.
Think of search engines like ElasticSearch and Solr as being purpose built for "search" rather than ad hoc querying.
They offer more advanced searching features like faceting and synonyms, if your example had been "SELECT id FROM pages WHERE title LIKE '%dog'" you could set things up so that matches for 'dog', 'dogs', 'doggie', 'puppy', 'pup', 'canine', and 'mans best friend' all returned the same results.
While you are absolutely correct that this won't scale, and doesn't have a lot of advanced features like faceting and synonyms, it's still a useful technique. I use the SQL LIKE operator all the time, especially when I just need a simple way of searching my data.
If you under a million rows, LIKE is reasonably performant.