k0emt's random access bear: Getting the Enron mail database into MongoDB

Background

In order to get some experience working with Python and MongoDB I decided I would like to find a data source with a lot of free form text. This would give me experience in pulling the data into MongoDB and at a future date I’ll have a ready data source for use with learning NLTK.
Finding a large dataset that met my needs turned out to be harder than expected. I came across Hilary Mason’s link page of research-quality data sets (now a dead link, see my collection which includes as much of Hilary's as we could recover.) and found the Enron email dataset. This data set contains over 500K emails. The emails are in individual files stored in a directory structure. To me, the first step in being able to use the data is to get it into a database where I could query it.

The environment

The code was developed on an Intel Core-I7 machine with 6G of RAM and a 5400RPM hard disk. This code is I/O intensive and could have benefitted from a faster hard disk or an SSD. The host OS was Ubuntu 11.04 64bit. The tools used were Python 2.7 with pymongo and MongoDB.

The code

How to query

You can use the mongo shell to do some queries once you have loaded the data.
“use the enron_mail” database and you can do the following:
db.messages.find({ contents : /query text/i }).limit(1).skip(0);
Besides content, the document structure also includes: mailbox, subFolder and filename.
Here are some additional links with material on the shell and how to query:
http://www.mongodb.org/display/DOCS/Overview+-+The+MongoDB+Interactive+Shell http://www.mongodb.org/display/DOCS/Tutorial http://www.mongodb.org/display/DOCS/Querying http://www.mongodb.org/display/DOCS/Advanced+Queries

Analysis

Important things to note about the code:

change the MAIL_DIR_PATH variable to match your installation.

getFileContents decodes the text as being in cp1252 character set

saveToDatabase encodes the text in utf-8 for mongo compatibility

The os.walk method is key to the simplicity of this code

Here are some references on unicode:
http://docs.python.org/howto/unicode.html http://stackoverflow.com/questions/4685568/importing-file-with-unknown-encoding-from-python-into-mongodb
The full run took ~21 minutes after an initial run that probably had a bunch of files in cache. The run maxed out a single core of the CPU. The process was i/o bound with the hard drive. The full 6G of RAM was in use on the machine
My query of MonogoDB says I have 517,424 emails in the document store. It shouldn’t be too difficult to modify the code to work with your database of choice.
I hope you find this code useful and that it enables you to do some analysis with this dataset.

Addendum

Brendan McAdams @rit created a version of the code which utilizes the Python email library to produce a database with more metadata. You can see the results of his work here: http://mongodb-enron-email.s3-website-us-east-1.amazonaws.com/ (now a dead link)

k0emt's random access bear

Getting the Enron mail database into MongoDB