Background
In order to get some experience working with Python and MongoDB I decided I would like to find a data source with a lot of free form text. This would give me experience in pulling the data into MongoDB and at a future date I’ll have a ready data source for use with learning NLTK.Finding a large dataset that met my needs turned out to be harder than expected. I came across Hilary Mason’s link page of research-quality data sets (now a dead link, see my collection which includes as much of Hilary's as we could recover.) and found the Enron email dataset. This data set contains over 500K emails. The emails are in individual files stored in a directory structure. To me, the first step in being able to use the data is to get it into a database where I could query it.
The environment
The code was developed on an Intel Core-I7 machine with 6G of RAM and a 5400RPM hard disk. This code is I/O intensive and could have benefitted from a faster hard disk or an SSD. The host OS was Ubuntu 11.04 64bit. The tools used were Python 2.7 with pymongo and MongoDB.The code
How to query
You can use the mongo shell to do some queries once you have loaded the data.“use the enron_mail” database and you can do the following:
db.messages.find({ contents : /query text/i }).limit(1).skip(0);
Besides content, the document structure also includes: mailbox, subFolder and filename.
Here are some additional links with material on the shell and how to query:
http://www.mongodb.org/display/DOCS/Overview+-+The+MongoDB+Interactive+Shell http://www.mongodb.org/display/DOCS/Tutorial http://www.mongodb.org/display/DOCS/Querying http://www.mongodb.org/display/DOCS/Advanced+Queries
Analysis
Important things to note about the code:Here are some references on unicode:
http://docs.python.org/howto/unicode.html http://stackoverflow.com/questions/4685568/importing-file-with-unknown-encoding-from-python-into-mongodb
The full run took ~21 minutes after an initial run that probably had a bunch of files in cache. The run maxed out a single core of the CPU. The process was i/o bound with the hard drive. The full 6G of RAM was in use on the machine
My query of MonogoDB says I have 517,424 emails in the document store. It shouldn’t be too difficult to modify the code to work with your database of choice.
I hope you find this code useful and that it enables you to do some analysis with this dataset.