Getting the Enron mail database into MongoDB

Background

In order to get some experience working with Python and MongoDB I decided I would like to find a data source with a lot of free form text. This would give me experience in pulling the data into MongoDB and at a future date I’ll have a ready data source for use with learning NLTK.
Finding a large dataset that met my needs turned out to be harder than expected. I came across Hilary Mason’s link page of research-quality data sets (now a dead link, see my collection which includes as much of Hilary's as we could recover.) and found the Enron email dataset. This data set contains over 500K emails. The emails are in individual files stored in a directory structure. To me, the first step in being able to use the data is to get it into a database where I could query it.

The environment

The code was developed on an Intel Core-I7 machine with 6G of RAM and a 5400RPM hard disk. This code is I/O intensive and could have benefitted from a faster hard disk or an SSD. The host OS was Ubuntu 11.04 64bit. The tools used were Python 2.7 with pymongo and MongoDB.

The code

import os
import datetime
from pymongo import MongoClient
__author__ = 'k0emt'
MAIL_DIR_PATH = '/Users/k0emt/Projects/enron/enron_mail_20110402/maildir'
PREFIX_TRIM_AMOUNT = len(MAIL_DIR_PATH) + 1
MAX_USER_RUN_LIMIT = 50
MAX_USER_EMAILS_PER_FOLDER_FILE_LIMIT = 2
counter = 1
def get_file_contents(file_to_open_name):
data_file = open(file_to_open_name)
file_contents = ""
try:
for data_line in data_file:
file_contents += data_line
finally:
data_file.close()
return file_contents.decode('cp1252')
def save_to_database(mailbox_owner_name, sub_folder, file_name, message_contents):
document = {"mailbox": mailbox_owner_name,
"subFolder": sub_folder,
"filename": file_name,
"contents": message_contents.encode('utf-8')}
messages = db.messages
messages.insert(document)
return
cn = MongoClient('localhost')
db = cn.enron_mail
print("database initialized {0}".format(datetime.datetime.now()))
# all the mail folders
user_counter = 0
previous_owner = ""
for root, dirs, files in os.walk(MAIL_DIR_PATH, topdown=False):
directory = root[PREFIX_TRIM_AMOUNT:]
# extract mail box owner
parts = directory.split('/', 1)
mailbox_owner = parts[0]
if previous_owner != mailbox_owner:
previous_owner = mailbox_owner
user_counter += 1
# sub-folder info
if 2 == len(parts):
subFolder = parts[1]
else:
subFolder = ''
# files in each mail folder
folder_email_counter = 0
for file in files:
# get the file contents
name_of_file_to_open = "{0}/{1}".format(root, file)
contents = get_file_contents(name_of_file_to_open)
save_to_database(mailbox_owner, subFolder, file, contents)
folder_email_counter += 1
counter += 1
if counter % 100 == 0:
print("{0} {1}".format(counter, datetime.datetime.now()))
if MAX_USER_EMAILS_PER_FOLDER_FILE_LIMIT > 0 and MAX_USER_EMAILS_PER_FOLDER_FILE_LIMIT == folder_email_counter:
break
if MAX_USER_RUN_LIMIT > 0 and MAX_USER_RUN_LIMIT == user_counter:
print("Maximum users limit {0} met.".format(MAX_USER_RUN_LIMIT))
break
db.close
print("database closed {0}".format(datetime.datetime.now()))
print("{0} total records processed".format(counter - 1))

How to query

You can use the mongo shell to do some queries once you have loaded the data.
“use the enron_mail” database and you can do the following:
db.messages.find({ contents : /query text/i }).limit(1).skip(0);
Besides content, the document structure also includes: mailbox, subFolder and filename.
Here are some additional links with material on the shell and how to query:
http://www.mongodb.org/display/DOCS/Overview+-+The+MongoDB+Interactive+Shell http://www.mongodb.org/display/DOCS/Tutorial http://www.mongodb.org/display/DOCS/Querying http://www.mongodb.org/display/DOCS/Advanced+Queries

Analysis

Important things to note about the code:



  • change the MAIL_DIR_PATH variable to match your installation.

  • getFileContents decodes the text as being in cp1252 character set

  • saveToDatabase encodes the text in utf-8 for mongo compatibility

  • The os.walk method is key to the simplicity of this code


  • Here are some references on unicode:
    http://docs.python.org/howto/unicode.html http://stackoverflow.com/questions/4685568/importing-file-with-unknown-encoding-from-python-into-mongodb
    The full run took ~21 minutes after an initial run that probably had a bunch of files in cache. The run maxed out a single core of the CPU. The process was i/o bound with the hard drive. The full 6G of RAM was in use on the machine
    My query of MonogoDB says I have 517,424 emails in the document store. It shouldn’t be too difficult to modify the code to work with your database of choice.
    I hope you find this code useful and that it enables you to do some analysis with this dataset.

    Addendum

    Brendan McAdams @rit created a version of the code which utilizes the Python email library to produce a database with more metadata.  You can see the results of his work here: http://mongodb-enron-email.s3-website-us-east-1.amazonaws.com/ (now a dead link)