First, let me say that I'm a complete newbie in Python. I never studied this language, I just thought, “How hard it is to be,” when Google found nothing but Python fragments to solve my problem. :)
I have a bunch of mailboxes in the Maildir format (backup from the mail server on my old web host), and I need to extract emails from them. So far, the easiest way has been to convert them to the mbox format that Thunderbird supports, and it seems that Python has several classes for reading / writing both formats. Seems perfect.
Python docs even have this little piece of code that does exactly what I need:
src = mailbox.Maildir('maildir', factory=None) dest = mailbox.mbox('/tmp/mbox') for msg in src:
Also, this will not work. And here, where my complete lack of knowledge about Python comes. In a few posts, I get a UnicodeDecodeError during iteration (that is, when it tries to read msg from src , on line #1 ). In other cases, I get a UnicodeEncodeError when trying to add msg to dest (line #2 ).
It is clear that he makes some incorrect assumptions regarding the encoding used. But I do not know how to specify the encoding in the mailbox (In this regard, I do not know what the encoding should be, but I can probably figure it out as soon as I find a way to actually specify the encoding).
I get stack traces similar to the following:
File "E:\Python30\lib\mailbox.py", line 102, in itervalues value = self[key] File "E:\Python30\lib\mailbox.py", line 74, in __getitem__ return self.get_message(key) File "E:\Python30\lib\mailbox.py", line 317, in get_message msg = MaildirMessage(f) File "E:\Python30\lib\mailbox.py", line 1373, in __init__ Message.__init__(self, message) File "E:\Python30\lib\mailbox.py", line 1345, in __init__ self._become_message(email.message_from_file(message)) File "E:\Python30\lib\email\__init__.py", line 46, in message_from_file return Parser(*args, **kws).parse(fp) File "E:\Python30\lib\email\parser.py", line 68, in parse data = fp.read(8192) File "E:\Python30\lib\io.py", line 1733, in read eof = not self._read_chunk() File "E:\Python30\lib\io.py", line 1562, in _read_chunk self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) File "E:\Python30\lib\io.py", line 1295, in decode output = self.decoder.decode(input, final=final) File "E:\Python30\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 37: character maps to <undefined>
And in UnicodeEncodeErrors:
File "E:\Python30\lib\email\message.py", line 121, in __str__ return self.as_string() File "E:\Python30\lib\email\message.py", line 136, in as_string g.flatten(self, unixfrom=unixfrom) File "E:\Python30\lib\email\generator.py", line 76, in flatten self._write(msg) File "E:\Python30\lib\email\generator.py", line 108, in _write self._write_headers(msg) File "E:\Python30\lib\email\generator.py", line 141, in _write_headers header_name=h, continuation_ws='\t') File "E:\Python30\lib\email\header.py", line 189, in __init__ self.append(s, charset, errors) File "E:\Python30\lib\email\header.py", line 262, in append input_bytes = s.encode(input_charset, errors) UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 16: ordinal not in range(128)
Can anyone help me here? (Suggestions for completely different non-Python solutions are also welcome. I just need a way to access the import of emails from these Maildir files.
Update:
sys.getdefaultencoding returns 'utf-8'
I have downloaded sample messages that cause both errors. This one throws UnicodeEncodeError and this throws UnicodeDecodeError
I tried running the same script in Python2.6 and got TypeErrors instead:
File "c:\python26\lib\mailbox.py", line 529, in add self._toc[self._next_key] = self._append_message(message) File "c:\python26\lib\mailbox.py", line 665, in _append_message offsets = self._install_message(message) File "c:\python26\lib\mailbox.py", line 724, in _install_message self._dump_message(message, self._file, self._mangle_from_) File "c:\python26\lib\mailbox.py", line 220, in _dump_message raise TypeError('Invalid message type: %s' % type(message)) TypeError: Invalid message type: <type 'instance'>