Python Mailbox Encoding Errors

First, let me say that I'm a complete newbie in Python. I never studied this language, I just thought, “How hard it is to be,” when Google found nothing but Python fragments to solve my problem. :)

I have a bunch of mailboxes in the Maildir format (backup from the mail server on my old web host), and I need to extract emails from them. So far, the easiest way has been to convert them to the mbox format that Thunderbird supports, and it seems that Python has several classes for reading / writing both formats. Seems perfect.

Python docs even have this little piece of code that does exactly what I need:

src = mailbox.Maildir('maildir', factory=None) dest = mailbox.mbox('/tmp/mbox') for msg in src: #1 dest.add(msg) #2 

Also, this will not work. And here, where my complete lack of knowledge about Python comes. In a few posts, I get a UnicodeDecodeError during iteration (that is, when it tries to read msg from src , on line #1 ). In other cases, I get a UnicodeEncodeError when trying to add msg to dest (line #2 ).

It is clear that he makes some incorrect assumptions regarding the encoding used. But I do not know how to specify the encoding in the mailbox (In this regard, I do not know what the encoding should be, but I can probably figure it out as soon as I find a way to actually specify the encoding).

I get stack traces similar to the following:

  File "E:\Python30\lib\mailbox.py", line 102, in itervalues value = self[key] File "E:\Python30\lib\mailbox.py", line 74, in __getitem__ return self.get_message(key) File "E:\Python30\lib\mailbox.py", line 317, in get_message msg = MaildirMessage(f) File "E:\Python30\lib\mailbox.py", line 1373, in __init__ Message.__init__(self, message) File "E:\Python30\lib\mailbox.py", line 1345, in __init__ self._become_message(email.message_from_file(message)) File "E:\Python30\lib\email\__init__.py", line 46, in message_from_file return Parser(*args, **kws).parse(fp) File "E:\Python30\lib\email\parser.py", line 68, in parse data = fp.read(8192) File "E:\Python30\lib\io.py", line 1733, in read eof = not self._read_chunk() File "E:\Python30\lib\io.py", line 1562, in _read_chunk self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) File "E:\Python30\lib\io.py", line 1295, in decode output = self.decoder.decode(input, final=final) File "E:\Python30\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 37: character maps to <undefined> 

And in UnicodeEncodeErrors:

  File "E:\Python30\lib\email\message.py", line 121, in __str__ return self.as_string() File "E:\Python30\lib\email\message.py", line 136, in as_string g.flatten(self, unixfrom=unixfrom) File "E:\Python30\lib\email\generator.py", line 76, in flatten self._write(msg) File "E:\Python30\lib\email\generator.py", line 108, in _write self._write_headers(msg) File "E:\Python30\lib\email\generator.py", line 141, in _write_headers header_name=h, continuation_ws='\t') File "E:\Python30\lib\email\header.py", line 189, in __init__ self.append(s, charset, errors) File "E:\Python30\lib\email\header.py", line 262, in append input_bytes = s.encode(input_charset, errors) UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 16: ordinal not in range(128) 

Can anyone help me here? (Suggestions for completely different non-Python solutions are also welcome. I just need a way to access the import of emails from these Maildir files.

Update:

sys.getdefaultencoding returns 'utf-8'

I have downloaded sample messages that cause both errors. This one throws UnicodeEncodeError and this throws UnicodeDecodeError

I tried running the same script in Python2.6 and got TypeErrors instead:

  File "c:\python26\lib\mailbox.py", line 529, in add self._toc[self._next_key] = self._append_message(message) File "c:\python26\lib\mailbox.py", line 665, in _append_message offsets = self._install_message(message) File "c:\python26\lib\mailbox.py", line 724, in _install_message self._dump_message(message, self._file, self._mangle_from_) File "c:\python26\lib\mailbox.py", line 220, in _dump_message raise TypeError('Invalid message type: %s' % type(message)) TypeError: Invalid message type: <type 'instance'> 
+6
python encoding
source share
2 answers

Note

  • @ Jimmy2Times may be very right in saying that this module cannot be updated for version 3.0.

  • This is not an answer, but rather a more likely explanation of what is happening, why, how to reproduce it, other people can benefit from it. I am trying to continue this answer.

I put everything I could find, like Change below

=====

I think this is what happens

Among many of your other data, you have two characters - \x9d and \xe5 , and they are encoded in some encoding format say iso-8859-1 .

when Python 3.0 finds an encoded string, it first tries to guess the encoding of the string and then decodes it in Unicode using the guessed encoding (the way it stores Unicode encoded strings is Link ).

I think its guessing part is where it goes wrong.

To show what is most likely happening -

Let's say the encoding was iso-8859-1 , and the wrong guess was cp1252 (as from the first trace).

Decoding error for \x9d .

 In [290]: unicode(u'\x9d'.encode('iso-8859-1'), 'cp1252') --------------------------------------------------------------------------- <type 'exceptions.UnicodeDecodeError'> Traceback (most recent call last) /home/jv/<ipython console> in <module>() /usr/lib/python2.5/encodings/cp1252.py in decode(self, input, errors) 13 14 def decode(self,input,errors='strict'): ---> 15 return codecs.charmap_decode(input,errors,decoding_table) 16 17 class IncrementalEncoder(codecs.IncrementalEncoder): <type 'exceptions.UnicodeDecodeError'>: 'charmap' codec can't decode byte 0x9d in position 0: character maps to <undefined> 

Decoding is \xe5 for \xe5 , but then when the message is retrieved from Python somewhere, it tries to encode it in ascii , which fails

 In [291]: unicode(u'\xe5'.encode('iso-8859-1'), 'cp1252').encode('ascii') --------------------------------------------------------------------------- <type 'exceptions.UnicodeEncodeError'> Traceback (most recent call last) /home/jv/<ipython console> in <module>() <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128) 

=============

EDIT

Both of your problems are on line # 2. Where it first decodes in unicode and then encodes in ascii

Easy_install chardet first

Decoding Error:

 In [75]: decd=open('jalf_decode_err','r').read() In [76]: chardet.detect(decd) Out[76]: {'confidence': 0.98999999999999999, 'encoding': 'utf-8'} ##this is what is tried at the back - my guess :) In [77]: unicode(decd, 'cp1252') --------------------------------------------------------------------------- <type 'exceptions.UnicodeDecodeError'> Traceback (most recent call last) /home/jv/<ipython console> in <module>() /usr/lib/python2.5/encodings/cp1252.py in decode(self, input, errors) 13 14 def decode(self,input,errors='strict'): ---> 15 return codecs.charmap_decode(input,errors,decoding_table) 16 17 class IncrementalEncoder(codecs.IncrementalEncoder): <type 'exceptions.UnicodeDecodeError'>: 'charmap' codec can't decode byte 0x9d in position 2812: character maps to <undefined>' ##this is a FIX- this way all your messages r accepted In [78]: unicode(decd, chardet.detect(decd)['encoding']) Out[78]: u'Return-path: < root@apps2.servage.net >\nEnvelope-to: public@jalf.dk \nDelivery-date: Fri, 22 Aug 2008 16:49:53 -0400\nReceived: from [77.232.66.102] (helo=apps2.servage.net)\n\tby c1p.hostingzoom.com with esmtp (Exim 4.69)\n\t(envelope-from < root@apps2.servage.net >)\n\tid 1KWdZu-0003VX-HP\n\tfor public@jalf.dk ; Fri, 22 Aug 2008 16:49:52 -0400\nReceived: from apps2.servage.net (apps2.servage.net [127.0.0.1])\n\tby apps2.servage.net (Postfix) with ESMTP id 4A87F980026\n\tfor < public@jalf.dk >; Fri, 22 Aug 2008 21:49:46 +0100 (BST)\nReceived: (from root@localhost )\n\tby apps2.servage.net (8.13.8/8.13.8/Submit) id m7MKnkrB006225;\n\tFri, 22 Aug 2008 21:49:46 +0100\nDate: Fri, 22 Aug 2008 21:49:46 +0100\nMessage-Id: < 200808222049.m7MKnkrB006225@apps2.servage.net >\nTo: public@jalf.dk \nSubject: =?UTF-8?B?WW5ncmVzYWdlbnMgTnloZWRzYnJldiAyMi44LjA4?=\nFrom: Nyhedsbrev fra Yngresagen < info@yngresagen.dk >\nReply-To: info@yngresagen.dk \nContent-type: text/plain; charset=UTF-8\nX-Abuse: Servage.net Listid 16329\nMime-Version: 1.0\nX-mailer: Servage Maillist System\nX-Spam-Status: No, score=0.1\nX-Spam-Score: 1\nX-Spam-Bar: /\nX-Spam-Flag: NO\nX-ClamAntiVirus-Scanner: This mail is clean\n\n\nK\xe6re medlem\n\nH\xe5ber du har en god sommer og er klar p\xe5 at l\xe6se seneste nyt i Yngresagen. God forn\xf8jelse!\n\n\n::. KOM TIL YS-CAF\xc8 .::\nFlere og billigere ungdomsboliger, afskaf 24-\xe5rs-reglen eller hvad synes du? Yngresagen indbyder dig til en \xe5ben debat over kaffe og snacks. Yngresagens Kristian Lauta, Mette Marb\xe6k, og formand Steffen M\xf8ller fort\xe6ller om tidligere projekter og vil gerne diskutere, hvad Yngresagen skal bruge sin tid p\xe5 fremover. \nVil du diskutere et emne, du br\xe6nder for, eller vil du bare v\xe6re med p\xe5 en lytter?\nS\xe5 kom torsdag d. 28/8 kl. 17-19, Kulturhuset 44, 2200 KBH N \n \n::. VIND GAVEKORT & BLIV H\xd8RT .:: \nYngresagen har lavet et sp\xf8rgeskema, s\xe5 du har direkte mulighed for at sige din mening, og v\xe6re med til at forme Yngresagens arbejde. Brug 5 min. p\xe5 at dele dine holdninger om f.eks. uddannelse, arbejde og unges vilk\xe5r - og vind et gavekort til en musikbutik. Vi tr\xe6kker lod blandt alle svarene og finder tre heldige vindere. Sp\xf8rgeskemaet er her: www.yngresagen.dk\n\n::. YS SPARKER NORDJYLLAND I GANG .::\nNordjylland bliver Yngresagens sunde region. Her er regionsansvarlig Andreas M\xf8ller Stehr ved at starte tre projekter op: 1) L\xf8beklub, 2) F\xf8rstehj\xe6lpskursus, 3) Mad til unge-program.\nVi har brug for flere frivillige til at sparke projekterne i gang. Vi tilbyder gratis fede aktiviteter, gratis t-shirts og ture til K\xf8benhavn, hvor du kan m\xf8de andre unge i YS. Har det fanget din interesse, s\xe5 t\xf8v ikke med at kontakte os: nordjylland@yngresagen.dk tlf. 21935185. \n\n::. YNGRESAGEN I PRESSEN .::\nL\xe6s her et udsnit af sidste nyt om Yngresagen i medierne. L\xe6s og lyt mere p\xe5 hjemmesiden under \u201dYS i pressen\u201d.\n\n:: Radionyhederne: Unge skal informeres bedre om l\xe5n \nUnge ved for lidt om at l\xe5ne penge. Det udnytter banker og rejseselskaber til at give dem l\xe5n med t\xe5rnh\xf8je renter. S\xe5dan lyder det fra formand Steffen M\xf8ller fra landsforeningen Yngresagen. \n\n:: Danmarks Radio P1: Dansk Folkeparti - de \xe6ldres parti? \nHvorfor er det kun fattige \xe6ldre og ikke alle fattige, der kan s\xf8ge om at f\xe5 nedsat medielicens?\nDansk Folkepartis ungeordf\xf8rer, Karin N\xf8dgaard, og Yngresagens formand Steffen M\xf8ller debatterer medielicens, \xe6ldrecheck og indflydelse til unge \n\n:: Frederiksborg Amts Avis: Turen til Roskilde koster en holdning!\nFor at skabe et m\xf8de mellem politikere og unge fragter Yngresagen unge gratis til \xe5rets Roskilde Festival. Det sker med den s\xe5kaldte Yngrebussen, der kan l\xe6ses mere om p\xe5 www.yngrebussen.dk\n\n \n \nMed venlig hilsen \nYngresagen\n\nLandsforeningen Yngresagen\nKulturhuset Kapelvej 44\n2200 K\xf8benhavn N\n\ntlf. 29644960\ ninfo@yngresagen.dk \nwww.yngresagen.dk\n\n\n-------------------------------------------------------\nUnsubscribe Link: \nhttp://apps.corecluster.net/apps/ml/r.php?l=16329&e=public%40jalf.dk%0D%0A&id=40830383\n-------------------------------------------------------\n\n' 

Now its in unicode so that it doesn't give you any problems.

Now the problem with the encoding: the problem

 In [129]: encd=open('jalf_encode_err','r').read() In [130]: chardet.detect(encd) Out[130]: {'confidence': 0.78187650822865284, 'encoding': 'ISO-8859-2'} #even after the unicode conversion the encoding to ascii fails - because the criteris is strict by default In [131]: unicode(encd, chardet.detect(encd)['encoding']).encode('ascii') --------------------------------------------------------------------------- <type 'exceptions.UnicodeEncodeError'> Traceback (most recent call last) /home/jv/<ipython console> in <module>() <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\u0159' in position 557: ordinal not in range(128)' ##changing the criteria to ignore In [132]: unicode(encd, chardet.detect(encd)['encoding']).encode('ascii', 'ignore') Out[132]: 'Return-path: < info@kollegierneskontor.dk >\nEnvelope-to: alf@5elements.net \nDelivery-date: Tue, 21 Aug 2007 06:10:08 -0400\nReceived: from pfepc.post.tele.dk ([195.41.46.237]:52065)\n\tby c1p.hostingzoom.com with esmtp (Exim 4.66)\n\t(envelope-from < info@kollegierneskontor.dk >)\n\tid 1INQgX-0003fI-Un\n\tfor alf@5elements.net ; Tue, 21 Aug 2007 06:10:08 -0400\nReceived: from local.com (ns2.datadan.dk [195.41.7.21])\n\tby pfepc.post.tele.dk (Postfix) with SMTP id ADF4C8A0086\n\tfor < alf@5elements.net >; Tue, 21 Aug 2007 12:10:04 +0200 (CEST)\nFrom: "Kollegiernes Kontor I Kbenhavn" < info@kollegierneskontor.dk >\nTo: "Jesper Alf Dam" < alf@5elements.net >\nSubject: Fornyelse af profil\nDate: Tue, 21 Aug 2007 12:10:03 +0200\nX-Mailer: Dundas Mailer Control 1.0\nMIME-Version: 1.0\nContent-Type: Multipart/Alternative;\n\tboundary="Gark=_20078211010346yhSD0hUCo"\nMessage-Id: < 20070821101004.ADF4C8A0086@pfepc.post.tele.dk >\nX-Spam-Status: No, score=0.0\nX-Spam-Score: 0\nX-Spam-Bar: /\nX-Spam-Flag: NO\nX-ClamAntiVirus-Scanner: This mail is clean\n\n\n\n--Gark=_20078211010346yhSD0hUCo\nContent-Type: text/plain; charset=ISO-8859-1\nContent-Transfer-Encoding: Quoted-Printable\n\nHej Jesper Alf Dam=0D=0A=0D=0AHusk at forny din profil hos KKIK inden 28.=\n august 2007=0D=0ALog ind p=E5 din profil og benyt ikonet "forny".=0D=0A=0D=\n=0AVenlig hilsen=0D=0AKollegiernes Kontor i K=F8benhavn=0D=0A=0D=0Ahttp:/=\n/www.kollegierneskontor.dk/=0D=0A=0D=0A\n\n--Gark=_20078211010346yhSD0hUCo\nContent-Type: text/html; charset=ISO-8859-1\nContent-Transfer-Encoding: Quoted-Printable\n\n<html>=0D=0A<head>=0D=0A=0D=0A<style>=0D=0ABODY, TD {=0D=0Afont-family: v=\nerdana, arial, helvetica; font-size: 12px; color: #666666;=0D=0A}=0D=0A</=\nstyle>=0D=0A=0D=0A<title></title>=0D=0A=0D=0A</head>=0D=0A<body bgcolor=3D=\n#FFFFFF>=0D=0A<hr size=3D1 noshade>=0D=0A<table cellpadding=3D0 cellspaci=\nng=3D0 border=3D0 width=3D100%>=0D=0A<tr><td >=0D=0AHej Jesper Alf Dam<br=\n><br>Husk at forny din profil inden 28. august 2007<br>=0D=0ALog ind p=E5=\n din profil og benyt ikonet "forny".=0D=0A<br><br>=0D=0A<a href=3D"http:/=\n/www.kollegierneskontor.dk/">Klik her</a> for at logge ind.<br><br>Venlig=\n hilsen<br>Kollegiernes Kontor i K=F8benhavn=0D=0A</td></tr>=0D=0A</table=\n>=0D=0A<hr size=3D1 noshade>=0D=0A</body>=0D=0A</html>=0D=0A\n\n--Gark=_20078211010346yhSD0hUCo--\n\n' In [133]: len(encd) Out[133]: 2303 In [134]: len(unicode(encd, chardet.detect(encd)['encoding']).encode('ascii', 'ignore')) Out[134]: 2302 

CAUTION: As you can see, there may be slight or moderate data loss in this procedure. Thus, the user can use it or not.

so the code should look like

 import chardet for msg in src: msg=unicode(msg, chardet.detect(msg)['encoding']).encode('ascii', 'ignore') dest.add(msg) 
+4
source share

Try in Python 2.5 or 2.6 instead of 3.0. 3.0 has a completely different Unicode processing, and this module may not have been updated for version 3.0.

+4
source share

All Articles