Python email.header.decode_header not working for multi-line headers

I am creating a system that reads emails from a gmail account and retrieves objects using the Python imaplib and email modules. Sometimes emails received from a hotmail account have line breaks in their headers, for example:

 In [4]: message['From'] Out[4]: '=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t< isatocino22@hotmail.com >' 

If I try to decode this header, it does nothing:

 In [5]: email.header.decode_header(message['From']) Out[5]: [('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t< isatocino22@hotmail.com >', None)] 

However, if I replace the line break and the tab with a space, it works:

 In [6]: email.header.decode_header(message['From'].replace('\r\n\t', ' ')) Out[6]: [('isabel mar\xc3\xada tocino garc\xc3\xada', 'utf-8'), ('< isatocino22@hotmail.com >', None)] 

Is this a bug in decode_header ? If not, I would like to know what other special cases like this I should know.

+6
source share
2 answers

This is a bug in decode_header , a bug which is present in python2.7 and fixed in python3.3.

 >>> sys.version_info sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0) >>> email.header.decode_header('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t< isatocino22@hotmail.com >') [(b'isabel mar\xc3\xada tocino garc\xc3\xada', 'utf-8'), (b'< isatocino22@hotmail.com >', None)] 

vs

 >>> sys.version_info sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0) >>> email.header.decode_header('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t< isatocino22@hotmail.com >') [('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t< isatocino22@hotmail.com >', None)] 
+5
source

This error still occurs in some versions of Python 2.7, so you can use the following workaround:

 >>> email.header.decode_header('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t< isatocino22@hotmail.com >'.replace('\r\n\t', ' ')) [('isabel mar\xc3\xada tocino garc\xc3\xada', 'utf-8'), ('< isatocino22@hotmail.com >', None)] 

It replaces the CLRF and the tab for spaces. In this case, decode_header will correctly analyze the header.

0
source

All Articles