This is a great question.
It doesn't matter if you open() file with open() or codecs.open() . The first works in terms of byte strings. The latter works in terms of Unicode strings. In Python, these behave differently .
The same question arose as Python Issue 7643, What is a Unicode line break character? . The discussion, as well as links to the Unicode Character Database , are fun. Release 7643 also gives this short code snippet to demonstrate the difference:
for s in '\x0a\x0d\x1c\x1d\x1e': print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)
But it comes down to that.
To determine if bytes in byte strings are line breaks (or spaces), Python uses ASCII control character rules. By this measure, bytes 10 and 13 are line break characters (and Python treats byte 13 followed by 10 as line break).
But to determine whether characters in Unicode strings are line breaks, Python follows the Unicode Character Database character classifications documented in UAX # 44 , and UAX # 14 Line Breaking Algorithm, section 5 Linear Properties . According to Issue 7643, these documents identify three character properties that identify a character as a line for Python purposes:
- Generic Zl Category "Line Separator"
- General Category Zp "Paragraph Separator"
- Bidirectional Class B Paragraph Separator
Symbols 28 (0x001C), 29 (0x001D) and 30 (0x001E) have these symbol properties. Character 31 (0x001F) no. What for? This is a question for the Unicode Technical Committee. But in ASCII these characters were known as “File Separator”, “Separator Group”, “Separator Records” and “Unit Separator”. Using a tabbed text data file as a comparison, the first three mean at least the same split as line breaks, while the fourth is probably the same as a tab.
You can see the code that actually defines these three Unicode characters as line breaks in Python Unicode strings in Objects/unicodeobject.c . Find the ascii_linebreak[] array. This array underlies the implementation of unicode .splitlines() . Different code is at the core of str .splitlines() . I suppose, but did not trace it in the Python source code, that enumerate() in the file opened with codecs.open() is implemented in terms of unicode .splitlines() .
You ask: "How can I prevent this?" I see no way to make splitlines() behavior differently. However, you can open the file as a byte stream, read the lines as bytes with the behavior of str.splitlines() , and then decode each line as UTF-8 for use as a string in Unicode:
with open('unicodetest.txt', 'r') as f: for i,l in enumerate(f): print i, l.decode('UTF-8')
I assume that you are using Python 2.x and not 3.x. My answer is based on Python 2.7.