Python codec line ending

It seems that Python's UTF-8 encoding ( codecs package) interprets Unicode characters 28, 29, and 30 as line endings. What for? And how can I prevent this?

Code example:

 with open('unicodetest.txt', 'w') as f: f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e') with open('unicodetest.txt', 'r') as f: for i,l in enumerate(f): print i, l # prints "0 abcde" with special characters in between. 

The fact is that he reads it as a single line, as I expect. Now that I use codecs to read in UTF-8, it interprets it as many lines.

 import codecs with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f: for i,l in enumerate(f): print i, l # 0 a # 1 b # 2 c # 3 de # (again with the special characters after each a, b, c, d 

Characters 28 through 31 are described as “Information Separator Four” through “One” (in that order). Two things strike me: 1) 28-30 are interpreted as the ends of lines, 2) 31 are not. Is this intentional behavior? Where can I find a definition of which characters are interpreted as line endings? Is there a way to not interpret them as line endings?

Thanks.

edit forgot to copy the argument "UTF-8" to codecs.open . The code in my question is now fixed.

+7
source share
1 answer

This is a great question.

It doesn't matter if you open() file with open() or codecs.open() . The first works in terms of byte strings. The latter works in terms of Unicode strings. In Python, these behave differently .

The same question arose as Python Issue 7643, What is a Unicode line break character? . The discussion, as well as links to the Unicode Character Database , are fun. Release 7643 also gives this short code snippet to demonstrate the difference:

 for s in '\x0a\x0d\x1c\x1d\x1e': print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1) 

But it comes down to that.

To determine if bytes in byte strings are line breaks (or spaces), Python uses ASCII control character rules. By this measure, bytes 10 and 13 are line break characters (and Python treats byte 13 followed by 10 as line break).

But to determine whether characters in Unicode strings are line breaks, Python follows the Unicode Character Database character classifications documented in UAX # 44 , and UAX # 14 Line Breaking Algorithm, section 5 Linear Properties . According to Issue 7643, these documents identify three character properties that identify a character as a line for Python purposes:

  • Generic Zl Category "Line Separator"
  • General Category Zp "Paragraph Separator"
  • Bidirectional Class B Paragraph Separator

Symbols 28 (0x001C), 29 (0x001D) and 30 (0x001E) have these symbol properties. Character 31 (0x001F) no. What for? This is a question for the Unicode Technical Committee. But in ASCII these characters were known as “File Separator”, “Separator Group”, “Separator Records” and “Unit Separator”. Using a tabbed text data file as a comparison, the first three mean at least the same split as line breaks, while the fourth is probably the same as a tab.

You can see the code that actually defines these three Unicode characters as line breaks in Python Unicode strings in Objects/unicodeobject.c . Find the ascii_linebreak[] array. This array underlies the implementation of unicode .splitlines() . Different code is at the core of str .splitlines() . I suppose, but did not trace it in the Python source code, that enumerate() in the file opened with codecs.open() is implemented in terms of unicode .splitlines() .

You ask: "How can I prevent this?" I see no way to make splitlines() behavior differently. However, you can open the file as a byte stream, read the lines as bytes with the behavior of str.splitlines() , and then decode each line as UTF-8 for use as a string in Unicode:

 with open('unicodetest.txt', 'r') as f: for i,l in enumerate(f): print i, l.decode('UTF-8') # prints "0 abcde" with special characters in between. 

I assume that you are using Python 2.x and not 3.x. My answer is based on Python 2.7.

+5
source

All Articles