Python unicode confusion length

Question

Python unicode confusion length

There was some help already, but I was still confused.

I have a line in Unicode:

title = u'😉test' title_length = len(title) #5

But! I need len (title) to be 6. Customers expect this to be 6 because they seem to be counted differently than on the backend.

As a workaround, I wrote this little helper, but I'm sure it can be improved (with sufficient knowledge of coding) or, possibly, even incorrectly.

 title_length = len(title) + repr(title).count('\\U') #6

1. Is there a better way to get a length of 6 ?:-)

I assume that (Python) counts the number of Unicode characters, which is 5. Are clients counting the number of bytes?

2. Will my logic be violated for other Unicode characters that require, for example, 4 bytes?

Running Python 2.7 ucs4.

+7

python unicode

kev Jun 11 '15 at 8:37

source share

1 answer

Martijn pieters · Accepted Answer · 2015-06-11T08:44:06+0000

You have 5 code points. One of these code points is outside the Basic Multilingual Plane , which means the UTF-16 encoding for these code points must use two code blocks for the character .

In other words, the client relies on implementation details and does something wrong. They should count codes, not codes. There are several platforms where this happens fairly regularly; Python 2 UCS2 builds are like that, but Java developers often forget about the differences, like the Windows API.

You can encode text in UTF-16 and divide the number of bytes into two (each UTF-16 code module is 2 bytes). Choose utf-16-le or utf-16-be to not include the specification in length:

 title = u'😉test' len_in_codeunits = len(title.encode('utf-16-le')) // 2

If you are using Python 2 (and judging by the u prefix for a string that you might well be), be aware that there are two different versions of Python, depending on how you built it. Depending on the switch of the build time configuration, you will have a UCS-2 or UCS-4 build; the first uses surrogates internally, and your title value will also be 6. See Python returns a length of 2 for a single Unicode character string .

Python unicode confusion length

More articles: