Python String Encodings and ==

Question

Python String Encodings and ==

I am having problems with strings in python without being == when I think they should be, and I believe this has something to do with how they are encoded. Basically, I parse some values separated by commas that are stored in zip archives (GTFS channels are especially for those who are interested).

I use the ZipFile module in python to open specific zip archive files and then compare the text there with some known values. Here is an example file:

 agency_id,agency_name,agency_url,agency_phone,agency_timezone,agency_lang ARLC,Arlington Transit,http://www.arlingtontransit.com,703-228-7433,America/New_York,en

The code I use tries to identify the position of the line "agency_id" in the first line of text so that I can use the corresponding value in any subsequent lines. Here is the code snippet:

 zipped_feed = ZipFile(feed_name, "r") agency_file = zipped_feed.open("agency.txt", "r") line_num = 0 agencyline = agency_file.readline() while agencyline: if line_num == 0: # this is the header, all we care about is the agency_id lineparts = agencyline.split(",") position = -1 counter = 0 for part in lineparts: part = part.strip() if part == "agency_id": position = counter counter += 1 line_num += 1 agencyline = agency_file.readline() else: .....

This code works for some zip archives, but not for others. I did some research and tried the listing (part), and I got '\ xef \ xbb \ xbfagency_id' instead of 'agency_id'. Does anyone know what is going on here and how can I fix it? Thanks for the help!

+1

python string utf-8

jmetz Jun 2 '12 at 17:38

source share

4 answers

Your input file looks like utf-8 and starts with 'ZERO WIDTH NO-BREAK SPACE' -character,

 import unicodedata unicodedata.name('\xef\xbb\xbf'.decode('utf8')) # gives: 'ZERO WIDTH NO-BREAK SPACE'

which is used as a specification (or, more precisely, to define a file as utf8, since the byte order is not very accurate with utf8, but it is usually called BOM anyway)

+3

mata Jun 2 '12 at 17:45

source share

Simple: some of your zip archives print a Unicode BOM (byte order) at the beginning of a line. This is used to indicate byte order for use with multibyte encodings. This means that you are reading a Unicode string (possibly UTF-16 encoded) as a byte string. The simplest thing is to check it at the beginning of the line and delete it.

0

Lukasa Jun 2 '12 at 17:48

source share

You have a file, which can sometimes have a Unicode byte order sign at the beginning of the file. Sometimes this is entered by editors to indicate the encoding.

Here are some details - http://en.wikipedia.org/wiki/Byte_order_mark

The bottom line is that you can search for the string \ xef \ xbb \ xbf, which is a marker for UTF-8 encoded data and simply robs it. Or another option is to open it with a codec pack

 with codecs.open('input', 'r', 'utf-8') as file:

or in your case

 zipped_feed = ZipFile(feed_name, "r") # adding a StreamReader around the zipped_feed.open(...) agency_file = codecs.StreamReader(zipped_feed.open("agency.txt", "r"))

0

koblas Jun 2 '12 at 17:53

source share

Kjir · Accepted Answer · 2012-06-02T17:45:43+0000

This is "Byte Estimation" , which indicates the encoding of the file, and in the case of UTF-16 and UTF-32, the content of the file is also reported. You can interpret it or check it and remove from your line. To remove it, you can do this:

 import codecs unicode(part, "utf8").lstrip(codecs.BOM_UTF8.decode("utf8", "strict"))

Python String Encodings and ==

More articles: