Automatic character encoding processing in Perl / DBI / DBD :: ODBC

Question

Automatic character encoding processing in Perl / DBI / DBD :: ODBC

I am using Perl with DBI / DBD::ODBC to retrieve data from a SQL Server database and have some problems with character encoding.

The database has a default SQL_Latin1_General_CP1_CI_AS , so the data in the varchar columns is encoded in the version of Microsoft Latin-1, AKA windows-1252 .

There seems to be no way to handle this transparently in DBI / DBD :: ODBC. I get data still encoded as windows-1252 , for example, € "" is encoded as bytes 0x80, 0x93 and 0x94. When I write them to a UTF-8 encoded XML file without first decoding them, they are written as Unicode characters 0x80, 0x93 and 0x94 instead of 0x20AC, 0x201C, 0x201D, which is clearly wrong.

My current workaround is to call $val = Encode::decode('windows-1252', $val) for each column after each fetch . This works, but hardly seems the right way to do it.

Is there no way to tell DBI or DBD::ODBC for this conversion for me?

I use ActivePerl (5.12.2 Build 1202), DBI (1.616) and DBD::ODBC (1.29) provided by ActivePerl and updated using ppm; runs on the same server as the database (SQL Server 2008 R2).

My connection string:

 dbi:ODBC:Driver={SQL Server Native Client 10.0};Server=localhost;Database=$DB_NAME;Trusted_Connection=yes;

Thanks in advance.

+4

sql-server perl dbi odbc

mscha May 6, '11 at 13:23

source share

1 answer

bohica · Accepted Answer · 2011-05-06T15:36:39+0000

DBD :: ODBC (and the ODBC API) does not know the character set of the base column, so DBD :: ODBC cannot do anything with the returned 8-bit data, it can only return it as it is, and you need to know what to decrypt it his. If you bind columns as SQL_WCHAR / SQL_WVARCHAR, the / sql _server driver should translate characters to UCS2, and DBD :: ODBC should see the columns as SQL_WCHAR / SQL_WVARCHAR. When DBD :: ODBC is built in Unicode mode, the SQL_WCHAR columns are treated as UCS2 and decoded and transcoded into UTF-8, and Perl should see them as Unicode characters.

You need to set SQL_WCHAR as the binding type after bind_columns, because the binding types are not sticky, like parameter types.

If you want to continue reading varchar data, which windows are 1252 as bytes, then at present you have no choice but to decode them. I am in no hurry to add something to DBD :: ODBC to do this for you, as this is the first time someone has mentioned this to me. You might want to take a look at DBI callbacks, since decoding the returned data may be easier to do in those (say, the fetch method).

You can also explore the option "Translate for character data" in the new ODBC drivers for SQL Server, although I have little experience with it.

Automatic character encoding processing in Perl / DBI / DBD :: ODBC

More articles: