How to handle import of a file with UTF-8 encoding, codepage = 65001, to SQL server

In Norway, we have 3 very annoying characters, æøå, who create all kinds of problems. Starting with SQL Server 2008, Microsoft decided not to support code page 65001. I found a manageable solution to the problem of importing a UTF-8 file into a SQL server using OPENROWSET (BULK) and saving æøå tokens.

I created a powershell script that uses StreamReader and StreamWriter to convert a file from UTF-8 to standard ANSI encoding.

$filename = "C:\Test\UTF8_file.txt" $outfile = "C:\Test\ANSI_file.txt" $reader = new-object System.IO.StreamReader($filename, [System.Text.Encoding]::GetEncoding(65001)) $stream = new-object System.IO.StreamWriter($outfile, $false, [System.Text.Encoding]::Default) 

I am sharing the file of the first line, the title line, in the same process.

 $i=1 while(($line = $reader.ReadLine()) -ne $null) { if($i -gt 1) { $stream.WriteLine($line) } $i++ } $reader.Close() $stream.Close() 

Then I can use OPENROWSET to import the ANSI file into the SQL server and manipulate the data during this. Using codepage 1252, which equals Danish_Norwegian sorting.

  insert into SomeDatabase.dbo.SomeTable SELECT [companynumber] , case [role] when 'Styreformann' then 'Styreleder' when 'Styrets leder' then 'Styreleder' else rolle end as 'role' , case [representant] when 'Y' then '1' else '0' end as 'representant' , left((RIGHT('0000'+ CONVERT(VARCHAR,postnr),5)),4) end as 'postnr' , income*1000 as income , null as person2id FROM OPENROWSET( BULK 'C:\Test\ANSI_file.txt', FORMATFILE = 'C:\Test\FormatBulkInsert_file.xml' , CODEPAGE =1252 , ROWS_PER_BATCH = 50000 ) as v 

This method provided the correct display of Norwegian tokens. The format file is as follows:

 <?xml version="1.0"?> <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RECORD> <FIELD ID="1" xsi:type="CharTerm" TERMINATOR=';"' /> <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='";"' /> <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='";"' /> <FIELD ID="4" xsi:type="CharTerm" TERMINATOR='";' /> <FIELD ID="5" xsi:type="CharTerm" TERMINATOR=';' /> <FIELD ID="6" xsi:type="CharTerm" TERMINATOR='\n' /> </RECORD> <ROW> <COLUMN SOURCE="1" NAME="companynumber" xsi:type="SQLINT"/> <COLUMN SOURCE="2" NAME="role" xsi:type="SQLNVARCHAR"/> <COLUMN SOURCE="3" NAME="representant" xsi:type="SQLBIT"/> <COLUMN SOURCE="4" NAME="postnr" xsi:type="SQLNVARCHAR"/> <COLUMN SOURCE="5" NAME="income" xsi:type="SQLDECIMAL"/> <COLUMN SOURCE="6" NAME="person2id" xsi:type="SQLINT"/> </ROW> </BCPFORMAT> 

Hope this helps someone else, because I spent quite a lot of time googleing before I found a way to solve this problem.

+6
source share
1 answer

Instead, convert to UTF16. This is NCHAR's proprietary format for SQL Server and allows full display of Unicode values.

To do this, you will need to specify SQLNCHAR or SQLNVARCHAR in the format file, and also be aware of the warning:

For a format file for working with a Unicode character data file, all input fields must be Unicode text strings (that is, Unicode strings with a fixed or lowercase character).

An alternative is to load it as binary data and use the CONVERT function to convert it from VARBINARY to NVARCHAR (this is UTF-16), and then to the desired code page as VARCHAR .

0
source

All Articles