In Norway, we have 3 very annoying characters, æøå, who create all kinds of problems. Starting with SQL Server 2008, Microsoft decided not to support code page 65001. I found a manageable solution to the problem of importing a UTF-8 file into a SQL server using OPENROWSET (BULK) and saving æøå tokens.
I created a powershell script that uses StreamReader and StreamWriter to convert a file from UTF-8 to standard ANSI encoding.
$filename = "C:\Test\UTF8_file.txt" $outfile = "C:\Test\ANSI_file.txt" $reader = new-object System.IO.StreamReader($filename, [System.Text.Encoding]::GetEncoding(65001)) $stream = new-object System.IO.StreamWriter($outfile, $false, [System.Text.Encoding]::Default)
I am sharing the file of the first line, the title line, in the same process.
$i=1 while(($line = $reader.ReadLine()) -ne $null) { if($i -gt 1) { $stream.WriteLine($line) } $i++ } $reader.Close() $stream.Close()
Then I can use OPENROWSET to import the ANSI file into the SQL server and manipulate the data during this. Using codepage 1252, which equals Danish_Norwegian sorting.
insert into SomeDatabase.dbo.SomeTable SELECT [companynumber] , case [role] when 'Styreformann' then 'Styreleder' when 'Styrets leder' then 'Styreleder' else rolle end as 'role' , case [representant] when 'Y' then '1' else '0' end as 'representant' , left((RIGHT('0000'+ CONVERT(VARCHAR,postnr),5)),4) end as 'postnr' , income*1000 as income , null as person2id FROM OPENROWSET( BULK 'C:\Test\ANSI_file.txt', FORMATFILE = 'C:\Test\FormatBulkInsert_file.xml' , CODEPAGE =1252 , ROWS_PER_BATCH = 50000 ) as v
This method provided the correct display of Norwegian tokens. The format file is as follows:
<?xml version="1.0"?> <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RECORD> <FIELD ID="1" xsi:type="CharTerm" TERMINATOR=';"' /> <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='";"' /> <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='";"' /> <FIELD ID="4" xsi:type="CharTerm" TERMINATOR='";' /> <FIELD ID="5" xsi:type="CharTerm" TERMINATOR=';' /> <FIELD ID="6" xsi:type="CharTerm" TERMINATOR='\n' /> </RECORD> <ROW> <COLUMN SOURCE="1" NAME="companynumber" xsi:type="SQLINT"/> <COLUMN SOURCE="2" NAME="role" xsi:type="SQLNVARCHAR"/> <COLUMN SOURCE="3" NAME="representant" xsi:type="SQLBIT"/> <COLUMN SOURCE="4" NAME="postnr" xsi:type="SQLNVARCHAR"/> <COLUMN SOURCE="5" NAME="income" xsi:type="SQLDECIMAL"/> <COLUMN SOURCE="6" NAME="person2id" xsi:type="SQLINT"/> </ROW> </BCPFORMAT>
Hope this helps someone else, because I spent quite a lot of time googleing before I found a way to solve this problem.