Parse CSV / TSV file in Haskell - Unicode characters

Question

Parse CSV / TSV file in Haskell - Unicode characters

I am trying to parse a tab delimited file using cassava / Data.Csv in Haskell. However, I am having problems if there are "Unicode" characters in my CSV file. Then I will receive parse error (endOfInput).

According to the command-line tool “file”, my file has decoding “Unicode UTF-8”. My Haskell code looks like this:

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString as C
import qualified System.IO.UTF8 as U
import qualified Data.ByteString.UTF8 as UB
import qualified Data.ByteString.Lazy.Char8 as DL
import qualified Codec.Binary.UTF8.String as US
import qualified Data.Text.Lazy.Encoding as EL
import qualified Data.ByteString.Lazy as L

import Data.Text.Encoding as E

-- Handle CSV / TSV files with ...
import Data.Csv
import qualified Data.Vector as V

import Data.Char -- ord

csvFile :: FilePath
csvFile = "myFile.txt"

-- Set delimiter to \t (tabulator)
myOptions = defaultDecodeOptions {
              decDelimiter = fromIntegral (ord '\t')
            }

main :: IO ()
main = do
  csvData <- L.readFile csvFile 
  case EL.decodeUtf8' csvData of 
   Left err -> print err
   Right dat ->
     case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of
       Left err -> putStrLn err
       Right v -> V.forM_ v $ \ (category :: String ,
                               user :: String ,
                               date :: String,
                               time :: String,
                               message :: String) -> do
         print message

I tried to use decodingUtf8 ', pre-processing (filtering) the input with predicates from Data.Char and much more. However, the endOfFile error persists.

My CSV file looks like this:

a   -   -   -   RT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a   -   -   -   Uhm .. wat dan ook ????!!!! 👋

Or more literally:

a\t-\t-\t-\tRT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a\t-\t-\t-\tUhm .. wat dan ook ????!!!! 👋

👋 • ( ). , cassava/Data.Csv ?

EDIT: (. tibbe). , , !

import qualified Data.Text as T

preprocess :: T.Text -> T.Text
preprocess txt = cons '\"' $ T.snoc escaped '\"'
  where escaped = T.concatMap escaper txt

escaper :: Char -> T.Text
escaper c
  | c == '\t' = "\"\t\""
  | c == '\n' = "\"\n\""
  | c == '\"' = "\"\""
  | otherwise = T.singleton c

+4

haskell csv

Pold 22 . '14 3:23

1

tibbe · Accepted Answer · 2014-10-24T05:58:44+0000

:

, , , .
( ).

, , , :

a   -   -   -   "RT USE "" Kenny"" • Hahahahahahahahaha. #Emmen #Brandstapel"

:

import Data.ByteString.Lazy
import Data.Char
import Data.Csv
import Data.Text.Encoding
import Data.Vector

test :: Either String (Vector (String, String, String, String, String))
test = decodeWith
    defaultDecodeOptions {decDelimiter = fromIntegral $ ord '\t' }
    NoHeader
    (fromStrict $ encodeUtf8 "a\t-\t-\t-\t\"RT USE \"\" Kenny\"\" • Hahahahahahahahaha. #Emmen #Brandstapel\"")

( , encodeUtf8 Text, ByteString. IsString ByteString s, ByteString, .)

Parse CSV / TSV file in Haskell - Unicode characters

More articles: