I am trying to parse a tab delimited file using cassava / Data.Csv in Haskell. However, I am having problems if there are "Unicode" characters in my CSV file. Then I will receive parse error (endOfInput).
According to the command-line tool βfileβ, my file has decoding βUnicode UTF-8β. My Haskell code looks like this:
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString as C
import qualified System.IO.UTF8 as U
import qualified Data.ByteString.UTF8 as UB
import qualified Data.ByteString.Lazy.Char8 as DL
import qualified Codec.Binary.UTF8.String as US
import qualified Data.Text.Lazy.Encoding as EL
import qualified Data.ByteString.Lazy as L
import Data.Text.Encoding as E
-- Handle CSV / TSV files with ...
import Data.Csv
import qualified Data.Vector as V
import Data.Char -- ord
csvFile :: FilePath
csvFile = "myFile.txt"
-- Set delimiter to \t (tabulator)
myOptions = defaultDecodeOptions {
decDelimiter = fromIntegral (ord '\t')
}
main :: IO ()
main = do
csvData <- L.readFile csvFile
case EL.decodeUtf8' csvData of
Left err -> print err
Right dat ->
case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of
Left err -> putStrLn err
Right v -> V.forM_ v $ \ (category :: String ,
user :: String ,
date :: String,
time :: String,
message :: String) -> do
print message
I tried to use decodingUtf8 ', pre-processing (filtering) the input with predicates from Data.Char and much more. However, the endOfFile error persists.
My CSV file looks like this:
a - - - RT USE " Kenny" β’ Hahahahahahahahaha.
a - - - Uhm .. wat dan ook ????!!!! π
Or more literally:
a\t-\t-\t-\tRT USE " Kenny" β’ Hahahahahahahahaha.
a\t-\t-\t-\tUhm .. wat dan ook ????!!!! π
π β’ ( ). , cassava/Data.Csv ?
EDIT:
(. tibbe). , , !
import qualified Data.Text as T
preprocess :: T.Text -> T.Text
preprocess txt = cons '\"' $ T.snoc escaped '\"'
where escaped = T.concatMap escaper txt
escaper :: Char -> T.Text
escaper c
| c == '\t' = "\"\t\""
| c == '\n' = "\"\n\""
| c == '\"' = "\"\""
| otherwise = T.singleton c