Performance FSharp.Data CsvProvider

I have a csv file with 6 columns and 678 552 lines. Unfortunately, I cannot use any sample data, but the types are simple: int64 , int64 , date , date , string , string and there are no missing values.

Time to load this data into a data frame in R using read.table : ~ 3 seconds.

The load time of this data using CsvFile.Load in F #: ~ 3 seconds.

Time to load this data in a Deedle frame in F #: ~ 7 seconds.

Adding inferTypes=false and providing the Deedle Frame.ReadCsv reduces the time to ~ 3 seconds

Download time for this data using CsvProvider in F #: ~ 5 minutes .

And it's 5 minutes even after I define the types in the Schema parameter, presumably excluding the time that F # will use to output them.

I understand that a type provider needs to do a lot more than R or CsvFile.Load to parse the data into the correct data type, but I am surprised at the x100 speed penalty. Even more confusing is the time that Deedle takes to load data, since it also needs to infer types and cast them accordingly, arrange in Series, etc. I would really expect Deedle to take longer than CsvProvider.

In this problem, the poor performance of CsvProvider was caused by a large number of columns, this is not my case.

I am wondering if I am doing something wrong or if there is some way to speed things up a bit.

Just to clarify: the creation of the provider occurs almost instantly. This is when I forcefully generate the generated sequence, which will be implemented using Seq.length df.Rows , it takes ~ 5 minutes to return the fsharpi string.

I am on a linux system, F # v4.1 on mono v4.6.1.

Here is the code for CsvProvider

 let [<Literal>] SEP = "|" let [<Literal>] CULTURE = "sv-SE" let [<Literal>] DATAFILE = dataroot + "all_diagnoses.csv" type DiagnosesProvider = CsvProvider<DATAFILE, Separators=SEP, Culture=CULTURE> let diagnoses = DiagnosesProvider() 

EDIT1: I added the time that Deedle takes to load data into a frame.

EDIT2: Added time that Deedle accepts if inferTypes=false and a schema is provided.

Also, CacheRows=false in CsvProvider, as suggested in the comments, has no tangible effect at boot time.

EDIT3: Alright, we're getting somewhere. For some peculiar reason, Culture seems to be the culprit. If I omit this argument, CsvProvider loads the data in ~ 7 seconds. I'm not sure what could be causing this. My system locale is en_US. However, the data comes from SQL Server in the Swedish locale, where decimal digits are separated by "," instead of ".". This particular dataset does not have decimals, so I can completely abandon Culture. However, another set has 2 decimal columns and more than 1,000,000 rows. My next task is to test this on a Windows system, which I don’t have right now.

EDIT4: The problem seems to be resolved, but I still don't understand what causes it. If I change the culture "globally" by doing:

 System.Globalization.CultureInfo.DefaultThreadCurrentCulture = CultureInfo("sv-SE") System.Threading.Thread.CurrentThread.CurrentCulture = CultureInfo("sv-SE") 

and then remove the Culture="sv-SE" argument from CsvProvider, the load time will be reduced to ~ 6 seconds, and the decimal places will be correctly analyzed. I leave it open if anyone can explain this behavior.

+7
f # mono f # -data
source share
2 answers

The problem was caused by the fact that CsvProvider is not memoizing the explicitly installed Culture . The problem was resolved with this pull request.

+2
source share

I am trying to reproduce the problem that you see, because you cannot share the data that I tried to create some test data. However, on my machine (.NET 4.6.2 F # 4.1), I do not see it taking minutes, it takes several seconds.

Perhaps you can try to see how my sample application works in your setup, and can we work with it?

 open System open System.Diagnostics open System.IO let clock = let sw = Stopwatch () sw.Start () fun () -> sw.ElapsedMilliseconds let time a = let before = clock () let v = a () let after = clock () after - before, v let generateDataSet () = let random = Random 19740531 let firstDate = DateTime(1970, 1, 1) let randomInt () = random.Next () |> int64 |> (+) 10000000000L |> string let randomDate () = (firstDate + (random.Next () |> float |> TimeSpan.FromSeconds)).ToString("s") let randomString () = let inline valid ch = match ch with | '"' | '\\' -> ' ' | _ -> ch let c = random.Next () % 16 let gi = if i = 0 || i = c + 1 then '"' else 32 + random.Next() % (127 - 32) |> char |> valid Array.init (c + 2) g |> String let columns = [| "Id" , randomInt "ForeignId" , randomInt "BirthDate" , randomDate "OtherDate" , randomDate "FirstName" , randomString "LastName" , randomString |] use sw = new StreamWriter ("perf.csv") let headers = columns |> Array.map fst |> String.concat ";" sw.WriteLine headers for i = 0 to 700000 do let values = columns |> Array.map (fun (_, f) -> f ()) |> String.concat ";" sw.WriteLine values open FSharp.Data [<Literal>] let sample = """Id;ForeignId;BirthDate;OtherDate;FirstName;LastName 11795679844;10287417237;2028-09-14T20:33:17;1993-07-21T17:03:25;", xS@ %aY)N*})Z";"ZP~;" 11127366946;11466785219;2028-02-22T08:39:57;2026-01-24T05:07:53;"H-/QA(";"g8}J?k~" """ type PerfFile = CsvProvider<sample, ";"> let readDataWithTp () = use streamReader = new StreamReader ("perf.csv") let csvFile = PerfFile.Load streamReader let length = csvFile.Rows |> Seq.length printfn "%A" length [<EntryPoint>] let main argv = Environment.CurrentDirectory <- AppDomain.CurrentDomain.BaseDirectory printfn "Generating dataset..." let ms, _ = time generateDataSet printfn " took %d ms" ms printfn "Reading dataset..." let ms, _ = time readDataWithTp printfn " took %d ms" ms 0 

Performance numbers (.NET462 on my desktop):

 Generating dataset... took 2162 ms Reading dataset... took 6156 ms 

Performance numbers (Mono 4.6.2 on my Macbook Pro):

 Generating dataset... took 4432 ms Reading dataset... took 8304 ms 

Update

It turns out that pointing Culture to a CsvProvider clearly degrades performance. It can be any culture, not just sv-SE , but why?

If someone checks the code that the provider generates for quick and slow cases, note the difference:

Quick

 internal sealed class csvFile@78 { internal System.Tuple<long, long, System.DateTime, System.DateTime, string, string> Invoke(object arg1, string[] arg2) { Microsoft.FSharp.Core.FSharpOption<string> fSharpOption = TextConversions.AsString(arg2[0]); long arg_C9_0 = TextRuntime.GetNonOptionalValue<long>("Id", TextRuntime.ConvertInteger64("", fSharpOption), fSharpOption); fSharpOption = TextConversions.AsString(arg2[1]); long arg_C9_1 = TextRuntime.GetNonOptionalValue<long>("ForeignId", TextRuntime.ConvertInteger64("", fSharpOption), fSharpOption); fSharpOption = TextConversions.AsString(arg2[2]); System.DateTime arg_C9_2 = TextRuntime.GetNonOptionalValue<System.DateTime>("BirthDate", TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption); fSharpOption = TextConversions.AsString(arg2[3]); System.DateTime arg_C9_3 = TextRuntime.GetNonOptionalValue<System.DateTime>("OtherDate", TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption); fSharpOption = TextConversions.AsString(arg2[4]); string arg_C9_4 = TextRuntime.GetNonOptionalValue<string>("FirstName", TextRuntime.ConvertString(fSharpOption), fSharpOption); fSharpOption = TextConversions.AsString(arg2[5]); return new System.Tuple<long, long, System.DateTime, System.DateTime, string, string>(arg_C9_0, arg_C9_1, arg_C9_2, arg_C9_3, arg_C9_4, TextRuntime.GetNonOptionalValue<string>("LastName", TextRuntime.ConvertString(fSharpOption), fSharpOption)); } } 

Slow

 internal sealed class csvFile@78 { internal System.Tuple<long, long, System.DateTime, System.DateTime, string, string> Invoke(object arg1, string[] arg2) { Microsoft.FSharp.Core.FSharpOption<string> fSharpOption = TextConversions.AsString(arg2[0]); long arg_C9_0 = TextRuntime.GetNonOptionalValue<long>("Id", TextRuntime.ConvertInteger64("sv-SE", fSharpOption), fSharpOption); fSharpOption = TextConversions.AsString(arg2[1]); long arg_C9_1 = TextRuntime.GetNonOptionalValue<long>("ForeignId", TextRuntime.ConvertInteger64("sv-SE", fSharpOption), fSharpOption); fSharpOption = TextConversions.AsString(arg2[2]); System.DateTime arg_C9_2 = TextRuntime.GetNonOptionalValue<System.DateTime>("BirthDate", TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption); fSharpOption = TextConversions.AsString(arg2[3]); System.DateTime arg_C9_3 = TextRuntime.GetNonOptionalValue<System.DateTime>("OtherDate", TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption); fSharpOption = TextConversions.AsString(arg2[4]); string arg_C9_4 = TextRuntime.GetNonOptionalValue<string>("FirstName", TextRuntime.ConvertString(fSharpOption), fSharpOption); fSharpOption = TextConversions.AsString(arg2[5]); return new System.Tuple<long, long, System.DateTime, System.DateTime, string, string>(arg_C9_0, arg_C9_1, arg_C9_2, arg_C9_3, arg_C9_4, TextRuntime.GetNonOptionalValue<string>("LastName", TextRuntime.ConvertString(fSharpOption), fSharpOption)); } } 

More specifically, this difference:

 // Fast TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption) // Slow TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption) 

When we specify the culture, it is passed to ConvertDateTime , which translates it to GetCulture

 static member GetCulture(cultureStr) = if String.IsNullOrWhiteSpace cultureStr then CultureInfo.InvariantCulture else CultureInfo cultureStr 

This means that for the default case, we use CultureInfo.InvariantCulture , but for any other case, for each field and row, we create a CultureInfo object. Caching can be performed, but it is not. The creation process itself does not seem to take too much time, but something happens when we each time deal with a new CultureInfo object.

Parsing DateTime in FSharp.Data essentially this

 let dateTimeStyles = DateTimeStyles.AllowWhiteSpaces ||| DateTimeStyles.RoundtripKind match DateTime.TryParse(text, cultureInfo, dateTimeStyles) with 

So, let's run a performance test in which we will use the cached CultureInfo object, and the other every time we create it every time.

 open System open System.Diagnostics open System.Globalization let clock = let sw = Stopwatch () sw.Start () fun () -> sw.ElapsedMilliseconds let time a = let before = clock () let v = a () let after = clock () after - before, v let perfTest c cf () = let dateTimeStyles = DateTimeStyles.AllowWhiteSpaces ||| DateTimeStyles.RoundtripKind let text = DateTime.Now.ToString ("", cf ()) for i = 1 to c do let culture = cf () DateTime.TryParse(text, culture, dateTimeStyles) |> ignore [<EntryPoint>] let main argv = Environment.CurrentDirectory <- AppDomain.CurrentDomain.BaseDirectory let ct = "sv-SE" let cct = CultureInfo ct let count = 10000 printfn "Using cached CultureInfo object..." let ms, _ = time (perfTest count (fun () -> cct)) printfn " took %d ms" ms printfn "Using fresh CultureInfo object..." let ms, _ = time (perfTest count (fun () -> CultureInfo ct)) printfn " took %d ms" ms 0 

Performance numbers in .NET 4.6.2 F # 4.1:

 Using cached CultureInfo object... took 16 ms Using fresh CultureInfo object... took 5328 ms 

Thus, caching a CultureInfo object in FSharp.Data should significantly improve the performance of CsvProvider when specifying a culture.

+6
source share

All Articles