Performance error using CSV typeprovider from FSharp.Data

I am trying to learn more about the FSharp.Data project, using it to read a CSV file. The CSV file is a simplified version of the data from the Kaggle digit recognition contest.

When I read a CSV file containing 785 columns and 113 lines (including the header line), the following two lines of code run very slowly:

type trainingSet = CsvProvider<"Data/trainSmall.csv", ",", CacheRows=false> let data = trainingSet.Load("Data/trainSmall.csv") 

When I sent the first line to F # interactive, it will return in about 10 seconds, while when sending the second line of code to interactive F #, it takes more than 5 minutes before the interactive invitation responds.

I have been running code on my MacBook Pro since 2013 with an I5 2.6 GHz processor and 16 GB of RAM using F # 3.0 and Xamarin Studio. I tried the same experiment with Windows7 / VS2013 running under VM on the same hardware. The results are comparable. When I use the same machine and try to do the same with R, it is so fast that I can’t do this with a regular watch.

Please advise me on the proper use of the CSV type propvider from Fsharp.Data!

+3
f # csv f # -data type-providers
source share
2 answers

I recommend that you do not use CsvProvider for this. You load the matrix, so you won’t get any benefit from what type of each column is displayed, since they are all the same. You can still use the CSV parser for F # data using the CsvFile. CsvProvider is optimized for files with not many columns, but potentially many lines. The way the code is generated will try to create a tuple with 785 elements in your example, which just won't work

+1
source share

Humm, the second line is supposed to do nothing, since the lines are read on demand. There is something wrong, can you post a question to github using a repro file?

0
source share

All Articles