F # ref-mutable vars vs object fields

I am writing a parser in F # and it should be as fast as possible (I hope to parse a 100 MB file in less than a minute). As usual, it uses mutable variables to store the next available character and the next available token (i.e., both the lexer and the analyzer itself use one viewing unit).

My current partial implementation uses local variables for this. Since closure variables cannot be mutable (does anyone know the reason for this?) I declared them as ref:

let rec read file includepath = let c = ref ' ' let k = ref NONE let sb = new StringBuilder() use stream = File.OpenText file let readc() = c := stream.Read() |> char // etc 

I guess this has some overhead (not so much, I know, but I try for maximum speed here) and it is a little inelegant. The most obvious alternative would be to create an object of the parser class and variable variables in it. Does anyone know what could be faster? Is there any consensus that is considered more / more idiomatic? Is there any other option that I am missing?

+3
closures parsing f #
May 31 '10 at 9:29 a.m.
source share
2 answers

You mentioned that local changed values ​​cannot be captured by closure, so you need to use ref . The reason for this is that mutable values ​​fixed in the closure must be allocated on the heap (since the closure is allocated on the heap).

F # forces you to write this explicitly (using ref ). In C #, you can “capture a mutable variable”, but the compiler translates it into a field in the heaped object behind the scene, so it will still be on the heap.

Summary: If you want to use closure, variable variables must be allocated on the heap.

Now, regarding your code - your implementation uses ref , which creates a small object for each mutable variable that you use. An alternative would be to create a single object with several mutable fields. Using the entries, you can write:

 type ReadClosure = { mutable c : char mutable k : SomeType } // whatever type you use here let rec read file includepath = let state = { c = ' '; k = NONE } // ... let readc() = state.c <- stream.Read() |> char // etc... 

This may be a little more efficient because you select one object instead of several objects, but I do not expect the difference to be noticeable.

There is also one confusion in your code - the stream value will be chosen after the read function returns, so the call to stream.Read may not be valid (if you call readc after read completes).

 let rec read file includepath = let c = ref ' ' use stream = File.OpenText file let readc() = c := stream.Read() |> char readc let f = read a1 a2 f() // This would fail! 

I'm not quite sure how you actually use readc , but this can be a problem for thought. Also, if you only declare this as closing the helper, you could probably rewrite the code without closing (or write it explicitly using tail-recursion, which translates into an imperative loop with mutable variables) to avoid any allocations.

+5
May 31 '10 at 12:09
source share

I did the following profiling:

 let test() = tic() let mutable a = 0.0 for i=1 to 10 do for j=1 to 10000000 do a <- a + float j toc("mutable") let test2() = tic() let a = ref 0.0 for i=1 to 10 do for j=1 to 10000000 do a := !a + float j toc("ref") 

the average for mutable is 50 ms and ref is 600 ms. The difference in performance is due to the fact that mutable variables are on the stack, and ref variables are on the managed heap.

The relative difference is great. However, 10-8 times access is a large number. And the total time is acceptable. So don’t worry about the performance of ref variables. And remember:

Premature optimization is the root of all evil.

My advice: first you finish your parser, and then think about optimizing it. You will not know where the bottom is located until you actually run the program. One good thing about F # is that its concise syntax and functional style support code refactoring well. Once the code is executed, optimization will be convenient. Here is an example of profiling.

Another example, we use .net arrays every day, which is also in a managed heap:

 let test3() = tic() let a = Array.create 1 0.0 for i=1 to 10 do for j=1 to 10000000 do a.[0] <- a.[0] + float j toc("array") 

test3 () works in much the same way as ref. If you worry too many variables in the managed heap, then you will no longer use the array.

+4
May 31 '10 at 10:00
source share



All Articles