Idiomatic (functional) file processing protocol in Scala

I would like to get an elegant pipeline for converting text input to json output. The stream should look something like this:

(input file) // concatenated htmls and url Collection[String] // unit: line Collection[String,String] // unit: url, html doc Collection[MyObj] // unit: parsed MyObj (output file) // json representation of parsed objects 

I am currently doing this with nested loops, but I would like to write this in a more functional style. Is there a standard way to do this or typical libraries that I should look at? Note: the data is quite large, so I can not fully use it in memory.

+5
source share
2 answers

Perhaps you can use Scalaz-stream. The library provides composition, expressiveness, resource security and IO processing speed. In addition, it uses instant memory, which will be very useful for processing big data. Here is the github for you:

https://github.com/scalaz/scalaz-stream

Youtube will talk about this:

https://www.youtube.com/watch?v=GSZhUZT7Fyc

https://www.youtube.com/watch?v=nCxBEUyIBt0

+2
source

I usually use scala -arm for resource management ( AutoClosable , Closeable , etc.) for understanding for such tasks.

Most scala tutorials use for { s <- Source.fromFile(...).getLines() } , but this is a good way to leak resources, since the source will not be automatically closed.

With scala -arm, it looks like this:

 import resource._ for { source <- managed(Source.fromFile(...)) target <- managed(Files.newBufferedWriter(...)) } { for { rawLine <- source.getLines line = rawLine.trim() if !rawLine.startsWith("#") (url, html) <- parseString(line) json <- toJsonOpt(html) } { // actual action target.write(s"$url\t$json\n") } } 

If you need a more sophisticated pipeline, you can use scalaz-stream, strom, spark or another library to determine the actual DAG pipeline and start executing it.

0
source

All Articles