Indented language parsing using scala parser combinators

Is there a convenient way to use Scala parser combinators to parse languages ​​where indentation is significant? (e.g. Python)

+6
source share
2 answers

Suppose we have a very simple language where it is a valid program

block inside the block 

and we want to parse this into a List[String] with each row inside the block as one String .

First, we define a method that takes a minimum indent level and returns a parser for a line with that indent level.

 def line(minIndent:Int):Parser[String] = repN(minIndent + 1,"\\s".r) ~ ".*".r ^^ {case s ~ r => s.mkString + r} 

Then we define the block with the minimum indentation, repeating the line parser with a suitable separator between the lines.

 def lines(minIndent:Int):Parser[List[String]] = rep1sep(line(minIndent), "[\n\r]|(\n\r)".r) 

Now we can define a parser for our small language as follows:

 val block:Parser[List[String]] = (("\\s*".r <~ "block\\n".r) ^^ { _.size }) >> lines 

First, it determines the current indentation level and then passes that value to at least the line parser. Let him check it:

 val s = """block inside the block outside the block""" println(block(new CharSequenceReader(s))) 

And get

 [4.10] parsed: List( inside, the, block) 

For all this to compile you need import data

 import scala.util.parsing.combinator.RegexParsers import scala.util.parsing.input.CharSequenceReader 

And you need to put everything in an object that extends RegexParsers so

 object MyParsers extends RegexParsers { override def skipWhitespace = false .... 
+5
source

From what I know, no, Scala harvesters for parsers do not support this thing out of the box. You can do this by analyzing the gap in a meaningful way, but you will run into some problems since you need some form of state machine to track the indent stack.

I would recommend doing a preprocessing step. Here is a small preprocessor that adds markers to individual locked blocks:

 object Preprocessor { val BlockStartToken = "{" val BlockEndToken = "}" val TabSize = 4 //how many spaces does a tab take def preProcess(text: String): String = { val lines = text.split('\n').toList.filterNot(_.forall(isWhiteChar)) val processedLines = BlockStartToken :: insertTokens(lines, List(0)) processedLines.mkString("\n") } def insertTokens(lines: List[String], stack: List[Int]): List[String] = lines match { case List() => List.fill(stack.length) { BlockEndToken } //closing all opened blocks case line :: rest => { (computeIndentation(line), stack) match { case (indentation, top :: stackRest) if indentation > top => { BlockStartToken :: line :: insertTokens(rest, indentation :: stack) } case (indentation, top :: stackRest) if indentation == top => line :: insertTokens(rest, stack) case (indentation, top :: stackRest) if indentation < top => { BlockEndToken :: insertTokens(lines, stackRest) } case _ => throw new IllegalStateException("Invalid algorithm") } } } private def computeIndentation(line: String): Int = { val whiteSpace = line takeWhile isWhiteChar (whiteSpace map { case ' ' => 1 case '\t' => TabSize }).sum } private def isWhiteChar(ch: Char) = ch == ' ' || ch == '\t' } 

Execution for this text gives:

 val text = """ |line1 |line2 | line3 | line4 | line5 | line6 | line7 | line8 | line9 |line10 | line11 | line12 | line13 """.stripMargin println(Preprocessor.preProcess(text)) 

... next result

 { line1 line2 { line3 line4 line5 { line6 line7 } } { line8 line9 } line10 { line11 line12 line13 } } 

And an afterword you can use the combinator library to simplify parsing.

Hope this helps

+1
source

All Articles