Parsing a simple imperative language
This tutorial will present how to parse a subset of a simple imperative programming language called WHILE (introduced in a book "Principles of Program Analysis" by Nielson, Nielson and Hankin). It includes only a few statements and basic boolean/arithmetic expressions, which makes it a nice material for a tutorial.
Imports
First let's specify the name of the module:
> module ParseWhile where
And then import the necessary libraries:
> import System.IO
> import Control.Monad
> import Text.ParserCombinators.Parsec
> import Text.ParserCombinators.Parsec.Expr
> import Text.ParserCombinators.Parsec.Language
> import qualified Text.ParserCombinators.Parsec.Token as Token
The language
The grammar for expressions is defined as follows:
a ::= x | n | - a | a opa a
b ::= true | false | not b | b opb b | a opr a
opa ::= + | - | * | /
opb ::= and | or
opr ::= > | <
Note that we have three groups of operators - arithmetic, booloan and relational ones.
And now the definition of statements:
S ::= x := a | skip | S1; S2 | ( S ) | if b then S1 else S2 | while b do S
We probably want to parse that into some internal representation of the language (abstract syntax tree). Therefore we need to define the data structures for the expressions and statements.
Data structures
We need to take care of boolean and arithmetic expressions and the appropriate operators. First let's look at the boolean expressions:
> data BExpr = BoolConst Bool
> | Not BExpr
> | BBinary BBinOp BExpr BExpr
> | RBinary RBinOp AExpr AExpr
> deriving (Show)
Binary booloan operators:
> data BBinOp = And | Or deriving (Show)
Relational operators:
> data RBinOp = Greater | Less deriving (Show)
Now we define the types for arithmetic expressions:
> data AExpr = Var String
> | IntConst Integer
> | Neg AExpr
> | ABinary ABinOp AExpr AExpr
> deriving (Show)
And arithmetic operators:
> data ABinOp = Add
> | Subtract
> | Multiply
> | Divide
> deriving (Show)
Finally let's take care of the statements:
> data Stmt = Seq [Stmt]
> | Assign String AExpr
> | If BExpr Stmt Stmt
> | While BExpr Stmt
> | Skip
> deriving (Show)
Lexer
Having all the data structures we can go on with writing the code to do actual parsing. First of all we create the language definition using Haskell's record syntax and the constructor emptyDef
from Text.ParserCombinators.Parsec.Language
:
> languageDef =
> emptyDef { Token.commentStart = "/*"
> , Token.commentEnd = "*/"
> , Token.commentLine = "//"
> , Token.identStart = letter
> , Token.identLetter = alphaNum
> , Token.reservedNames = [ "if"
> , "then"
> , "else"
> , "while"
> , "do"
> , "skip"
> , "true"
> , "false"
> , "not"
> , "and"
> , "or"
> ]
> , Token.reservedOpNames = ["+", "-", "*", "/", ":="
> , "<", ">", "and", "or", "not"
> ]
> }
This creates a language definition that accepts the C-style comments, requires that the identifiers start with a letter, and end with alphanumeric characters. Moreover there is a number of reserved names, that cannot be used by the identifiers.
Having the above definition we can create a lexer:
> lexer = Token.makeTokenParser languageDef
lexer contains a number of lexical parsers, that we can use to parse identifiers, reserved words/operations, etc. Now we can select/extract them in the following way:
> identifier = Token.identifier lexer -- parses an identifier
> reserved = Token.reserved lexer -- parses a reserved name
> reservedOp = Token.reservedOp lexer -- parses an operator
> parens = Token.parens lexer -- parses surrounding parenthesis:
> -- parens p
> -- takes care of the parenthesis and
> -- uses p to parse what's inside them
> integer = Token.integer lexer -- parses an integer
> semi = Token.semi lexer -- parses a semicolon
> whiteSpace = Token.whiteSpace lexer -- parses whitespace
This isn't really necessary, but should make the code much more readable (also this is the reason why we used the qualified import of Text.ParserCombinators.Parsec.Token
). Now we can use them to parse the source code at the token level. One of the nice features of these parsers is that they take care of all whitespace after the tokens.
Main parser
As already mentioned a program in this language is simply a statement, so the main parser should basically only parse a statement. But remember to take care of initial whitespace - our parsers only get rid of whitespace after the tokens!
> whileParser :: Parser Stmt
> whileParser = whiteSpace >> statement
Now because any statement might be actually a sequence of statements separated
by semicolon, we use sepBy1
to parse at least one statement. The
result is a list of statements. We also allow grouping statements by the
parenthesis, which is useful, for instance, in the while loop.
> statement :: Parser Stmt
> statement = parens statement
> <|> sequenceOfStmt
> sequenceOfStmt =
> do list <- (sepBy1 statement' semi)
> -- If there's only one statement return it without using Seq.
> return $ if length list == 1 then head list else Seq list
Now a single statement is quite simple, it's either an if conditional, a while
loop, an assignment or simply a skip statement. We use <|>
to
express choice. So a <|> b
will first try parser a
and if it fails (but without actually consuming any input) then parser
b
will be used. Note: this means that the order is important.
> statement' :: Parser Stmt
> statement' = ifStmt
> <|> whileStmt
> <|> skipStmt
> <|> assignStmt
If you have a parser that might fail after consuming some input, and you still
want to try the next parser, you should look into try
combinator.
For instance try p <|> q
will try parsing with p
and
if it fails, even after consuming the input, the q
parser will be
used as if nothing has been consumed by p
.
Now let's define the parsers for all the possible statements. This is quite straightforward as we just use the parsers from the lexer and then use all the necessary information to create appropriate data structures.
> ifStmt :: Parser Stmt
> ifStmt =
> do reserved "if"
> cond <- bExpression
> reserved "then"
> stmt1 <- statement
> reserved "else"
> stmt2 <- statement
> return $ If cond stmt1 stmt2
> whileStmt :: Parser Stmt
> whileStmt =
> do reserved "while"
> cond <- bExpression
> reserved "do"
> stmt <- statement
> return $ While cond stmt
> assignStmt :: Parser Stmt
> assignStmt =
> do var <- identifier
> reservedOp ":="
> expr <- aExpression
> return $ Assign var expr
> skipStmt :: Parser Stmt
> skipStmt = reserved "skip" >> return Skip
Expressions
What's left is to parse the expressions. Fortunately Parsec provides a very easy way to do that. Let's define the arithmetic and boolean expressions:
> aExpression :: Parser AExpr
> aExpression = buildExpressionParser aOperators aTerm
> bExpression :: Parser BExpr
> bExpression = buildExpressionParser bOperators bTerm
Now we have to define the lists with operator precedence, associativity and what constructors to use in each case.
> aOperators = [ [Prefix (reservedOp "-" >> return (Neg )) ]
> , [Infix (reservedOp "*" >> return (ABinary Multiply)) AssocLeft,
> Infix (reservedOp "/" >> return (ABinary Divide )) AssocLeft]
> , [Infix (reservedOp "+" >> return (ABinary Add )) AssocLeft,
> Infix (reservedOp "-" >> return (ABinary Subtract)) AssocLeft]
> ]
> bOperators = [ [Prefix (reservedOp "not" >> return (Not )) ]
> , [Infix (reservedOp "and" >> return (BBinary And )) AssocLeft,
> Infix (reservedOp "or" >> return (BBinary Or )) AssocLeft]
> ]
In case of Prefix operators it is enough to specify which one should be parsed and what is the associated data constructor. Infix operators are defined similarly, but it's necessary to add information about associativity. Note that the operator precedence depends only on the order of the elements in the list.
Finally we have to define the terms. In case of arithmetic expressions, it is quite simple:
> aTerm = parens aExpression
> <|> liftM Var identifier
> <|> liftM IntConst integer
However, the term in a boolean expression is a bit more tricky. In this case, a term can also be an expression with relational operator consisting of arithmetic expressions.
> bTerm = parens bExpression
> <|> (reserved "true" >> return (BoolConst True ))
> <|> (reserved "false" >> return (BoolConst False))
> <|> rExpression
Therefore we have to define a parser for relational expressions:
> rExpression =
> do a1 <- aExpression
> op <- relation
> a2 <- aExpression
> return $ RBinary op a1 a2
> relation = (reservedOp ">" >> return Greater)
> <|> (reservedOp "<" >> return Less)
And that's it. We have a quite simple parser able to parse a few statements and arithmetic/boolean expressions.
Notes
If you want to experiment with the parser inside ghci, these functions might be handy:
> parseString :: String -> Stmt
> parseString str =
> case parse whileParser "" str of
> Left e -> error $ show e
> Right r -> r
> parseFile :: String -> IO Stmt
> parseFile file =
> do program <- readFile file
> case parse whileParser "" program of
> Left e -> print e >> fail "parse error"
> Right r -> return r
Now you can simply load the module in ghci and then do
ast <- parseFile "<filename>"
to parse a file and get the
result if parsing was successful. If you already have a string with
the program, you can use parseString
.