Parsing expressions and statements

From HaskellWiki

We use the Parsec library to parse a small grammar of expressions and statements. The main purpose is to showcase makeTokenParser and buildExpressionParser, which cover a large class of common applications.

We will use these modules:

> import Control.Applicative((<*))
> import Text.Parsec
> import Text.Parsec.String
> import Text.Parsec.Expr
> import Text.Parsec.Token
> import Text.Parsec.Language

Sample grammar[edit]

Here is the expression grammar:

expr  ::= var | const | ( expr ) | unop expr | expr duop expr
var   ::= letter { letter | digit }*
const ::= true | false
unop  ::= ~
duop  ::= & | =

Operator precedence from high to low: ~, &, =. Both binary operators are left-associative.

Here is the statement grammar:

stmt ::= nop | var := expr | if expr then stmt else stmt fi | while expr do stmt od
       | stmt { ; stmt }+

In addition to these grammars, we will also allow block comments like {- this is a comment -}.

We will parse these grammars into an internal representation (abstract syntax tree). (In some applications you skip the internal representation and go straight to evaluation or code generation or... but let's fix just one goal here.) Here is an internal representation:

> data Expr = Var String | Con Bool | Uno Unop Expr | Duo Duop Expr Expr
>     deriving Show
> data Unop = Not deriving Show
> data Duop = And | Iff deriving Show
> data Stmt = Nop | String := Expr | If Expr Stmt Stmt | While Expr Stmt
>           | Seq [Stmt]
>     deriving Show

Sample parser[edit]

Here is the plan. Fill in a record to specify comments, roles of characters (what goes into identifiers, what goes into operators), reserved words, reserved operators. Give this record to makeTokenParser and get a record of utility parsers that enable us to work at the token level (rather than the character level). The expression parser is obtained with the help of buildExpressionParser. The statement parser is written as a recursive descent parser.

Define symbols[edit]

LanguageDef is the name of the record type we have to fill in. Many presets are provided so that we can pick one and just customize a few fields. A minimalist preset is emptyDef and we change it with:

def = emptyDef{ commentStart = "{-"
              , commentEnd = "-}"
              , identStart = letter
              , identLetter = alphaNum
              , opStart = oneOf "~&=:"
              , opLetter = oneOf "~&=:"
              , reservedOpNames = ["~", "&", "=", ":="]
              , reservedNames = ["true", "false", "nop",
                                 "if", "then", "else", "fi",
                                 "while", "do", "od"]
              }

Make token parser[edit]

When we pass the above record to makeTokenParser, the return value is a record of type TokenParser. Its fields are little parsers that we can use. They parse and return various tokens (identifiers, operators, reserved things, all sorts of brackets) and skip comments as we have specified. They also eat whitespaces after the tokens. Using them, our expression parser and statement parser can be written at the token level and without worrying about whitespaces. The only caveat: they don't eat whitespaces at the very, very, very beginning, and we have to do that ourselves, but even that is made easy.

There are only a few of the many provided parsers we will use here. Using record pattern matching, we call makeTokenParser def and pick out just those we need:

TokenParser{ parens = m_parens
           , identifier = m_identifier
           , reservedOp = m_reservedOp
           , reserved = m_reserved
           , semiSep1 = m_semiSep1
           , whiteSpace = m_whiteSpace } = makeTokenParser def

Here is an overview of what these parsers do:

m_parens
m_parens p parses an open parenthesis, then runs p, then parses a close parenthesis, and returns what p returns.
m_identifier
parses and returns an identifier, checking that it does not clash with a reserved word.
m_reservedOp
by example: m_reservedOp ":=" checks that the next token is the := reserved operator.
m_reserved
by example: m_reserved "od" checks that the next token is the od reserved word.
m_semiSep1
m_semiSep1 p parses and returns a semicolon-separated sequence of one or more p's.
m_whiteSpace
eats whitespaces.

You are encouraged to explore these and other parsers provided in the record. They are all very handy.

Expression parser[edit]

The expression parser is obtained from buildExpressionParser. We do not need to write our own recursive descent parser or perform left/right-factoring.

exprparser :: Parser Expr
exprparser = buildExpressionParser table term <?> "expression"
table = [ [Prefix (m_reservedOp "~" >> return (Uno Not))]
        , [Infix (m_reservedOp "&" >> return (Duo And)) AssocLeft]
        , [Infix (m_reservedOp "=" >> return (Duo Iff)) AssocLeft]
        ]
term = m_parens exprparser
       <|> fmap Var m_identifier
       <|> (m_reserved "true" >> return (Con True))
       <|> (m_reserved "false" >> return (Con False))

We give operator precedence (by ordering them in the table), association, "semantic action", and how to parse an "atomic term" including the parenthesized case (the only place we do a recursive descent, and it's trivial). It is possible to place several operators at the same precedence, though not shown here. The general format for the table is hard to explain but easy to illustrate and copy.

Statement parser[edit]

The statement parser is best done by recursive descent, with special treatment to the semicolon-separated list (easy with m_semiSep1), utilizing the handy token parsers and the expression parser. We also remember to skip whitespaces just once at the very, very, very beginning:

mainparser :: Parser Stmt
mainparser = m_whiteSpace >> stmtparser <* eof
    where
      stmtparser :: Parser Stmt
      stmtparser = fmap Seq (m_semiSep1 stmt1)
      stmt1 = (m_reserved "nop" >> return Nop)
              <|> do { v <- m_identifier
                     ; m_reservedOp ":="
                     ; e <- exprparser
                     ; return (v := e)
                     }
              <|> do { m_reserved "if"
                     ; b <- exprparser
                     ; m_reserved "then"
                     ; p <- stmtparser
                     ; m_reserved "else"
                     ; q <- stmtparser
                     ; m_reserved "fi"
                     ; return (If b p q)
                     }
              <|> do { m_reserved "while"
                     ; b <- exprparser
                     ; m_reserved "do"
                     ; p <- stmtparser
                     ; m_reserved "od"
                     ; return (While b p)
                     }

Sample usage[edit]

Finally, here is a little piece of code for casual testing. This function parses the string parameter and outputs either a parse error or the answer.

> play :: String -> IO ()
> play inp = case parse mainparser "" inp of
>              { Left err -> print err
>              ; Right ans -> print ans
>              }