Difference between revisions of "Yhc/TMR"

Latest revision as of 13:10, 8 April 2007

Authors: Neil Mitchell, Tom Shackell, Dimitry Golubovsky, Matthew Naylor

This is a draft of the Yhc TMR article, deadline April 13th. It isn't intended as a wiki article beyond the listed authors (although if you want to fix some spelling, we don't mind!). If you are interested in helping email the Yhc list.

The beginning

In the beginning there was the nhc compiler, which had a number of issues. We fixed some of them.

Author: Tom/Neil/Andrew

How we started up Yhc, this is the section that would have been in the History of Haskell paper if they had done a Yhc section :)

Include the transition from CVS -> york Darcs -> haskell.org Darcs

Portability concerns

From the beginning portability was a prime concern, while the original nhc was only running on Linux v old.old, and never Windows, Yhc was fully portable by design.

Author: Tom, Andrew

Why portability is such a concern, details of our ports system. Include our scons architecture, buildbot system etc. Mention that Yhc runs under Hugs, and indeed some of the developers use Hugs.

Yhc.Core

Yhc.Core is one of our most successful libraries to date. The original nhc compiler used an intermediate core language called PosLambda – a basic lambda calculus extended with positional information. Although the language was very similar to a subset of Haskell, it wasn’t the same. In particular there were unusual constructs such as FatBar, and all names were stored in a symbol table.

Rather than attempt to change the PosLambda language, a task that would have been decidedly painful, we chose instead to write a Core language from scratch, being inspired by both PosLambda and GHC Core. We then wrote a converter from PosLambda to Yhc.Core.

In particular our idealised Core language differs from GHC Core in a number of ways:

Untyped – originally this was a restriction of PosLambda, but now some of us see this as a feature (others still see this as a bug).
Syntactically a subset of Haskell.
Minimal name mangling.

All these features combine to create a Core language which resembles Haskell much more than Core languages in other Haskell compilers. This very low barrier to entry means that with merely the Haddock description of our abstract syntax tree, most Haskell programmers can feel at home with relatively little effort.

Having reduced the barrier to entry for our Core language, the number of projects depending on it has increased. Consistently we have attempted to add facilities to the libraries for common tasks, rather than duplicating them separately in projects. As a result the Core library now has facilities for dealing with primitives, removing recursive lets, reachability analysis, strictness analysis, simplification, inlining and more.

One of the first features we added to Core was whole program linking – any Haskell program, regardless of the number of modules, can be collapsed into one single Yhc.Core module. While this breaks separate compilation, if allows many types of analysis and transformation to be performed in a simpler manner. Naturally, if a technique turns out to be successful breaking the dependence on whole program compilation is a worthy goal – but this approach allows developers to pay that cost only when it is needed.

A Small Example

To give a flavour of what Core looks like, it is easiest to start with a small program:

head2 (x:xs) = x
 
map2 f [] = []
map2 f (x:xs) = f x : map2 f xs
 
test x = map2 head2 x

Compiling this with yhc -showcore Sample.hs generates:

Sample.head2 v220 =
    case v220 of
        (:) v221 v222 -> v221
        _ -> Prelude.error Sample._LAMBDA228
 
Sample._LAMBDA228 =
    "Sample: Pattern match failure in function at 9:1-9:15."
 
Sample.map2 v223 v224 =
    case v224 of
        [] -> []
        (:) v225 v226 -> (:) (v223 v225) (Sample.map2 v223 v226)
 
Sample.test v227 = Sample.map2 Sample.head2 v227

The generated Core looks very much like Haskell, and can be treated as such. The generated Core is the simplest form of Haskell, with many restrictions:

Case statements only examine their outermost constructor
No type classes
No where
No nested functions (only top level)
All names are fully qualified

Yhc.Core.Overlay

Describe the overlay here

Semantics of Yhc Core

In this section an evaluator for Yhc Core programs is presented in the form of a literate Haskell program. The aim is to define the meaning of Core programs while demonstrating a full, albeit simple, application of the Yhc.Core library.

> module Eval where

> import Yhc.Core

Our evaluator is based around the function whnf that takes a core program (of type Core) along with a core expression (of type CoreExpr) and reduces that expression until it has the form:

a data constructor with unevaluated arguments, or
an unapplied lambda expression.

In general, data values in Haskell are tree-shaped. The function whnf is often said to "reduce an expression to head normal form" because it reveals the head (or root) of a value's tree and no more. Stricly speaking, when the result of reduction could be a functional value (i.e. a lambda expression), and the body of that lambda is left unevaluated, then the result is said to be in "weak head normal form" -- this explains the strange acronym WHNF!

The type of whnf is:

> whnf                            :: Core -> CoreExpr -> CoreExpr

Defining it is a process of taking each kind of core expression in turn, and asking "how do I reduce this to weak head normal form?" As usual, it makes sense to define the base cases first, namely constructors and lambda expressions:

> whnf p (CoreCon c)              =  CoreCon c
> whnf p (CoreApp (CoreCon c) as) =  CoreApp (CoreCon c) as
> whnf p (CoreLam (v:vs) e)       =  CoreLam (v:vs) e

Notice that a constructor may take one of two forms: standalone with no arguments, or as function application to a list of arguments. Also, because of the way our evaluator is designed, we may encounter lambda expressions with no arguments. Hence, only lambdas with arguments represent a base-case. For the no-arguments case, we just shift the focus of reduction to the body:

> whnf p (CoreLam [] e)           =  whnf p e

Currently, lambda expressions do not occur in the Core output of Yhc. They are part of the Core syntax because they are useful conceptually, particularly when maniplating (and evaluating) higher-order functions.

Moving on to case-expressions, we first reduce the case subject, then match it against each pattern in turn, and finally reduce the body of the chosen alternative. In Core, we can safely assume that patterns are at most one constructor deep, so reduction of the subject to WHNF is sufficient.

> whnf p (CoreCase e as)          =  whnf p (match (whnf p e) as)

We leave the definition of match until later.

To reduce a let-expression, we apply the let-bindings as a substitution to the body of the let. This is easily done using the Core function replaceFreeVars. Of course, let-expressions in Core are recursive, but before evaluating a core program we transform all recursive-lets away (see below). Notice that we are in no way trying to preserve the sharing implied by let-expressions, although we have done so in more complex variants of the evaluator. Strictly-speaking, Haskell evaluators are not obliged to implement sharing -- this is why it is more correct to term Haskell non-strict than lazy.

> whnf p (CoreLet bs e)           =  whnf p (replaceFreeVars bs e)

When we ecounter an unapplied function we call coreFunc to lookup its definition (i.e. its arguments and its right-hand-side), and construct a corresponding lambda expression:

> whnf p (CoreFun f)              =  whnf p (CoreLam bs body)
>   where
>     CoreFunc _ bs body          =  coreFunc p f

This means that when reducing function applications, we know that reduction of the function part will yield a lambda:

> whnf p (CoreApp f [])           =  whnf p f
> whnf p (CoreApp f (a:as))       =
>   case whnf p f of
>     CoreLam [] e                -> whnf p (CoreApp e (a:as))
>     CoreLam (b:bs) e            -> whnf p (CoreLet [(b,a)]
>                                           (CoreApp (CoreLam bs e) as))

Core programs may contain positional information which we just ignore:

> whnf p (CorePos _ e)            =  whnf p e

And the final, fall-through case covers primitive literals and functions which we are not concerned with here:

> whnf p e                        =  e

Now, for the sake of completeness, we return to our match function. It takes the evaluated case subject and tries to match it against each case-alternative (a pattern-expression pair) in order of appearance. We use the "failure as a list of successes" technique to model the fact that matching may fail.

> type Alt                        =  (CoreExpr, CoreExpr)

> match                           :: CoreExpr -> [Alt] -> CoreExpr
> match e as                      =  head (concatMap (try e) as)

Before defining try, it is useful to have a function that turns the two possible constuctor forms into a single, normal form. This greatly reduces the number of cases we need to consider in the definition of try.

> norm                            :: CoreExpr -> CoreExpr
> norm (CoreCon c)                =  CoreApp (CoreCon c) []
> norm x                          =  x

Hopefully by now the definition of try will be self-explanatory:

> try                             :: CoreExpr -> Alt -> [CoreExpr]
> try e (pat, rhs)                =
>   case (norm pat, norm e) of
>     (CoreApp (CoreCon f) as, CoreApp (CoreCon g) bs)
>        | f == g                 -> [CoreLet (zip (vars as) bs) rhs]
>     (CoreVar v, e)              -> [CoreLet [(v, e)] rhs]
>     _                           -> []
>   where
>     vars                        =  map fromCoreVar

This completes the definition of whnf. However, we would like to be able to fully evaluate expressions -- to what we simply call "normal form" -- so that the resulting value's tree is computed in its entirety. Our nf function repeatedly applies whnf at progressively deeper nodes in the growing tree:

> nf                              :: Core -> CoreExpr -> CoreExpr
> nf p e                          =
>   case whnf p e of
>     CoreCon c                   -> CoreCon c
>     CoreApp (CoreCon c) es      -> CoreApp (CoreCon c) (map (nf p) es)
>     e                           -> e

All that remains is to turn our evaluator into a program by giving it a sensible main function. We first load the core file using loadCore and then apply removeRecursiveLet, as discussed ealier, before evaluating the expression CoreFun "main" to normal form and printing it.

> module Main where

> import System
> import Monad

> main                            :: IO ()
> main                            =  liftM head getArgs
>                                >>= liftM removeRecursiveLet . loadCore
>                                >>= print . flip nf (CoreFun "main")

In future we hope to use a variant of this evaluator (with sharing) in a property-based testing framework. This will let us check that various program analyses and transformations that we have developed are semantics-preserving. As part of another project, we have sucessfully extended the evaluator to support various functional-logic evaluation strategies, including residuation and narrowing.

Javascript backend

The Javascript backend is a unique feature of Yhc, something which our light weight approach makes easier.

Author: Dimitry

The idea to write a converter from Haskell to Javascript, aimed at execution of Haskell programs in a web browser has been floating around for a while [this post], [this post], [Hajax]. Many people expressed interest in such feature, but no practical implementation was visible.

Initial goals of this subproject were:

To develop a conversion program that converts the Yhc Core to Javascript, thus making it possible to execute arbitrary Haskell code within a web browser

To develop an unsafe interface layer for quick access to Javascript objects with ability to wrap arbitrary Javascript code into a Haskell-callable function

To develop a typesafe interface layer on top of the unsafe interface layer for access to the Document Object Model (DOM) available to Javascript executed in a web browser

To develop or adopt an existing GUI library or toolkit working on top of the typesafe DOM layer for actual development of clientside Web applications.

General concepts

The Javascript backend converts a linked and optimized Yhc Core file into a piece of Javascript code to be embedded in a XHTML document. The Javascript code generator attempts to translate Core expressions to Javascript expressions one-to-one with minor optimizations on its own, taking advantage of Javascript capability to pass functions around as values.

Three kinds of functions are present in the Javascript backend:

Unsafe functions that embed pieces of Javascript directly into the generated code: these functions pay no respect to types of arguments passed, and may force evaluation of their arguments if needed.

Typesafe wrappers that provide type signatures for unsafe functions. Such wrappers are either handwritten, or automatically generated from external interface specifications (such as the Document Object Model interface)

Regular library functions. These either come unmodified from the standard packages that come with Yhc, or are substituted by the Javascript backend using the Core overlay technique. An example of such a function is the toUpper function which is hooked up to the Javascript implementation supporting Unicode (the original library function currently works correctly only for the Latin1 range of characters).

Unsafe interfaces

The core part of unsafe interface to Javascript (or, in other words, Javascript FFI) is a pseudo-function unsafeJS.

The function has a type signature:

foreign import primitive unsafeJS :: String -> a

Which means that it takes a string. Type of the return value does not matter: the function itself is never executed. Its applications are detected by the Yhc Core to Javascript conversion program at the time of Javascript generation.

The unsafeJS function should be called with a string literal. Neither explicitly coded (with (:)) list of characters nor concatenation of two or more strings will work. The converter will report an error in this situation.

A valid example of using unsafeJS is shown below:

global_YHC'_Primitive'_primIntSignum :: Int -> Int

global_YHC'_Primitive'_primIntSignum a = unsafeJS
  "var ea = exprEval(a); if (ea>0) return 1; else if (ea<0) return -1; else return 0;"

This is a Javascript overlay (in the sense that it overlays the default Prelude definition of the signum function) of a function that returns sign of an Int value.

The string literal unsafeJS is applied to is the Javascript code to be wrapped.

Below is the Javascript representation of this function found in generated code.

strIdx["F_hy"] = "YHC.Primitive.primIntSignum";
...
var F_hy=new HSFun("F_hy", 1, function(a){
var ea = exprEval(a); if (ea>0) return 1; else if (ea<0) return -1; else return 0;});

Typesafe wrappers

These functions add type safety on top of unsafe interface to Javascript. Sometimes they are defined within the same module as unsafe interfaces themselves, thus avoiding the exposure of unsafe interfaces to programmers.

An example of a handwritten wrapper is a function to create a new JSRef (a mechanism similar to IORef, but specific to Javascript).

data JSRef a

newJSRef :: a -> CPS b (JSRef a)

newJSRef a = toCPE (newJSRef' a)
newJSRef' a = unsafeJS "return {_val:a};"

Technically, a JSRef is a Javascript object with a property named _val that holds a persistent reference to some value. On the unsafe side, invoking a constructor for such an object would be sufficient. It is however desired that:

calls to functions creating such persistent references are properly sequenced with calls to functions using these references, and

type of values referred to are known to the Haskell compiler.

The unsafe part is implemented by the function newJSRef' which merely calls unsafeJS with proper Javascript constructor. The wrapper part newJSRef wraps the unsafe function into a CPS-style function, and is given a proper type signature, so the compiler is better informed.

In some cases, such typesafe wrappers may be generated automatically, using some external interface specifications provided by third parties for their APIs.

As an example of such API, the W3C DOM interface may be taken. For instance, this piece of OMG IDL:

  interface Text : CharacterData {
    Text               splitText(in unsigned long offset)
                                        raises(DOMException);
  };

is converted into:

data TText = TText

...

instance CText TText
 
instance CCharacterData TText
 
instance CNode TText

...

splitText :: (CText this, CText zz) => this -> Int -> CPS c zz
splitText a b = toCPE (splitText' a b)
splitText' a b
  = unsafeJS "return((exprEval(a)).splitText(exprEval(b)));"

again, giving the Haskell compiler better control over types of this function's (initially type-agnostic) arguments.

Usage of Continuation Passing Style

Initially it was attempted to build a monadic framework. The JS monad was designed to play the same role as the IO monad plays in "regular" Haskell programming. There were however arguments in favor of using Continuation Passing Style (CPS):

CPS involves less overhead as each expression passes its continustion iself instead of bind which takes the expression and invokes the continuation

CPS results in Javascript patterns that are easy to detect and optimize for (this is one of the future plans).

The Fudgets GUI library internals are written in CPS, so taking CPS as general approach to programming is believed to make adoption of Fudgets easier.

Integration with DOM

The Web Consortium provides OMG IDL files to describe the API to use with the Document Object Model (DOM). An utility was designed, based on HaskellDirect, to parse these files and convert them to set of Haskell modules. The way interface inheritance is reflected differs from the original HaskellDirect way: in HaskellDirect this was achieved by declaration of "nested" algebraic data types, while the Javascript backend utility takes advantage of Haskell typeclasses, representing DOM types with phantom types, and declaring them instances of appropriate class(es).

Unicode support

Despite the fact that all modern Web browsers support Unicode, this is not the case with Javascript: no access to Unicode characters' properties is provided. In the same time it is impossible for a Haskell application running in a browser not to have access to such information. The approach used is the same as used in Hugs and GHC: the Unicode characters database file from Unicode Consortium was converted into a set of Javascript arrays, each array entry representing a range of character code values, or a case conversion rule for a range (for this implementation, Unicode support was limited to character category, and simple case conversions). First, a range is found by character code using binary search; then character category and case conversion distances (values to add to character code to convert between upper and lower cases) are retrieved from the range entry. The whole set of arrays adds about 70 kilobytes to the web page size, if embedded inside a <script> tag.

Using the Core overlay technique, Haskell character functions (like toUpper, isAlpha, etc.) were hooked up to the Javascript implementations supporting Unicode. This did not result in considerable slowdowns, rather, some browsers even showed minor speedup in heavy functions like read::String -> Int.

Examples of code generation

The examples below show some real-life functions' conversion from Haskell via Core to Javascript. It would be good to mention that the Javascript code generator changes over time as the Javascript backend evolves, so these examples really describe the situation as of the moment this article is being written.

This function:

fromRoman = foldr fromNumeral 0 . maxmunch . map toUpper

converted to Yhc Core:

Roman.fromRoman =
    Prelude._LAMBDA27191
      (Prelude._LAMBDA27191
         (Prelude.map Data.Char.toUpper)
         (Roman.maxmunch Prelude.Prelude.Num.Prelude.Int))
      (Prelude.foldr
         (Roman.fromNumeral
            Prelude.Prelude.Num.Prelude.Int
            Prelude.Prelude.Ord.Prelude.Int)
         0)

-- The LAMBDA is similar to the composition function (.), only with
-- inverted order of application: _LAMBDA27191 f g x = g (f x)

Prelude._LAMBDA27191 v22167 v22166 v2007 = v22166 (v22167 v2007)

converted to Javascript:


/* fromRoman, code was formatted manually */

var F_g8=new HSFun("F_g8", 0, function(){
  return (F_e9)._ap([(F_e9)._ap([new HSFun("F_gz", 0, 
    function(){
      return (F_gz)._ap([F_Z]);
    }), new HSFun("F_g9", 0, 
    function(){
      return (F_g9)._ap([F_dC]);
    })]), new HSFun("F_gp", 0, 
    function(){
      return (F_gp)._ap([new HSFun("F_g7", 0, 
        function(){
          return (F_g7)._ap([F_dC, F_d1]);
        }), 0]);
    })]);
});

/* _LAMBDA27191 */

var F_e9=new HSFun("F_e9", 3, function(_b3, _b2, _bO){return (_b2)._ap([(_b3)._ap([_bO])]);});

During the conversion to Javascript, all identifiers found in Yhc Core are renamed to much shorter ones consisting only of alphanumeric characters and thus surely valid for Javascript (identifiers in Yhc Core often are very long, or contain special characters, etc.)

While it is really hard to understand anything from the Javascript for the fromRoman function (other than the Javascript backend already makes a good obfuscator), something may be seen in the Javascript for the composition function. It builds an application of its first argument to the third, and then the application of the second to the previous application, and returns the latter.

Another example of a function whose implementation was replaced via the Overlay technique is the isSpace function:

global_Data'_Char'_isSpace = f . ord
  where f a = unsafeJS "return uIsSpace(exprEval(a));"

Data.Char.isSpace =
    Prelude._LAMBDA27191
      Data._CharNumeric.ord
      StdOverlay.StdOverlay.Prelude.287.f

var F_W=new HSFun("F_W", 0, function(){return (F_e9)._ap([F_bh, F_hk]);});

In the Haskell code, the global_Data'_Char'_isSpace identifier tells the Core Overlay engine that the function with qualified name Data.Char.isSpace is to be replaced with a new implementation. In Yhc Core, the previously reviewed reversed composition function can be seen which composes the ord function, and an inner function that actually invokes the Javascript function which in turn performs the Unicode properties lookup for a given character numeric code.

Browsers compatibility

Compatibility with major browsers such as Netscape/Mozilla/Firefox and Microsoft Internet Explorer, and also Opera was observed. Compatibility with Safari has not been reached so far.

Future plan: Fudgets

It is planned to port some portion of Fudgets, so it becomes possible to write Web applications using this library. Several experiments showed that the Stream Processors (SP), and some parts of Fudget Kernel layers worked within a Javascript application. More problems are expected with porting the toplevel widgets due to differences in many concepts between Web browser and X Window, for which the Fudgets library was originally developed.

Wacky features

Yhc is going in many interesting directions. Some of these directions are likely to become very important in the future, some are likely to fade away. Yhc is a genuine research bed for brand new ideas.

Author: All

When you don't spend all the time on wacky type systems, you get a lot more time left to work on Wacky other stuff. Include Java interpetter, .NET back end, Javascript back end, Python interpretter, Hat debugging, yhi-stack, whole program optimisation. Lots of these things are breeding grounds for various useful technologies, and most are marching towards genuine usefulness.

The Future

The original structure of nhc was as one big set of modules – some were broadly divided into type checking/parsing etc, but the overall structure and grouping was weaker than in other compilers – particularly GHC which has a very well defined structure. One of our first actions was to split up the code into hierarchical modules, introducing Type.*, Parse.* etc to better divide the components.

Some of the sections are shared – for example both hpc and Hat share the nhc Parse modules, and could equally share the Yhc parse modules. This direction is obviously attractive – by opening up more parts of the compiler for reuse in other projects we gain in many ways:

The authors of other project benefit from having libraries to build upon, not needlessly replicating the functionality already present in Yhc.
We benefit from having more people focused and dependent on our code, resulting in a bigger community.
We get additional bug reports.
By structuring code as a library, “hacks” that before might have been tolerated quickly take on a whole new aura of ugliness.
By isolating each component, we minimize the amount of whole compiler knowledge required.

This direction attracts us, and we see this as the future direction of our compiler.

Interestingly, almost at the same time the GHC developers have introduced a GHC API. We do not wish to replicate the GHC API, in particular their API is more complex than ours. We also wish to design an API, and then construct some future version of Yhc around the API we develop – rather than exporting an API.

For the next major Yhc version we would hope to rewrite the parser, type checker, renamer and desugarer – leaving only the new ByteCode and Core aspects still intact. This is clearly a lot of work, and we view this as a very long term plan. For the moment we hope to push the Core and ByteCode features of Yhc into new areas.

In the meantime things like libary compatability and Haskell' loom more closely on horizon. Our progress towards these goals relies on the help of volunteers.

Acknowledgements

Thanks to everyone who has submitted a patch, become a buildbot, reported bugs or done anything else to benefit the Yhc project. We've put together a list of most of the people (if we've missed you, please shout, and we'll add your name in future versions of this document!)

Andrew Wilkinson, Bernie Pope, Bob Davie, Brian Alliet, Christopher Lane Hinson, Dimitry Golubovsky, Gabor Greif, Goetz Isenmann, Isaac Dupree, Kartik Vaddadi, Krasimir Angelov, Malcolm Wallace, Michal Palka, Mike Dodds, Neil Mitchell, Robert Dockins, Samuel Bronson, Stefan O'Rear, Thorkil Naur, Tom Shackell