Sunday, March 29, 2009 9:41 PM bart

The M Programming Language – Part 3 – Primer to MGrammar

Before on “The M Programming Language Tales”:

Today, we continue our journey through the wonderful world of M by taking a look at MGrammar.

 

Why custom grammars?

Let me tell you a story first. About 10 years ago I came in touch with the early traces of the .NET Framework. One of its keys goals was to make deployment of applications easier (so-called xcopy deployment and elimination of the DLL Hell) and a key part of that mission was to make it easier for developers to configure their applications without relying on the registry. Hence, System.Configuration was introduced, together with the concepts of XML-driven configuration files like app.config and web.config. At the same time, static typing was believed to be the ultimate solution to every problem (a little exaggerating, but still) so together with those files came an effort to schematize those as much as possible, while still allowing developers to have their custom settings in there too: <appSettings> was born. A sample from MSDN:

<?xml version="1.0" encoding="utf-8" ?>
<appSettings>
    <add key="Application1" value="MyApplication1" />
    <add key="Setting1" value="MySetting" />
</appSettings>

Why am I telling you all of this? To draw your attention to a couple of things. First of all, a highly specialized (on a per-application basis) thing like configuration has been generalized piggybacking on a general data representation technology like XML. And although the whole thing is schematized, the only real way to give developers some storage of their own is this “appSettings” thing with very very generic child elements like <add>, <remove> and <clear>.

In other words, though we have achieved the goal of making configuration easier in the light of deployment, this approach failed in giving enough flexibility to developers to specialize the configuration store to their application’s needs. Next you want to have keys that contain collections or hierarchically organized settings, and all you have is key-value pairs. Adding syntax highlighting doesn’t help to “highlight” what’s important and what’s not (like keywords in a general purpose language indicate special meaning), everything looks flat:

<?xml version="1.0" encoding="utf-8" ?>
<appSettings>
    <add key="Application1" value="MyApplication1" />
    <add key="Setting1" value="MySetting" />
</appSettings>

There’s a positive thing to it though: our application is now metadata-driven. You could go one step further and say the application is model-driven for at least what its configuration is concerned. And that’s a good thing: no need to recompile to reconfigure the application. Gigantic applications like IIS and SharePoint take this to extremes: extending a SharePoint site is “simply” a matter of adding lots of configuration entries (most of which are automated through configuration and management UIs) to the SQL database that drives it. And that’s what modeling is all about. By the way, this approach of modeling applications has gradually increased over the years: COM+ had attribute-driven configuration means (e.g. to make an operation transactional), in .NET we introduced a generalized means to custom attributes, and XAML is also about capturing the “intent” of the developer in a structured way (as opposed to burying it under a load of (generated) code).

So, where are we with regards to XML as a means to make an application play better in the world of modeling? For one, we have a democratized means of storing and retrieving data in a (not-so-much-)human-readable and (again-not-so-much-)human-writeable way. Why am I using an expensive word like “democratized” in this context? Simply because XML is extremely handy to interact with from a programmatic point of view: we have a lots of XML libraries out there – MSXML, System.Xml and LINQ to XML come to mind – that make it incredibly easy to consume and produce XML data. Nobody has to think about lexing and parsing (to be explained further on) anymore, XML API + XML data = power. But there are still gradations to the ease of using XML. When there’s an XML schema (XSD) at hand, a mapping can be established to make the XML look like regular strongly typed objects (xsd.exe) but it’s far more common not to have an XML schema it turns out (think of REST services as opposed to WDSL-based web services).

And that’s where “M” comes in again, with a technology called MGrammar. There are quite some parallels to be made with XML:

  • MGraph ~ XML: using MGraph we declare data (like { X = 1, Y = 2 }), just like we do with XML (e.g. <Point><X>1</X><Y>2</Y></Point>). Another comparison you could make here is with JSON; it’s no coincidental that MGraph shares the philosophy of reducing syntactical noise of JSON.
  • MSchema ~ XSD: to capture the shape of MGraph data, one can use MSchema, just like XML data can be constrained by means of an XSD schema. This opens up for validation, tooling support, typing (structural or not). E.g. type Point { X : Integer, Y : Integer }.
  • MGrammar ~ XSLT^-1: although MGraph is less noisy than comparable XML data (no sharp brackets and closing tags anymore, long live curly braces) it’s still a way too generic tool in order for humans to interact with the data in a more natural way. What if we want to take a textual representation of our data (like (1, 2)) and turn it into data that follows a certain structure? That’s what MGrammar is for.

So, what we end up with is the concept of Textual DSLs (Domain Specific Languages):

image

Just like XML democratized the consumption and production of data (without the burden to configure databases and even to think about schematization if you’re lazy), MGrammar democratizes the use of a fit-n-finish textual representation of your data by means of a textual DSL.

An overview picture tells it all:

image

The MSchema and MGraph portions should be familiar, as we’ve seen those in previous posts. Today we’ll focus on the MGrammar portion of the picture.

 

MGrammar from a 10,000 feet view

One way to introduce MGrammar is from a tooling chain perspective: where does MGrammar fit in the end-to-end tooling story around Oslo? In the previous posts we’ve seen how to use M.exe and MX.exe (and you can envision much more built-in tooling support to hide those low-level details) to take in a model definition, compile and deploy it to the Repository. This part of the tool-chain is shown below:

image

Now the question is where those input M files come from? One answer is we could have written them by hand. That’s certainly a viable option but as pointed out before, we also want to get to a way for having more specialized input formats under the form of custom DSLs. The output of those DSLs should eventually lead to M files to be fed in to the compile-and-deploy chain, so we can put the following MGrammar-chain in front of it:

image 

The output of this chain is in the form of MGraph data, which can be fed into the M/MX tool-chain above. The left part of the picture is the development-time part of it: we write a custom grammar and compile it to an executable grammar (.mgx file) using the mg.exe tool. The right part of the picture is concerned with the execution of the grammar, given the executable grammar and an input file, resulting in an MGraph output, ready to be deployed to the Repository using the M/MX chain, or to be consumed by other means (as there’s a whole managed framework around MGrammar).

What can we compare MGrammar to? To a limited extend, you can compare it with XSLT, or better: the inverse of it. XSLT is typically employed to take an XML file in, and out comes a file in any (textual) format you see fit (HTML, text files, other XML files, C# files). This is a simpler transformation than the opposite, where any textual representation of data comes in (a C# file, a custom DSL, what have you) that we want to transform into structured data. Actually this is a very well-understood problem of lexing and parsing.

 

Lexing and parsing

Over the years many ways to turn text into more structured forms of data have been invented. Compiler writers use those techniques on a day-to-day basis but it’s hard to say this technique has been available to the masses (Mort, to name one). I won’t go into much detail about this, the interested reader should grab the Dragon Book from the shelve (or buy it if it isn’t there already – it’s worth a thousand fiction books as you can infer from the cover picture). Nevertheless, a short overview of the steps involved in turned text into something more actionable for further consumption:

  • Lexical analysis (or “lexing”) is the act of tokenizing the input text into individual tokens, which you can see as words in the source language. E.g. x => x + 1 could be turned into a sequence of the following tokens: ID(“x”), LAMBDA_ARROW, ID(“x”), PLUS, NUM(1). By itself, lexing can be divided into two parts: scanning and tokenizing. The first part takes in the source text as a stream of characters that drive a state machine in order to reach final states that denote lexemes, which are strings of characters that compose into a known kind of token (e.g. 123 would be a lexeme that can be turned into a NUM token). Known lexers include lex, flex, ANLTR, fslex, etc.
  • Syntactical analysis (or “parsing”) takes in a sequence of tokens in order to recognize well-formed sentences as described by a grammar. For example, the sentence of tokens for “x => x + 1” as seen above would be recognized (in C#) as a lambda function which is a well-formed kind of expression in that language. However, a sentence like “x => x 1” would not be recognized as valid input as the grammar has no production that states that a sequence of an ID and a NUM can be turned into a valid expression (in this case to become the body of the lambda expression). There are various forms of parsers, such as recursive descent, LL, LR and various variants. Known parsers include yacc, bison, fsyacc, etc.

Typically both the lexer and the parser are chained together to form the front-end of a language compiler. Part of the parsing phase is concerned with executing certain actions when a valid sentence is detected. For example, a sequence of EXPR, PLUS and EXPR could be turned into an object that represents the binary addition operator with the first EXPR as the left-hand side and the second EXPR as the right-hand side. Carrying out this process for an entire input ultimately results into a syntax tree that can then be consumed by the next phases of the compiler, to do type-checking, optimization, code generation.

Time permitting, I’ll blog about F#’s fslex and fsyacc tools too, to show an example of the fairly low-level plumbing that’s involved in dealing with lexers and parsers. However, rest assured: MGrammar is much easier to deal with :-).

 

Your first grammar

Okay, time to get the ball rolling. Let’s open up Intellipad in “Samples Enabled” mode (see the dedicated shortcut in the Start Menu for this mode) as MGrammar is only available in that mode currently. Switch to MGrammar mode in the right-top corner:

image

Notice the new menu entry that appears:

image

Don’t click any of this yet. Let’s write some MGrammar code first. As usual, the top-level container for M-stuff is a “module”, which controls visibility, import/export mechanisms, units of deployment, etc. Let’s call the one “Demo” to stick with a cliché and add a “language” declaration to it. This is where we enter the MGrammar part of the picture. As I didn’t want to settle for the boring “Hello World”, we’ll go for a FriendList language:

image

Notice the syntax highlighting for MGrammar; to Intellipad it’s just another DSL (this time a DSL to build DSLs in). Now the real work: inside the language definition we can write two things:

  • Tokens, to recognize words in the language.
  • Syntax rules (or productions), to recognize sentences in the language.

In the lex/yacc world, the starting production is indicated with a %start directive; in MGrammar patterns are followed that developers already know about: Main is the “entry-point” of the grammar. In passing, let’s reveal what Hello World would look like:

image

This language would only accept the literal string “Hello World” as its input. Notice this makes the language whitespace sensitive too: exactly one space is permitted between the words “Hello” and “World”. This can be remedied too, but let’s keep that for later. First, let’s focus on what valid sentences in our language should look like. Our DSL is going to be used to keep track of friend names and ages (feel free to party on the result to go for a more complicated language that accepts birthdays, food preferences, shoe sizes, and everything you find interesting to store about your friends). We want users to be able to specify this information using natural language, like this:

Bart is 26
John is 60

where “is” is recognized as a DSL keyword. What we’re seeing here are two sentences, each of which has one keyword and two tokens: one for the name and one for the age. Let’s start by defining what a name and age token look like.

First the name. A name consists of a sequence of letters, the first of which is capitalized, and let’s assume a minimum of one character. Regular expressions can be used to express those requirements, and in MGrammar we can use similar syntax to express this. One way to express this is as follows:

token Name = 'A'..'Z' 'a'..'z'*;

Notice a couple of things here: “..” is used to indicate a range. Concatenation is simply done by juxtaposition: in this case a capital letter followed by any number of lower-case letters. And finally Kleene operators like + and * can be used to indicate repetition. Although this is a good first attempt, we could split up matters a bit more: what about recognizing a lower case as a separate token, and similarly for upper case:

token Upper = 'A'..'Z';
token Lower = 'a'..'z';
token Name = Upper Lower*;

Next token to recognize is an age. For fun, let’s rule out the chance to have babies as friends (no meaningful conversation possible anyway), so let’s only recognize ages 1 to 99 (feel free to relax or constrain those rules further):

token Digit = '0'..'9';
token Age = '1'..'9' Digit?;

Here we state that an age is 1 to 9 optionally (? operator) followed by a digit from 0 to 9. So far, we have this:

image

Next we need to have a syntax rule (fancy word for sentence) that recognized a single friendship declaration in the form of Name “is” Age. Simple once more:

syntax Friend = Name “is” Age;

And finally, we need a way to express a list of friends for the main syntax rule:

syntax Main = Friend*;

The result is as follows:

image

All of this is nice, but how do we test the damn thing? Actually I’ve held back something all this time: the use of the “MGrammar Mode” menu. One option in there is the Tree Preview, which we’ll turn on now:

image

Intellipad will now ask for an input file to run against the grammar. Just create a new empty text file somewhere and point to it:

image 

The editor will now switch to a three-pane view that looks like this:

image

In the left pane, we can write input, resulting in an output MGraph on the right and possible errors at the bottom. Let’s give a try and write a sentence for myself (I adore myself as a friend, ahum):

image

Something isn’t quite right. The parser is telling us that the empty string “ “ around the “is” keyword is invalid. We need some way to accept spaces there, and preferably one or more whitespace characters as we want some flexibility. An interleave rule is what we need:

image

That’s better (although there’s a remaining problem when dealing with multiple friend sentences, try to figure out what whitespace characters we need to add in addition to ‘ ‘ and ‘\t’), but take a look at our output: not quite how we want it. All we want is a piece of data stating there’s a friend called “Bart” with age “26” but we don’t want to see the “is” keyword anymore. To make this work, we can extend the syntax rule with a production, stating how the parts of the input (Name and Age) need to be turned into a data representation of our choice:

syntax Friend = n:Name “is” a:Age => Friend { Name = n, Age = a };

Think of the => as “produces” or “is turned into” where the right-hand-side is in whatever shape we want the data to be in. Notice how the parts of the sentence are prefixed with names so they can be referred to in the production’s output (this is far easier to understand than numeric reference schemes as used often in YACC and such). Notice how the output is much nicer now:

image

Finally we could clean up the Main rule too (to showcase another capability of MGrammar), like this:

syntax Main = l:Friend* =>
    Friends { valuesof(l) };

Here we use the valuesof keyword to extract the values of a list. This is handy to flatten lists produced by certain rules, as will become apparent if you go for more complicated grammars (maybe a next time).

 

Compiling and executing the grammar

We’ve already been interacting with the grammar through Intellipad above, but to make it really useful we’d like to use it in an executable format. There are various approaches to be taken here:

  • Compile the grammar using mg.exe and execute it using mgx.exe.
  • Embed the grammar in a .NET application and consume output of a parse-run in there (as rich .NET objects).

For the purpose of this introductory post, we’ll go for the former option and defer discussion of .NET integration till a later post. So, save the grammar in an .mg file and open a command prompt with the PATH set to include the location of mg.exe (%programfiles%\Microsoft Oslo SDK 1.0\Bin). First thing to do is to compile the grammar, as follows:

mg grammar.mg

This will produce an mgx file, which is the executable grammar format:

image

Actually an MGX file is (once more) an OPC file, which is a ZIP file containing XML definitions of the grammar. You can verify this by copying the file to a copy with a .zip extension, open it, and inspect its contents:

image

The Demo.FriendList file is an XML file representing the grammar of our language. In there you’ll find some information about the productions and a lexer table (in compressed format). But that would lead us too far. All we need is something that knows how to execute the grammar, and mgx.exe is one such tool:

mgx test.txt /r:grammar.mgx

This produces an .m file with the output of the parse run:

image

 

Conclusion

MGrammar makes it very easy to produce M content based on a custom DSL specification. In this post you saw how to use Intellipad to author an MGrammar specification, test it against some sample input, and shape the output. To finish off this post, we looked at the MGrammar tool-chain used to compile and execute the grammar.

In subsequent posts we’ll take a look at MSBuild support to compile MGrammar specifications and how to leverage the power of MGrammar from within a .NET application using the Oslo SDK.

Have fun!

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Filed under: ,

Comments

# The M Programming Language – Part 3 – Primer to MGrammar - Bart De Smet

Monday, March 30, 2009 12:22 AM by DotNetShoutout

Thank you for submitting this cool story - Trackback from DotNetShoutout

# Язык программирования M, часть 3: введение в MGrammar

Monday, March 30, 2009 1:30 AM by progg.ru

Thank you for submitting this cool story - Trackback from progg.ru

# Dew Drop - March 30, 2009 | Alvin Ashcraft's Morning Dew

Pingback from  Dew Drop - March 30, 2009 | Alvin Ashcraft's Morning Dew

# re: The M Programming Language – Part 3 – Primer to MGrammar

Monday, March 30, 2009 11:58 AM by aL

M seems cool but why is it so tied to sql server? it seems like the repository could be any thing but in every example, sql server is used and that worries me a bit

i also yet to understand how m will add value to my app, but thats most likely because i dont fully understnd it :)

# The M Programming Language – Part 3 – Primer to MGrammar - B# .NET Blog

Monday, March 30, 2009 12:19 PM by DotNetShoutout

Thank you for submitting this cool story - Trackback from DotNetShoutout

# Posts about Programming from google blogs as of March 31, 2009 &laquo; tryfly.com

Pingback from  Posts about Programming from google blogs as of March 31, 2009 &laquo;  tryfly.com

# re: The M Programming Language – Part 3 – Primer to MGrammar

Tuesday, March 31, 2009 9:39 AM by bart

Hi aL,

M isn't directly tied to SQL Server specifically; it just happens to be one storage technology we're leveraging for the implementation of our Repository. But you could take e.g. mx files and deploy them to other technologies given such a binding is created, e.g. to XAML file-based storage.

Hope this helps,

-Bart

# New and Notable 310

.NET Services/CSD/WCF/Oslo .NET Services March 2009 CTP Is Released - There are many exciting improvements in cross-platform interoperability, Services Bus and Workflow Service capabilities and the developer experience. Speaking of said release, Clemens

# Topics about Lex-luther &raquo; Archive &raquo; The M Programming Language ??? Part 3 ??? Primer to MGrammar

Pingback from  Topics about Lex-luther  &raquo; Archive   &raquo; The M Programming Language ??? Part 3 ??? Primer to MGrammar

# Bookmarks [2009-11-22] (tamnd) &laquo; Scapbi02&#039;s Blog

Sunday, November 22, 2009 8:15 AM by Bookmarks [2009-11-22] (tamnd) « Scapbi02's Blog

Pingback from  Bookmarks [2009-11-22] (tamnd) &laquo; Scapbi02&#039;s Blog

# find out here

Sunday, September 28, 2014 6:25 PM by find out here

The M Programming Language – Part 3 – Primer to MGrammar - B# .NET Blog

# sneak a peek at these guys

Monday, October 13, 2014 5:21 PM by sneak a peek at these guys

The M Programming Language – Part 3 – Primer to MGrammar - B# .NET Blog