April 2008 - Posts

In our journey to find the perfect match, we've arrived at an interpreter-based pattern matcher in the previous post. Although there are quite some limitations and the syntax isn't as sweet as we'd like it to be, it's fully functional. However, what about the performance of our pattern matcher? Consider the following benchmark:

Random rand = new Random();

var persons = new List<Person>();
for (int i = 0; i < 100000; i++)
    persons.Add(new Person("Bart", rand.Next(1, 10)));

var watch = new Stopwatch();

var baseline = persons.Select(p =>
{
    Person person = p as Person;
    if (person != null && person.Age == 5)
        return true;
    else
        return false;
});

watch.Start();

foreach (var p in baseline)
    ;

watch.Stop();
long baseMs = watch.ElapsedMilliseconds;
Console.WriteLine(watch.Elapsed);

var res = persons.Select(p => new Pattern<bool>(p)
    .Match((string name) => new Person(name, 5), name => { return true; })
    .Else(() => { return false; })
);

watch.Reset();
watch.Start();

foreach (var p in res)
    ;

watch.Stop();
Console.WriteLine(watch.Elapsed + " " + watch.ElapsedMilliseconds / baseMs);

Debug.Assert(baseline.SequenceEqual(res));

This comparison is "fair enough" - given a series of 100,000 randomized Person objects, we want to find matches with age 5. Although our pattern match will do slightly more work (by design) to extract the Name property to bind it to the name variable (which is unused), the essentials are pretty much the same: in both cases we use iterators, we measure the time to iterate over the sequence and the pattern match has identical semantics in both cases. Of course we check the validity of the result using an Assert operation. Here's the result of this benchmark:

image

 

Some profiling

Correctness has proven to be okay, but speed has been proven to be a little problematic, but where's the bottleneck? Although it's pretty apparent from the code itself (explanation to follow), let's go with CLR Profiler - the art of engineering is measurement after all.

Manual pattern match Our pattern match
Summary for C:\temp\Switch++\Pattern\bin\Release\Pattern.exe
Allocated bytes:                    442,449
Relocated bytes:                          0
Final Heap bytes:                   442,449
Objects finalized:                        0
Critical objects finalized:               0
Gen 0 collections:                        0
Gen 1 collections:                        0
Gen 2 collections:                        0
Induced collections:                      0
Gen 0 Heap bytes:                   Unknown
Gen 1 Heap bytes:                   Unknown
Gen 2 Heap bytes:                   Unknown
Large Object Heap bytes:            Unknown
Handles created:                         68
Handles destroyed:                        0
Handles surviving:                       68
Heap Dumps:                               0
Comments:                                 0
Summary for C:\temp\Switch++\Pattern\bin\Release\Pattern.exe
Allocated bytes:                  1,286,172
Relocated bytes:                     55,642
Final Heap bytes:                   296,884
Objects finalized:                        1
Critical objects finalized:               1
Gen 0 collections:                        1
Gen 1 collections:                        0
Gen 2 collections:                        0
Induced collections:                      0
Gen 0 Heap bytes:                 1,048,612
Gen 1 Heap bytes:                    57,660
Gen 2 Heap bytes:                        12
Large Object Heap bytes:             13,440
Handles created:                         44
Handles destroyed:                        0
Handles surviving:                       44
Heap Dumps:                               0
Comments:                                 0

Definitely the allocated bytes go much higher (notice I've reduced the number of iterations to only 1000). The allocation graph shows clearly where the huge number of allocations comes from:

image

<Benchmark>b__2 is an anonymous method generated by the lambda expression in here:

var res = persons.Select(p => new Pattern<bool>(p)
    .Match((string name) => new Person(name, 5), name => { return true; })
    .Else(() => { return false; })
);

which results in Pattern<T> objects but also an bunch of LambdaExpression objects because the call to Match takes in an Expression<...> so the compiler generates the expression tree objects by calling things like Expression.New and Expression.Lambda. If you were to follow the MethodBase::GetMethodFromHandle path, this is the code responsible to get the ConstructorInfo object for the new Person(...) call inside our lambda, which becomes a parameter to the NewExpression.

A breakdown of the allocations shows us clearly how much we're allocating for expression trees and reflection info. The age of objects tells us something too but shouldn't be taken too serious since our run didn't last long enough to really tell what's going on there (exercise: play a bit with the profiler for longer runs):

image image

Telling even more is the call-graph:

image

And here's a closer look at the call graph for an individual iteration:

image

Enough stats for now - what are we going to do about it? Those match objects don't really change much in a series of pattern matches inside a loop body. Yes, we're dragging around some metadata that tells the pattern match is already has taken on a value (_hasValue) to suppress subsequent match clause evaluations, but other than that a pattern match object can be reused. There are certain edge cases of course where reuse might get trickier, such a multi-threading, but that too we can fix by eliminating global state.

 

Lazy evaluation

Our first attempt uses eager evaluation and mimics a classic switch statement expression in a syntactical fashion:

int res = new Pattern<int>(o)
   .Match(clause, (...) => {
      // body
   })
   .Match(clause, (...) => {
      // body
   })
   .Else(() => {
      // body
   })
;
int res = switch(o) {
   case clause: {
      // body
   }
   case clause: {
      // body
   }
   default: {
      // body
   }
};

This would be fine if the pattern matcher would be code instead of an object. This form of a pattern match could be called closed since it's already bound to an object (passed in through the constructor) and it's eager because every match clause evaluates straight away. What we really want is to reuse a pattern match object to match many different objects, something we can denote as an open pattern match since it's not bound to an object at construction time. It will also be lazy since it won't evaluate till we force it to do so by passing it an object. This is much like LINQ queries that simply capture the intended operation without performing it straight away, rather it waits for a stimulus (which is iteration in that case) to go ahead and execute whatever is needed to obtain the requested results.

Let's start by building the open form of the Pattern<T> object by adding a constructor:

class Pattern<T>
{
   private bool _hasValue;

   public Pattern() { }

   public Pattern(object o)
   {
      if (o == null)
         throw new ArgumentNullException("o");

      Object = o;
   }

   public object Object { get; private set; }
   public T Result { get; private set; }

   ...
}

No rocket science so far. We've just added a default constructor which implicitly captures some state. The fact the Object property will be null in that case acts as the distinction mechanism between lazy and eager and for the same reason we disallow a closed pattern match to be bound to a null value, in an attempt to keep things simple. You could certainly go ahead and add another boolean to distinguish between closed and open (exercise).

The idea for an open pattern is to capture all the information required to operate on an object at a later point in time, so we'll change our Match and Else methods. First of all, Else can no longer return the result since the whole intent is to defer execution. Since there's no overloading on a method's return type, we'll simply change it as follows:

public Pattern<T> Else(Func<T> f)
{
    if (this.Object == null)
    {
        if (_else != null)
            throw new InvalidOperationException("Else clause has been set already.");

        _else = f;
    }
    else
    {
        if (!_hasValue)
        {
            _hasValue = true;
            Result = f();
        }
    }

    return this;
}

introducing another member variable _else:

private Func<T> _else;

The main difference is obvious - if we're in the case of a closed pattern match, the old code does all the work and assigns the result to the Result property if no other match clause has caused a final value to be set. In the case of an open pattern match, we simply cache the body delegate of the Else clause and make sure there's only one (something we don't protect against in the eager case - exercise). This change means the caller of a close pattern must add a call to the Result property getter to obtain the result:

var res = persons.Select(p => new Pattern<bool>(p)
    .Match((string name) => new Person(name, 5), name => { return true; })
    .Else(() => { return false; }).Result
);

This doesn't trigger the eager evaluation though - it simply returns the result that has already been calculated. One could even try (exercise) to reduce the need for calling Result by introducing an implicit conversion operator, converting a Pattern<T> into a T but there's a caveat (what would that be?).

How are we going to change the Match calls? Here's the revised MatchInternal:

private Pattern<T> MatchInternal(LambdaExpression e, Delegate f)
{
    if (this.Object == null)
    {
        if (_else != null)
            throw new InvalidOperationException("Else clause has been set already.");

        _matches.Add(new MatchEntry() { Match = e, Action = f });
        return this;
    }


    ...
}

Again, we simply distinguish between closed and open, and in the latter case we make sure there's still room for a match clause and if so, we add a new MatchEntry to a list of match entries. A what? Here's the definition of a match entry:

internal class MatchEntry
{
   public LambdaExpression Match { get; set; }
   public Delegate Action { get; set; }
}

Those entries keep the pair of the match clause (Match) and match body (Action) and are stored in a list:

private List<MatchEntry> _matches = new List<MatchEntry>();

Finally, we need to introduce an evaluation mechanism and what's better than an Execute method, taking in an object and returning a T?

public T Execute(object o)
{
    if (Object != null)
        throw new InvalidOperationException("Execution is only valid on unbound pattern match objects.");

    if (_else == null)
        throw new InvalidOperationException("Can't evaluate the match. Incomplete match object. Are you missing an Else clause?");

    foreach (MatchEntry entry in _matches)
    {
        Dictionary<ParameterExpression, PropertyInfo> bindings;
        if (TryMatch(o, entry.Match, out bindings))
        {
            return Evaluate(o, entry.Match, entry.Action, bindings);
        }
    }

    return _else();
}

We don't use the Result property since the Pattern<T> has an open character: it shouldn't know about any object whatsoever, neither about any result. The logic is simple: only unbound (open) pattern match objects can be used with the Execute method and _else should be present (an alternative is to either return default(T) or to throw an exception, which is not really functional). Finally we cycle through the list of match entries and try to find the right match. In case we find one, we evaluate and return the result. Otherwise, control reaches the Else case.

 

Re-evaluating the performance

We expect this to be better already, but let's measure through our extended benchmark:

var pattern = new Pattern<bool>()
    .Match((string name) => new Person(name, 5), name => { return true; })
    .Else(() => { return false; });

watch.Reset();
watch.Start();

var res2 = persons.Select(p => pattern.Execute(p));

foreach (var p in res2)
    ;

watch.Stop();
Console.WriteLine(watch.Elapsed + " " + watch.ElapsedMilliseconds / baseMs);

Debug.Assert(baseline.SequenceEqual(res2));

Here's the result for 100,000 iterations:

image

Definitely better with just a simple change. However, there's definitely more room for improvement and by making our move towards an open pattern match, we're enabling lots of new scenarios. We'll cover those next time, but let's finish off this episode by looking into the profiling data for this run. As expected, allocation numbers have gone down:

Summary for C:\temp\Switch++\Pattern\bin\Release\Pattern.exe
Allocated bytes:                    756,860
Relocated bytes:                          0
Final Heap bytes:                   756,860
Objects finalized:                        0
Critical objects finalized:               0
Gen 0 collections:                        0
Gen 1 collections:                        0
Gen 2 collections:                        0
Induced collections:                      0
Gen 0 Heap bytes:                   Unknown
Gen 1 Heap bytes:                   Unknown
Gen 2 Heap bytes:                   Unknown
Large Object Heap bytes:            Unknown
Handles created:                         44
Handles destroyed:                        0
Handles surviving:                       44
Heap Dumps:                               0
Comments:                                 0

There's obviously still an allocation cost to build up the open pattern match but things have improved. The number of calls triggered by the iteration's select clause have gone down too:

image

but this figure allows us to pick some other targets: GetCustomAttributes to extract the constructor's mapping metadata is a huge cost. Again expanding a single iteration, we now see:

image

Notice the number of calls on the top line: 485. We come from 637 in the previous round with the closed pattern, due to the cost of allocating the new pattern object (with all its expression tree magic etc). Eliminating the loop body's cost is our next goal.

 

Stay tuned!

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

In the previous episode of this series we took a look at the concept of pattern matching as it exists in a couple of functional languages out there as well as how we're going to map it to C# 3.0 syntax. We left off with the exercise of guesstimating the public contract of our Pattern`1 class based on this sample:

string message = new Pattern<string>(x)
   .Match((string name) => new Person(name, 25), name => name + " is 25 years.")
   .Match((int age) => new Person("John", age), age=> "John is " + age + " years.")
   .Else(() => "Unmatched object");

 

Deriving the pattern

A few observations to infer this contract:

  1. The type parameter T of Pattern<T> denotes the return type of the pattern.
  2. Patterns can operate on any object, hence the constructor takes a System.Object.
  3. The number of matches, denoted by Match calls, has no upper bound and we want to chain these together fluently (aka fluent interfaces, just like LINQ does with its Standard Query Operators).
  4. There can only be one Else clause and it seems to return the pattern's result type (we'll revisit this decision in subsequent posts).

So, we already have something like:

class Pattern<T>
{
   public Pattern<T>(object o) { ... }

   public Pattern<T> Match(...) { ... }
   public T Else(...) { ... }
}

The way we'll chain together match clauses is by accumulating information about matches in the pattern and returning ourselves, until Else comes around to complete the pattern. Notice different decisions could be made for the Else policy, we stick with the above for now for sake of simplicity.

There's one big missing piece: the parameterization of the Match and Else methods. Given the sample code, you're seeing lambdas everywhere but from that observation alone one can't make the decision of the type. Lambdas in C# 3.0 don't really have a type by themselves, they can be assigned to two things: either delegates (that compile into IL) or expression tree objects (that compile into expression trees = code as data). A quick analysis:

  • For the match clause, the first parameter to Match that is, we don't really want to new up an object like new Person(bar, 25). Sure enough, we could trace back (see further) the 25 to the Age property and check the passed in object is of type Person with a matching Age property value, but what for heaven's sake would we pass in to the "free parameter" denoting the person's name? Moreover, we need to be able to extract the Name property instead.
  • For the match body, once we have the "free parameter(s)" in case the match clause evaluated positively, we need to execute the body given the value for those parameters. One question is what precisely we want to allow in the body. One could argue simply for valued expressions, since the whole thing returns a value. On the other hand, statements would come in handy too (you could imagine a Pattern class without a return type, i.e. a non-generic variant), much like a normal switch statement has statement bodies in the case blocks.

The latter one is fairly straightforward at the time of writing this post. We don't have statement trees in the language (yet) and since we want to allow arbitrary code - not just simple expressions - the use of a Func<...> delegate is the only way. This is where our syntactical restriction comes from to a certain extent, as I'll outline below.

For the former one, we're in somewhat more trouble it seems. We don't really need to create an object in the match clause, we rather need to match it and have all the information required to do so. This screams for a runtime approach since we're not changing the language nor the compiler. Such a meta-way of programming is enabled through expression trees so the match clause will be of type Expression<Func<...>>.

Here's the full definition finally:

class Pattern<T>
{
   public Pattern<T>(object o) { ... }

   public Pattern<T> Match<TR>(Expression<Func<TR>> e, Func<T> f) { ... }
   public Pattern<T> Match<T1, TR>(Expression<Func<T1, TR>> e, Func<T1, T> f) { ... }
   public Pattern<T> Match<T1, T2, TR>(Expression<Func<T1, T2, TR>> e, Func<T1, T2, T> f) { ... }
   public Pattern<T> Match<T1, T2, T3, TR>(Expression<Func<T1, T2, T3, TR>> e, Func<T1, T2, T3, T> f) { ... }
   public T Else(Func<T> f) { ... }
}

with a limited number of overloads since we don't have a params equivalent for type arguments as captured in the following PC# syntax:

   public Pattern<T> Match<params T[], TR>(Expression<Func<params T[], TR>> e, Func<params T[], T> f) { ... }

 

Syntax restrictions

As outlined before, the syntax of our match is a bit cumbersome in that we need to repeat the parameterization for both the match clause and match body:

.Match((string name) => new Person(name, 25), name => name + " is 25 years.")

There are different approaches possible but keep in mind we need the former to become an expression tree and the latter a regular delegate. You could imagine a Match object defined like this instead:

class Match<T1, TR>
{
   public Expression<Func<T1, TR>> Clause { get; set; }
   public Func<T1, T> Body { get; set; } // Q: Where could T possible come from?
}

while having and Add method on Pattern<T>, taking in a Match<...> object and allowing for collection initializer syntax. Now we could new up the match using either a constructor (not shown above) or object initializer syntax but we still need to say the parameterization twice (i.e. two lambdas). What we'd need is lifting of the parameterization to both properties:

(string name) => new Match { Clause = new Person(name, 25), Body = name + " is 25 years." }

Exercise: I'll leave it to the reader as an intellectual journey to think about the issues that arise in this hypothetical code and how something maybe could be done if you restrict the bodies to be expressions only.

For our explorative purposes in this series we consider the syntax restrictions not the biggest concern.

 

Picking an evaluation strategy

Next question to answer is how we're going to evaluate the pattern match. We'll explore different methodologies but here's a list of possibilities that jump into our minds:

  1. As the chain of Match calls flows, we evaluate the match body at once and if it evaluates we run the body. The result is kept around and subsequent Match calls short-circuit any subsequent match clause evaluation. If the flow reaches the Else clause and the pattern hasn't taken on a value yet, we evaluate the Else body and return that as the result. This is a form of eager evaluation.
  2. Instead of trying to match straight away, we keep track of the match clauses and bodies and store those somewhere. This is a form of lazy evaluation but it allows for more flexibility since we can reuse the same Pattern object multiple times and can even come up with an evaluation strategy.

We keep the best for the last and start off with eager evaluation in today's post. In subsequent posts we'll revamp the Pattern<T> class to support lazy evaluation. Be prepared for tons of cool stuff including Reflection.Emit by then...

 

The art of unification

In the center of our pattern matching lies the evaluation of the match clauses which comprises two tasks:

  • Checking for a match to the supplied object. This is called unification, a technique also employed in languages like Prolog.
  • Extracting unmatched "free parameters" (aka match variables) in order to feed those into the match body delegate.

What's there to unify? Lots of stuff and we could go a long way in providing flexibility. Let's start with a basic sample:

int Fib(uint i)
{
   return new Pattern<uint>(i)
      .Match(() => 0, () => 1)
      .Match(() => 1, () => 1)
      .Match((uint n) => n, n => Fib(n - 1) + Fib(n - 2))
      .Else(() => 0 /* should never happen */ );
}

Notice we could work with an int too but then we need to take care of the negative case. This would require guard conditions (like when in F#) which is an easy extension to the pattern match code (exercise for later). The code above already illustrates a predictable issue with eager evaluation: since we call ourselves recursively, we always new up the same pattern object which is clearly not the best idea performance-wise. As we'll see later by keeping the pattern object around for reuse we can improve the situation a lot, and we'd even come close to the imperative variant:

int Fib(uint i)
{
   if (i == 0) return 1;
   else if (i == 1) return 1;
   else return Fib(i - 1) + Fib(i - 2);
}

which could even be written as a nested ternary operator expression:

int Fib(uint i)
{
   return (i == 0 ? 1 : (i == 1 ? 1 : Fib(i - 1) + Fib(i - 2)));
}

Back to the original question, what's there to unify in this case? Consider the following match clause: () => 0. In terms of an expression tree this will result in a LambdaExpression with it's Body set to a ConstantExpression, wrapping the value 0 . This means our translated match looks like (pseudo):

if (<inputobject> == 0)
   // execute body

Nothing too fancy and actually precisely the same as a regular switch expression that matches a constant. So, let's skip this trivial case (exercise: implement this unification yourself extending the code presented further on - the third match with variable n will be covered when talking about the extraction phase).

What's more interesting is the case of matching a complex type, let's call it a record, as in:

string message = new Pattern<string>(x)
   .Match((string name) => new Person(name, 25), name => name + " is 25 years.")
   .Match((int age) => new Person("John", age), age=> "John is " + age + " years.")
   .Else(() => "Unmatched object");

Consider the first match clause:

(string name) => new Person(name, 25)

One thing is clear: we need to do a type-check for a Person object. A simply equality check won't work since we don't have a value to compare with (name is unspecified). However, we're missing an additional piece of information. All we have is a constructor call which is turned into an expression tree like this:

ParameterExpression name = Expression.Parameter(typeof(string), "name");
LambdaExpression<Func<string, Person>> clause = Expression.Lambda(
   Expression.New(
      typeof(Person).GetConstructor(new Type[] { typeof(string), typeof(int) }),
      name,
      Expression.Constant(25)
   )
);

The culprit here is the second parameter to the constructor call in the NewExpression. We know its value to be 25 but given a random Person object, how can we correlate this value back to "something" to check against? We know the parameter denotes the Age property and we assume what has been passed through the constructor can be retrieved - unchanged - through the Age property but our pattern matcher doesn't know this. Remember we defined the Person object in PC# in the previous post:

class Person(string Name, int Age)

Consider this a record type with two properties Name and Age and a constructor that sets both, just as we proposed in the following C# 3.0 translation:

class Person
{
   public Person(string name, int age)
   {
      Name = name;
      Age = age;
   }

   public string Name { get; private set; }
   public int Age { get; private set; }
}

Notice the immutability employed by the properties by having a private setter. However, the translation above alone isn't enough since we only have a local view on the situation through the ConstructorInfo object which doesn't know anything about properties. In order to correlate things back, we'll add some metadata on the constructor:

   public Person([Property("Name")]string name, [Property("Age")]int age)
   {
      Name = name;
      Age = age;
   }

Now, the ConstructorInfo object emitted in the expression tree will carry enough information to trace back the second parameter to the Age property and emit the following (idealized) check for the match:

Person p;
if ((p = <inputobject> as Person) != null && p.Age == 25)
   // execute body

To keep things simple, we'll only support one level of unification but an extension to this isn't too hard either (it just takes some implementation effort with some recursion). Such a scenario arises when you do a match like this:

(string name, string city) => new Product(name, 123.45, new Supplier(city, "BE"))

Freaks could take a detour from here through the world of constructor algebras but that would lead us off-track for the scope of this post.

There's one piece of low-hanging fruit left: object initializers which get mapped onto MemberInitExpression. For example, consider the following match clause:

(string name) => new Person { Name = name, Age = 25 }

This maps to an expression tree like this:

ParameterExpression name = Expression.Parameter(typeof(string), "name");
LambdaExpression<Func<string, Person>> clause = Expression.Lambda(
   Expression.MemberInit(
      Expression.New(typeof(Person)),
      Expression.Bind(typeof(Person).GetProperty("Name").GetSetMethod(), name),
      Expression.Bind(typeof(Person).GetProperty("Age").GetSetMethod(), Expression.Constant(25))
   )
);

Unfortunately, the property's setter is kept instead of the property itself, so you'd need to map it back to the property itself (a substring operation on the name to cut off the "set_" part will do it). A bigger issue is we'd have to loosen our definition of a record type since the setter obviously needs to be accessible in order for the object initializer to compile (in this case to an expression tree). We'll leave the processing of the low-hanging fruit to turn it into a pattern match tropical cocktail as an exercise to the reader.

 

Binding leftover parameters

We said before pattern matching embodies two tasks: the unification procedure to derive the match condition but also the extraction of remaining parameters that are to be fed in to the match body. In our running sample

.Match((string name) => new Person(name, 25), name => name + " is 25 years.")

let's assume we're processing <inputobject> = new Person("Bart", 25). We already saw this is turned into:

Person p;
if ((p = <inputobject> as Person) != null && p.Age == 25)
   // execute body

However, to execute the body, we need to extract the value for name first. Following the same metadata-path as described before, we trace the name parameter to the constructor back to the Name property, so the execution code becomes:

Person p;
if ((p = <inputobject> as Person) != null && p.Age == 25)
   <result> = <body>(p.Name);

This completes our match operation translation.

 

Coding time - A first attempt

Let's get into some coding and start with the skeleton. We're going to implement a fluent interface that continues to try finding a match and as soon it has found one, the Match and Else clauses should be a no-op. In other words, the lifecycle of a Pattern<T> object goes from valueless to valued with Else being the last change of making such a transition. Therefore we adapt the _hasValue pattern. In addition to this we keep to keep track of the result and the input, leading to the following:

class Pattern<T>
{
   private bool _hasValue;

   public Pattern(object o)
   {
      Object = o;
   }

   public object Object { get; private set; }
   public T Result { get; private set; }

   ...
}

Notice I'm keeping the result in a publicly available property, although it could be kept around in a private field. However, when we switch from eager to lazy evaluation, we'll see that the Else clause shouldn't return the result of the match. Instead we'll want to trigger the evaluation manually on a pre-constructed pattern matching object. There the public Result property will come handy as an escape valve for cases where we still want eager evaluation, triggered by calling Result. This allows us to share the same interface for both lazy and eager evaluation, but more on that later.

Next, we'll put our public contract for Match and Else in place:

   public Pattern<T> Match<TR>(Expression<Func<TR>> e, Func<T> f)
   {
      return MatchInternal(e, f);
   }

   public Pattern<T> Match<T1, TR>(Expression<Func<T1, TR>> e, Func<T1, T> f)
   {
      return MatchInternal(e, f);
   }

   public Pattern<T> Match<T1, T2, TR>(Expression<Func<T1, T2, TR>> e, Func<T1, T2, T> f)
   {
      return MatchInternal(e, f);
   }

   public Pattern<T> Match<T1, T2, T3, TR>(Expression<Func<T1, T2, T3, TR>> e, Func<T1, T2, T3, T> f)
   {
      return MatchInternal(e, f);
   }

   public T Else(Func<T> f)
   {
      ...
   }

By flowing all matches through a common MatchInternal method we can add overloads at will without any relevant effort (although you'll run out of Func objects, so if you want more than 5 parameters, you'll have to brew your own function delegates, which is also trivial).

Exercise: Before moving on (or scrolling down), guess the signature of MatchInternal that establishes its generic nature.

Time to enter the real magic of the pattern match's mechanics. Here's what MatchInternal looks like, nicely broken down into the distinct phases of an individual match: unification with parameter extraction and body evaluation in case the match succeeds:

private Pattern<T> MatchInternal(LambdaExpression e, Delegate f)
{
    //
    // Already matched.
    //
    if (_hasValue)
        return this;

    //
    // List of bindings.
    //
    Dictionary<ParameterExpression, PropertyInfo> bindings;

    //
    // Try to find a match.
    //
    if (!TryMatch(this.Object, e, out bindings))
        return this;

    //
    // We have a match. Evaluate the match body.
    //
    this.Result = Evaluate(this.Object, e, f, bindings);
    _hasValue = true;

    //
    // Allow further chaining (won't do anything because of _hasValue check).
    //
    return this;
}

The logic is straightforward: _hasValue acts as a barrier so that only one match clause can be executed. TryMatch does the unification work, producing a list of bindings in case the match was valid. In that case, evaluation of the body kicks in and the result value is set. In any case, we pass on ourselves to allow the fluent interface pattern. As in our type switch before, there are alternatives to implement such a pattern and making it flow with all sorts of objects (hint: extending System.Object) but let's no go there (although it's applicable under certain dynamic circumstances).

Our TryMatch match implements the unification algorithm. For sake of simplicity I've trimmed down the implementation to the bare minimum, only supporting non-nested NewExpressions. Extending it to support ConstantExpression and MemberInitExpression isn't too hard and more fancy matches (tip: lists, dictionaries, etc using collection initializers) are possible too. Stay tuned to see more patches in further posts, but for now, here's our minimalistic approach:

private static bool TryMatch(object o, LambdaExpression e, out Dictionary<ParameterExpression, PropertyInfo> bindings)
{
    bindings = new Dictionary<ParameterExpression, PropertyInfo>();

    //
    // Only support for NewExpression.
    //
    NewExpression ne = e.Body as NewExpression;
    if (ne == null)
        throw new NotSupportedException("Match clause can only contain NewExpression.");

    //
    // Created target type.
    //
    Type target = ne.Constructor.DeclaringType;

    //
    // Definite mismatch if input object isn't (a subclass of) the constructed type.
    //
    if (!target.IsAssignableFrom(o.GetType()))
        return false;

    //
    // Resolve parameters.
    //

    int i = 0;
    foreach (ParameterInfo param in ne.Constructor.GetParameters())
    {
        //
        // Mapping information from constructor parameters to properties required.
        //

        PropertyAttribute pa = param.GetCustomAttributes(typeof(PropertyAttribute), false).Cast<PropertyAttribute>().SingleOrDefault();
        if (pa == null)
            throw new InvalidOperationException("Input object doesn't have required mapping information.");

        //
        // Find the property.
        //

        PropertyInfo property = target.GetProperty(pa.Name);
        if (property == null)
            throw new InvalidOperationException(String.Format("Property {0} on type {1} not found.", pa.Name, target.Name));

        ConstantExpression ce;
        ParameterExpression pe;

        //
        // Expression should either be a constant or a parameter reference.
        //
        Expression arg = ne.Arguments[i++];
        if ((ce = arg as ConstantExpression) != null)
        {
            //
            // Check the object's property matches the constant.
            //
            object value = property.GetValue(o, null);
            if (!Object.Equals(value, ce.Value))
                return false;
        }
        else if ((pe = arg as ParameterExpression) != null)
        {
            //
            // Keep track of the loose parameter.
            //

            bindings[pe] = property;
        }
        else
            throw new NotSupportedException("Can only match constants.");
    }

    return true;
}

Let's analyze piecemeal. Apart from some boilerplate input checking because of our intended aforementioned limitation, the first real check is this one:

    //
    // Created target type.
    //
    Type target = ne.Constructor.DeclaringType;

    //
    // Definite mismatch if input object isn't (a subclass of) the constructed type.
    //
    if (!target.IsAssignableFrom(o.GetType()))
        return false;

This simply checks that the object passed in to the pattern match is assignable to the type used in the match clause. We allow subtypes to match a supertype, which makes sense if you have extended record types like Genious : Person where all of the base class's properties are still set by the subclass's constructor (and even that isn't really required).

Next we enter the parameter resolution loop which analyzes all of the constructor's parameters:

    //
    // Resolve parameters.
    //

    int i = 0;
    foreach (ParameterInfo param in ne.Constructor.GetParameters())
    {
        ...
    }

This is the real unification engine at work. First, using some LINQ operators, there's a lookup for the constructor parameter metadata. Without this we can't map back constructor parameters to the underlying properties to do either matching or value extraction. For our record types we expect this metadata to be present.

        //
        // Mapping information from constructor parameters to properties required.
        //

        PropertyAttribute pa = param.GetCustomAttributes(typeof(PropertyAttribute), false).Cast<PropertyAttribute>().SingleOrDefault();
        if (pa == null)
            throw new InvalidOperationException("Input object doesn't have required mapping information.");

Having the metadata is one thing, it should be valid too:

        //
        // Find the property.
        //

        PropertyInfo property = target.GetProperty(pa.Name);
        if (property == null)
            throw new InvalidOperationException(String.Format("Property {0} on type {1} not found.", pa.Name, target.Name));

Last but not least, the matching itself. Ideally, the world would be split into just two pieces over here: ParameterExpressions denote extractions, while all other types of expressions would denote match conditions (possibly recursively). To keep things simple, we only unify with constants in a non-recursive way, hence this code:

        ConstantExpression ce;
        ParameterExpression pe;

        //
        // Expression should either be a constant or a parameter reference.
        //
        Expression arg = ne.Arguments[i++];
        if ((ce = arg as ConstantExpression) != null)
        {
            //
            // Check the object's property matches the constant.
            //
            object value = property.GetValue(o, null);
            if (!Object.Equals(value, ce.Value))
                return false;
        }
        else if ((pe = arg as ParameterExpression) != null)
        {
            //
            // Keep track of the loose parameter.
            //

            bindings[pe] = property;
        }
        else
            throw new NotSupportedException("Can only match constants.");

Starting with the ParameterExpression, such a match simply results in adding a binding that keeps the metadata for the property set by that constructor parameter. We'll use this in the Evaluate method to extract data before feeding it into the match body. The remainder case is the ConstantExpression (due to intentional limitations outlined before), so we simply extract the value from it and check for equality with the value of the underlying mapped property. The way of doing the equality check could be part of a separate discussion but Object.Equals is solid enough for our goals.

Exercise: Could you think of a way to allow wildcard matches (as _ in F#)? Would MemberInitExpression help to solve the problem? What if you have to provide the capability with NewExpressions having no constructor overloads (tip: Unmatched<T> type)? Those of you familiar with old dragons like SASL might thunk of something else too.

With that, our unification process has completed. Let's move on to the evaluation phase.

private static T Evaluate(object o, LambdaExpression e, Delegate f, Dictionary<ParameterExpression, PropertyInfo> bindings)
{
    ParameterInfo[] targetParameters = f.Method.GetParameters();
    object[] args = new object[e.Parameters.Count];
    int j = 0;
    foreach (ParameterExpression param in e.Parameters)
    {
        //
        // Parameters need to be bound in the match expression.
        //
        PropertyInfo property;
        if (!bindings.TryGetValue(param, out property))
            throw new InvalidOperationException("Parameter " + param.Name + " was not bound in the pattern match.");

        //
        // Check type of value retrieved from target property against the expected type from the parameter.
        //
        object value = property.GetValue(o, null);
        if (!value.GetType().IsAssignableFrom(param.Type))
            throw new InvalidOperationException(String.Format("Property {0} on type {1} cannot be bound to parameter {3}.", property.Name, property.DeclaringType.Name, param.Name));

        args[j++] = value;
    }

    //
    // Invoke match body lambda and assign result.
    //

    T result = (T)f.DynamicInvoke(args);
    return result;
}

First of all, notice we've lost static type information on the parameters of the match body due to our generic approach of having one Evaluate-fits-all. If the user of the object doesn't trick the compiler by introducing casts on the action body, we should be safe. Since we're all paranoid framework-writers, I've added some additional checking for consistency by doing a simple type compatibility check comparing the property's value's type against the target parameter in the action body delegate.

The essence of this code is to iterate over the parameters in the match clause, retrieve the binding information gathered during the unification phase, extract the property values and feed them into the match body delegate. For now, we please ourselves with the crude approach of putting everything in an array and calling DynamicInvoke on the delegate (which is an expensive operation). Ultimately we get the result which should be compatible with the result type T (unless the user tricked the compiler by introducing a cast on the match body delegate - it's straightforward to implement another type consistency check here).

This leaves us with one remaining method: Else. Else doesn't take parameters (tip: you can still access the original object in the Else match body thanks to closures) and simply evaluates if no result has been found yet:

public T Else(Func<T> f)
{
    if (!_hasValue)
    {
        _hasValue = true;
        Result = f();
    }

    return Result;
}

That's it! I said it was trivial, didn't I?

 

Putting it to the test

Let's start with a simple collection of objects:

var people = new List()
{
   new Person("Bart", 25),
   new Person("John", 52),
   new Person("Lisa", 25),
   new Person("Rosa", 63),
   "Hello"
};

Since patterns are valued expressions, we can use them e.g. in a LINQ projection clause:

var res = people.Select(p => new Pattern<string>(p)
    .Match((string name) => new Person(name, 25), name =>
    {
        return name + " is 25.";
    })
    .Match((int age) => new Person("John", age), age =>
    {
        return "John is " + age + ".";
    })
    .Match((string name, int age) => new Person(name, age), (name, age) =>
    {
        return "I'm matching them all!";
    })
    .Else(() =>
    {
        return p.ToString();
    })
);

foreach (var s in res)
    Console.WriteLine(s);

And here's the result:

image

Agreed, it's a little verbose but hey, it's only (oh so important) syntax. What's more interesting for this post is to show off the capabilities of expression trees beyond the LINQ case. Next time, we'll switch gears and move on to lazy evaluation. We'll take a look at performance to motivate our lazy evaluation adventure (you can already guess how the above performs).

 

Happy matching!

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Last week I blogged about a functional-inspired (type)switch expression in C# and in a reaction on some comments I mentioned I'm playing around with something bigger. Let me set the record straight, this post isn't that bigger thing, it's just a related piece of the puzzle. But what's up this time? Pattern matching. Not as in regular expressions or so, but more of a language construct as one can find in the ML camp including Haskell and F# (amongst others).

In fact, this blog series is not just about bringing this concept to the world of C# - and allow me to point out I'm not going to take this all the way, rather I want to discuss some implementation patterns - it's also about the bigger picture, outlining what one can do with expression trees and the such.

 

Introduction

What's a pattern match in this context? Much like a complicated switch. But where a switch simply checks for equality (either of the value or the type, as outlined in the aforementioned post), a pattern match goes much further. Here's a simple example from F#:

let rec fib x =
   match x with
   | 1 -> 1
   | 2 -> 1
   | x -> fib(x - 1) + fib(x - 2)

First of all, the whole thing is an expression: it produces a value. This is a core difference with a-functional statements (an overstatement) such as if and switch, something our functional switch solves. But there's more. In fact you couldn't really write the code above in terms of a typical switch (without cheating with the default clause) since the last matching clause doesn't compare against a constant but rather a symbolic variable. in hear you scream "This isn't too fancy, is it?" and you're absolutely right (for now).

But first an exercise: something is missing in the fragment above (tip: rabbits freeze below zero).

Nothing too fancy so far indeed, but what about this?

let orElse left right =
   match left, right with
   | true, _ -> true
   | _, true -> true
   | _ -> false

We're still matching constants, but have punched a few holes in the patterns. For example, the first pattern says: if left == true (and I don't care about right), the result is true. The last clause is rather the fall-through case that matches everything. So, an important take-away is the ordered nature of a match (as opposed to a regular switch where one should be able to reorder the case blocks at will).

Still not too mind-blowing I imagine, so let's add another piece to the picture:

let rec calc x =
   match x with
   | Constant a -> a
   | Add a b -> (calc a) + (calc b)
   | Sub a b -> (calc a) - (calc b)
   | Mul a b -> (calc a) * (calc b)
   | Div a b -> (calc a) / (calc b)
   | _ -> failwith "Unknown node"

Here we're matching x with certain shapes of objects, while extracting data from those objects before entering the clause body. Such matches can be pretty complex and the language emits code that knows how to do the matching and extract the right data to bind it to the free variables in the clause (e.g. a and b).

There's even more, like matching lists using the cons operator ::, matching union and record types as well as regular .NET type checks but let's stay with this for now. As mentioned before, the goal of this series is not to provide a full-blown pattern match in C#, but rather an exploration of the journey that takes us there.

 

Goals

Let's take a pick and decide what we want to provide. The last sample above matching objects with a certain "shape" is pretty attractive. Given an object that has been constructed to embed certain data, we want to match and extract that data given a set of patterns. For example, consider a Person object:

class Person
{
   public Person(string name, int age)
   {
      Name = name;
      Age = age;
   }

   public string Name { get; private set; }
   public int Age { get; private set; }
}

which I'll abbreviate in PC# (pseudo C#) as:

class Person(string Name, int Age)

We can think of this as a record type that simply holds a few values and plays the rules of immutability. Feel free to think about the realization of such types, possibly based on things as simple as tuple types but the definition above should be good enough for our purposes. We'll refine the realization as we go along in order to make the implementation of our pattern match possible.

Now that we have such a type (and others), we can start to do some matching. Ideally we'd have some integrated syntax that looks like this in PC#:

string message = pattern (x) {
   match Person(name, 25) => name + " is 25 years."
   match Person("John", age) => "John is " + age + " years."
   else => "Unmatched object"
};

or even, if we know we're about to match only persons:

string message = pattern (Person x) {
   match (name, 25) => name + " is 25 years."
   match ("John", age) => "John is " + age + " years."
   else => "Unmatched person"
};

Of course there are many assumptions on types and syntax to make such a thing possible, and one can think of many alternatives that feel more C#-ish:

string message = switch (x)
{
   case Person(name, 25):
      return name + " is 25 years.";
  
case Person("John", age):
      return "John is " + age + " years.";
   default:
      return "Unmatched person";
};

which comes close to our type switch. In fact, the type switch presented before was a rude simplification of the one I have that's layered on top of the pattern match we'll be talking about (which, by itself is simplified for the purpose of this blog series).

 

Realization

Time to talk a little about how we're going to realize this, at least from the API level. We'll have to stick with an API for sure since we don't have a language to build on top of right now. Mirrored after the type switch, it will look like this:

string message = new Pattern<string>(x)
   .Match((string name) => new Person(name, 25), name => name + " is 25 years.")
   .Match((int age) => new Person("John", age), age => "John is " + age + " years.")
   .Else(() => "Unmatched object");

Notice you shouldn't necessarily match against Person objects only. Variable x can be anything, so you can match anything, as shown in the next hypothetical sample:

public static XElement ToXml(this Control c)
{
   return new Pattern<string>(c)
      .Match((string text) => new Button(text),
         text => new XElement("Button", new XAttribute("Text", text))
      )
      .Match((int value) => new Slider(value),
         value => new XElement("Slider", new XAttribute("Value", value))
      )
      .Else(
         () => new XElement("Control", c.ToString()) // Closures at work!
      );
}

There's room for lots of refinements (such a a Pattern<Person, string> that would only accept matches against Person objects) but here's how you read it:

Match x with the following, producing a string:
- if x is a person with age 25, the result is name + " is 25 years."
- if x is a person with name John, the result is "John is " + age + " years."
- else, the result is "Unmatched object"

Notice the way the matching works: given a constructor (we aren't really constructing Person objects as we'll see the next time) certain parameter values are bound, resulting in the condition (like age == 25) which is made implicit. Other parameters are bound to a symbolic name which - only on a match - is extracted from the underlying object and bound to the corresponding parameter in the match body. In fact, the sample above simply reuses the same parameter name but (with alpha-conversion) that shouldn't be the case:

string message = new Pattern<string>(x)
   .Match((string bar) => new Person(bar, 25), foo => foo + " is 25 years.")
   .Match((int bar) => new Person("John", bar), foo => "John is " + foo + " years.")
   .Else(() => "Unmatched object");

This reflects a restriction from the implementation (ideally we'd reuse the parameter for both the match clause and body) that has a very specific reason.

Exercise: Can you infer (or guess?) the public contract of our Pattern`1 class from the above? The answer will follow in the next post.

 

Enjoy!

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

"Never reveal too much" - it's a motto that's applicable in many areas, including security.

Why does a failed attempt to log in into Windows result in "Invalid user name or password". Couldn't the system tell you just which field you screwed up? Yes, it could. But no, it shouldn't. Imagine an attacker trying to get into the system. First guess the logins (looping though a hypothetical series of "Invalid user name" messages) and once you got it right (e.g. Administrator, admin, TheAdmin, <name of someone who is known to be an admin>) move on to the passwords. Or better, use a tool that does the work for you.

There are lots of other examples, like disabling custom errors in web applications, resulting in revealing exception info that could contain e.g. the name (or more) of a database server you attempted to connect to. Anyhow, revealing too much is definitely harmful.

<Facts>

This evening I came back home after a busy day at work, walked through the front door of the apartment complex where I have to use my "radio key" to open the door. Visitors see this lovely touch screen device used to ring someone in the complex to have the door opened. It's just an embedded computer with some custom app that sometimes experiences troubles (luckily the door opening mechanism using the "radio key" isn't linked to it or I wouldn't be writing this post right now :-)). This is what I saw (click to enlarge):

image

I blurred a few pieces in red revealing company and product names but a few things are of interest:

  1. The "capture device" aka "camera" wasn't initialized - a nice hint for an attacker to start playing (assuming there are no other cameras in the proximity of course).
  2. Some detailed version information - always handy to search the web for known vulnerabilities in that particular piece of software (notice the version numbers seem to reflect dates, some almost 4 years ago == no patches in 4 years?).
  3. Information about the runtime, what gets loaded, port numbers and references to config files. We know at least what we can play with now.

But the possibly even more interesting one has been blurred in green: login and password in plain-text (you gotta believe me). The password isn't too long either (doesn't really matter, you already have it) and pretty predictable. Given the software and device are used in many apartment buildings and given the simplicity of both login and password it's not too hard to imagine it's the same everywhere. It also tells you where it lives.

</Facts>

<Questions>

  1. Why do you need to log everything to the console? Just write to a log file and tell something went wrong (but avoid pointing to the log file). Typically logs should go to a pseudo-well-known location (please don't make it c:\some.log - automated tools would love it). But of course protecting access to the logs is as important as this.
  2. What's the need to log the password to *anything* if it's already stored somewhere? It's clear that the "Resolved property" message is generic, revealing the password was just stored next to a bunch of other settings. Whether or not this particular password keeps something really secret is another question obviously.
  3. Hash (and salt) passwords, please! Not only to protect against usefulness after theft, but also to rule out the possibility to ever print it. I'm shocked every time a web site offers the possibility to "send my password" indicating they haven't (at least) hashed it before storing it. This means I've lost a password forever (deleting an account won't help - backups should have spread the secret). Notice the opposite (a site offering "reset password" functionality) doesn't necessarily mean hashing is applied but there's at least the conditional probability of it being so.

</Questions>

<Fiction>

<Disclaimer>

Everything in this section is purely based on imagination, not experience. The author is not responsible for any application whatsoever of possible acts outlined below.

</Disclaimer>

A particularly nice property of touch screens is you can touch them :-). The application isn't touch-screen aware (meaning it would only enable the touch screen when initialization succeeded, and disable it when the application quits), meaning you can click anywhere, including:

image

Maybe there's more to find behind those other windows? Or we could browse Windows Explorer to find some other well-kept (ahum) secrets and with a little luck the thing is connected to the internet allowing us to submit data or install some piece of software to possibly allow us "to work from home". You might wonder what's there to steal? Definitely a list of resident names, possibly some telephone numbers, radio key data or more. There will likely be code that knows how to open the door since that's the intent of the device and maybe a little test-script to check the "open door" functionality is working properly (would be handy - purely for testing of course). Just make sure not to test the "Alarm Agent" :-).

You might wonder how to proceed with only a mouse cursor? On-screen keyboards maybe, depending on the edition of the OS installed? Or there might some keyboard in the unit if you want to turn it into a physical attack (or you can bring you own)... You could do that without the crash of course, but one might feel more comfortable knowing the camera is out of order. And you can always tell incoming residents you're there to fix the unit, just make sure to dress like a worker and bring a radio to set the scene. Or if it's a fancy machine hiding therein, BlueTooth might work too, hoping drivers are available and device detection is enabled.

And if you give up a simple DoS (for what it's worth - maybe the machine does more than operating the door) is at hand through Start, Shut down or "non-software based attacks".

</Fiction>

Oh, and by the way the daylight saving time wasn't applied yet:

image

To be clear, I just took the picture, didn't touch the screen, made the observations outlined in this post and fired up Windows Live Writer.

Have a good (and safe) night!

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

More Posts « Previous page