Sunday, October 26, 2008 8:01 PM bart

About cruel lambdas, closures, TypedReferences, CS0610 and other things you shouldn’t do

A few days ago I had a derailed conversation on C# languages features once more. It turned out that closures are not well-understood in general, so I wanted to point out a few things in an attempt to clarify the concept and how it’s implemented in the language. By the end of this post you’ll understand what the following dialog is really telling you and why there’s no way around it without what I’d call leaking the closure from the language implementation space into the developer’s code space:

image

 

Cruel lambdas

But first a tale on cruel lambdas. This week I saw the following piece of code in a training manual I got to read somehow (literal color-inclusive bold-exclusive copy):

[TestMethod()]
public void ReadFromSocketTest()
{
    string str = null;

    MockReceiver mockReceiver = new MockReceiver();
    mockReceiver.UpdateImpl = delegate(string text)
    {
        str = text;
    };
   
/* or in C# 3.0 lambda syntax
    mockReceiver.UpdateImpl = text => str = text;
    */

    // send to receiver
    ClassUnderTest cut = new ClassUnderTest();
    cut.Send(“hi there”, mockReceiver);

    Assert.AreEqual<string>(“hi there”, str);
}

By the coloring of the added comment I can tell how the code did not end up in the Word document: copy-paste from VS (the bold line would be green otherwise). In other words, this is a case of lost in (reverse) translation, that even a lambda-geek like me needs a few seconds to stand still and wonder: does this work? But sure enough, it does. Simple assignments (ECMA-334 §14.14.1) are expressions; writing “lhs = rhs” simply takes on the value of what got assigned to lhs:

14.14.1 Simple assignment

The = operator is called the simple assignment operator. In a simple assignment, the right operand shall be
an expression of a type that is implicitly convertible to the type of the left operand. The operation assigns the
value of the right operand to the variable, property, or indexer element given by the left operand.

The result of a simple assignment expression is the value assigned to the left operand. The result has the
same type as the left operand, and is always classified as a value.

This is an interesting thing by itself. Consider the following fragment:

{
   int
a, b, c;
   a = b = c = 0;
}

What do you think the compiler will emit as warnings? Here’s the answer: “warning CS0219: The variable 'c' is assigned but its value is never used”. Remove c altogether and now the warning reads the same but with c substituted by b, and so on. I’ll leave it to the reader to figure out why this happens as a thought experiment (but don’t leave your sleep for it). Also, if there would be a property (or indexer) assignment in the set of assignments above, only the setter would get called, never the getter:

a.Bar = a.Foo = 0;

wouldn’t call a.get_Foo() in order to feed it in to a.set_Bar(int). Instead, the value that got assigned to Foo (i.e. 0) will be fed in to the Bar setter.

But there are more subtle things going on in the innocent-looking comment above. The type of the UpdateImpl property actually is an Action<string>, so it’s void-returning. I’m using the word returning here as lambdas read as if they are to return something by their very functional nature. So the statement made by the Word-document altering person about lambda syntax equivalence is off a little. Why? Consider the following code:

{
   string s = null;

   Action<string> a1 = t => s = t;
   Action<string> a2 = t => { s = t; };
}

There’s a difference here. In the first case we’re discarding the result of the lambda body, while in the second one the lambda has a statement body without a return, so it’s inferred to be void. The first form is the one referred to in the C# comment, while the second one corresponds to the simplified form in the original code without the anonymous method.

All of this still doesn’t cause me to call this lambda “cruel”. There’s nothing wrong with leveraging the expressive power of lambdas to simplify existing anonymous method based code. However, where the cruelty comes in is in the side-effect encoded in the lambda body. Let’s rewrite the lambda a bit and assume we’re using a string-returning function instead (ruling out the implicit discard of the expression value) and introduce another pair of parentheses to make things more readable:

string s = null;
Func<string, string> f = (t => s = t);

Here we’re capturing the outer scope variable s in a closure, so invoking f somewhere will change the s in the outer scope:

private class Closure
{
   public string s;

   public string f(string t)
   {
      return (s = t);
   }
}

Closure c = new Closure();
c.s = null;
Func<string, string> f = c.f;

where all references to s in the original code have been replaced by references to the public field on the closure class instance. So all we’ve created here is another way to perform assignment to a local variable through some function:

f(“Hello”)

assigns “Hello” to the (captured) local variable s and returns the value that got assigned. It’s almost as if f(…) is syntactical sugar for s = …. Notice I’m also avoiding some more complexities by using a reference type instead of a value type, I’ll give you some time to think about this. Consider the (syntactically) reverse lambda:

string s = null;
Func<string, string> f = (t => t = s);

This is a subtle one. Do you think we now have f(…) as a shorthand for … = s? Why (not)?

 

Quiz

If you think you understand all subtleties aforementioned, try to predict the output for the following:

string s1 = "John"; string t1 = "Bart";
Func<string, string> assignRefO = t => s1 = t;
Console.WriteLine("s1 = \"{0}\"; (t => s1 = t)(\"{1}\") = \"{2}\"; s1 = \"{3}\"", s1, t1, assignRefO(t1), s1);

int i1 = 0; int j2 = 1;
Func<int, int> assignValO = j => i1 = j;
Console.WriteLine("i1 = {0}; (j => i1 = j)({1}) = {2}; i1 = {3}", i1, j2, assignValO(j2), i1);

string s2 = "John"; string u = "Lisa";
Func<string, string> assignRefI = t => t = u;
Console.WriteLine("s2 = \"{0}\"; u = \"{1}\"; (t => t = u)(s2) = \"{2}\"; s2 = \"{3}\"", s2, u, assignRefI(s2), s2);

int i2 = 0; int k = 2;
Func<int, int> assignValI = j => j = k;
Console.WriteLine("i2 = {0}; k = {1}; (j => j = k)(i2) = {2}; i2 = {3}", i2, k, assignValI(i2), i2);

 

The key take-away for our short adventure through cruel lambdas: side-effects through closures can be rather subtle to spot (as the capturing of local variables into a closure goes unnoticed, and that by itself deserves whole posts by itself) but have a huge impact. And that brings us back to our first screenshot, brought up by the refactoring “Extract Method” feature in Visual Studio.

 

Closures and refactoring

Back to our first screenshot, but slightly annotated:

image

I’ve used two colors here. Actually all errors are bad, but compile errors in this case are far better than semantic changes that go unnoticed. How did I got to the dialog? It’s really simple: take any piece of code that has a closure in it, either an anonymous method or a lambda expression that captures a local variable, and choose Extract Method:

image

Let’s take a look at both possible error cases the dialog outlines, starting with the worst one: a semantic change. Assume the following piece of code:

static void Main()
{
    int x = 0;
    Func<int> f = () => x;
x = 5;
Console.WriteLine(f());
}

gets refactored into:

static void Main()
{
    int x = 0;
    Func<int> f = GetFunction(x);
x = 5;
Console.WriteLine(f());
} private static Func<int> GetFunction(int x) { Func<int> f = () => x; return f; }

If you’ve understood the way closures work, it should be piece of cake to predict what the first fragment prints. Right: 5. The reason this happens is because the outer local variable ‘x’ gets captured in a closure class, together with the defined lambda expression. When updating x on the third line, we’re really updating the public field on the closure instance, so the call through the delegate f will produce 5 as its result. However, the second fragment will print 0. Why is that? Due to the refactoring, the original value of x, i.e. 0, got copied by-value to a local variable ‘x’ in the GetFunction method, where it got captured by a closure. There’s no way the Main method can ever update a local of a called function, so the copy of x is trapped in the closure forever and the returned function will always print 0. So the assignment on line three doesn’t have any effect whatsoever on the lambda.

What really would need to happen to preserve semantics in this case, is to bubble up the closure to the original method, in order to capture the local variable ‘x’ in the context of the Main method. This what I referred to in the introduction paragraph as “leaking the closure to the developer space”, i.e. take it in your own hands:

static void Main()
{
    Closure c = new Closure();
    c.x = 0;
    Func<int> f = GetFunction(c);
    c.x = 5;
    Console.WriteLine(f());
}

private static Func<int> GetFunction(Closure c)
{
    Func<int> f = () => c.x;
    return f;
}

One way the refactoring could work is by performing all this machinery on behalf of the developer, but that would make refactoring to have non-local effects (i.e. rewriting code other than the selected piece). The original piece of code would get expanded all the way down to the closures (the types of which become visible in the code) and then the real refactoring could work without causing semantical problem. But that’d be also the end of transparent closures…

But let’s move on to the second case where a compile error results after the refactoring. To illustrate this, consider our cruel lambda again:

static void Main()
{
    int x = 0;
    Func<int, int> f = y => x = y;
f(5);
.WriteLine(x); }

Now trying to refactor the second line produces the following dialog. Notice the way the variable y is passed to the method being generated:

image

And here’s the result:

image

What has happened now? The refactor engine has seen that one of the variables in the selected piece of code is being assigned to, so it needs to bubble up that change after the method has returned, hence it feeds in that variable using by-reference parameter passing style. However, in this case this won’t compile:

image

Of course you’re curious to know why this is the case. Well, let’s assume we’d be able to by-ref a parameter in to an anonymous method or lambda expression (query expressions follow as those are simply glue for lambda expressions passed in to certain methods). Consider the following piece of code:

static void M1()
{
    Func<int, int> f = M2();
    f(5);
}

static Func<int, int> M2()
{
    int x = 0;
    return GetFunction(ref x);
}

static Func<int, int> GetFunction(ref int x)
{
    return y => x = y;
}

To see what’s going on, we’ll try to form a picture of the stack behavior when executing this code. In doing so, we’ll simplify quite a bit, ignoring calling convention, stack frame and other details that occur in practice. However, all we need to care about in this case is the behavior of the stack with regards to local variables. Starting with the execution of M1, we notice there’s one local variable containing ‘f’, the delegate retrieved by calling M2. Let’s call this local variable FUNC, a 32-bit pointer to the function in memory. Initially it’s unassigned as we’ve not called M2 yet:

(M1)  FUNC*  -->  ????

Now we’re calling M2, where two variables will live on the stack: one containing the integer value x, called intx, and one containing the return value of the call to GetFunction, again of type Func<int, int>:

      intx
(M2)  FUNC*  -->  ????
(M1)  FUNC*  -->  ????

Time for the real work, the call to GetFunction (abbreviated GF). Here two items appear on the stack: the closure created by the hypothetical lambda expression capturing the outer variable x, and the result of the call, again our delegate type. I’m simplifying here, but the relevant thing is that the (hypothetical) reference parameter for ‘x’ is stored inside the closure as a public field (the whitespace that occurs in the stack is purely for ASCII-art purposes):

      CLOS*  -->  {intx*, func}
                     |      ^
                     |      |
(GF)  FUNC*  --------~------+
      intx   <-------+
(M2)  FUNC*  -->  ????
(M1)  FUNC*  -->  ????

Notice the closure is heap-allocated, which is crucial in the implementation of closures, after all we want them to outlive the scope of the method they’re defined in. Now what happens when GF returns?

                  {intx*, func}
                     |      ^
                     |      |
      intx   <-------+      |
(M2)  FUNC*  ---------------+
(M1)  FUNC*  -->  ????

Our local variable for the return value of M2, the delegate, points at the function in the closure, so the closure is kept alive. At the same time, the hypothetical reference parameter x captured by the closure is pointing at our local variable x. This is where the real problem kicks in, assume M2 now returns:

                  {intx*, func}
                     |      ^
                     |      |
      intx   <-------+      |
                            |
(M1)  FUNC*  ---------------+

Now we have a dangling pointer to a place in the stack that’s post its stack frame’s lifetime. At this point, all bets are off and everything imaginable can happen, for example a subsequent call to another function will overwrite the value of int x by something else that doesn’t necessarily need to be an integer. So all that remains is rotten type safety, no wonder reference and output parameters are a big no-no in lambda expressions.

If you really want to shoot yourself in the foot, it’s possible to do so of course. Here’s a sample using unsafe code:

static void M1()
{
    Func<int, int> f = M2();
    M3(f);
}

static unsafe Func<int, int> M2()
{
    int x = 0;
    return GetFunction(&x);
}

static unsafe Func<int, int> GetFunction(int* x)
{
    return y => *x = y;
}

static void M3(Func<int, int> f)
{
    long x = 123;
    f(5);
    Console.WriteLine(x);
}

Bonus points if you can predict what a call to M1 will print to the screen. Quiz: Can you use this piece of hacking to construct a method called “GetEndianness” to determine whether your computer has a little or big endian architecture? What’s a better (i.e. without unsafe code) alternative to determine this in managed code?

 

TypedReferences and not so secret keywords

Talking about all of this by-reference passing stuff, this is the ideal opportunity to dive into parameter passing on the CLI in more detail. Most, if not all, readers of my blog will be familiar with two of those strategies:

  • by value – the value of an object is passed from caller to callee; for example, an integer is pushed on the stack or the address of a reference type instance is pushed on the stack.
  • by reference – here the address of the data is passed from the caller to the callee; this is what we’ve used in the previous paragraph.

However, there is a third, far less known, parameter passing style supported on the CLI: typed references. It’s very similar to by-ref, but besides of the address of the data also a runtime representation of the data is passed from the caller to the callee. But what’s the use for this? Assume the following scenario: you want to create a function that accepts any type of data in a by-ref fashion, because you want to change it inside the method. In some sense, the function you’re about to write needs to be polymorphic in that specific parameter. One way to accomplish this is to pass the data by-ref as a parameter of type object. What’s wrong with this? For value types, this will cause boxing, thus heap allocation. By-ref would eliminate this problem at the cost of sacrificing the intended flexible nature of the parameter, as we’d need to be specific about the type. Typed reference allow to work around this problem.

A central concept in both by-ref (and hence output parameters in C# which are implemented as by-refs with some additional metadata) and typed parameters is that of a home. A data value’s home is a location where it can be stored for possible reuse, e.g. a local variable, a method’s argument, an array element or a field. In the case of typed references, both the address of the home and information about its type are passed in the typed reference.

It turns out that C# actually supports typed reference to a limited extent, in an official way by means of direct usage of System.TypedReference and compile-time errors for uses that are invalid, but also using undocumented keywords. I’m purely showing this for illustrative purposes, as no-one should rely on this undocumented feature; it turns out there are only a handful of places in the BCL code where this is being used for very specialized tasks. For the interested, I’ve blogged about another of these – __arglist – in my post entitled: “Calling printf from C# - The tale of the hidden __arglist keyword”. Here’s a sample on how to get a typed reference and use it:

static void Main()
{
    int i = 123;
    Do(__makeref(i));
    Console.WriteLine(i);
}

static void Do(TypedReference tr)
{
    Type t = __reftype(tr);
    int i = __refvalue(tr, int);
    __refvalue(tr, int) = -i;
}

Think of those keywords as follows: __makeref has the potential of & (address-of) but does a little more to get the type information, __reftype gets the type that was captured by the typed reference, and __refvalue behaves like a * (dereference) and can be used both as a lhs or rhs.

As you can see, it’s perfectly possible to pass the typed reference as a parameter in a method call. However, trying to use it in ways that are inherently unsafe is prohibited by the compiler. For example, trying to return the typed reference results in the following error:

error CS1599: Method or delegate cannot return type 'System.TypedReference'

Trying to use it in an output parameter or reference parameter (which are intrinsically the same) is prohibited too, for similar reasons as the ones outlined in the previous paragraph:

error CS1601: Method or delegate parameter cannot be of type 'out System.TypedReference'

An finally, trying to use the typed reference as a field in a class won’t work either, as you can’t stick a typed reference on the heap having it point to a stack-allocated object:

error CS0610: Field or property cannot be of type 'System.TypedReference'

What about trying to define a cruel lambda with a captured typed reference in its closure? Here you can outsmart the compiler:

static Func<int, int> Bar()
{
    int i = 0;
    TypedReference tr = __makeref(i);
    return j => __refvalue(tr, int) = j;
}

This produces the following closure class:

image

Notice how the typed reference forced its way in the closure class. However, the runtime knows this is a violation, so trying to execute the following:

int i = Bar()(1);

results (thankfully) in the following nice exception:

image

 

Conclusion

Know you closures! While anonymous methods, lambdas and everything that uses them, like LINQ, are great pieces of technology, you should know a bit about how they’re implemented and why refactoring pieces of code that contain them is a potentially dangerous operation. There are a few things to do here:

  • Avoid mutating state in a lambda expression. Doing so will cause the refactoring to pass the mutated variable by-ref, causing a compile error. As an example, none of the LINQ operators requires you to do so (although you can). Here’s a bad sample on how to use LINQ:

    int i = 0;
    var res = from x in source where select new { Index = i++, x.Name };
    // bad things can happen to i here
    foreach (var item in res)
       …


    I’ll leave it to the reader to figure out the better way to make this numbering work; there are a few ways, but suffice to say: Select might have useful brothers.
  • Try to avoid changing a captured local variable after it has been captured. Doing so puts refactoring efforts at risk, especially when value types are used as they’ll be passed by-value to the refactored method. When the variable in the original method is changed after the point of the call to the refactored method, its mutation won’t become visible to the lambda. If you still need to do this, refactor at least the block of code that contains the declaration and all uses of the variable being captured, so that the closure in the refactored method will have the same reach as the original for that particular variable.

    int i = 0;
    // … 1
    Func<int> f = () => i;
    // … 2
    int j = f();
    // … 3


    Quiz: What is/are the danger zone(s), expressed in terms of the marked blocks, for semantical changes when the “() => i” lambda is refactored behind a GetFunction method, assigning the result to f? How can the type of the variable i (and hence the generic parameter to Func`1) influence this?

Hope this helps to understand closures a bit better.

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Filed under: ,

Comments

# re: About cruel lambdas, closures, TypedReferences, CS0610 and other things you shouldn’t do

Monday, October 27, 2008 6:30 AM by Mark Shiffer

Wow, I feel like I've just been clubbed over the head...but in a good way. Thank you for a thought provoking start to my Monday; quite informative.

# TypedReference Structure (System) &laquo; The Wiert Corner &#8211; irregular stream of Wiert stuff

Pingback from  TypedReference Structure (System) &laquo; The Wiert Corner &#8211; irregular stream of Wiert stuff