Monday, November 03, 2008 11:11 PM
bart
C# 4.0 Feature Focus - Part 3 - Intermezzo: LINQ's new Zip operator
After named parameters and optional parameters, we'll take a little breadth and deviate a bit from the language specifics to present a new LINQ operator: Zip. Just like a zipper zips two streams of materials together, LINQ's Zip operator can zip together two sequences. Here's the signature of the new method:
public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func);
Sample
Given two sequences and a function that combines two elements from both sequences, a sequence of zipped pairs is produced. Here's a sample:
string[] names = { "Bart", "John" };
int[] ages = { 25, 60 };
names.Zip(ages, (name, age) => name + " is " + age + " years old.");
This produces a sequence with the sentences "Bart is 25 years old." and "John is 60 years old.". The lambda syntax for the passed-in function should speak for itself, and notice we're using extension method invocation here, so names is propagated to become the left-hand side of the method call.
How Zip works
Previously I've implemented the Standard Query Operators for reference purposes on the Codeplex site at http://www.codeplex.com/LINQSQO. I won't update that sample library just yet, but here's an illustration on how easy Zip is to implement, ignoring exception handling (which is more subtle than you might think, see further):
static class Enumerable
{
public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
{
var ie1 = first.GetEnumerator();
var ie2 = second.GetEnumerator();
while (ie1.MoveNext() && ie2.MoveNext())
yield return func(ie1.Current, ie2.Current);
}
}
The Zip operation is implemented using iterators inside ("yield return") and stops as soon as one of the sequences runs out of juice (the case of the asymmetric zipper).
Zip without Zip?
As a curiosity, is it actually possible to build Zip out of existing LINQ operators, ignoring performance worries? Unsurprisingly, it turns out this is the case. Here's how:
static class Enumerable
{
public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
{
return first.Select((x, i) => new { X = x, I = i }).Join(second.Select((x, i) => new { X = x, I = i }), o => o.I, i => i.I, (o, i) => func(o.X, i.X));
}
}
(Exercise for the reader: think of more alternatives.) How does this work? To explain this we need a bit of vocabulary, so let's refer to Wikipedia for a second and apply an isomorphism between textile sciences and computer sciences (something in me screams "zip, zipper, zippest"):
The bulk signature of a zipper Zip consists of two strips sequences of fabric element taypes, each affixed to one of the two pieces sequence instances to be joined, carrying tens or hundreds or anything below OutOfMemoryException conditions of specially regularly shaped allocated metal reference- or plastic value-typed teeth elements. (...) The slider Func<TFirst, TSecond, TResult> delegate, operated by hand executing Zip, moves along the rows sequences of teeth elements. Inside the slider Func<TFirst, TSecond, TResult> delegate is a Y-shaped strongly-typed channel function that meshes together or separates the opposing rows sequences of teeth elements, depending on the direction of its movement. The friction binding and vibration application of the slider Func<TFirst, TSecond, TResult> delegate against on the teeth elements causes a characteristic buzzing callvirt'ing noise, which is probably not the origin of the name zipper. The name also may have originated in the greater speed and ease with which the two sides of a zipper Zip can be joined, compared to the time needed for fastening executing laces or buttons the method above.
Well, that's exactly what happens: opposing rows sequences of teeth elements are combined by a Y-shaped strongly-typed channel function. How to determine opposing sequence elements? Using Select's overload that provides (besides the element itself) an index denoting the element's position in the original sequence. Matching the opposing elements is a matter of joining both sequences, extracting the keys (marked as I in the anonymous type) and combining the selected pairs of elements from both sequences (marked as X in the anonymous type) by feeding them in to the slider Func<TFirst, TSecond, TResult> delegate. I told you the explanation was straightforward, or did I?
Iterators and exceptions
Most of the LINQ operators are implemented using iterators. You might wonder this this is relevant at all. Obviously we need it as LINQ operators need to be lazy. Only when you start fetching results by iterating of the sequence, the internal machinery should kick in. Declaring a query doesn't cause any execution whatsoever. Internally iterators are implemented as state machines. Every state can do on "yield", after which the state machine is suspended till the consumer asks for the next element in the sequence being produced by calling MoveNext on the IEnumerator<T> object.
Our Zip implementation above is turned into an equivalent piece of code (slightly simplified for illustrative purposes):
public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
{
return new ZipIterator<TFirst, TSecond, TResult>(first, second, func);
}
private sealed class ZipIterator<TFirst, TSecond, TResult> : IEnumerable<TResult>, IEnumerator<TResult>
{
//
// Captured iterator method parameters.
//
private IEnumerable<TFirst> _first;
private Func<TFirst, TSecond, TResult> _func;
private IEnumerable<TSecond> _second;
//
// Iterator's enumerator state and thread affinity.
//
private State _state;
private int _initialThreadId;
//
// Value currently yielded by the iterator's enumerator.
//
private TResult _current;
//
// Local variables used by the iterator's enumerator.
//
private IEnumerator<TFirst> _ieFirst;
private IEnumerator<TSecond> _ieSecond;
//
// Captured iterator method parameters used by enumerator.
//
private IEnumerable<TFirst> first;
private Func<TFirst, TSecond, TResult> func;
private IEnumerable<TSecond> second;
//
// Public constructor to create an iterator in the initial ready state.
//
public ZipIterator(IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
: this()
{
this._first = first;
this._second = second;
this._func = func;
}
//
// Private constructor to set state and thread affinity.
//
private ZipIterator()
{
this._state = State.Before;
this._initialThreadId = Thread.CurrentThread.ManagedThreadId;
}
//
// Gets a new enumerator over the iterator.
//
IEnumerator<TResult> IEnumerable<TResult>.GetEnumerator()
{
ZipIterator<TFirst, TSecond, TResult> iterator;
//
// Can reuse the current enumerator if thread affinity and state permit it.
//
if (Thread.CurrentThread.ManagedThreadId == _initialThreadId && _state == State.Before)
{
iterator = this;
}
else
{
iterator = new ZipIterator<TFirst, TSecond, TResult>();
}
iterator.first = this._first;
iterator.second = this._second;
iterator.func = this._func;
return iterator;
}
//
// Gets a new enumerator over the iterator.
//
IEnumerator IEnumerable.GetEnumerator()
{
return ((IEnumerable<TResult>)this).GetEnumerator();
}
//
// Advances the iterator one yield return at a time.
//
bool IEnumerator.MoveNext()
{
switch (this._state)
{
case State.Before:
this._ieFirst = this.first.GetEnumerator();
this._ieSecond = this.second.GetEnumerator();
goto case State.Suspended;
case State.Suspended:
if (this._ieFirst.MoveNext() && this._ieSecond.MoveNext())
{
this._current = this.func(this._ieFirst.Current, this._ieSecond.Current);
this._state = State.Suspended;
return true;
}
break;
}
this._state = State.After;
return false;
}
//
// Retting the iterator enumerator is not supported; create a new one instead by calling GetEnumerator.
// The foreach statement does this anyway.
//
void IEnumerator.Reset()
{
throw new NotSupportedException();
}
//
// Gets the value currently yielded from the iterator.
//
TResult IEnumerator<TResult>.Current
{
get
{
return _current;
}
}
//
// Gets the value currently yielded from the iterator.
//
object IEnumerator.Current
{
get
{
return _current;
}
}
//
// IDisposable implementation.
//
void IDisposable.Dispose()
{
}
//
// Internal iterator enumerator states.
//
private enum State
{
Before,
Running,
Suspended,
After
}
}
Notice that the code above holds the middle between the specification (paragraph 10.14 of the C# 3.0 specification) and the actual implementation in the Visual C# 2008 compiler. More specifically, I've made the discrete states (which are internally represented as integers) match the ones in the specification, but haven't gone all the way in matching the code up with the
Of all this machinery, the MoveNext method is the most interesting one from the iterator's point of view. Here a state machine is built, rewriting the original iterator block by splitting it into discrete blocks. To perform this transformation, yield return statements are replaced as follows:
yield a;
becomes
this._current = a;
this._state = State.Suspended;
return true;
Similar rewrites happen for yield break statements, but that's not relevant here. Furthermore all the iterator block code is placed in a MoveNext method, switching on the current state. The specification only mentions the four states I've implemented above, but we're lucky the number of states for our sample matches up with the number of states that's specified. In cases where there are more yield statements, more states are needed to suspend the machine and resume at the same point during the next MoveNext call. Other transformations needed include rewriting of loops (things like while don't really play well with suspension points and need to be rewritten in terms of if/goto, where gotos require additional states). Applying all those tricks gives us:
bool IEnumerator.MoveNext()
{
Begin:
switch (this._state)
{
case State.Before:
this._ieFirst = this.first.GetEnumerator();
this._ieSecond = this.second.GetEnumerator();
this._state = State.Running;
goto Begin;
case State.Running:
case State.Suspended:
if (this._ieFirst.MoveNext() && this._ieSecond.MoveNext())
{
this._current = this.func(this._ieFirst.Current, this._ieSecond.Current);
this._state = State.Suspended;
return true;
}
break;
}
this._state = State.After;
return false;
}
Here the local variables used in the iterator block are captured in the initial pass through MoveNext, when state is still set to Before. Strictly speaking we move from Before to Running all the way to the first point where the code gets suspended, but in our case states can be merged quite a bit. In fact the code really produced by the compiler looks like this:
bool IEnumerator.MoveNext()
{
switch (this._state)
{
case 0:
this._state = -1;
this._ieFirst = this.first.GetEnumerator();
this._ieSecond = this.second.GetEnumerator();
while (this._ieFirst.MoveNext() && this._ieSecond.MoveNext())
{
this.current = this.func(this._ieFirst.Current, this._ieSecond.Current);
this._state = 1;
return true;
Resume:
this._state = -1;
}
break; case 1:
goto Resume;
}
return false;
}
But why am I telling you all this? Because it's fascinating? Yes, for that reason too. But I promised to tell you something about exceptions, right? Let me quote from the C# 3.0 specification paragraph 10.14.4.1 first:
(...) The precise action performed by MoveNext depends on the state of the enumerator object when MoveNext is invoked:
- If the state of the enumerator object is 'before', invoking MoveNext:
- Changes the state to 'running'.
- Initializes the parameters (including this) of the iterator block to the argument values and instance value saved when the enumerator object was initialized.
- Executes the iterator block from the beginning until execution is interrupted (as described below).
(...)
The red line is what's of interest to us, and you should read it in reverse:
The code from the beginning of the iterator block till the place where execution is interrupted (i.e. a yield occurs) is executed by MoveNext when transitioning from 'before' to 'running'.
What if that code contains exception throwing statements?
static class Enumerable
{
public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
{
if (first == null)
throw new ArgumentNullException("first");
if (second == null)
throw new ArgumentNullException("second");
var ie1 = first.GetEnumerator();
var ie2 = second.GetEnumerator(); while (ie1.MoveNext() && ie2.MoveNext())
yield return func(ie1.Current, ie2.Current);
}
}
As you can guess, this code won't get executed till the very first passage through the iterator's enumerator object's MoveNext method. Recall what foreach corresponds to:
foreach (V v in x) embedded-statement
becomes (with C the collection type, E the enumerator type, V the local variable type and T the element type)
{
E e = ((C)(x)).GetEnumerator();
try {
V v;
while (e.MoveNext()) {
v = (V)(T)e.Current;
embedded-statement
}
}
finally {
...
}
}
So just calling the iterator does nothing that can cause the exceptions to be thrown, until a call is made to MoveNext, e.g. by using a foreach loop. As we want the exception to be thrown straight away when calling Zip with invalid arguments, the right way to implement our Zip method is:
static class Enumerable
{
public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
{
if (first == null)
throw new ArgumentNullException("first");
if (second == null)
throw new ArgumentNullException("second");
return ZipInternal(first, second, func);
}
private static IEnumerable<TResult> ZipInternal<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
{
var ie1 = first.GetEnumerator();
var ie2 = second.GetEnumerator(); while (ie1.MoveNext() && ie2.MoveNext())
yield return func(ie1.Current, ie2.Current);
}
}
Conclusion
The addition of Zip to LINQ is a nice one, but not a mind-blower. So I hope you'll accept my apologies for using this as a lame excuse to nag about the implementation of iterators and their subtle impact on exceptions. Next time: generics co- and contra-variance in C# 4.0.
Del.icio.us |
Digg It |
Technorati |
Blinklist |
Furl |
reddit |
DotNetKicks
Filed under: C# 3.0, LINQ, C# 4.0