Wednesday, September 27, 2006 2:47 AM bart

.NET 2.0 string interning inside out

Introduction

Time for some cool .NET 2.0 feature that might prove useful in some scenarios: string interning. If you don't know what it is, check out the resources at the end of the article first. Simply stated: string interning keeps a hashtable of strings while running an application. If a string with the same contents is created, no new heap allocation happens but instead a reference to the existing (identical) string is returned.

What do you think the output of the following code fragment will be (turn on your eye blocking filter for green letters :-))?

//
// Compile time intelligence ==> &sa == &&sb
//
string
sa = "Hello World" + "!";
string sb = "Hello World!"
;
Console.WriteLine("sa = {0}"
, sa);
Console.Write("&sa = "
); AddressOf(sa);
Console.WriteLine("sb = {0}"
, sb);
Console.Write("&sb = "
); AddressOf(sb);
Console.WriteLine("(&sa == &sb) == {0}\n", Object.ReferenceEquals(sa, sb));

Note: I rely on the unmaged AddressInspector (AddressOf method) created in a previous blog post. You can just omit AddressOf calls and trust the Object.ReferenceEquals call to display the right result (it does) but seeing the real address can be more "sexy" :-).

The answer to the question above is this:

sa = Hello World!
&sa = 01591CC8
sb = Hello World!
&sb = 01591CC8
(&sa == &sb) == True

Did you expect this? Maybe, maybe not. Right, what's going on here? Look at the IL for the first two lines:

IL_0001: ldstr "Hello World!"
IL_0006: stloc.0
IL_0007: ldstr "Hello World!"
IL_000c: stloc.1

Allrighty, the concatenation is performed at compile time thanks to C# compiler goodness. That is, there is no runtime call to String.Concat. Hang on, doesn't ldstr create a new string object? Exactly, from the III.4.15 section in the CLI spec: "The ldstr instruction pushes a new string object representing the literal stored in the metadata as string.". So, two allocations, one physical address. Welcome to the world of string interning.

Some demos and experiments

Let's start with something easy:

//
// Duplication of references ==> &s1 == &s2
//
string s1 = "Hello World"
;
Console.WriteLine("s1 = {0}"
, s1);
Console.Write("&s1 = "
); AddressOf(s1);

string
s2 = s1;
Console.WriteLine("s2 = {0}"
, s2);
Console.Write("&s2 = "
); AddressOf(s2);

Console.WriteLine("(&s2 == &s1) == {0}\n", Object.ReferenceEquals(s1, s2));

No surprise that s1 = s2 results in two identical addresses (or references, whatever terminology you like most). In the end, System.String is a reference type, thus assignment is pointer/address/reference assignment:

s1 = Hello World
&s1 = 01591DB8
s2 = Hello World
&s2 = 01591DB8
(&s2 == &s1) == True

So, we now have a string s1 containing "Hello World". What happens if we allocate another string with the same contents? An experiment:

//
// Another string with the same value ==> &s1 == &s3 (!)
//
string s3 = "Hello World"
;
Console.WriteLine("s3 = {0}"
, s3);
Console.Write("&s3 = "
); AddressOf(s3);

Console.WriteLine("(&s3 == &s1) == {0}\n", Object.ReferenceEquals(s1, s3));

Spitting out the following:

s3 = Hello World
&s3 = 01591DB8
(&s3 == &s1) == True

This, again, is string interning at your service. What happens is that the JIT checks the table with interned strings in the current appdomain (to be precise: on the module level, see SSCLI code later) and if the string is found in the (hash)table, that reference is returned.

Now, what if we append a character to s1 and assign it back to s1? Because strings are immutable, we'll end up with a new string object (and thus a new address/reference):

//
// Strings are immutable; changing them allocates a new string ==> different address!
//
s1 += "!"
;
Console.WriteLine("s1 = {0}"
, s1);
Console.Write("&s1 = "); AddressOf(s1);

In IL, this looks like this:

IL_00fa: ldloc.2
IL_00fb: ldstr "!"
IL_0100: call string [mscorlib]System.String::Concat(string, string)
IL_0105: stloc.2

You see, there is no such thing as ldstr "Hello World!" this time. The compiler can't foresee this concatenation (would be even useless if it could in this case, because we've used the unmodified s1 object before), so a runtime String.Concat call is emitted to IL, effectively creating a new string. Allocation of this string doesn't happen using ldstr but through a so-called "framed allocation" that directly calls into the GC for memory allocation (see SSCLI code further on).

Because the newly created string (obtained through concatenation) isn't allocated through ldstr (see further for SSCLI analysis), it doesn't end up in the (hash)table of interned strings. Therefore, calling ldstr "Hello World!" subsequently won't find a match in the interned strings table and will result in a new allocation (which on its turn does end up in the interned strings table!):

//
// Another string, equal to the modified s1 string ==> &s1 != &s4
//
string s4 = "Hello World!"
;
Console.WriteLine("s4 = {0}"
, s4);
Console.Write("&s4 = "
); AddressOf(s4);

Console.WriteLine("(&s4 == &s1) == {0}\n", Object.ReferenceEquals(s1, s4));

performs

IL_0124: ldstr "Hello World!"
IL_0129: stloc.s s4

and results in

s1 = Hello World!
&s1 = 015973C8
s4 = Hello World!
&s4 = 01591CC8
(&s4 == &s1) == False

How can we solve this? The answer: intern s1. Why? To reduce memory footprint caused by duplicate strings. What happens? The runtime finds s4 that already has the contents of s1 at the moment, thus the String.Intern(s1) call will return the address of s4.

Note: You might wonder what the implications are of interning and having multiple variables point to the same string in the back. Doesn't that cause problems when someone tries to change the string through one variable? The answer is of course no, because string are immutable. Changing a string isn't possible without creating another one (i.e. all instance methods on System.String return a new string instance).

//
// String interning ==> &s1 == &s4
//
Console.WriteLine("\n*** INTERNING s1 ***\n"
);
s1 =
String
.Intern(s1);
Console.WriteLine("s1 = {0}"
, s1);
Console.Write("&s1 = "
); AddressOf(s1);

Console.WriteLine("(&s4 == &s1) == {0}\n", Object.ReferenceEquals(s1, s4));

This displays (compare the new value for &s1 with &s4 in the previous experiment):

*** INTERNING s1 ***

s1 = Hello World!
&s1 = 01591CC8
(&s4 == &s1) == True

Finally we allocate yet another string that contains "Hello World!", this time via ldstr directly:

//
// Another string, equal to the modified but interned s1 string ==> &s1 == &s5
//
string s5 = "Hello World!"
;
Console.WriteLine("s5 = {0}"
, s5);
Console.Write("&s5 = "
); AddressOf(s5);

Console
.WriteLine("(s5 == s1) == " + Object.ReferenceEquals(s1, s5));

which displays

s5 = Hello World!
&s5 = 01591CC8
(s5 == s1) == True

Going much deeper - enter the SSCLI

You're a geek? So you do want to know how this really works behind the scenes? The SSCLI will help you out. First download it here and extract it using some ZIP/RAR tool. Download the .vcproj file created by shawnfa too if you want some comfortable browsing through the massive code base.

Part 1 - The JITter interns strings

Let's show how strings get interned when strings are loaded in IL. It all starts by having a call to create a new string literal:

jithelpers.h

    CORINFO_HELP_STRCNS,            // create a new string literal
    ...
    JITHELPER(CORINFO_HELP_STRCNS, JIT_StrCns)

This relies on the JITHELPER macro defined in jitinterface.h. Just for your reference:

jitinterface.h

    // enum for dynamically assigned helper calls
    enum DynamicCorInfoHelpFunc {
    #define JITHELPER(code, pfnHelper)
    #define DYNAMICJITHELPER(code, pfnHelper) DYNAMIC_##code,
    #include "jithelpers.h"    
        DYNAMIC_CORINFO_HELP_COUNT
};

Back on the right track, we find that jithelpers.cpp specifies the JIT_StrCns operation:

jithelpers.cpp

HCIMPL2(Object *, JIT_StrCns, unsigned metaTok, CORINFO_MODULE_HANDLE scopeHnd)

    hndStr = ConstructStringLiteral(scopeHnd, metaTok);
    return *(Object**)hndStr;

This piece of code relies on the HCIMPL2 macro, defined in fcall.h ("fast call"):

fcall.h

    #define FC_COMMON_PROLOG(target, assertFn) FCALL_TRANSITION_BEGIN()
    #define HCIMPL_PROLOG(funcname) LPVOID __me; __me = 0; FC_COMMON_PROLOG(funcname, HCallAssert)
    #define HCIMPL2(rettype, funcname, a1, a2) rettype F_CALL_CONV funcname(a1, a2) { HCIMPL_PROLOG(funcname)

That being said, on to ConstructStringLiteral which we find further in the jithelpers.cpp file:

jithelpers.cpp

OBJECTHANDLE ConstructStringLiteral(CORINFO_MODULE_HANDLE scopeHnd, mdToken metaTok)

    Module* module = GetModule(scopeHnd);
    if (module->HasNativeImage() && module->IsNoStringInterning())
    {
        return module->ResolveStringRef(metaTok, module->GetAssembly()->Parent(), true);
    }
    return module->ResolveStringRef(metaTok, module->GetAssembly()->Parent(), false);

As you can see, the current module's ResolveStringRef method is called, which is defined in ceeload.cpp:

ceeload.cpp

OBJECTHANDLE Module::ResolveStringRef(DWORD token, BaseDomain *pDomain, bool bNeedToSyncWithFixups)

    string = (OBJECTHANDLE)pDomain->GetStringObjRefPtrFromUnicodeString(&strData);
    return string;

This one calls into the appdomain implementation:

appdomain.hpp

   //****************************************************************************************
    // Methods to retrieve a pointer to the COM+ string STRINGREF for a string constant.
    // If the string is not currently in the hash table it will be added and if the
    // copy string flag is set then the string will be copied before it is inserted.
    STRINGREF *GetStringObjRefPtrFromUnicodeString(EEStringData *pStringData);

The comment reveals the intentions of this long-named method and mentions the hash table used for interning. The implementation is rather straightforward too:

appdomain.cpp

STRINGREF *BaseDomain::GetStringObjRefPtrFromUnicodeString(EEStringData *pStringData)

    if (m_pStringLiteralMap == NULL)
    {
        LazyInitStringLiteralMap();
    }

    return m_pStringLiteralMap->GetStringLiteral(pStringData, TRUE, !CanUnload() /* bAppDOmainWontUnload */);

And finally we end up in the "string literal map" class:

stringliteralmap.cpp

StringLiteralEntry *GlobalStringLiteralMap::GetStringLiteral(EEStringData *pStringData, DWORD dwHash, BOOL bAddIfNotFound)

    HashDatum Data;
    StringLiteralEntry *pEntry = NULL;

    if (m_StringToEntryHashTable->GetValue(pStringData, &Data, dwHash))
    {
        pEntry = (StringLiteralEntry*)Data;
        // If the entry is already in the table then addref it before we return it.
        if (pEntry)
            pEntry->AddRef();
    }
    else
    {
        if (bAddIfNotFound)
            pEntry = AddStringLiteral(pStringData);
    }

    return pEntry;

The hashtable implementation itself is of no further interest to us.

Part 2 - String.Concat doesn't take care of interning

As I told you before, String.Concat doesn't check whether the constructed string is interned. Doing so would require an expensive hashtable lookup and consistency would demand other methods like ToUpper, ToLower, etc to do the same. An overview of the stack:

string.cs

public static String Concat(String str0, String str1)

    ...
    String result = FastAllocateString(str0Length + str1.Length);
    ...
    retun result;

The FastAllocateString method that's called is implemented elsewhere in unmanaged code. Therefore it's declared "extern" and decorated with a MethodImplOptions.InternalCall type of method implementation attribute as shown below:

string.cs

    [MethodImplAttribute(MethodImplOptions.InternalCall)]
    private extern static String FastAllocateString(int length);

The real implementation lives in ecall.cpp which contains a big mapping from various "internal call" methods to the corresponding supporting methods in the back:

ecall.cpp

    ECClass gECClasses[] =
    {
        ...
        FCClassElement("String", "System", gStringFuncs)
        ...
    };

    FCFuncStart(gStringFuncs)
        ...
        FCFuncStart(gStringFuncs)
    FCDynamic("FastAllocateString", CORINFO_INTRINSIC_Illegal, ECall::FastAllocateString)
        ...
    FCFuncEnd()

This implements so-called "fast calls" (abbreviated FCall). Ecall.h comes to a rescue for dynamically assigned FCalls as shown below:

ecall.h

    #define DYNAMICALLY_ASSIGNED_FCALLS() \
    DYNAMICALLY_ASSIGNED_FCALL_IMPL(FastAllocateString,                FramedAllocateString) \

The end is near. Now we end up in a JIT helper, just as we did before in part 1. However, this time the implementation calls into SlowAllocateString stuff.

jithelpers.cpp

HCIMPL1(StringObject*, FramedAllocateString, DWORD stringLength)
    ...
    result = SlowAllocateString(stringLength+1);
    result->SetStringLength(stringLength);

    return((StringObject*) OBJECTREFToObject(result));

The end of our journey: the GC implementation. Never far away in .NET as you can see. Basically, the gcscan piece of it is asked to allocate a string:

gcscan.cpp

STRINGREF SlowAllocateString( DWORD cchArrayLength )
    ...
    orObject = (StringObject *)Alloc( ObjectSize, FALSE, FALSE );
    ...
    return( ObjectToSTRINGREF(orObject) );

The Alloc method (calling into the heap allocation mechanism and possibly triggering a garbage collection) would lead us too far, outside the scope of our current investigation. There's much more in here (masked by the ellipses ...) such as method table initialization and profiling stuff. Check it out if you got curious :-).

Part 3 - How String.Intern works

In one of our experiments we made an explicit call to String.Intern. Let's see how this one works now:

string.cs

    public static String Intern(String str) {
        if (str==null) {
            throw new ArgumentNullException("str");
        }
        return Thread.GetDomain().GetOrInternString(str);
    }

appdomain.cpp

STRINGREF *BaseDomain::GetOrInternString(STRINGREF *pString)
{
    if (m_pStringLiteralMap == NULL)
    {
        LazyInitStringLiteralMap();
    }
    _ASSERTE(m_pStringLiteralMap);
    return m_pStringLiteralMap->GetInternedString(pString, TRUE, !CanUnload() /* bAppDOmainWontUnload */);
}

And finally back to the "string literal map" class as in part 1. Oh yes, the m_pStringLiteralMap type is:

    // The AppDomain specific string literal map.
    AppDomainStringLiteralMap   *m_pStringLiteralMap;

Other resources

Other articles on interning include a post on Chris Brumme's weblog and a VB2TheMax article on DevX by Francesco Balena. Even Wikipedia knows the concept. So why wouldn't you too? I hope to have wet your SSCLI appetite too by means of this article.

Enjoy your internal string kitchen :-)

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Filed under: ,

Comments

# re: .NET 2.0 string interning inside out

Wednesday, September 27, 2006 11:06 AM by Andy Maule

Great post Bart. I had no idea about string interning. It's really interesting to see it explained at the byte code level, I think it's good to learn how things work at this level. Keep up the good work!

# re: .NET 2.0 string interning inside out

Friday, September 29, 2006 10:17 AM by Muff

But what the hell is it good for? Gimme real world practical examples.

# re: .NET 2.0 string interning inside out

Friday, September 29, 2006 11:51 AM by bart

Hi Muff,

Thanks for the feedback. The basic goal of string interning is to reduce the memory footprint of an application when it deals with multiple "copies" of the same string. Notice this happens on the app domain level, so when talking about e.g. an ASP.NET application that uses a data connection string (DSN) all over again for each and every user, you'll see the same string (i.e. physical address) being used instead of ending up with multiple copies. In the lion's part of the cases you won't have to cope with it directly; it just works fine as shown in the post. However, when you decide to use things such as runtime string concatenation and you *know* other "sessions" within the same application have a great chance to do exactly the same, you might choose to intern that resulting string by caling String.Intern. That way, you can take maximum benefit out of this feature.

Having seen the discussion of the SSCLI implementation should not scare you; I consider an information overshoot to be valuable in helping to remember about the basic idea of the feature ("once I knew how it worked internally, I read it on Bart's blog; now I just remember the basic idea of it, but that's fair enough to work with it"). Just a little philosophical side-note from my side though.

Hope this additional explanation helps,

-Bart

# Something about .NET string interning « Some Creativity

Tuesday, December 11, 2007 4:35 PM by Something about .NET string interning « Some Creativity

Pingback from  Something about .NET string interning « Some Creativity

# C???????????????????????????????????? | IT??????????????????IT??????????????????

Pingback from  C???????????????????????????????????? | IT??????????????????IT??????????????????

# Today we all learned something | Heliocentrist

Thursday, June 02, 2011 1:32 PM by Today we all learned something | Heliocentrist

Pingback from  Today we all learned something | Heliocentrist

# C# string concatenation and string interning « « Programmers Goodies Programmers Goodies

Pingback from  C# string concatenation and string interning «  « Programmers Goodies Programmers Goodies

# C# string concatenation and string interning « « Programmers Goodies Programmers Goodies

Pingback from  C# string concatenation and string interning «  « Programmers Goodies Programmers Goodies

# Locking on an interned string? | BlogoSfera

Tuesday, October 15, 2013 3:02 AM by Locking on an interned string? | BlogoSfera

Pingback from  Locking on an interned string? | BlogoSfera

# String.format With Variables Vs Inline Variables | Click & Find Answer !

Pingback from  String.format With Variables Vs Inline Variables | Click & Find Answer !

# C #?????????????????????????????????C# string concatenation and string interning | BlueFAQ | C#

Pingback from  C #?????????????????????????????????C# string concatenation and string interning  | BlueFAQ | C#