cbrumme's WebLog
view rss
A quick update on me.
15/9/2006 external link
It’s been over two years since I blogged.  Although I remain happily (perhaps even ecstatically) working at Microsoft, I left the CLR team and the Developer Division about a year ago.  I’m now on an incubation team, exploring evolution and revolution in operating systems.  This is a fascinating area that includes devices, concurrency, scheduling, security, distribution, application model, programming model and even some aspects of user interaction (where I am totally out of my depth).  And, as you might expect with my background, our effort also includes managed programming.   Anyway, this blog will remain available indefinitely.  It continues to be useful for certain technical details which are unavailable elsewhere.   In the meantime, if any readers are interested in working on a deep systems incubation with me and a team of truly outstanding developers, please send me email (cbrumme).  We are holding to some very high standards for this effort in terms of insight, experience and hard work.  But if you are like me, I am confident you will find it a dream opportunity.  
Updated Finalization and Hosting
26/4/2004 external link
My original posts on Finalization and Hosting had some hokey XXXXX markers in place of content, where that content hadn't already been disclosed in some form.  Now that the Visual Studio 2005 Community Preview is available, I've gone back to those two posts and replaced the XXXXX markers with real text. Also, it's obviously been a while since my last post.  I started writing something this weekend, but the weather here has been spectacular and I was compelled to go outside and play.  I'll try to have something in the next couple of weeks.  
Hosting
21/2/2004 external link
Hosting   My prior three blogs were supposed to be on Hosting.  Each time I got side tracked, first on Exceptions, then on Application Compatibility and finally on Finalization.  I refuse to be side tracked this time… much.   Also, I need to explain why it’s taken so long to get this blog out.  Part of the reason is vacation.  I spent Thanksgiving skiing in Whistler.  Then I took a quick side trip to Scottsdale for a friend’s surprise birthday party and to visit my parents.  Finally, I spent over three weeks on Maui getting a break from the Seattle winter.   Another reason for the delay is writer’s block.  This topic is so huge.  The internal specification for the Whidbey Hosting Interfaces is over 100 pages.  And that spec only covers the hosting interfaces themselves.  There are many other aspects of hosting, like how to configure different security policy in different AppDomains, or how to use COM or managed C++ to stitch together the unmanaged host with the managed applications.  There’s no way I can cover the entire landscape.   Anyway, here goes.   Mostly I was tourist overhead at the PDC.  But one of the places I tried to pay for my ticket was a panel on Hosting.  The other panelists included a couple of Program Managers from the CLR, another CLR architect, representatives from Avalon / Internet Explorer, SQL Server, Visual Studio / Office, and – to my great pleasure – a representative from IBM for DB2.   One thing that was very clear at that panel is that the CLR team has done a poor job of defining what hosting is and how it is done.  Depending on your definition, hosting could be:   Mixing unmanaged and managed code in the same process.   Running multiple applications, each in its own specially configured AppDomain.   Using the unmanaged hosting interfaces described in mscoree.idl.   Configuring how the CLR runs in the process, like disabling the concurrent GC through an application config file.   Even though the hosting interfaces described in mscoree.idl are a small part of what could be hosting, I’m going to concentrate on those interfaces.   In V1 and V1.1 of the CLR, we provided some APIs that allowed an unmanaged process host to exercise some limited control over the CLR.  This limited control included the ability to select the version of the CLR to load, the ability to create and configure AppDomains from unmanaged code, access to the ThreadPool, and a few other fundamental operations.   Also, we knew we eventually needed to support hosts which manage all the memory in the process and which use non-preemptive scheduling of tasks and perhaps even light-weight fibers rather than OS threads.  So we added some rudimentary (and alas inadequate) APIs for fibers and memory control.  This invariably happens when you add features that you think you will eventually need, rather than features that someone is actually using and giving feedback on.   If you look closely at the V1 and V1.1 hosting APIs, you really see what we needed to support ASP.NET and a few other scenarios, like ones involving EnterpriseServices, Internet Explorer or VSA, plus some rudimentary guesses at what we might need to coexist properly inside SQL Server.   Obviously in Whidbey we have refined those guesses about SQL Server into hard requirements.  And we tried very hard to generalize each extension that we added for SQL Server, so that it would be applicable to many other hosting scenarios.  In fact, it’s amazing that the SQL Server team still talks to us – whenever they ask for anything, we always say No and give them something that works a lot better for other hosts and not nearly so well for SQL Server.   In our next release (Whidbey), we’ve made a real effort to clean up the existing hosting support and to dramatically extend it for a number of new scenarios.  Therefore I’m not going to spend any more time discussing those original V1 & V1.1 hosting APIs, except to the extent that they are still relevant to the following Whidbey hosting discussion.   Also I’m going to skip over all the general introductory topics like “When to host” since they were the source of my writer’s block.  Instead, I’m going to leap into some of the more technically interesting topics.  Maybe after we’ve studied various details we can step back and see some general guidelines.     Threading and Synchronization   One of the most interesting challenges we struggled with during Whidbey was the need to cooperate with SQL Server’s task scheduling.  SQL Server can operate in either thread mode or fiber mode.  Most customers run in thread mode, but SQL Server can deliver its best numbers on machines with lots of CPUs when it’s running in fiber mode.  That gap between thread and fiber mode has been closing as the OS addresses issues with its own preemptive scheduler.   A few years ago, I ran some experiments to see how many threads I could create in a single process.  Not surprisingly, after almost 2000 threads I ran out of address space in the process.  That’s because the default stack size on NT is 1 MB and the default user address space is 2 GB.  (Starting with V1.1, the CLR can load into LARGEADDRESSAWARE processes and use up to 3 GB of address space).  If you shrink the default stack size, you can create more than 2000 threads before hitting the address space limit.  I see stack sizes of 256 KB in the SQL Server process on my machine, clearly to reduce this impact on process address space.   Of course, address space isn’t the only limit you can hit.  Even on the 4 CPU server box I was experimenting with, the real memory on the system was inadequate for the working set being used.  With enough threads, I exceeded real memory and experienced paging.  (Okay, it was actually thrashing).  But nowadays there are plenty of servers with several GB of real – and real cheap – memory, so this doesn’t have to be an issue.   In my experiments, I simulated server request processing using an artificial work load that combined blocking, allocation, CPU-intensive computation, and a reasonable memory reference set using a mixture of both shared and per-request allocations.  In the first experiments, all the threads were ready to run and all of them had equal priority.  The result of this was that all threads were scheduled in a round-robin fashion on those 4 CPUs.  Since the Windows OS schedules threads preemptively, each thread would execute until it either needed to block or it exceeded its quantum.  With hundreds or even thousands of threads, each context switch was extremely painful.  That’s because most of the memory used by that thread was so cold in the cache, having been fully displaced by the hundreds of threads that ran before it.   As we all know, modern CPUs are getting faster and faster at raw computation.  And they have more and more memory available to them.  But access to that memory is getting relatively slower each year.  By that, I mean that a single memory access costs the equivalent of an increasing number of instructions.  One of the ways the industry tries to mitigate that relative slowdown is through a cache hierarchy.  Modern X86 machines have L1, L2 and L3 levels of cache, ordered from fastest and smallest to slowest and largest.   (Other ways we try to mitigate the slowdown is by increasing the locality of our data structures and by pre-fetching.  If you are a developer, hopefully you already know about locality.  In the unmanaged world, locality is entirely your responsibility.  In the managed world, you get some locality benefits from our environment – notably the garbage collector, but also the auto-layout of the class loader.  Yet even in managed code, locality remains a major responsibility of each developer).   Unfortunately, context switching between such a high number of threads will largely invalidate all those caches.  So I changed my simulated server to be smarter about dispatching requests.  Instead of allowing 1000 requests to execute concurrently, I would block 996 of those requests and allow 4 of them to run.  This makes life pretty easy for the OS scheduler!  There are four CPUs and four runnable threads.  It’s pretty obvious which threads should run.   Not only will the OS keep those same four threads executing, it will likely keep them affinitized to the same CPUs.  When a thread moves from one CPU to another, the new CPU needs to fill all the levels of cache with data appropriate to the new thread.  However, if we can remain affinitized, we can enjoy all the benefits of a warm cache.  The OS scheduler attempts to run threads on the CPU that last ran them (soft affinity).  But in practice this soft affinity is too soft.  Threads tend to migrate between CPUs far more than we would like.  When the OS only has 4 runnable threads for its 4 CPUs, the amount of migration seemed to drop dramatically.   Incidentally, Windows also supports hard affinity.  If a thread is hard affinitized to a CPU, it either runs on that CPU or it doesn’t run.  The CLR can take advantage of this when the GC is executing in its server mode.  But you have to be careful not to abuse hard affinity. You certainly don’t want to end up in a situation where all the “ready to run” threads are affinitized to one CPU and all the other CPUs are necessarily stalled.   Also, it’s worth mentioning the impact of hyper-threading or NUMA on affinity.  On traditional SMP, our choices were pretty simple.  Either our thread ran on its ideal processor, where we are most likely to see all the benefits of a warm cache, or it ran on some other processor.  All those other processor choices can be treated as equally bad for performance.  But with hyper-threading or NUMA, some of those other CPUs might be better choices than others.  In the case of hyper-threading, some logical CPUs are combined into a single physical CPU and so they share access to the same cache memory at some level in the cache hierarchy.  For NUMA, the CPUs may be arranged in partitions (e.g. hemispheres on some machines), where each partition has faster access to some memory addresses and slower access to other addresses.  In all these cases, there’s some kind of gradient from the very best CPU(s) for a thread to execute on, down to the very worst CPU(s) for that particular thread.  The world just keeps getting more interesting.   Anyway, remember that my simulated server combined blocking with other operations.  In a real server, that blocking could be due to a web page making a remote call to get rows from a database, or perhaps it could be blocking due to a web service request.  If my server request dispatcher only allows 4 requests to be in flight at any time, such blocking will be a scalability killer.  I would stall a CPU until my blocked thread is signaled.  This would be intolerable.   Many servers address this issue by releasing some multiple of the ideal number of requests simultaneously.  If I have 4 CPUs dedicated to my server process, then 4 requests is the ideal number of concurrent requests.  If there’s “moderate” blocking during the processing of a typical request, I might find that 8 concurrent requests and 8 threads is a good tradeoff between more context switching and not stalling any CPUs.  If I pick too high of a multiple over the number of CPUs, then context switching and cache effects will hurt my performance.  If I pick too low a multiple, then blocking will stall a CPU and hurt my performance.   If you look at the heuristics inside the managed ThreadPool, you’ll find that we are constantly monitoring the CPU utilization.  If we notice that some CPU resources are being wasted, we may be starving the system by not doing enough work concurrently.  When this is detected, we are likely to release more threads from the ThreadPool in order to increase concurrency and make better use of the CPUs.  This is a decent heuristic, but it isn’t perfect.  For instance, CPU utilization is “backwards looking.”  You actually have to stall a CPU before we will notice that more work should be executed concurrently.  And by the time we’ve injected extra threads, the stalling situation may already have passed.   The OS has a better solution to this problem.  IO Completion Ports have a direct link to the blocking primitives in Win32.  When a thread is processing a work item from a completion port, if that thread blocks efficiently through the OS, then the blocking primitive will notify the completion port that it should release another thread.  (Busy waiting instead of efficient blocking can therefore have a substantial impact on the amount of concurrency in the process).  This feedback mechanism with IO Completion Ports is far more immediate and effective than the CLR’s heuristic based on CPU utilization.  But in fairness I should point out that if a managed thread performs managed blocking via any of the managed blocking primitives (contentious Monitor.Enter, WaitHandle.WaitOne/Any/All, Thread.Join, GC.WaitForPendingFinalizers, etc.), then we have a similar feedback mechanism.  We just don’t have hooks into the OS, so we cannot track all the blocking operations that occur in unmanaged code.   Of course, in my simulated server I didn’t have to worry about “details” like how to track all OS blocking primitives.  Instead, I postulated a closed world where all blocking had to go through APIs exposed by my server.  This gave me accurate and immediate information about threads either beginning to block or waking up from a blocking operation.  Given this information, I was able to tweak my request dispatcher so it avoided any stalling by injecting new requests as necessary.   Although it’s possible to completely prevent stalling in this manner, it’s not possible to prevent context switches.  Consider what happens on a 1 CPU machine.  We release exactly one request which executes on one thread.  When that thread is about to block, we release a second thread.  So far, it’s perfect.  But when the first thread resumes from its blocking operation, we now have two threads executing concurrently.  Our request dispatcher can “retire” one of those threads as soon as it’s finished its work.  But until then we have two threads executing on a single CPU and this will impact performance.   I suppose we could try to get ruthless in this situation, perhaps by suspending one of the threads or reducing its priority.  In practice, it’s never a good idea to suspend an executing thread.  If that thread holds any locks that are required by other concurrent execution, we may have triggered a deadlock.  Reducing the priority might help and I suspect I played around with that technique.  To be honest, I can’t remember that far back.   We’ll see that SQL Server can even solve this context switching problem.     Oh yeah, SQL Server   So what does any of this have to do with SQL Server?   Not surprisingly, the folks who built SQL Server know infinitely more than me about how to get the best performance out of a server.  And when the CLR is inside SQL Server, it must conform to their efficient design.  Let’s look at their thread mode, first.  Fiber mode is really just a refinement over this.   Incoming requests are carried on threads.  SQL Server handles a lot of simultaneous requests, so there are a lot of threads in the process.  With normal OS “free for all” scheduling, this would result in way too many context switches, as we have seen.  So instead those threads are affinitized to a host scheduler / CPU combination.  The scheduler tries to ensure that there is one unblocked thread available at any time.  All the other threads are ideally blocked.  This gives us the nirvana of 100% busy CPUs and minimal context switches.  To achieve this nirvana, all the blocking primitives need to cooperate with the schedulers.  Even if an event has been signaled and a thread is considered by the application to be “ready to run”, the scheduler may not choose to release it, if the scheduler’s corresponding CPU is already executing another thread.  In this manner, the blocking primitive and the scheduler are tightly integrated.   When I built my simulated server, I was able to achieve an ideal “closed world” where all the synchronization primitives were controlled by me.  SQL Server attempts the same thing.  If a thread needs to block waiting for a data page to be read, or for a page or row latch to be released, that blocking occurs through the SQL Server scheduler.  This guarantees that exactly one thread is available to run on each CPU, as we’ve seen.   Of course, execution of managed code also hits various blocking points.  Monitor.Enter (‘lock’ in C# and ‘SyncLock’ in VB.NET) is a typical case.  Other cases include waiting for a GC to complete, waiting for class construction or assembly loading or type loading to occur, waiting for a method to be JITted, or waiting for a remote call or web service to return.  For SQL Server to hit their performance goals and to avoid deadlocks, the CLR must route all of these blocking primitives to SQL Server (or any other similar host) through the new Whidbey hosting APIs.     Leaving the Closed World   But what about synchronization primitives that are used for coordination with unmanaged code and which have precise semantics that SQL Server cannot hope to duplicate?  For example, WaitHandle and its subtypes (like Mutex, AutoResetEvent and ManualResetEvent) are thin wrappers over the various OS waitable handles.  These primitives provide atomicity guarantees when you perform a WaitAll operation on them.  They have special behavior related to message pumping.  And they can be used to coordinate activity across multiple processes, in the case of named primitives.  It’s unrealistic to route operations on WaitHandle through the hosting APIs to some equivalent host-provided replacements.   This issue with WaitHandle is part of a more general problem.  What happens if I PInvoke from managed code to an OS service like CoInitialize or LoadLibrary or CryptEncrypt?  Do those OS services block?  Well, I know that LoadLibrary will have to take the OS loader lock somewhere.  I could imagine that CoInitialize might need to synchronize something, but I have no real idea.  One thing I am sure of: if any blocking happens, it isn’t going to go through SQL Server’s blocking primitives and coordinate with their host scheduler.  The idealized closed world that SQL Server needs has just been lost.   The solution here is to alert the host whenever a thread “leaves the runtime”.  In other words, if we are PInvoking out, or making a COM call, or the thread is otherwise transitioning out to some unknown unmanaged execution, we tell the host that this is happening.  If the host is tracking threads as closely as SQL Server does, it can use this event to disassociate the thread from the host scheduler and release a new thread.  This ensures that the CPU stays busy.  That’s because even if the disassociated thread blocks, we’ve released another thread.  This newly released thread is still inside our closed world, so it will notify before it blocks so we can guarantee that the CPU won’t stall.   Wait a second.  The CLR did a ton of work to re-route most of its blocking operations through the host.  But we could have saved almost that entire ton of engineering effort if we had just detached the thread from the host whenever SQL Server called into managed code.  That way, we could freely block and we wouldn’t disrupt the host’s scheduling decisions.   This is true, but it won’t perform as well as the alternative.  Whenever a thread disassociates from a host scheduler, another thread must be released.  This guarantees that the CPU is busy, but it has sacrificed our nirvana of only having a single runnable thread per CPU.  Now we’ve got two runnable threads for this CPU and the OS will be preemptively context-switching between them as they run out of quantum.   If a significant amount of the processing inside a host is performed through managed code, this would have a serious impact on performance.   Indeed, if a significant amount of the processing inside a host is performed in unmanaged code, called via PInvokes or COM calls or other mechanisms that “leave the runtime”, this too can have a serious impact on performance.  But, for practical purposes, we expect most execution to remain inside the host or inside managed code.  The amount of processing that happens in arbitrary unmanaged code should be low, especially over time as our managed platform grows to fill in some of the current gaps.   Of course, some PInvokes or COM calls might be to services that were exported from the host.  We certainly don’t want to disassociate from the host scheduler every time the in-process ADO provider performs a PInvoke back to SQL Server to get some data.  This would be unnecessary and expensive.  So there’s a way for the host to control which PInvoke targets perform a “leave runtime” / “return to runtime” pair and which ones are considered to remain within the closed world of our integrated host + runtime.   Even if we were willing to tolerate the substantial performance impact of considering all of the CLR to be outside the host’s closed world (i.e. we disassociated from the host’s scheduler whenever we ran managed code), this approach would be inadequate when running in fiber mode.  That’s because of the nasty effects which thread affinity can have on a fiber-based system.     Fiber Mode   As we’ve seen, SQL Server and other “extreme” hosts can ensure that at any time each CPU has only a single thread within the closed world that is ready to run.  But when SQL Server is in thread mode, there are still a large number of threads that aren’t ready to run.  It turns out that all those blocked threads impose a modest cost upon the OS preemptive scheduler.  And that cost becomes an increasing consideration as the number of CPUs increases.  For 1, 2, 4 and probably 8 CPU machines, fiber mode isn’t worth the headaches we’re about to discuss.  But by the time you get to a larger machine, you might achieve something like a 20% throughput boost by switching to fiber mode.  (I haven’t seen real numbers in a year or two, so please take that 20% as a vague ballpark).   Fiber mode simply eliminates all those extra threads from any consideration by the OS.  If you stay within the idealized nirvana (i.e. you don’t perform a “leave runtime” operation), there is only one thread for each host scheduler / CPU.  Of course, there are many stacks / register contexts and each such stack / register context corresponds to an in-flight request.  When a stack is ready to run, the single thread switches away from whatever stack it was running and switches to the new stack.  But from the perspective of the OS scheduler, it just keeps running the only thread it knows about.   So in both thread mode and fiber mode, SQL Server uses non-preemptive host scheduling of these tasks.  This scheduling happens in user mode, which is a distinct advantage over the OS preemptive scheduling which happens in kernel mode.  The only difference is whether the OS scheduler is aware of all the tasks on the host scheduler, or whether they all look like a single combined thread – albeit with different stacks and register contexts.   But the impact of this difference is significant.  First, it means that there is an M:N relationship between stacks (logical CLR threads) and OS threads.  This is M:N because multiple stacks will execute on a single thread, and because the specially nominated thread that carries those stacks can change over time.  This change in the nominated thread occurs as a consequence of those “leave runtime” calls.  Remember that when a thread leaves the runtime, we inform the host which disassociates the thread from the host scheduler.  A new thread is then created or obtained from a short list of already-created threads.  This new thread then picks up the next stack that is ready to run.  The effect is that this stack has migrated from the original disassociated thread to the newly nominated thread.   This M:N relationship between stacks and OS threads causes problems everywhere that thread affinity would normally occur.  I’ve already mentioned CPU affinity when discussing how threads are associated with CPUs.  But now I’m talking about a different kind of affinity.  Thread affinity is the association between various programmatic operations and the thread that these operations must run on.  For example, if you take an OS critical section by calling EnterCriticalSection, the resulting ownership is tied to your thread.  Sometimes developers say that the OS critical section is scoped to your thread.  You must call LeaveCriticalSection from that same thread.   None of this is going to work properly if your logical thread is asynchronously and apparently randomly migrating between different physical threads.  You’ll successfully take the critical section on one logical thread.  If you attempt to recursively acquire this critical section, you will deadlock if a migration has intervened.  That’s because it will look like a different physical thread is actually the owner.   Imagine writing some hypothetical code inside the CLR:   EnterCriticalSection(pCS);   If (pGlobalBlock == NULL)    pGlobalBlock = Alloc(count);   LeaveCriticalSection(pCS);   Obviously any real CLR code would be full of error handling, including a ‘finally’ clause to release the lock.  And we don’t use OS critical sections directly since we typically reflect them to an interested host as we’ve discussed.  And we instrument a lot of this stuff, including spinning during lock acquisition.  And we wrap the locks with lots of logic to avoid deadlocks, including GC-induced deadlocks.  But let’s ignore all of the goop that would be necessary for real CLR code.   It turns out that the above code has a thread affinity problem.  Even though SQL Server’s fiber scheduling is non-preemptive, scheduling decisions can still occur whenever we call into the host.  For reasons that I’ll explain later, all memory allocations in the CLR have the potential to call into the host and result in scheduling.  Obviously most allocations will be satisfied locally in the CLR without escalation to the host.  And most escalations to the host still won’t cause a scheduling decision to occur.  But from a correctness perspective, all allocations have the potential to cause scheduling.   Other places where thread affinity can bite us include:   The OS Mutex and the managed System.Threading.Mutex wrapper.   LoadLibrary and DllMain interactions.  As I’ve explained in my blog entry on Shutdown, DllMain notifications occur on a thread which holds the OS loader lock.   TLS (thread local storage).  It’s worth mentioning that, starting with Windows Server 2003, there are new FLS (fiber local storage) APIs.  These APIs allow you to associate state with the logical rather than the physical thread.  When a fiber is associated with a thread for execution (SwitchToFiber), the FLS is automatically moved from the fiber onto the thread.  For managed TLS, we now move this automatically.  But we cannot do this unconditionally for all the unmanaged TLS.   Thread culture or locale, the impersonation context or user identity, the COM+ transaction context, etc.  In some sense, these are just special cases of thread local storage.  However, for historical reasons it isn’t possible to solve these problems by moving them to FLS.   Taking control of a thread for GC, Abort, etc. via the OS SuspendThread() service.   Any use of ThreadId or Thread Handle.  This includes all debugging.   “Hand-rolled” locks that we cannot discover or reason about, and which you have inadvertently based on the physical OS thread rather than the logical thread or fiber.   Various PInvokes or COM calls that might end up in unmanaged code with affinity requirements.  For instance, MSHTML can only be called on STA threads which are necessarily affinitized.  Of course, there is no list of all the APIs that have odd threading behavior.  It’s a minefield out there.   Solving affinity issues is relatively simple.  The hard part is identifying all the places.  Note that the last two bullet items are actually the application’s responsibility to identify.  Some application code might appear to execute correctly when logical threads and OS threads are 1:1.  But when a host creates an M:N relationship, any latent application bugs will be exposed.   In many cases, the easiest solution to a thread affinity issue is to disassociate the thread from the host’s scheduler until the affinity is no longer required.  The hosting APIs provide for this, and we’ve taken care of it for you in many places – like System.Threading.Mutex.   Before we finish our discussion of locking, there’s one more aspect worth mentioning.  In an earlier blog, I have mentioned the limited deadlock detection and deadlock breaking which the CLR performs when executing class constructors or JITting.   Except for this limited case, the CLR doesn’t concern itself with application-level deadlocks.  If you write some managed code that takes a set of locks in random order, resulting in a potential deadlock, we consider that to be your application bug.  But some hosts may be more helpful.  Indeed, SQL Server has traditionally detected deadlocks in all data accesses.  When a deadlock occurs, SQL Server selects a victim and aborts the corresponding transaction.  This allows the other requests implicated in the deadlock to proceed.   With the new Whidbey hosting APIs, it’s possible for the host to walk all contentious managed locks and obtain a graph of the participants.  This support extends to locking through our Monitor and our ReaderWriterLock.  Clearly, an application could perform locking through other means.  For example, an AutoResetEvent can be used to simulate mutual exclusion.  But it’s not possible for such locks to be included in the deadlock algorithms, since there isn’t a strong notion of lock ownership that we can use.   Once the host has selected a deadlock victim, it must cause that victim to abort its forward progress somehow.  If the victim is executing managed code, some obvious ways to do this include failing the lock attempt (since the thread is necessarily blocking), aborting the thread, or even unloading the AppDomain.  We’ll return to the implications of this choice in the section on Reliability below.   Finally, it’s interesting to consider how one might get even better performance than what SQL Server has achieved.  We’ve seen how fiber mode eliminates all the extra threads, by multiplexing a number of stacks / register contexts onto a single thread.  What happens if we then eliminate all those fibers?  For a dedicated server, we can achieve even better performance by forcing all application code to maintain its state outside of a thread’s stack.  This allows us to use a single thread per CPU which executes user requests by processing them on its single dedicated stack.  All synchronous blocking is eliminated by relying on asynchronous operations.  The thread never yields while holding its stack pinned.  The amount of memory required to hold an in-flight request will be far less than a 256 KB stack reservation.  And the cost of processing an asynchronous completion through polling can presumably be less than the cost of a fiber context switch.   If all you care about is performance, this is an excellent way to build a server.  But if you need to accommodate 3rd party applications inside the server, this approach is questionable.  Most developers have a difficult time breaking their logic into segments which can be separately scheduled with no stack dependencies.  It’s a tedious programming model.  Also, the underlying Windows platform still contains a lot of blocking operations that don’t have asynchronous variants available.  WMI is one example.     Memory Management   Servers must not page.   Like all rules, this one isn’t strictly true.  It is actually okay to page briefly now and then, when the work load transitions from one steady state to another.  But if you have a server that is routinely paging, then you have driven that server beyond its capacity.  You need to reduce the load on the server or increase the server’s memory capacity.   At the same time, it’s important to make effective use of the memory capacity of a server.  Ideally, a database would store the entire database contents in memory.  This would allow it to avoid touching the disk, except to write the durable log that protects it from data loss and inconsistency in the face of catastrophic failure.  Of course, the 2 or 3 GB limit of Win32 is far too restrictive for most interesting databases.  (SQL Server can use AWE to escape this limit, at some cost).  And even the address limits of Win64 are likely to be exceeded by databases presently.  That’s because Win64 does not give you a full 64 bits of addressing and databases are already heading into the petabytes.   So a database needs to consider all the competing demands for memory and make wise decisions about which ones to satisfy.  Historically, those demands have included the buffer cache which contains data pages, compiled query plans, and all those thread stacks.  When the CLR is loaded into the process, significant additional memory is required for the GC heap, application code, and the CLR itself.  I’m not sure what techniques SQL Server uses to trade off the competing demands for memory.  Some servers carve memory up based on fixed ratios for the different broad uses, and then rely on LRU within each memory chunk.  Other servers assign a cost to each memory type, which indicates how expensive it would be to regenerate that memory.  For example, in the case of a data page, that cost is an IO.   Some servers use elaborate throttling of inbound requests, to keep the memory load reasonable.  This is relatively easy to do when all requests are comparable in terms of their memory and CPU requirements.  But if some queries access a single database page and other queries touch millions of rows, it would be hard to factor this into a throttling decision that is so far upstream from the query processor.  Instead, SQL Server tends to accept a large number of incoming requests and process them “concurrently.”  We’ve already seen in great detail why this concurrent execution doesn’t actually result in preemptive context switching between all the corresponding tasks.  But it is still the case that each request will hold onto some reference set of memory, even when the host’s non-preemptive scheduler has that request blocked.   If enough requests are blocked while holding onto significant unshared memory, then the server process may find itself over-committed on memory.  At this point, it could page – which hurts performance.  Or it could kill some of the requests and free up the resources they are holding onto.  This is an unfortunate situation, because we’ve presumably already devoted resources like the CPU to get the request to its current state of partial completion.  If we throw away the request, all that work was wasted.  And the client is likely to resubmit the request, so we will have to repeat all that work soon.   Nevertheless, if the server is over-committed and it’s not practical to recover more memory by e.g. shrinking the number of pages devoted to the buffer cache, then killing in-flight requests is a sound strategy.  This is particularly reasonable in database scenarios, since the transactional nature of database operations means that we can kill requests at any time and with impunity.   Unfortunately, the world of arbitrary managed execution has no transactional foundation we can rely on.  We’ll pick up this issue again below, in the section on Reliability.   It should be obvious that, if SQL Server or any other host is going to make wise decisions about memory consumption on a “whole process” basis, that host needs to know exactly how much memory is being used and for what purposes.  For example, before the host unloads an AppDomain as a way of backing out of an over-committed situation, the host needs some idea of how many megabytes this unload operation is likely to deliver.   In the reverse direction, the host needs to be able to masquerade as the operating system.  For instance, the CLR’s GC monitors system memory load and uses this information in its heuristics for deciding when to schedule a collection.  The host needs a way to influence these collection decisions.     SQL Server and ASP.NET   Clearly a lot of work went into threading, synchronization and memory management in SQL Server.  One obvious question to ask is how ASP.NET compares.  They are both server products from Microsoft and they both execute managed code.  Why didn’t we need to add all this support to the hosting interfaces in V1 of the CLR, so we could support ASP.NET?   I think it’s fair to say that ASP.NET took a much simpler approach to the problem of building a scalable server.  To achieve efficient threading, they rely on the managed ThreadPool’s heuristics to keep the CPUs busy without driving up too many context switches.  And since the bulk of memory allocations are due to the application, rather than the ASP.NET infrastructure (in other words, they aren’t managing large shared buffer pools for data pages), it’s not really possible for ASP.NET to act as a broker for all the different memory consumers.  Instead, they just monitor the total memory load, and recycle the worker process if a threshold is exceeded.   (Incidentally, V1 of ASP.NET and the CLR had an unfortunate bug with the selection of this threshold.  The default point at which ASP.NET would recycle the process was actually a lower memory load than the point at which the CLR’s GC would switch to a more aggressive schedule of collections.  So we were actually killing the worker process before the CLR had a chance to deliver more memory back to the application.  Presumably in Whidbey this selection of default thresholds is now coordinated between the two systems.)   How can ASP.NET get away with this simpler approach?   It really comes down to their fundamental goals.  ASP.NET can scale out, rather than having to scale up.  If you have more incoming web traffic, you can generally throw more web servers at the problem and load balance between them.  Whereas SQL Server can only scale out if the data supports this.  In some cases, it does.  There may be a natural partitioning of the data, like access to the HotMail mailbox for a particular incoming user.  But in too many other cases, the data cannot be sufficiently partitioned and the server must be scaled up.  On X86 Windows, the practical limit is a 32-way CPU with a hard limit of 3 GB of user address space.  If you want to keep increasing your work load on a single box, you need to use every imaginative trick – like fibers or AWE – to eke out all possible performance.   There’s also an availability issue.  ASP.NET can recycle worker processes quite quickly.  And if they have scaled out, recycling a worker process on one of the computers in the set will have no visible effect on the availability of the set of servers.  But SQL Server may be limited to a single precious process.  If that process must be recycled, the server is unavailable.  And recycling a database is more expensive than recycling a stateless ASP.NET worker process, because transaction logs must be replayed to move the database forwards or backwards to a consistent state.   The short answer is, ASP.NET didn’t have to do all the high tech fancy performance work.  Whereas SQL Server was forced down this path by the nature of the product they must build.     Reliability   Well, if you haven’t read my earlier blogs on asynchronous exceptions, or if – like me – you read the Reliability blog back in June and don’t remember what it said – you might want to review it quickly at http://blogs.msdn.com/cbrumme/archive/2003/06/23/51482.aspx.   The good news is that we’ve revisited the rules for ThreadAbortException in Whidbey, so that there is now a way to abort a thread without disturbing any backout code that it is currently running.  But it’s still the case that asynchronous exceptions can intrude at fairly arbitrary spots in the execution.   Anyway, the availability goals of SQL Server place some rather difficult requirements on the CLR.  Sure, we were pretty solid in V1 and V1.1.  We ran a ton of stress and – if you avoided stack overflow, running out of memory, and any asynchronous exceptions like Thread.Abort – we could run applications indefinitely.  We really were very clean.   One problem with this is that “indefinitely” isn’t long enough for SQL Server.  They have a noble goal of chasing 5 9’s and you can’t get there with loose statements like “indefinitely”.  Another problem is that we can no longer exclude OutOfMemoryException and ThreadAbortException from our reliability profile.  We’ve already seen that SQL Server tries to use 100% of memory, without quite triggering paging.  The effect is that SQL Server is always on the brink of being out of memory, so allocation requests are frequently being denied.  Along the same lines, if the server is loaded it will allow itself to become over-committed on all resources.  One strategy for backing out of an over-commitment is to abort a thread (i.e. kill a transaction) or possibly unload one or more AppDomains.   Despite this stressful abuse, at no time can the process terminate.   The first step to achieve this was to harden the CLR so that it was resilient to any resource failures.  Fortunately we have some extremely strong testers.  One tester built a system to inject a resource failure in every allocator, for every unique logical call stack.  This tests every distinct backout path in the product.  This technique can be used for unmanaged and managed (FX) code.  That same tester is also chasing any unmanaged leaks by applying the principles of a tracing garbage collector to our unmanaged CLR data structures.  This technique has already exposed a small memory leak that we shipped in V1 of the CLR – for the “Hello World” application!   With testers like that, you better have a strong development team too.  At this point, I think we’ve annotated the vast majority of our unmanaged CLR methods with reliability contracts.  These are a bit like Eiffel pre- and post-conditions and they provide machine-verifiable statements about each method’s behavior with respect to GC, exceptions, and other fundamental operations.  These contracts can be used during test coverage (and, in some cases, during static scans of the binary images) to test for conformance.   The bottom line is that the next release of CLR should be substantially more robust in the face of resource errors.  Leaving aside stack overflows and focusing entirely on the unmanaged runtime, we are shooting for perfection.  Even for stack overflow, we expect to get very, very close.  And we have the mechanisms in place that allow us to be rigorous in chasing after these goals.   But what about all of the managed code?   Will FX be as robust as the unmanaged CLR?  And how can we possibly hold 3rd party authors of stored procedures or user defined functions to that same high bar?  We want to enable a broad class of developers to write this sort of code, and we cannot expect them to perform many hundreds of hours of stress testing and fault injection on each new stored procedure.  If we’re chasing 5 9’s by requiring every external developer to write perfect code, we should just give up now.   Instead, SQL Server relies on something other than perfect code.  Consider how SQL Server worked before it started hosting the CLR:   The vast majority of execution inside SQL Server was via Transact SQL or TSQL.  Any application written in TSQL is inherently scalable, fiber-aware, and robust in the face of resource errors.  Any computation in TSQL can be terminated with a clean transaction abort.   Unfortunately, TSQL isn’t expressive enough to satisfy all application needs.  So the remaining applications were written in extended stored procedures or xprocs.  These are typically unmanaged C++.  Their authors must be extremely sophisticated, because they are responsible for integrating their execution with the unusual threading environment and resource rules that exist inside SQL Server.  Throw in the rules for data access and security (which I won’t be discussing in this blog) and it takes superhuman knowledge and skill to develop a bug-free xproc.   In other words, you had a choice of well-behaved execution and limited expression (TSQL), or the choice of arbitrary execution coupled with a very low likelihood that you would get it right (xprocs).   One of the shared goals of the SQL Server and CLR teams in Whidbey was to eliminate the need for xprocs.  We wanted to provide a spectrum of choices to managed applications.  In Whidbey, that spectrum consists of three buckets for managed code:   Safe Code in this bucket is the most constrained.  In fact, the host constrains it beyond what the CLR would normally allow to code that’s only granted SecurityPermissionFlag.Execution.  So this code must be verifiably typesafe and has a reduced grant set.  But it is further constrained from defining mutable static fields, from creating or controlling threads, from using the threadpool, etc.  The goal here is to guide the code to best practices for scalability and robustness within the SQL Server or similar hosted environments.  In the case of SQL Server, this means that all state should be stored in the database and that concurrency is controlled through transactions against the data.  However, it’s important to realize that these additional constraints are not part of the Security system and they may well be subvertible.  The constraints are simply speedbumps (not roadblocks) which guide the application code away from potentially non-scalable coding techniques and which encourage best practices.   External Access Code in this bucket should be sufficient for replacing most xprocs.  Such code must also be verifiably typesafe, but it is granted some additional permissions.  The exact set of permissions is presumably subject to change until Yukon ships, but it’s likely to allow access to the registry, the file system, and the network.   Unsafe This is the final managed escape hatch for writing code inside SQL Server.  This code does not have to be verifiable.  It has FullTrust (with the possible exception of UIPermission, which makes no sense within the database).  This means that it can do anything the most arbitrary xproc can do.  However, it is much more likely to work properly, compared to that xproc.  First, it sits on top of a framework that has been designed to work inside the database.  Second, the code has all the usual benefits of managed code, like a memory manager that’s based on accurate reachability rather than on programmer correctness.  Finally, it is executing on a runtime that understands the host’s special rules for resource management, synchronization, threading, security, etc.   For code in the Safe bucket, you may be wondering how a host could constrain code beyond SecurityPermissionFlag.Execution.  There are two techniques available for this:   1)      Any assembly in the ‘Safe’ subset could be scanned by a host-provided pre-verifier, to check for any questionable programming constructs like the definition of mutable static fields, or the use of reflection.  This raises the obvious question of how the host can interject itself into the binding process and guarantee that only pre-verified assemblies are loaded.  The new Whidbey hosting APIs contain a Fusion loader hook mechanism, which allows the host to abstract the notion of an assembly store, without disturbing all our normal loader policy.  You can think of this as the natural evolution of the AppDomain.AssemblyResolve event.  SQL Server can use this mechanism to place all application assemblies into the database and then deliver them to the loader on demand.  In addition to enabling pre-verification, the loader hooks can also be used to ensure that applications inside the database are not inadvertently broken or influenced by changes outside the database (e.g. changes to the GAC).  In fact, you could even copy a database from one machine to another and theoretically this could automatically transfer all the assemblies required by that database.   2)      The Whidbey hosting APIs provide controls over a new Host Protection Attribute (HPA) feature.  Throughout our frameworks, we’ve decorated various unprotected APIs with an appropriate HPA.  These HPAs indicate that the decorated API performs a sensitive operation like Synchronization or Thread Control.  For instance, use of the ThreadPool isn’t considered a secure operation.  (At some level, it is a risk for Denial of Service attacks, but DOS remains an open design topic for our managed platform).  If code is running outside of a host that enables these HPAs, they have no effect.  Partially trusted code, including code that only has Execution permission, can still call all these APIs.  But if a host does enable these attributes, then code with insufficient trust can no longer call these APIs directly.  Indirect calls are still permitted, and in this sense the HPA mechanism is similar to the mechanism for LinkDemands.   Although HPAs use a mechanism that is similar to LinkDemands, it’s very important to distinguish the HPA feature – which is all about programming model guidance – from any Security feature.  A great way to illustrate this distinction is Monitor.Enter.   Ignoring HPAs, any code can call Monitor.Enter and use this API to synchronize with other threads.  Naturally, SQL Server would prefer that most developers targeting their environment (including all the naïve ones) should rely on database locks under transaction control for this sort of thing.  Therefore they activate the HPA on this class:       [HostProtection(Synchronization=true, ExternalThreading=true)]     public sealed class Monitor     {         ...         [MethodImplAttribute(MethodImplOptions.InternalCall)]         public static extern void Enter(Object obj);   However, devious code in the ‘Safe’ bucket could use a HashTable as an alternate technique for locking.  If you create a synchronized HashTable and then perform inserts or lookups, your Object.Equals and GetHashCode methods will be called within the lock that synchronizes the HashTable.  The BCL developers were smart enough to realize this, and they added another HPA:       public class Hashtable : IDictionary, ISerializable,                              IDeserializationCallback, ICloneable     {         ...         [HostProtection(Synchronization=true)]         public static Hashtable Synchronized(Hashtable table) {             if (table==null)                 throw new ArgumentNullException("table");             return new SyncHashtable(table);         }   Are there other places inside the frameworks where it’s possible to trick an API into providing synchronization for its caller?  Undoubtedly there are, but we aren’t going to perform exhaustive audits of our entire codebase to discover them all.  As we find additional APIs, we will decorate them with HPAs, but we make no guarantees here.   This would be an intolerable situation for a Security feature, but it’s perfectly acceptable when we’re just trying to increase the scalability and reliability of naively written database applications.     Escalation Policy   I chose the HPA on System.Threading.Monitor for a reason, in the above example.  If you’ve read my earlier blogs on Thread.Abort, you know that it’s dangerous to asynchronously abort another thread.  That thread could be executing a class constructor, in which case that class is now unavailable throughout the AppDomain.  That thread could be in the middle of an update to some shared application state, which would leave the application in an inconsistent state.   In V1 & V1.1, it was not really possible to write code that is robust in the face of asynchronous exceptions like Abort.  In Whidbey, we’re now introducing some constructs (Constrained Execution Regions and Critical Finalization) which make it possible to do this.  I’m not going to discuss those constructs in this blog.  But suffice it to say that, although it makes it possible to write entirely robust code, it doesn’t make it easy.  Without a higher level programmatic construct, like transactions, it’s very difficult to write entirely robust code.  You must acquire all the resources required for forward progress, tolerating exceptions during this acquisition phase.  Then you enter a forward progress phase, which either cannot fail or which unconditionally triggers some compensating backout code upon failure.  If compensation is triggered, it must guarantee that the system is returned to a consistent state before it completes.   If you’ve successfully written that sort of code, you know that it’s an onerous discipline.  There’s no way that we can expect the greater population of developers to write large bodies of bug-free code based on this plan.   That’s why, in V1 & V1.1, we recommend either using Abort on the current thread (in which case it is not asynchronous) or we recommend using it in conjunction with an AppDomain.Unload (in which case any inconsistent application state is likely to be discarded).   In Whidbey, it is possible to avoid inducing asynchronous Aborts onto threads that are performing backout (i.e. filter, finally, catch or fault blocks) or that hold locks.  Our definition of a lock is pretty broad.  It includes execution of a class constructor, since all .cctor execution is synchronized according to elaborate rules by the CLR.  It also includes Monitor.Enter, Mutex, ReaderWriterLock, etc.  Finally, it includes any “hand-rolled” locks that you build, so long as you properly identify them to us.   Our rationale here is that any thread holding a lock may be updating shared state.  If a thread isn’t holding a lock, then any update it performs against shared state must be atomic or at least it never leaves that shared state in an inconsistent state.  This is strictly a heuristic, but it’s a pretty good one.   If we believe this heuristic, it means that we can use Abort without consequently unloading an AppDomain, if that thread doesn’t hold any locks and isn’t performing any backout.  And it just so happens that the bulk of all managed code executing inside SQL Server is in the ‘Safe’ subset – which coincidentally is highly discouraged via HPAs from taking or holding locks.   In other words, code in the ‘Safe’ subset can almost always take an asynchronous exception without affecting any of the execution on other threads in the same AppDomain.  This is the case, even though that code was written by developers who don’t understand the deep issues involved with asynchronous exceptions.  It further means that if we should catch such a thread at a point where it isn’t safe to inject an asynchronous exception without also unloading the AppDomain, we can identify this window.  Once this window is identified, we can either hold off from injecting the exception until this unsafe window has closed, or we can unload the entire AppDomain to eliminate the application inconsistency.  The host can decide whether to hold off on the injection or alternatively to proceed with an AppDomain unload, based on criteria like how resource-constrained the host is.   The hosting APIs for making these decisions imperatively would be rather complicated.  So the Whidbey hosting APIs provide a declarative mechanism called an escalation policy.  This allows the host to express transitions and timeouts that take effect during error conditions.  For instance, SQL Server might state that any attempt to Abort a thread should delay if the victim thread holds a lock.  But if that delay exceeds 30 seconds, the Abort attempt should be escalated to an AppDomain.Unload.  Of course, the feature is more general than SQL Server’s needs.  Indeed, the V1 ASP.NET process recycling feature should now be expressible as a particular Whidbey escalation policy.               Winding down   As usual, I didn’t get around to many of the interesting topics.  For instance, those guidelines on when and how to host are noticeably absent.  And I didn’t explain how to do any simple stuff, like picking concurrent vs. non-concurrent vs. server GC.  The above text is completely free of any specific details of what our hosting APIs look like (partly because they are subject to change until Whidbey ships).  And I didn’t touch on any hosting topics outside of the hosting APIs, like all of the AppDomain considerations.  As you can imagine, there’s also plenty I could have said about Security.  For instance, the hosting APIs allow the host to participate in role-based security and impersonation of Windows identities…  Oh well.   Fortunately, one of the PMs involved in the Whidbey hosting effort is apparently writing a book on the general topic of hosting.  Presumably all these missing topics will be covered there.  And hopefully he won’t run into the same issues with writer’s block that I experienced on this topic.   (Indeed, the event that ultimately resolved my writer’s block was that my wife got the flu.  When she’s not around, my weekends are boring enough for me to think about work.  The reason I’m posting two blogs this weekend is that Kathryn has gone to Maui for the week and has left me behind.)   Finally, the above blog talks about SQL Server a lot.   Hopefully it’s obvious that the CLR wants to be a great execution environment for a broad set of servers.  In V1, we focused on ASP.NET.  Based on that effort, we automatically worked well in many other servers with no additional work.  For example, EnterpriseServices dropped us into their server processes simply by selecting the server mode of our GC.  Nothing else was required to get us running efficiently.  (Well, we did a ton of other work in the CLR to support EnterpriseServices.  But that work was related to the COM+ programming model and infrastructure, rather than their server architecture.  We had to do that work whether we ran in their server process or were instead loading EnterpriseServices into the ASP.NET worker process or some other server).   In Whidbey we focused on extending the CLR to meet SQL Server’s needs.  But at every opportunity we generalized SQL Server’s requirements and tried to build something that would be more broadly useful.  Just as our ASP.NET work enabled a large number of base server hosting scenarios, we hope that our SQL Server work will enable a large number of advanced server hosting scenarios.   If you have a “commercially significant” hosting problem, whether on the server or the client, and you’re struggling with how to incorporate managed code, I would be interested in hearing from you directly.  Feel free to drop me an email with the broad outline of what you are trying to achieve, and I’ll try to get you supported.  That support might be something as lame as some suggestions from me on how I would tackle the problem.  Or at the other extreme, I could imagine more formal support and conceivably some limited feature work.  That other extreme really depends on how commercially significant your product is and on how well our business interests align.  Obviously decisions like that are far outside my control, but I can at least hook you up with the right people if this seems like a sensible approach.   Okay, one more ‘Finally’.  From time to time readers of my blog send me emails asking if there are jobs available on the CLR team.  At this moment, we do.  Drop me an email if you are interested.  It’s an extremely challenging team to work on, but the problems are truly fascinating.  
Finalization
21/2/2004 external link
Earlier this week, I wrote an internal email explaining how Finalization works in V1 / V1.1, and how it has been changed for Whidbey.  There’s some information here that folks outside of Microsoft might be interested in.     Costs   Finalization is expensive.  It has the following costs:   1)   Creating a finalizable object is slower, because each such object must be placed on a RegisteredForFinalization queue.  In some sense, this is a bit like having an extra pointer-sized field in your object that the system initializes for you.  However, our current implementation uses a slower allocator for every finalizable object, and this impact can be measured if you allocate small objects at a high rate.   2)   Each GC must do a weak pointer scan of this queue, to find out whether any finalizable objects are now collectible.  All such objects are then moved to a ReadyToFinalize queue.  The cost here is small.   3)   All objects in the ReadyToFinalize queue, and all objects reachable from them, are then marked.  This means that an entire graph of objects which would normally die in one generation can be promoted to the next generation, based on a single finalizable root to this graph.  Note that the size of this graph is potentially huge.   4)   The older generation will be collected at some fraction of the frequency of the younger generation.  (The actual ratio depends on your application, of course).  So promotion of the graph may have increased the time to live of this graph by some large multiple.  For large graphs, the combined impact of this item and #3 above will dominate the total cost of finalization.   5)   We currently use a single high priority Finalizer thread to walk the ReadyToFinalize queue.  This thread dequeues each object, executes its Finalize method, and proceeds to the next object.  This is the one cost of finalization which customers actually expect.   6)   Since we dedicate a thread to calling finalizers, we inflict an expense on every managed process.  This can be significant in Terminal Server scenarios where the high number of processes multiplies the number of finalizer threads.   7)   Since we only use a single thread for finalization, we are inherently non-scalable if a process is allocating finalizable objects at a high rate.  One CPU performing finalization might not keep up with 31 other CPUs allocating those finalizable objects.   8)   The single finalizer thread is a scarce resource.  There are various circumstances where it can become blocked indefinitely.  At that point, the process will leak resources at some rate and eventually die.  See http://blogs.msdn.com/cbrumme/archive/2004/02/02/66219.aspx for extensive details.   9)   Finalization has a conceptual cost to managed developers.  In particular, it is difficult to write correct Finalize methods as I shall explain.   Eventually we would like to address #5 thru #8 above by scheduling finalization activity over our ThreadPool threads.  We have also toyed with the idea of reducing the impact of #3 and #4 above, by pruning the graph based on reachability from your Finalize method and any code that it might call.  Due to indirections that we cannot statically explore, like interface and virtual calls, it’s not clear whether this approach will be fruitful.  Also, this approach would cause an observable change in behavior if resurrection occurs.  Regardless, you should not expect to see any of these possible changes in our next release.     Reachability   One of the guidelines for finalization is that a Finalize method shouldn’t touch other objects.  People sometimes incorrectly assume that this is because those other objects have already been collected.  Yet, as I have explained, the entire reachable graph from a finalizable object is promoted.   The real reason for the guideline is to avoid touching objects that may have already been finalized.  That’s because finalization is unordered.   So, like most guidelines, this one is made to be broken under certain circumstances.  For instance, if your object “contains” a private object that is not itself finalizable, clearly you can refer to it from your own Finalize method without risk.   In fact, a sophisticated developer might even create a cycle between two finalizable objects and coordinate their finalization behavior.  Consider a buffer and a file.  The Finalize method of the buffer will flush any pending writes.  The Finalize method of the file will close the handle.  Clearly it’s important for the buffer flush to precede the handle close.  One legitimate but brittle solution is to create a cycle of references between the buffer and the file.  Whichever Finalize method is called first will execute a protocol between the two objects to ensure that both side effects happen in order.  The subsequent Finalize call on the second object should do nothing.   I should point out that Whidbey solves the buffer and file problem differently, relying on the semantics of critical finalization.  And I should also point out that any protocol for sequencing the finalization of two objects should anticipate that one day we may execute these two Finalize methods concurrently on two different threads.  In other words, the protocol must be thread-safe.     Ordering   This raises the question of why finalization is unordered.   In many cases, no natural order is even possible.  Finalizable objects often occur in cycles.  You could imagine decorating some references between objects, to indicate the direction in which finalization should proceed.  This would add a sorting cost to finalization.  It would also cause complexity when these decorated references cross generation boundaries.  And in many cases the decorations would not fully eliminate cycles.  This is particularly true in component scenarios, where no single developer has sufficient global knowledge to create an ordering:   Your component would achieve its guarantees when tested by you, prior to deployment.  Then in some customer application, additional decorated references would create cycles and your guarantees would be lost.  This is a recipe for support calls and appcompat issues.   Unordered finalization is substantially faster.  Not only do we avoid sorting (which might involve metadata access and marking through intermediate objects), but we can also efficiently manage the RegisteredForFinalization and ReadyToFinalize queues without ever having to memcpy.  Finally, there’s value in forcing developers to write Finalize methods with minimal dependencies on any other objects.  This is key to our eventual goal of making Finalization scalable by distributing it over multiple threads.   Based on the above and other considerations like engineering complexity, we made a conscious decision that finalization should be unordered.     Partial Trust   There are no security permissions associated with the definition of a Finalize method.  As we’ve seen, it’s possible to mount a denial of service attack via finalization.  However, many other denial of service attacks are possible from partial trust, so this is uninteresting.   Customers and partners sometimes ask why partially trusted code is allowed to participate in finalization.  After all, Finalize methods are typically used to release unmanaged resources.  Yet partially trusted code doesn’t have direct access to unmanaged resources.  It must always go through an API provided by an assembly with UnmanagedCodePermission or some other effective equivalent to FullTrust.   The reason is that finalization can also be used to control pure managed resources, like object pools or caches.  I should point out that techniques based on weak handles can be more efficient than techniques based on finalization.  Nevertheless, it’s quite reasonable for partially trusted code to use finalization for pure managed resources.   SQL Server has a set of constraints that they place on partially trusted assemblies that are loaded into their environment.  I believe that these constraints prevent definition of static fields (except for initonly and literal static fields), use of synchronization, and the definition of Finalize methods.  However, these constraints are not related to security.  Rather, they are to improve scalability and reliability of applications by simplifying the threading model and by moving all shared state into the database where it can be transacted.     It’s hard to implement Finalize perfectly   Even when all Finalize methods are authored by fully trusted developers, finalization poses some problems for processes with extreme availability requirements, like SQL Server.  In part, this is because it’s difficult to write a completely reliable Finalize method – or a completely reliable anything else.   Here are some of the concerns specifically related to finalization.  I’ll explain later how some of these concerns are addressed in the context of a highly available process like SQL Server.     Your Finalize method must tolerate partially constructed instances   It’s possible for partially trusted code to subtype a fully trusted finalizable object (with APTCA) and throw an exception from the constructor.  This can be done before chaining to the base class constructor.  The result is that a zero-initialized object is registered for finalization.   Even if partially trusted code isn’t intentionally causing finalization of your partially constructed instances, asynchronous problems like StackOverflowException, OutOfMemoryException or AppDomainUnloadException can cause your constructor to be interrupted at a fairly arbitrary location.     Your Finalize method must consider the consequence of failure   It’s possible for partially trusted code to subtype a fully trusted finalizable object (with APTCA) and fail to chain to the base Finalize method.  This causes the fully trusted encapsulation of the resource to leak.   Even if partially trusted code isn’t intentionally causing finalization of your object to fail, the aforementioned asynchronous exceptions can cause your Finalize method to be interrupted at a fairly arbitrary location.   In addition, the CLR exposes a GC.SuppressFinalize method which can be used to prevent finalization of any object.  Arguably we should have made this a protected method on Object or demanded a permission, to prevent abuse of this method.  However, we didn’t want to add a member to Object for such an obscure feature.  And we didn’t want to add a demand, since this would have prevented efficient implementation of IDisposable from partial trust.     Your object is callable after Finalization   We’ve already seen how all the objects in a closure can access each other during finalization.  Indeed, if any one of those objects re-establishes its reachability from a root (e.g. it places itself into a static field or a handle), then all the other objects it reaches will also become re-established.  This is referred to as resurrection.  If you have a finalizable object that is publicly exposed, you cannot prevent your object from becoming resurrected.  You are at the mercy of all the other objects in the graph.   One possible solution here is to set a flag to indicate that your object has been finalized.  You can pepper your entire API with checks to this flag, throwing an ObjectDisposedException if you are subsequently called.  Yuck.     Your object is callable during Finalization   It’s true that the finalizer thread is currently single-threaded (though this may well change in the future).  And it’s true that the finalizer thread will only process instances that – at some point – were discovered to be unreachable from the application.  However, the possibility of resurrection means that your object may become visible to the application before its Finalize method is actually called.  This means that application threads and the finalizer thread can simultaneously be active in your object.   If your finalizable object encapsulates a protected resource like an OS handle, you must carefully consider whether you are exposed to threading attacks.  Shortly before we shipped V1, we fixed a number of handle recycling attacks that were due to race conditions between the application and Finalization.  See http://blogs.msdn.com/cbrumme/archive/2003/04/19/51365.aspx for more details.     Your Finalizer could be called multiple times   Just as there is a GC.SuppressFinalize method, we also expose a GC.ReRegisterForFinalize method.  And the same arguments about protected accessibility or security demands apply to the ReRegisterForFinalize method.     Your Finalizer runs in a delicate security context   As I’ve explained in prior blogs, the CLR flows the compressed stack and other security information around async points like ThreadPool.QueueUserWorkItem or Control.BeginInvoke.  Indeed, in Whidbey we include more security information by default.  However, we do not flow any security information from an object’s constructor to an object’s Finalize method.  So (to use an absurd example) if you expose a fully trusted type that accepts a filename string in its constructor and subsequently opens that file in its Finalize method, you have created a security bug.     Clearly it’s hard to write a correct Finalize method.  And the managed platform is supposed to make hard things easier.  I’ll return to this when I discuss the new Whidbey features of SafeHandles, Critical Finalizers and Constrained Execution Regions.   But what guarantees do I get if I don’t use any of those new gizmos?  What happens in a V1 or V1.1 process?     V1 & V1.1 Finalization Guarantees   If you allocate a finalizable object, we guarantee that it will be registered for finalization.  Once this has happened, there are several possibilities:   1)   As part of the natural sequence of garbage collection and finalization, the finalizer thread dequeues your object and finalizes it.   2)   The process can terminate without cooperating with the CLR’s shutdown code.  This can happen if you call TerminateProcess or ExitProcess directly.  In those cases, the CLR’s first notification of the shutdown is via a DllMain DLL_PROCESS_DETACH notification.  It is not safe to call managed code at that time, and we will leak all the finalizers.  Of course, the OS will do a fine job of reclaiming all its resources (including abandonment of any cross-process shared resources like Mutexes).  But if you needed to flush some buffers to a file, your final writes have been lost.   3)   The process can terminate in a manner that cooperates with the CLR’s shutdown code.  This includes calling exit() or returning from main() in any unmanaged code built with VC7 or later.  It includes System.Environment.Exit().  It includes a shutdown triggered from a managed EXE when all the foreground threads have completed.  And it includes shutdown of processes that are CLR-aware, like VisualStudio.  In these cases, the CLR attempts to drain both the ReadyToFinalize and the RegisteredForFinalization queues, processing all the finalizable objects.   4)   The AppDomain containing your object is unloaded.  Prior to Whidbey, the AppDomain will not unload until we have scanned the ReadyToFinalize and the RegisteredForFinalization queues, processing all the finalizable objects that live in the doomed AppDomain.   There are a few points to note here.   ·      Objects are always finalized in the AppDomain they were created in.  A special case exists for any finalizable objects that are agile with respect to AppDomains.  To my knowledge, the only such type that exists is System.Threading.Thread.   ·      I have heard that there is a bug in V1 and V1.1, where we get confused on AppDomain transitions in the ReadyToFinalize queue.  The finalization logic attempts to minimize AppDomain transitions by noticing natural partitions in the ReadyToFinalize queue.  I’m told there is a bug where we may occasionally skip finalizing the first object of a partition.  I don’t believe any customers have noticed this and it is fixed in Whidbey.   ·      Astute readers will have noticed that during process shutdown and AppDomain unloading we actually finalize objects in the RegisteredForFinalization queue.  Such objects are still reachable and would not normally be subject to finalization.  Normally a Finalize method can rely on safely accessing finalizable state that is rooted via statics or some other means.  You can detect when this is no longer safe by checking AppDomain.IsFinalizingForUnload or Environment.HasShutdownStarted.   ·      Since there is no ordering of finalization, critical infrastructure is being finalized along with application objects.  This means that WaitHandles, remoting infrastructure and even security infrastructure is disappearing underneath you.  This is a potential security concern and a definite reliability concern.  We have spot-fixed a few cases of this.  For example, we never finalize our Thread objects during process shutdown.   ·      Finalization during process termination will eventually timeout.  If a particular Finalize method gets stuck, or if the queue isn’t reducing in size over time (i.e. you create 2 new finalizable instances out of each execution of your Finalize method), we will eventually timeout and terminate the process.  The exact timeouts depend on whether a profiler is attached and other details.   ·      The thread that initiates process shutdown performs the duties of “watchdog.”  It is responsible for detecting timeouts during process termination.  If this thread is an STA thread, we cause it to pump COM calls in and out of the STA while it blocks as watchdog.  If the application has a deadlock that implicates the STA thread while it is executing these unmanaged COM calls, then the timeout mechanism is defeated and the process will hang.  This is fixed in Whidbey.   ·      Subject to all of the above, we guarantee that we will dequeue your object and initiate a call to the Finalize method.  We do not guarantee that your Finalize method can be JITted without running out of stack or memory.  We do not guarantee that the execution of your Finalize method will complete without being aborted.  We do not guarantee that any types you require can be loaded and have their .cctors run.  All you get is a “best effort” attempt.  We’ll soon see how Whidbey extensions allow you to do better than this and guarantee full execution.   ·      (If you want to know more about the shutdown of managed processes, see http://blogs.msdn.com/cbrumme/archive/2003/08/20/51504.aspx.)     SafeHandle   Whidbey contains some mechanisms that address many of the V1 and V1.1 issues with finalization.  Let’s start with SafeHandle, since it’s the easiest to understand.  Conceptually, this is just an encapsulation of an OS handle.  You should read the documentation of this feature for details.  Briefly, SafeHandle provides the following benefits:   1)   Someone else wrote it and is maintaining it.  Using it is much easier than building equivalent functionality yourself.   2)   It prevents races between an application thread and the finalizer thread in unmanaged code.  And it does this in a manner that leverages the type system.  Specifically, clients are forced to deal with SafeHandles rather than IntPtrs or value types which don’t have strong identity and lifetime semantics.   3)   It prevents handle-recycling attacks.  You can read more details about finalization races (#2 above) and this bullet on handle-recycling attacks by reading http://blogs.msdn.com/cbrumme/archive/2003/04/19/51365.aspx.  In that blog from last April, I allude to the existence of SafeHandle without giving details.   4)   It discourages promotion of large graphs of objects, by placing the finalizable resources in a tiny leaf instance.   5)   It participates with the PInvoke marshaler to ensure that unmarshaled instances will be registered for finalization.   6)   For the handful of bizarre APIs that aren’t covered by our standard marshaling styles, Constrained Execution Regions (CERs) can be used to guarantee that unmarshaled instances will be registered for finalization.   7)   It uses the new Critical Finalization mechanism to guarantee that leaks cannot occur.  This means that we not only guarantee we will initiate execution of your Finalize method, but we also make some strong guarantees that allow you to ensure that it actually completes execution.   8)   In order to guarantee that there will be no leaks, we necessarily leave the system open to denial of service and hangs.  This is the halting problem.  The Critical Finalization mechanism addresses this dilemma by making the leak protection explicit, restricting it to small regions of carefully written code, and by using the security system.  Only trusted code can achieve strong guarantees about leakage.  Such code is trusted not to create denial of service problems, whether maliciously or inadvertently, over small blocks of explicitly identified code.   9)   Since SafeHandle uses Critical Finalization, it solves the problem of sequencing buffer flushing before handle closing that I mentioned earlier.   So what is this Critical Finalization thing?     Critical Finalization (CF) and CERs   Any object that derives from CriticalFinalizerObject (including SafeHandle) obtains critical finalization.  This means:   1)   Before any objects of this type are created, the CLR will “prepare” any resources that will be necessary for the Finalize method to run.  Preparation includes JITting the code, running class constructors and – most importantly – traversing the static reachability of other methods and types that will be required during execution and making sure that they are likewise prepared.  However, the CLR cannot statically traverse through indirections like interface calls and virtual calls.  So there is a mechanism for the developer to guide the CLR through these opaque indirections.   2)   The CLR will never timeout on the execution of one of these Finalize methods.  As I mentioned, we rely on the limited amount of code written via this discipline combined with the trust decisions of the security system to avoid hangs here.   3)   When the Finalize method is called, it is called in a protected state that prevents the CLR from injecting Thread.Aborts or other optional asynchronous exceptions.  Because of our special preparation, we also prevent other asynchronous exceptions like OutOfMemoryExceptions due to JITting or type loading and TypeInitializationExceptions due to .cctors failures.  Of course, if the application tries to allocate an object it may get an OutOfMemoryException.  This is application-induced rather than system-induced and therefore is not considered the CLR’s responsibility.  The Finalize method can use standard exception handling to protect itself here.   4)   All normal finalizable objects are either executed or discarded without finalization, before any critical finalizers are executed.  This means that a buffer flush can precede the close of the underlying handle.   The first 3 bullet points above are not restricted to CF.  These bullet points apply to all CERs.  The fundamental difference between CF and other CERs is the funky flow control from the instantiation of an object to the execution of its Finalize method via registration on our finalization queues.  Other CERs can use normal block scopes in a single method to express the same reliability concepts.  For normal CERs, the preparation phase, the forward execution phase and the backout phases are all contained in a single method.   A full description of CERs is beyond the scope of a note that is ostensibly about finalization.  However, a brief description makes sense.   Essentially, CERs address issues with asynchronous exceptions.  I have already mentioned asynchronous exceptions, which is the CLR’s term for all the pesky problems that manifest themselves as surprising exceptions.  These are distinct from the application-level exceptions, which presumably are anticipated by and handled by the application.   You can read about asynchronous exceptions and the novel problems introduced by a managed execution environment that virtualizes resources so aggressively at http://blogs.msdn.com/cbrumme/archive/2003/06/23/51482.aspx.   In V1 and V1.1, the CLR does a poor job of distinguishing asynchronous exceptions from application exceptions.  In Whidbey, we are starting to make this separation but it remains one of the weak design points for our hosting and exception stories.   Anyway, I’m sure that many readers are familiar with the difficulty of writing reliable unmanaged code that is guaranteed to complete in the face of limited resources (e.g. memory or stack), free threading, and other facts of life.  And by now, if you’ve read all the blog articles I’ve mentioned, you are also familiar with the additional problems caused by a highly virtualized execution environment.   CERs allow you to declare regions of code where the CLR is constrained from injecting any system-generated errors.  And the author of the code is constrained from performing certain actions if he wants to avoid additional exceptions.  An obvious example is that he shouldn’t new up an object if he is not prepared to deal with an OutOfMemoryException from that operation.   In addition to CERs, Whidbey provides reliability contracts.  These contracts can be used to annotate methods with their guarantees and requirements with respect to reliability.  Using these contracts, it’s possible to compose reliable execution out of components written by different authors.  This is necessary for building reliable applications that make use of framework services.  If the reliability requirements and guarantees of the framework services were not themselves explicit, the client applications could not remain reliable on top of them.     Finalization in SQL Server and other high availability hosts   Back to finalization.   In a normal unhosted process, there isn’t a strong distinction between normal and critical finalization.  Normal processes won’t run out of memory, and if they do they should probably Fail Fast.  It’s unlikely that the risk of trying to continue execution after resource exhaustion is worth the increased risk of subsequent crashes, hangs or other corruptions.  Normal processes won’t experience Thread.Aborts that are injected across threads.  (As opposed to aborting the current thread, which is no more dangerous than throwing any other exception).   So the only real concern is whether all the finalizable objects will drain during process exit, before the timeouts kick in.  The timeouts are quite generous and in practice this is not a concern.   However, a hosted process like SQL Server is quite different.  Because of SQL Server’s availability requirements, it is vital that the process not FailFast for something innocuous like OutOfMemoryExceptions.  Indeed, SQL Server tries to run on the brink of memory exhaustion for performance reasons, so these memory exceptions are a constant fact of life in that environment.  Furthermore, SQL Server uses Thread.Abort explicitly across threads to terminate long-running requests and it uses Thread.Abort implicitly to unload AppDomains.  On a heavily loaded system, AppDomains may be unloaded to relieve resource pressure.   I have a lengthy blog on this topic, but I have not been able to post it because it talks about undisclosed Whidbey features.  At some point (no later than shipping Beta1), you will find it at http://blogs.msdn.com/cbrumme with a title of Hosting.  Until then, I’ll just mention that the Whidbey APIs support an escalation policy.  This is a declarative mechanism by which the host can express timeouts for normal finalization, normal AppDomain unload, normal Abort attempts, etc.  In addition to timeouts, the escalation policy can indicate appropriate actions whenever these timeouts expire.  So a normal AppDomain unload could (for example) be escalated to a rude AppDomain unload or a normal process exit or a rude process exit.   The distinction between polite/normal and rude involves several aspects beyond finalization.  If we just consider finalization, polite/normal means that we execute both normal and critical finalization.  Contrast this with a rude scenario where we will ignore the normal finalizers, which are discarded, and only execute the critical finalizers.  As you might expect, a similar distinction occurs between executing normal exception backout on threads, vs. restricting ourselves to any backout that is associated with CERs.   This allows a host to avoid solving the halting problem when performing normal finalization and exception backout, without putting the process at risk with respect to (critical) resource leakage or inconsistent state.
Apartments and Pumping in the CLR
2/2/2004 external link
I’ve already written the much-delayed blog on Hosting, but I can’t post it yet because it mentions a couple of new Whidbey features, which weren’t present in the PDC bits.  Obviously Microsoft doesn’t want to make product disclosures through my random blog articles.   I’m hoping this will be sorted out in another week or two.   While we’re waiting, I thought I would talk briefly(!) about pumping and apartments.  The CLR made some fundamental decisions about OLE, thread affinity, reentrancy and finalization.  These decisions have a significant impact on program correctness, server scalability, and compatibility with legacy (i.e. unmanaged) code.  So this is going to be a blog like the one on Shutdown from last August (see http://blogs.msdn.com/cbrumme/archive/2003/08/20/51504.aspx).  There will be more detail than you probably care to know about one of the more frustrating parts of the Microsoft software stack.   First, an explanation of my odd choice of terms.  I’m using OLE as an umbrella which includes the following pieces of technology:   COM – the fundamental object model, like IUnknown and IClassFactory DCOM – remoting of COM using IDL, NDR pickling and the SCM Automation – IDispatch, VARIANT, Type Libraries, etc. Active/X – Protocols for controls and their containers    Next, some disclaimers:   I am not and have never been a GUI programmer.  So anything I know about Windows messages and pumping is from debugging GUI applications, not from writing them.  I’m not going to talk about WM_PENCTL notifications or anything else that requires UI knowledge.   Also, I’m going to point out a number of problems with OLE and apartments.  The history of the CLR and OLE are closely related.  In fact, at one point COM+ 1.0 was known internally as COM98 and the CLR was known internally as COM99.  We had some pretty aggressive ship targets back then!   In general, I love OLE and the folks who work on it.  Although it is inappropriate for the Internet, DCOM is still the fastest and most enterprise-ready distributed object system out there.  In a few ways the architecture of .NET Remoting is superior to DCOM, but we never had the time or resources to even approach the engineering effort that has gone into DCOM.  Presumably Indigo will eventually change this situation.  I also love COM’s strict separation of contract from implementation, the ability to negotiate for contracts, and so much more.   The bottom line is that OLE has had at least as much impact on Microsoft products and the industry, in its day, as .NET is having now.   But, like anything else, OLE has some flaws.  In contrast to the stark architectural beauty of COM and DCOM, late-bound Automation is messy.  At the time this was all rolled out to the world, I was at Borland and then Oracle.  As an outsider, it was hard for me to understand how one team could have produced such a strange combination.   Of course, Automation has been immensely successful – more successful than COM and DCOM.  My aesthetic taste is clearly no predictor of what people want.  Generally, people want whatever gets the job done, even if it does so in an ad hoc way.  And Automation has enabled an incredible number of application scenarios.     Apartments   If there’s another part of OLE that I dislike, it’s Single Threaded Apartments.  Presumably everyone knows that OLE offers three kinds of apartments:   Single Threaded Apartment (STA) – one affinitized thread is used to call all the objects residing in the apartment.  Any call on these objects from other threads must perform cross-thread marshaling to this affinitized thread, which dispatches the call.  Although a process can have an arbitrary number of STAs (with a corresponding number of threads), most client processes have a single Main STA and the GUI thread is the affinitized thread that owns it.   Multiple Threaded Apartment (MTA) – each process has at most one MTA at a time.  If the current MTA is not being used, OLE may tear it down.  A different MTA will be created as necessary later.  Most people think of the MTA as not having thread affinity.  But strictly speaking it has affinity to a group of threads.  This group is the set of all the threads that are not affinitized to STAs.  Some of the threads in this group are explicitly placed in the MTA by calling CoInitializeEx.  Other threads in this group are implicitly in the MTA because the MTA exists and because these threads haven’t been explicitly placed into STAs.  So, by the strict rules of OLE, it is not legal for STA threads to call on any objects in the MTA.  Instead, such calls must be marshaled from the calling STA thread over to one of the threads in the MTA before the call can legally proceed.   Neutral Apartment (NA) – this is a recent invention (Win2000, I think).  There is one NA in the process.  Objects contained in the NA can be called from any thread in the process (STA or MTA threads).  There are no threads associated with the NA, which is why it isn’t called NTA.  Calls into NA objects can be relatively efficient because no thread marshaling is ever required.  However, these cross-apartment calls still require a proxy to handle the transition between apartments.  Calls from an object in the NA to an object in an STA or the MTA might require thread marshaling.  This depends on whether or not the current thread is suitable for calling into the target object.  For example, a call from an STA object to an NA object and from there to an MTA object will require thread marshaling during the transition out of the NA into the MTA.     Threading   The MTA is effectively a free-threaded model.  (It’s not quite a free-threaded model, because STA threads aren’t strictly allowed to call on MTA objects directly).  From an efficiency point of view, it is the best threading model.  Also, it imposes the least semantics on the application, which is also desirable.  The main drawback with the MTA is that humans can’t reliably write free-threaded code.   Well, a few developers can write this kind of code if you pay them lots of money and you don’t ask them to write very much.  And if you code review it very carefully.  And you test it with thousands of machine hours, under very stressful conditions, on high-end MP machines like 8-ways and up.  And you’re still prepared to chase down a few embarrassing race conditions once you’ve shipped your product.   But it’s not a good plan for the rest of us.   The NA model is truly free-threaded, in the sense that any thread in the process can call on these objects.  All such threads must still transition through a proxy layer that maintains the apartment boundary.  But within the NA all calls are direct and free-threaded.  This is the only apartment that doesn’t involve thread affinity.   Although the NA is free-threaded, it is often used in conjunction with a lock to achieve rental threading.  The rental model says that only one thread at a time can be active inside an object or a group of objects, but there is no restriction on which thread this might be.  This is efficient because it avoids thread marshaling.  Rather than marshaling a call from one thread to whatever thread is affinitized to the target objects, the calling thread simply acquires the lock (to rent the context) and then completes the call on the current thread.  When the thread returns back out of the context, it releases the lock and now other threads can make calls.   If you call out of a rental context into some other object (as opposed to the return pathway), you have a choice.  You can keep holding the lock, in which case other threads cannot rent the context until you fully unwind.  In this mode, the rental context supports recursion of the current thread, but it does not support reentrancy from other threads.  Alternatively, the thread could release the lock when it calls out of the rental context, in which case it must reacquire the lock when it unwinds back and returns to the rental context.  In this mode, the rental context supports full reentrancy.   Throughout this blog, we’ll be returning to this fundamental decision of whether to support reentrancy.  It’s a complex issue.   If only recursion is supported on a rental model, it’s clear that this is a much more forgiving world for developers than a free-threaded model.  Once a thread has acquired the rental lock, no other threads can be active in the rented objects until the lock has been released.  And the lock will not be released until the thread fully unwinds from the call into the context.   Even with reentrancy, the number of places where concurrency can occur is limited.  Unless the renting thread calls out of the context, the lock won’t be released and the developer knows that other threads aren’t active within the rented objects.  Unfortunately, it might be hard for the developer to know all the places that call out of the current context, releasing the lock.  Particularly in a componentized world, or a world that combines application code with frameworks code, the developer can rarely have sufficient global knowledge.   So it sounds like limiting a rental context to same-thread recursion is better than allowing reentrancy during call outs, because the developer doesn’t have to worry about other threads mutating the state of objects in the rental context.  This is true.  But it also means that the resulting application is subject to more deadlocks.  Imagine what can happen if two rental contexts are simultaneously making calls to each other.  Thread T1 holds the lock to rent context C1.  Thread T2 holds the lock to rent context C2.  If T1 calls into C2 just as T2 calls into C1, and we are on the recursion plan, we have a classic deadlock.  Two locks have been taken in different sequences by two different threads.  Alternatively, if we are on a reentrancy plan, T1 will release the lock for C1 before contending for the lock on C2.  And T2 will release the lock for C2 before contending for the lock on C1.  The deadlock has been avoided, but T1 will find that the objects in C1 have been modified when it returns.  And T2 will find similar surprises when it returns to C2.     Affinity   Anyway, we now understand the free-threaded model of the MTA and NA and we understand how to build a rental model on top of these via a lock.  How about the single-threaded affinitized model of STAs?  It’s hard to completely describe the semantics of an STA, because the complete description must incorporate the details of pages of OLE pumping code, the behavior of 3rd party IMessageFilters, etc.  But generally an STA can be thought of as an affinitized rental context with reentrancy and strict stacking.  By this I mean:   It is affinitized rental because all calls into the STA must marshal to the correct thread and because only one logical call can be active in the objects of the apartment at any time.  (This is necessarily the case, since there is only ever one thread). It has reentrancy because every callout from the STA thread effectively releases the lock held by the logical caller and allows other logical callers to either enter or return back to the STA. It has strict stacking because one stack (the stack of the affinitized STA thread) is used to process all the logical calls that occur in the STA.  When these logical calls perform a callout, the STA thread reentrantly picks up another call in, and this pushes the STA stack deeper.  When the first callout wants to return to the STA, it must wait for the STA thread’s stack to pop all the way back to the point of its own callout.   That point about strict stacking is a key difference between true rental and the affinitized rental model of an STA.  With true rental, we never marshal calls between threads.  Since each call occurs on its own thread, the pieces of stack for different logical threads are never mingled on an affinitized thread’s actual stack.  Returns back into the rental context after a callout can be processed in any order.  Returns back into an STA after a callout must be processed in a highly constrained order.   We’ve already seen a number of problems with STAs due to thread affinity, and we can add some more.  Here’s the combined list:   Marshaling calls between threads is expensive, compared to taking a lock.   Processing returns from callouts in a constrained fashion can lead to inefficiencies.  For instance, if the topmost return isn’t ready for processing yet, should the affinitized thread favor picking up a new incoming call (possibly leading to unconstrained stack growth) or should it favor waiting for the topmost return to complete (possibly idling the affinitized thread completely and conceivably resulting in deadlocks).   Any conventional locks held by an affinitized thread are worthless.  The affinitized thread is processing an arbitrary number of logical calls, but a conventional lock (like an OS CRITICAL_SECTION or managed Monitor) will not distinguish between these logical calls.  Instead, all lock acquisitions are performed by the single affinitized thread and are granted immediately as recursive acquisitions.  If you are thinking of building a more sophisticated lock that avoids this issue, realize that you are making that classic reentrancy vs. deadlock decision all over again.   Imagine a common server situation.  The first call comes in from a particular client, creates a few objects (e.g. a shopping cart) and returns.  Subsequent calls from that client manipulate that initial set of objects (e.g. putting some items into the shopping cart).  A final call checks out the shopping cart, places the order, and all the objects are garbage collected.  Now imagine that all those objects are affinitized to a particular thread.  As a consequence, the dispatch logic of your server must ensure that all calls from the same client are routed to the same thread.  And if that thread is busy doing other work, the dispatch logic must delay processing the new request until the appropriate affinitized thread is available.  This is complicated and it has a severe impact on scalability.   STAs must pump.  (How did I get this far without mentioning pumping?)   Any STA code that assumed a single-threaded world for the process, rather than just for the apartment, might not pump.  Such code breaks when we introduce the CLR into the process, as we will see.     Failure to Pump   Let’s look at those last two bullet points in more detail.  When your STA thread is doing nothing else, it needs to be checking to see if any other threads want to marshal some calls into it.  This is done with a Windows message pump.  If the STA thread fails to pump, these incoming calls will be blocked.  If the incoming calls are GUI SendMessages or PostMessages (which I think of as synchronous or asynchronous calls respectively), then failure to pump will produce an unresponsive UI.  If the incoming calls are COM calls, then failure to pump will result in calls timing out or deadlocking.   If processing one incoming call is going to take a while, it may be necessary to break up that processing with intermittent visits to the message pump.  Of course, if you pump you are allowing reentrancy to occur at those points.  So the developer loses all his wonderful guarantees of single threading.   Unfortunately, there’s a whole lot of STA code out there which doesn’t pump adequately.  For the most part, we see this in non-GUI applications.  If you have a GUI application that isn’t pumping enough, it’s obvious right there on the screen.  Those bugs tend to get fixed.   For non-GUI applications, a failure to pump may not be noticed in unmanaged code.  When that code is moved to managed (perhaps by re-compiling some VB6 code as VB.NET), we start seeing bugs.  Let’s look at a couple of real-world cases that we encountered during V1 of the CLR and how the lingering effects of these cases are still causing major headaches for managed developers and for Microsoft Support.  I’ll describe a server case first, and then a client case.     ADO and ASP Compatibility Mode   ADO.NET and ASP.NET are a winning co