Cloud, at last!

After some time experimenting, studying, designing (but mostly: presenting possible scenarios to management), we are preparing to move a central part of our systems to the cloud!
Cost savings (especially OPEX - especially linked to sourcing: finding and hiring a good DBA is very hard!), increased availability and resistance to HW failures/catastrophes are the key points I presented to management to help them decide.

On the downside, to be ready to move will require a good engineering effort; our systems are very old, but the general architecture built during the years is sound. It was good (surprising and pleasant) to discover how we already used  many of the patterns listed in the Azure Cloud Design Patterns Architecture Guidance in our systems.



The legacy components of the system have been extensively extended during the years, and the new parts and paths developed since I joined the company in 2012 always followed a classic pattern which you may recognize from several IoT designs:
  •   Field devices -> Queue (Inbox/Outbox)
  •   Queue -> Processing -> SQL
  •   Commands -> Queue (Inbox) <- Device
  
More precisely:
  • Field devices communicate to a "central" server, which just collects the data, buffers them on a durable (temporary) store. Little or no processing here (basics validation only)
  • On different machines, "consume" the items in the temporary store: pull things from there, persist each event in an "append-only" data store (Event Sourcing)
  • Process the events: generate domain objects through a series of steps (3), from the append-only store events to the final objects persisted in SQL tables (Pipes and Filters)
  • Generate "synthesized" data for reporting and statistics queries (Materialized View)
The back-end is already decomposed in several "medium" services: not really "micro" services, but several HTTP-based services talking through a REST API.
These services are already quite robust: they have to, since they are already exposed to the Internet. In particular, they implement Cache-aside for performance, Circuit Breaker/Retry with exp. backoff when they talk to external services (and, in most cases, even when they talk internally to each other), sharding for big data, throttling for some of the public-facing APIs.

Technically, the challenge is so interesting. The architecture is really apt to be ported to the cloud, but to make it really competitive (and to minimize running costs), some pieces will have to be rewritten.
To make the transition as smooth as possible, initially most of the pieces will be less than optimal (mostly IaaS - VMs, SQL storage where NoSQL/Cloud storage would suffice, Compute instances, ..) but will be slowly rewritten to be more efficient, more "cloudy" (App Fabric, Tables, Functions, ...).

Really excited to have begun this journey!


Old school code writing (sort of)

As I mentioned in my previous post, online resources on Hosting are pretty scarce. 

Also, writing an Host for the CLR requires some in-depth knowledge of topics you do not usually see in your day-to-day programming, like for example IO completion ports. Same for AppDomains: there is plenty of documentation and resources compared to Hosting, but still some more advanced feature, and the underlying mechanisms (how does it work? How does a thread interact and knows of AppDomains?) are not something you can find in a forum. 

Luckily, I have been coding for enough time to have a programming library at home. Also, I have always been the kind of guy that wants to know not only how to use stuff, but how they really work, so I had plenty of books on how the CLR (and Windows) work at a low level. All the books I am going to list were already in my library!

The first one, a mandatory read, THE book on hosting the CLR:



Then, a couple of books from Richter:

  

The first one is very famous. I have the third edition (in Italian! :) ) which used to be titled "Advanced Windows". It is THE reference for the Win32 API.
If you go anywhere near CreateProcess and CreateThread, you need to have and read this book.

The second one has a title which is a bit misleading. It is acutally a "part 2" for the first one, focused on highly threaded, concurrent applications. It is the best explanation I have ever read of APCs and IO Completion Ports.

  

A couple of very good books on the CLR to understand Type Loading and AppDomains.
A "softer" read before digging into...

  

...the Internals. You need to know what a TEB is and how it works when you are chasing Threads as they cross AppDomains.
And you need all the insider knowledge you may get, if you need to debug cross-thread, managed-unmanaged transitions. And bugs spanning over asynchronous calls. 

My edition of the first book is actually called "Inside Windows NT". It is the second edition of the same book, which described the internals of NT3.1 (which was, despite the name, the first Windows running on the NT kernel), and was originally authored by Helen Custer. Helen worked closely with Dave Cutler's original NT team. My edition covers NT4, but it is still valid today. Actually, it is kind of fun to see how things evolved over the years: you can really see the evolution, how things changed with the transition from 32 to 64 bits (which my edition already covers, NT4 used to run on 64 bit Alphas), and how they changed it for security reasons. But the foundations and concepts are there: evolution, not revolution.

  

And finally two books that really helped me while writing replacements for the ITaks API. The first one to tell me how it should work, the second one telling me how to look inside the SSLCI for the relevant parts (how and when the Hosting code is called).

Of course, I did not read all these books before setting to work! But I have read them over the years, and having them in my bookshelf provided a quick and valuable reference during the development of my host for Pumpkin.
This is one of the (few) times when I'm grateful to have learned to program "before google", in the late '90/early '00. Reading a book was the only way to learn. It was slow, but it really fixed the concepts in my mind. 

Or maybe I was just younger :)



So, in the end, what went into Pumpkin?

Control was performed at compilation time or execution time? And if it is execution, using which technique?

In general, compilation has a big pro (you can notify immediately the snippet creator that he did something wrong, and even preventing the code block from becoming an executable snippet) and a big con (you control only the code that is written. What if the user code calls down some (legitimate) path in the BCL that results in a undesired behaviour?)

AppDomain sandboxing has some big pros (simple, designed with security in mind) and a big con (no "direct" way to control some resource usage, like thread time or CPU time).
Hosting has a big advantage (fine control of everything, also of "third" assemblies like the BCL) which is also the big disadvantage (you HAVE to do anything by yourself).

So each of them can handle the same issue with different efficacy. Consider the issue of controlling thread creation:
  • at compilation, you "catch" constructs that create a new thread (new Thread, Task.Factory.StartNew, ThreadPool.QueueUserWorkItem, ...)
    • you have to find all of them, and live with the code that creates a thread indirectly.
    • but you can do wonderful things, like intercepting calls to thread and sync primitives and substitute them - run them on your own scheduler!
  • at runtime, you:
    • (AppDomain) check periodically. Count new threads from the last check.
    • (hosting) you are notified of thread creation, so you monitor it.
    • (debugger) you are notified as well, and you can even suspend the user code immediately before/after.

Another example:
  • at compilation, you control which namespaces can be used (indirectly controlling the assembly)
  • at runtime you can control which assemblies are really loaded (you are either notified OR asked to load them - and you can prevent the loading)

What I ended up doing is to use a mix of techniques. 

In particular, I implemented some compiler checks.
Then, run the compiled IL on a separate AppDomain with a restricted PermissionSet (sandboxing).
Then, run all the managed code in an hosted CLR.

I am not crazy...
 

Guess who is using the same technique? (well, not compiler checks/rewriting, but AppDomain sandboxing + Hosting?)
A piece of software that has the same problem, i.e. running unknown, third party pieces of code from different parties in a reliable, efficient way: IIS.
There is very little information on the subject; it is not one of those things for which you have extensive documentation already available. Sure, MSDN has documented it (MSDN has documentation for everything, thankfully), but there is no tutorial, or Q&As on the subject on StackOverflow. But the pieces of information you find in blogs and articles suggests that this technology is used in two Microsoft products: SQL Server, for which the Hosting API was created, and IIS.

Also, this is a POC, so one of the goals is to let me explore different ways of doing the same thing, and assess robustness and speed of execution. Testing different technologies is part of the game :)


In order to obtain what we want, i.e. fine grained resource control for our "snippets", we can act at two levels:

  • compilation time
  • execution time

Furthermore, we can control execution in three ways:

  1. AppDomain sandboxing: "classical" way, tested, good for security
  2. Hosting the CLR: greater control on resource allocation
  3. Execute in a debugger: even greater control on the executed program. Can be slower, can be complex

Let's examine all the alternatives.

Control at compilation time

Here, as I mentioned, the perfect choice would be to use the new (and open-source) C# compiler.

It divides well compilation phases, has a nice API, and can be used to recognize "unsafe" or undesired code, like unsafe blocks, pointers, creation of unwanted classes or call to undesired methods.

Basically, the idea is to parse the program text into a SyntaxTree, extract the node matching some criteria (e.g. DeclarationModifiers.Unsafe, calls to File.Read, ...), and raise an error. Also, it a possibility is to write a CSharpSyntaxRewriter that encapsulates (for diagnostic) or completely replace some classes or methods.

Unfortunately, Roslyn is not an option: StackOverflow requirements prevents the usage of this new compiler. Why? Well, users may want to show a bug, or ask for a particular behaviour they are seeing in version 1 of C# (no generics), or version 2 (No extension methods, no anonymous delegates, etc.). So, for the sake of fidelity, it is required that the snippet can be compiled with an older version of the compiler (and no, the /langversion switch is not really the same thing).

An alternative is to act at a lower level: IL bytecode. 
It is possible to compile the program, and then inspect the bytecode and even modify it. You can detect all the kind of unsafe code you do not want to execute (unsafe, pointers, ...), detect the usage of Types you do not want to load (e.g. through a whitelist), insert "probes" into the code to help you catch runaway code.

I'm definitely NOT thinking about "solving" the halting problem with some fancy new static analysis technique... :) Don't worry!

I'm talking about intercepting calls to "problematic" methods and wrap them. So for example:

static void ThreadMethod() {
   while (1) {
      new Thread(ThreadMethod).Start();
   }
}
This is a sort of fork bomb

(funny aside: I really coded a fork bomb once, 15 years ago. It was on an old Digital Alpha machine running Digital UNIX we had at the university. The problem was that the machine was used as a terminal server powering all the dumb terminals in the class, so bringing it down meant the whole class halted... whoops!)

After passing it through the IL analyser/transpiler, the method is rewritted (compiled) to:


static void ThreadMethod() {
   while (1) {
      new Wrapped_Thread(ThreadMethod).Start();
   }
}

And in Wrapped_Thread.Start() you can add "probes", perform every check you need, and allow or disallow certain behaviours or patterns. For example, something like: 

if (Monitor[currentSnippet].ThreadCount > MAX_THREADS)
  throw new TooManyThreadException();

if (OtherConditionThatWeWantToEnforce)
  ...

innerThread.Start();


You intercept all the code that deals with threads and wrap it: thread creation, synchronization object creation (and wait), setting thread priority ... and replace them with wrappers that do checks before actually calling the original code.

You can even insert "probes" at predefined points: inside loops (when you parse a while, or a for, or (at IL level), before you jump), before functions calls (to have the ability to check execution status before recursion). These "probes" may be used to perform checks, to yield the thread quantum more often (Thread.Sleep(0)), and/or to check execution time, so you are sure snippets will not take the CPU all by themselves. 

An initial version of Pumpkin used this very approach. I used the great Cecil project from Mono/Xamarin. IL rewriting is not trivial, but at least Cecil makes it less cumbersome. This sub-project is also on GitHub as ManagedPumpkin.

And obviously, whatever solution we may chose, we do not let the user change thread priorities: we may even run all the snippets in a thread with *lower* priority, so the "snippet" manager/supervisor classes are always guaranteed to run.

Control at execution time

Let's start with the basics: AppDomain sandboxing is the bare minimum. We want to run the snippets in a separate AppDomain, with a custom PermissionSet. Possibly starting with an almost empty one. 

Why? Because AppDomains are a unit of isolation in the .NET CLI used to control the scope of execution and resource ownership. It is already there, with the explicit mission of isolating "questionable" assemblies into "partially trusted" AppDomains. You can select from a set of well-known permissions or customize them as appropriate. Sometimes you will hear this approach referred to as sandboxing.

There are plenty of examples on how to do that, it should be simple to implement (for example, the PTRunner project).

AppDomain sandboxing helps with the security aspect, but can do little about resource control. For that, we should look into some form of CLR hosting.

Hosting the CLR

"Hosting" the CLR means running it inside an executable, which is notified of several events and acts as a proxy between the managed code and the unmanaged runtime for some aspects of the execution. It can actually be done in two ways:

1. "Proper" hosting of the CLR, like ASP.NET and SQL Server do

Looking a what you can control through the hosting interface  you see that, for example, you can control and replace all the native implementations of "task-related" (thread) functions.
It MAY seem overkill. But it gives you complete control. For example, there was a time (a beta of CLR v2 IIRC) in which it was possible to run the CLR on fibers, instead of threads. This was dropped, but gives you an idea of the level of control that can be obtained.

2. Hosting through the CLR Profiling API (link1, link2)

You can monitor (and DO!) a lot of things with it: I used it in the past to do on-the-fly IL rewriting (you are notified when a method is JIT-ed and you can modify the IL stream before JIT) (my past project used it for a similar thing, monitor thread synchronization... I should have talked about it on this blog years ago!)

In particular, you can intercept all kind of events relative to memory usage, CPU usage, thread creation, assembly loading, ... (it is a profiler, after all!).
An hypothetical snippet manager running alongside the profiler (which you control, as it is part of your own executable) can then use a set of policies to say "enough!" and terminate the offending snippet's threads.

Debugging

Another project I did in the past involved using the managed debugging API to run code step-by-step.

This gives you plenty of control, even if you do not do step-by-step execution: you can make the debugger code "break into" the debugger at thread creation, exit, ... And you can issue a "break" any time, effectively gaining complete control on the debugged process (after all, you are a debugger: it is your raison d'etre to inspect running code). It can be done at regular intervals, preventing resource depletion by the snippet.


Choices, choices, choices...

How would you design and write a system that takes some C# code and runs it "in the browser"?

In general, my answer would be: Roslyn. Roslyn was already quite hot and mature at the end of 2014; having something like scriptcs would give you complete control on each line of code you are going to execute.

But this particular project, being something that must work for StackOverflow, had several constraints, most of which were in stern contrast one with the other:
  • High fidelity: if I am asking a question about a peculiar problem I am having with C# 1 on .NET 1.1, I want my "snippet" to behave as if it is compiled with C# 1 and run on .NET CLR 1.1
  • Safe: can you just compile and execute your snippet inside your IIS? Mmmm.. not a great idea...
  • High performance: can you spin up a VM (or a container), wait for it to be ready, "deploy" the snippet, execute it, get it back? That would be very safe, but a bit slow.

Safety/security is particularly important. For example: you do not want users to use WMI to shutdown the machine, or open a random port, install a torrent server, read configuration files from your machine, erase files...
For safety, we want to be able to handle dependencies in a sensible way. Also, some assemblies/classes/methods just do no make any sense in this scenario: Windows Forms? Workflow Foundations? Sql?
For safety and performace, we want to monitor and cap resource usage (no snippet that does not terminate).

Going a deep further, I stared to dash out some constraints. It turns out that we need to disallow something, even if this means going againt the goal of "high-fidelity":
  • no "unsafe", no pointers
  • no p/invoke or unmanaged code
  • nothing from the server that runs the snippet is accessible: no file read, no access to local registry (read OR write!)
  • no arbitrary external dependency (assemblies): whitelist assemblies

Also, we need control over some "resources". We cannot allow snippets to get a unlimited or uncontrolled amount of them.
  1. limit execution time
    • per process/per thread?
    • running time/execution time
  2. limit kernel objects
    • thread creation (avoid "fork-bombs")
    • limit other too? Events, mutexes, semaphores...
    • deny (or handle in a sensible way) access to named kernel objects (e.g. named semaphores.. you do not want some casual interaction with them!)
  3. limit process creation (zero?)
  4. limit memory usage
  5. limit file usage (no files)
  6. limit network usage (no network)
    • in the future: virtual network, virtual files?
  7. limit output (Console.WriteLine, Debug.out...)
    • and of course redirect it
Does it sounds familiar? For me, it was when I learned about something called cgroups. Too bad we don't have it in windows! Yes, there are Job Objects, but they do not cover every aspect.

Could we have cgroups-like control for .NET applications?


Copyright 2020 - Lorenzo Dematte