Thursday, March 22, 2012

Emergent behaviour in the cloud

http://www.technologyreview.com/blog/arxiv/27642/

This is an interesting article hypothesising possible disater in the cloud computing systems due to the same sorts of systemic interactions that are observable in other computer or biological systems.

The scenario they illustrate is fun and plausible.  However its also a little "simple".  Its simply two coupled osscilators. (or a double pendulm in the simplest terms) with effectivly an unlimited power input which creates a growth spiral that will only stop when it hits some limit in the system... either causing a crash or some other effect on the system function or the osscilation.

Does make me wonder how many of these emergent effects are actually already going on in the software I write.  There are always weird fragments of behaviour that can be observed... but are not focused on until they show overt negative effects (bugs).  Until then they are just ... weirdness.  Some of it is the interplay of frameworks and code that is outside my control or undocumented or simply to low in the stack to bother with.

Other bits are loops and event chains that have unexplored outcomes... these however I take responsibility for.  Code coverage and unit testing are mechanisms to try to tame these. However Unit testing is really looking at the end result rather than the process.  If something bounces around wildly under the bonet but still generates the correct response... a unit test is still happy.

I think in my head atleast, I still think of the computer as a deterministic system.... which is just wrong.  Even looking at a simple little GUI app its obvious that its an infinitly dynamic system.  Its closer to a set of springs and dampers than it is to a deterministic ratchet. 

While unit tests let us sample outcomes at the interfaces ( this is a good thing), it does tend to accept the fact that whats inside the interface box is .."unknown"... The question is what can we do to systematically peer into that dark space?

Endless logging calls? 
Create manual tracing stacks?
Use a mad monkey testing engine to generate input with some sort of code coverage system to watch the results and look for some rule violations?

There are lots of possibilities for detecting "weird" behaviour in dynamic systems.  Kind of like load tuning a web app but with a lot more variables.  At some point assumptions get made and may not be upheld under different circumstances. 

The complexity guys probably have some ideas but I would guess that even they do some level of decomposition and division simply to manage the exponential effect of multiple variables and their possible intereactions.  

I think the main issue is to reduce the complexity in the system.  Intruduce buffers and dampers to regulate the flow.  Prevent race conditions and resource contention, even when it places a ceiling on performance.  Implement static limits even when they are a bit arbitrary.  At least there is a limit in place that can be tuned if it get hit.

This reminds me of the issue with the oracle databases in a previous post.  They had hard limits that made sense in the context of stand alone servers, but when they interacted in a networked environment, the synchronising mechanism had both emergent properties as well as presenting a possible exploit for inducing behaviour (crashes).

The problem there was the tight coupling of the index numbers between the database instances.  By introducing loose coupling with a buffer structure that allowed the coupling to happen but did not kill the databases if it went rogue, the tightly coupled system becomes much less fragile.  Problems cannot automatically propogate through the network of databasea and kill them all.  Obviously there would need to be some watch put on the buffers and clear exception rules in place ( which are also probably able to be attacked if a flaw is found....) which then allow the whole dynamic system to be monitored.

I guess the bigges need is to be willing to allow one toxic database to fail without the failure propogating to others in the network. This is, I think the assumption that needs to be explicitly dealt with in the case of this cloud scenario.

The problem is that if one set of servers goes down and the load shifts unpredictably, then it could cause a cascading failure as more and more load gets shifted around and more things fail. These types of cascade events are only stopped by firewalls.  The concept is that a fire can burn on one side of the wall, but cannot cross the wall.  In server terms, that may mean that a server cannot accept more load than it has capacity to handle, no matter how much load is trying to be shifted onto it.

There also needs to be a plan for graceful failure.  Servers need to go down on their knees before they go down on their faces.  (While they are on their knees, they write the load out to disk and then die gracefully....)

Anyway... enough rambling.







No comments:

Post a Comment