Stratasphere: Cloud Services

Showing posts with label Cloud Services. Show all posts

Thursday, March 22, 2012

Emergent behaviour in the cloud

http://www.technologyreview.com/blog/arxiv/27642/

This is an interesting article hypothesising possible disater in the cloud computing systems due to the same sorts of systemic interactions that are observable in other computer or biological systems.

The scenario they illustrate is fun and plausible. However its also a little "simple". Its simply two coupled osscilators. (or a double pendulm in the simplest terms) with effectivly an unlimited power input which creates a growth spiral that will only stop when it hits some limit in the system... either causing a crash or some other effect on the system function or the osscilation.

Does make me wonder how many of these emergent effects are actually already going on in the software I write. There are always weird fragments of behaviour that can be observed... but are not focused on until they show overt negative effects (bugs). Until then they are just ... weirdness. Some of it is the interplay of frameworks and code that is outside my control or undocumented or simply to low in the stack to bother with.

Other bits are loops and event chains that have unexplored outcomes... these however I take responsibility for. Code coverage and unit testing are mechanisms to try to tame these. However Unit testing is really looking at the end result rather than the process. If something bounces around wildly under the bonet but still generates the correct response... a unit test is still happy.

I think in my head atleast, I still think of the computer as a deterministic system.... which is just wrong. Even looking at a simple little GUI app its obvious that its an infinitly dynamic system. Its closer to a set of springs and dampers than it is to a deterministic ratchet.

While unit tests let us sample outcomes at the interfaces ( this is a good thing), it does tend to accept the fact that whats inside the interface box is .."unknown"... The question is what can we do to systematically peer into that dark space?

Endless logging calls?
Create manual tracing stacks?
Use a mad monkey testing engine to generate input with some sort of code coverage system to watch the results and look for some rule violations?

There are lots of possibilities for detecting "weird" behaviour in dynamic systems. Kind of like load tuning a web app but with a lot more variables. At some point assumptions get made and may not be upheld under different circumstances.

The complexity guys probably have some ideas but I would guess that even they do some level of decomposition and division simply to manage the exponential effect of multiple variables and their possible intereactions.

I think the main issue is to reduce the complexity in the system. Intruduce buffers and dampers to regulate the flow. Prevent race conditions and resource contention, even when it places a ceiling on performance. Implement static limits even when they are a bit arbitrary. At least there is a limit in place that can be tuned if it get hit.

This reminds me of the issue with the oracle databases in a previous post. They had hard limits that made sense in the context of stand alone servers, but when they interacted in a networked environment, the synchronising mechanism had both emergent properties as well as presenting a possible exploit for inducing behaviour (crashes).

The problem there was the tight coupling of the index numbers between the database instances. By introducing loose coupling with a buffer structure that allowed the coupling to happen but did not kill the databases if it went rogue, the tightly coupled system becomes much less fragile. Problems cannot automatically propogate through the network of databasea and kill them all. Obviously there would need to be some watch put on the buffers and clear exception rules in place ( which are also probably able to be attacked if a flaw is found....) which then allow the whole dynamic system to be monitored.

I guess the bigges need is to be willing to allow one toxic database to fail without the failure propogating to others in the network. This is, I think the assumption that needs to be explicitly dealt with in the case of this cloud scenario.

The problem is that if one set of servers goes down and the load shifts unpredictably, then it could cause a cascading failure as more and more load gets shifted around and more things fail. These types of cascade events are only stopped by firewalls. The concept is that a fire can burn on one side of the wall, but cannot cross the wall. In server terms, that may mean that a server cannot accept more load than it has capacity to handle, no matter how much load is trying to be shifted onto it.

There also needs to be a plan for graceful failure. Servers need to go down on their knees before they go down on their faces. (While they are on their knees, they write the load out to disk and then die gracefully....)

Anyway... enough rambling.

Friday, May 7, 2010

Case Study on WePay service

Yet another interesting use of the cloud to provide a solution to a complex social problem.

This is a very cleaver application that helps groups collect, manage and spend money. For the group, by the group. Something that happens all over the world in different ways and has the same problems no matter what the culture. (People.... if you were wondering what I was referring to!)

Very nice. Hope they can make it work. https://www.wepay.com/about/wepay

Tuesday, May 4, 2010

A Cloudy Day

The main thing that has changed since I was last working in the private sector is the availability and utility of cloud services. Every day I find more and more useful, mature services. The thing that has not changed is people. There is still a hesitancy to trust something that you can't touch or a person you can't look in the eye. I think this is the one cognitive leap that divides the current digital businesses from the nearly-digital businesses. Somewhere there is someone in a decision making capacity that just can't make that leap.

If you have ever seen a flock of sheep pile up because one sheep could not walk past a particular rock or bush... same dynamic. All the other sheep are going "I can't see anything... but maybe.... what do you think??"
The natural urge to caution comes out and all the sheep are reduced to the level of the most conservative sheep. Nothing wrong with that. That strategy has good strong survival instincts built in.

This brings us back to cloud services. How do those conservative choices apply with cloud services?

Look at risk management. Cloud services are a risk managers worst nightmare. How do you quantify and mitigate the risk of a business that you rely on going bust and disappearing into the night with your data and potentially a critical element in your business?

Same way businesses have always played the trust game. Contracts.

Behind every cloud service is a person who can have "penalty clauses" applied to them. But for that to happen, you need to be able to find them and their jurisdiction. You need to be able to contract agents in that jurisdiction who can act on your behalf. This is a game that has been played out in different ways countless times over the past centuries with traders selling things to people in "foreign" lands. It always comes down to a choice. Even though there is a "deal" ( read contract) in place. It's only useful as long as things are going well. When the deal goes sour, to actually apply the penalty clauses, there usually has to be someone who will "honor" the terms of the contract voluntarily. This usually means there has to be a business or owner or the estate/assets etc who will "do the right thing". When that comes to a pile of digital records and digital assets and your ability to contact agents in foreign parts is limited by you being a small to medium enterprise without a litigation department. What do you do? Who can you call?

Some thing to think about when purchasing cloud services.