Cloud Computing: Challenges and possible solutions for digital forensics

by Vividh Siddha, VP of Engineering

One of the inherent benefits of cloud computing is to have efficient and optimal utilization of resources for applications. The pay-as-you-go model that cloud computing provides requires elasticity of these resources. Cloud computing provides a self-serve model with an ability to scale resources up or down depending upon the needs of the customers. NIST’s definition of elasticity is as follows: “Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.” Enabling elasticity in the cloud strongly implies the use of virtualization of these resources. In Cloud Computing, elasticity of these resources poses a significant challenge with regards to the mobility of these resources across physical boundaries i.e. servers, switches and possibly data centers. 

The elasticity and mobility of resources coupled with the huge amount of data flowing in, out and within a cloud pose a significant challenge for digital forensics. Digital forensics  was originally used as a synonym for computer forensics but has expanded to cover investigation systems capable of storing digital data, which now includes the cloud. The cloud in particular poses some specific challenges for digital forensics. Waldo Delport and MS Olivier in their paper present the process for conducting digital forensics. A presentation based on their paper is available online. An instance or a group of instances would need to analyzed for a particular investigation. The process of analyzing these “crime” domains presents challenges some of which are listed below:

  • Identification of the location of a instance or a group of instances
  • Blocking and isolating traffic for a particular instance before the start of the investigation process
  • Guaranteeing non-contamination of the instance
  • Separation, i.e. data unrelated to the incident is not part of the isolation. 

Isolation of these resources during operations and during the forensics process in the event of a investigation is important. This isolation is necessitated due to the inherent multi-tenant and sharing of resources available in the cloud. It is required to maintain a successful isolation of a instance and to provide Confidentiality, Integrity and Availability (CIA) of the resources at all times. The data collected from the instances for forensics is not part of this blog post though Nimbula Director provides all the necessary data from the infrastructure for analysis.

Nimbula Director provides the following mechanisms to aid digital forensics: 

It provides a scalable and unique identification mechanisms for instances from the nodes across the clouds.

  • It provides the security mechanisms to blocking and isolating traffic. Nimbula Director has a highly scalable and distributed role based network policy mechanism. Security policies are defined and access between VMs is defined in terms of these policies, called security lists. In a cloud environment, where dynamic resource provisioning can see instances launched or terminated frequently, assigning an instance to one or more security lists enables cloud administrators or auditors to isolate instances at the behest of a API.
  • The role based and resource based object permissions systems unique to Nimbula Director enables cloud administrators to manage ownership of the instances and guarantee isolation in cases of audits
  • To extend the isolation of an instance or set of instances, Nimbula Director supports the ability to snapshot instances and tag nodes. A particular set of instances which are under the investigation can be isolated by moving instances under investigation to isolated areas or by moving other instances to other nodes.
  • All critical events and configuration changes are logged to enable postmortem of specific instances.

Nimbula Director supports these advanced mechanisms using RESTful APIs which makes it easier for cloud administrators or cloud auditors to develop digital forensics auditing and data collection mechanisms in a programmatic way and enable automation for digital forensics. 

Choosing A Cloud Software Partner

by Jay Judkowitz 

This is the third in a series of four articles discussing infrastructure as a service (IaaS) clouds. The series started at basic level setting and we will now begin diving progressively deeper. The topics for the series are:

1. Cloud 101
    - What is cloud
    - What value should cloud provide
    - Public, private, and hybrid cloud
    - Starting on a cloud project
2. Application taxonomy, what belongs in the cloud, and why
3. What you should look for in cloud infrastructure software
4. Evaluating different approaches to cloud infrastructure software

The concepts section describes architectural design points you should ask vendors about to make sure that the they are thinking like a true cloud provider and not simply cloud-washing older technology to try to be relevant in a new world. Keep in mind that these concepts are about infrastructure management in general, not just compute. You should think about the storage, network, and power aspects of your cloud in the same way.

The specific functionality section lists cloud features to check for.  If too many of these are missing, the cloud value proposition will not be delivered.

Core Concepts and Philosophy

Scale

In a cloud, scale is the key to long-term success. The number of nodes and instances, simultaneous connections to the management system, the networking and security features, etc. all need to scale. For each and every exciting and valuable feature a cloud vendor touts, you need to ask, “Can I have tens of thousands of those? What is the experience at that scale? When, if ever, does the scale impact the end user and how they do their work?”

While one can deploy a small-scale cloud, if a cloud is successful, it will become a single pool of capacity for an entire organization or even multiple organizations. In fact, the larger clouds scale, the more cost savings and value they generate since you start to see the benefits of  “the law of large numbers”. If you do not build for scale from the very beginning, you will hit a wall and need to create separately managed clouds. This will force end users to decide what workloads go on what clouds, thus the frictionless self-service model is broken. Furthermore, capex benefits will be lost as you are forced to overprovision each cloud fragment rather than benefiting from a single pool hosting many applications with offsetting resource consumption curves.

For example, in a non-cloud deployment, the datacenter management system is used by datacenter admins only. If it can only deal with tens of simultaneous connections and is limited to one or two nodes, there is no problem since the administrative team is relatively small. However, in a cloud, since a large number of end users drive the management system directly via self-service workflows, the management system requires a whole new level of scale.

Automation

Automation is the key for allowing end users to do their own work and also for lowering datacenter operation costs. Make sure there is a proper degree of automation for both application lifecycle operations and infrastructure operations.

The core principle for end user operations is that no end user task should ever trigger work on the datacenter administrator side, not even a single mouse click approval. There is still a high degree of control and protection required, but these controls must be implemented as up front policies where the right groups of people delegate the right privileges to the right consumers. Furthermore, there need to be audit trails so that one can show that the policies constrained people to the proper activities. However, none of this changes the fact that manual approval processes by central admins on regular daily end user operations cannot work in a cloud model.

The core principle for datacenter operations is that the cloud should be self-discovering, self-organizing, self-monitoring, and self-healing. Anyone that sells you a complete zero-touch datacenter today is certainly exaggerating, but you should check the features they have to make sure this philosophy is followed where technically feasible. Where manual intervention is needed, make sure that this intervention is required only for infrequent up front tasks and never for frequent operations that happen on a frequent basis.

As an example, initial cloud configuration and network setup may be items that require significant up front planning and hours to days of setup, but regularly growing the cloud deployment by adding nodes must require no more time and effort than it takes to rack the systems and plug them in.

Identity, Permissions and Delegation

Clouds need to understand who each user is, what groups they belong to and what customer or tenant their work is billed to. Each operation on each object needs to check the identity of the actor against the permissions system to make sure that the operation is allowed. Delegation then needs to be possible – from cloud admin to customer admin to end users and groups, and possibly between separate end users and groups. Without a strong concept of identity, permissions and delegation, your cloud will only scale to a single tenant and will never fully interoperate well with other clouds, thereby limiting the long-term benefit you derive from the system. Like scale and automation, this is a core design choice.

If cloud vendors do not have proper permissions systems for their objects or are lacking a way to delegate permissions through multiple levels, they are not thinking like a cloud vendor. The result will be trouble down the road as end users wind up having to place tickets to acquire permissions driving a heavyweight approval process where the owner of the resource and the end user’s management team need to be consulted.

Openness and Choice

Openness and choice mean that you have:

  • Independence at each layer: Your different cloud components are not locked in from end to end. A choice at one layer does not dictate a choice at another unrelated area.
    • Your choice of end-user self-service workflow management should never dictate your hypervisor or other core infrastructure component.
    • Equally importantly, your private cloud software should never dictate the choice of public clouds to which can federate. Your end-user provisioning interface should work on your private cloud infrastructure, any public cloud using the same cloud software, and even any public cloud that uses competing or homegrown cloud software. Having to present different interfaces to your end users for clouds using different cloud infrastructure components is not open.
  • Complete and open APIs: Your cloud vendor should have extensive APIs. At the very least the APIs should cover everything provided in the UI. This will allow customized workflows at both the infrastructure level and the end-user level.
  • Extensible components: Your cloud vendor should use open and extensible components where possible.  Open source, where anyone can insert code at any point, is the extreme example of this principle. In non-open source systems, there are ways to introduce more controlled, but still extremely flexible extensibility models.  For example, major components can be general purpose enough that customers can add in other ecosystem products readily, as with the Linux domain 0 model for hypervisors. Alternately, APIs can be made to be robust and complete enough so that most conceivable useful integrations are possible such as the case with Windows APIs. This makes a big difference as you try to augment your cloud with best of breed 3rd party cloud management products.
  • Standards: Your cloud vendor should take advantage of open standards where possible where those standards do not unduly constrain innovation.

Without openness and choice you risk vendor lock-in and the high cost that comes from not being able to have a meaningful option to replace an infrastructure component. Technology lock-in slows down the rate at which you get new features you request from your vendor. A limited ecosystem and an inability to augment your cloud with the latest and greatest offerings from companies both new and established or from the open source community further limits your ability to improve your cloud over time. Lastly, limited choice in public clouds to which you can federate may force you into a cloud with the wrong feature set or that is too expensive.

Key Functionality

When evaluating the base functionality described here, make sure to bring in the philosophies above for each. Make sure that every feature below is implemented with scale, automation, permissions and delegation, and openness in mind.

In the spirit of openness, it is key to recognize that it is not required, nor even desirable, for all the features here to come from the cloud software vendor. Cloud software vendors should be able to present you with an ecosystem that helps fulfill the requirements below.

Self-Service Developer and Deployment Workflows

  • This is the core of the concept of cloud. End users need a way to do the following on their own:
  • Manage their images, update, and version them.
  • Publish images to a selected community for use in deployment workflows.
  • Deploy images and configure the following runtime parameters:
    • The number of instances
    • The images used
    • The placement policy
    • The network connectivity
    • The application configuration
    • The resource allocation
    • The storage to mount
  • Scale applications up and down and retire them when their usefulness has ended.

Reliability and Scale of the Management System

With cloud, when we think about the management system, we’re not just talking about basic monitoring. We’re also talking about the whole datacenter control system – how workloads are deployed, managed, and retired. In a non-cloud datacenter, the management system is for datacenter admins only. If it can only deal with tens of simultaneous connections and is limited to one or two nodes creating a single point of failure compromising reliability, there is no problem since the administrative team is relatively small. However, in a true cloud, with end users driving the management system directly in self-service workflows, a whole new level of scale is required.

The management system of a cloud needs to be able to scale to handle thousands of simultaneous connections.  Furthermore, it can never be down. It needs to be self-monitoring and self-healing. When and if a management node is lost, the remaining management nodes need to continue operation, and the lost management node needs to be replaced from the remaining equipment in the cloud so that the degraded state is resolved. All of this needs to happen automatically without impact to the end user or intervention on the part of the administrator.

Multi-Tenancy and Networking

For clouds to be of use to anything more than the smallest organizations, robust and secure multi-tenancy separation is required. This involves a great deal of network functionality.

Some customers will require traditional separation at layer 2 like VLANs. Those networks need to be managed flexibly and securely.

  • The cloud needs to allow for the layer 2 network to be exposed to large sets of nodes without difficult configuration.
  • The cloud needs to make sure that there is a security system governing access to each layer 2 network so that only the right workloads from the right customers are placed on a given network.
  • The Layer-2 network should provide the equivalent of a broadcast domain and support all traffic types (unicast, muliticast and broadcast). It should also support IPv6 and any other layer 3 protocol, not just IPv4.

Other customers, who do not want to be limited to the scale and manageability of layer 2 networks and that do not need a broadcast domain may choose to have a more modern and flexible large flat cloud network with an integrated distributed firewall providing the isolation between customers’ workloads. This distributed firewall service needs:

  • To have a central configuration repository that ultimately informs the separation between workloads created by different customers as well as between workloads within individual customers.
  • To be configurable by the infrastructure administrator, the individual customer administrators, and even the end users who need to control access to their own work product within their organizations.
  • To have its configuration be independent of workloads and IPs – adding and removing workloads must not cause reconfiguration of the distributed firewall service.
  • To execute in a distributed manner that avoids network bottlenecks.
  • To be independent of server, building, network vendor, network topology, and even geography.

Storage Management

Like compute, storage needs to be aggregated into large pools for access by end users so that they are hidden from the details of the different storage devices and what objects are placed on what storage device.

Unlike compute, there are widely varying capabilities and prices for different storage devices, so some aggregation system is needed to create pools with different service levels where customers can decide the capabilities they need and are wiling to pay for to store their storage objects. This customer decision then drives automated pool selection and ultimately, device selection.

Like with all other cloud resources, the end-user created storage objects need to be created through self-service workflows without administrator interaction, but must also be governed by a robust permission and delegation system governing which storage can be used by which users.

The storage objects created by end-users in those pools need to be managed independently of the instances that mount and access them. This way, creating, updating or deleting workloads does not affect the core information the customer needs to preserve over time. Workloads that create data can be killed, redeployed from an updated template and reattached to storage without impact to the storage object itself. Also, storage objects should be able to be cloned or snapshotted for use by future instances or for rollback processes without any interaction with the running instance accessing it.

Billing and Chargeback

Core to the economic model of cloud is the ability to have end customers either pay for their usage or at the very least to understand their impact on datacenter costs. To that end, there needs to be complete metering APIs and a chargeback or showback system for the cloud.

Hands-Off Infrastructure Management

Management of the physical infrastructure should be as low touch as possible. This includes many aspects:

  • Installation of the nodes: Node installation becomes a frequent operation in a big, fast-growing, and/or mature cloud where parts need to be replaced regularly. Manually installing or configuring servers will be too expensive and error-prone in this world. The only proper experience is for the servers to be racked and connected, then powered on – and nothing else. The cloud needs to auto-discover the server, install it, and make it ready to accept workloads.
  • Intelligent workload placement: Workloads should be automatically, and without administrator involvement, placed such that they are:
    • Loosely packed enough that bottlenecks and performance problems are not generated as dealing with those problems reactively will be problematic at scale.
    • Tightly packed enough that hardware, power, and cooling are not wasted.
    • Strategically placed so that related workloads cohabitate for enhanced inter-workload communication and that redundant workloads are separated to eliminate single points of failure for the service.
    • Placed based on constraints such as the requirement to be on a node with GPUs or a node that is certified PCI compliant.
  • Capacity tracking: There needs to be cloud-wide tracking of resources so that the datacenter operators are aware of cloud capacity and when they need to acquire more hardware.
  • Isolating and retiring equipment: All systems should have a lifetime and a health status associated with them (due to length of maintenance contract, expected lifetime of component parts, and/or length of lease). When that lifetime is exceeded or when the part is failing or has failed, it is automatically isolated from the cloud and flagged for replacement. The cloud should be aware of datacenter layout so that administrators never have a problem locating the equipment at replacement time.
  • Managing planned and unplanned downtimes within the datacenter: If you generally deploy cloud-ready applications (see last article in this series), most datacenter events should be transparent to the end users of the services. Scale-out applications can be scaled up to repopulate lost instances and chunks of huge compute jobs can be automatically respun. However, downtimes associated with persistent data need to be managed as well as compute or network downtimes that affect any of your more monolithic applications. The datacenter should recover whatever it can on its own, and for what it can’t, end users need to be alerted to upcoming planned downtime or recent unplanned downtime and made capable of adjusting their workload deployments accordingly.

Federation Across Clouds

To provide a single interface for all end-users, a cloud must hide distinctions between different datacenters, geographies and providers. There should be one end-user experience for deploying to anywhere in the cloud – public, private, or hybrid. While end users, due to compliance reasons, may need to dictate placement policy in terms of location or provider, it should be a policy component of their work within a single experience – it is not acceptable for there to be a private cloud experience and completely separate public cloud experience.

It is critical that the choice of public clouds to federate to is not forced by the cloud software provider, that would be too limiting. By not allowing the customer to pick the best possible provider at the right cost, cloud deployment will become unnecessarily expensive.

Federation features need to be as follows:

  • From a single user interface, resources can be deployed and managed across multiple sites and providers.
  • Each site and provider can be allowed or denied on a per customer or per user/group basis.
  • When accessing public clouds that are tied to shared credential and billing information, like private keys and credit card numbers, the end user must have that information hidden from them. They don’t need to know it and they must not be able to take it with them when they change jobs or roles.
  • Identity is preserved across sites and providers so that:
    • Users are permitted access to resources at each site according to their specific permissions.
    • Bills from public cloud providers can be itemized by department, user, and project.
  • There needs to be a single audit trail showing who did what activity in what clouds.

When a cloud management system follows these rules, multiple sites within an organization can be managed as one. Furthermore, hybrid cloud becomes a reality with public cloud becoming a viable part of the IT toolkit, not a bootleg process hidden from the visibility of those most trained and responsible for keeping services safe and secure.

Conclusion

Enterprise IT departments and service providers have no shortage of choices today for cloud infrastructure software. But, for an organization doing a significant deployment, the list of requirements above can help separate the serious contenders from less mature or well thought through products.

When writing RFP’s don’t just fall back on the same enterprise management or virtualization platform requirements or you will wind up with the same old infrastructure. Start with your traditional requirements, but make sure to add in critical new requirements around what is really needed to take the next step and have a real cloud today!

What Is Best Suited For The Cloud

by Jay Judkowitz

This is the second in a series of four articles discussing infrastructure as a service (IaaS) clouds. The series started at basic level setting and we will now begin diving progressively deeper. The topics for the series are:

1. Cloud 101
    - What is cloud
    - What value should cloud provide
    - Public, private, and hybrid cloud
    - Starting on a cloud project
2. Application taxonomy, what belongs in the cloud, and why
3. What you should look for in cloud infrastructure software
4. Evaluating different approaches to cloud infrastructure software

Cloud is obviously a serious transformation for your datacenter, but that transformation does not need to be far off or futuristic. If you pick the right applications owned by the right users with the right needs and if you partner with the right cloud software provider, cloud and its many benefits are achievable today.

When first deploying a cloud, it is critical to choose the right applications to move to the cloud initially. Given that cloud is about allowing business units to manage their own computing needs, it follows that the ideal place for cloud is where the business units:

  • Expect self-service.
  • Are tolerant of incorporating provisioning logic into their day-to-day work.
  • Have variable compute needs for the tasks that generate the need for frequent provisioning and de-provisioning activities.
  • Have multiple tasks that cause context switching in the customers’ work, with different tasks acquiring and yielding compute resources over time.
  • Have workloads that, even at maximum interaction, do not cause complex resource contention issues on shared resources so that as much or as little can be deployed as necessary with little if any forethought or calculation.

If you break down the types of workloads in a datacenter, you get three major types:

  • Traditional, monolithic, and stateful client/server apps – things like Exchange Servers and traditional databases
  • Scale-out load balanced apps with disposable stateless instances
  • Batch type computing jobs that can be decomposed into small chunks of compute and storage and distributed across a pool – things like Hadoop, Monte Carlo simulations, business analytics, and media processing and conversion

For each class of application, there are dev/test deployments and production deployments. This gives us a simple six-way classification of what runs in a datacenter that we can use to select our cloud candidates.

The following chart shows which workloads are good for early cloud adoption and what should be dealt with later on in the process. Beneath the chart is some explanation and justification for this assessment.

(Green represents near-term opportunity and red represents something that should be sent to the cloud later on)

Scale-out Load Balanced Apps with Disposable Instances

This type of application does not rely on any one instance of software being able to grow to use tremendous amounts of compute resource. Rather it assumes that each instance can only do so much and that additional power will be supplied by adding more instances. For this sort of application to work the data and other application state must be driven out of the application itself to another location.This allows instances to be created and destroyed with no data loss or outage to the end user. The worst problem caused by failure of an instance is the need for the end user to retry their operation. When the operation is retried, it finds a live instance and completes without difficulty.

Generally, the application requires greater or lesser instances over time as load grows and shrinks. This creates a need for dynamism in the deployment with low turn-around time on provisioning operations. As load grows, the application must respond in a short period of time. Either the application administrator or some auto-scaling management system needs to be able to make the required changes without a ticket to the infrastructure team. When load is high, new instances must be spawned to handle the increased demands. When load drops, instances should be deleted to reclaim compute resources for more useful activity. Because the instances are stateless, without sufficient load that needs to be serviced, no instance has an inherent reason to remain persistently deployed.

Scale-out applications are also a great fit with the datacenter operations models required to manage a massive cloud.  Cloud scale datacenters need to assume that any piece of equipment can break at any time with the application being resilient and able to hide the failure from the end user. Scale-out applications accomplish this by spreading many instances across nodes, datacenters, or even geographies. This makes it much simpler for the infrastructure to be managed – failures become a capacity issue which can be managed in aggregate on a periodic basis, not end-user outage issues that need to be addressed immediately and individually.

More generally, this type of application is the way of the future in general. Due to current systems architectures, we are seeing more and more cheaper cores distributed across many commodity nodes rather than the massive scaling up of individual servers. The only way to really get applications to scale is to build for many smaller instances teaming up together. Some programming infrastructures take this to a logical conclusion. Node.js, a language increasingly used for scale-out web applications, refuses to give a program access to more than one core regardless of how many cores are in a system. Should the developer need more power, they need to increase the number of node.js application instances.

Scale-out applications and cloud are such a good match because they have the same goals – elastically, scale up and down in response to load using only the resources actually needed, distributing applications across hosts and sites to better tolerate systems outages without end user impact and better conformation to modern system architecture.

Batch Applications

Batch applications are decomposable to smaller compute and storage packages. They are good in both test/dev and production for similar reasons as the scale-out applications. As jobs are launched, the number of instances depends on how fine grained you can chunk the job. Given different sizes of data sets for separate runs, the number of instances needed for each run will vary. Therefore, statically provisioning instances doesn’t make sense. Furthermore, there are likely to be completely different applications in such an environment that will need to run at different times. Clouds are perfect for repurposing an infrastructure, ramping up one application while ramping down another without having to do a massive retooling of the physical infrastructure. Like with the last case, the infrastructure administrators should not decide how many and when to deploy instances. Activity needs to be driven by the business that is actually running the jobs and deriving benefit from the output.

Traditional IT Applications

Traditional IT applications, vertically scaled, stateful and monolithic are not good for clouds in their production deployments. These applications tend to be custom built for a particular load in mind, deployed once, expected to have each instance stay running persistently, and are only touched for upgrade. These types of applications are not a natural fit for the cloud paradigm. They do not accommodate dynamic scaling by simply adding more instances. As a result there is a reduced need for end-user self-service.  Also, since they are monolithic, any performance or availability problem must be addressed in the one or two instances that make up the application with a deep understanding of the impact of the underlying infrastructure. As a result, the burden of management remains on the infrastructure administrator and cannot shift the application administrator. This breaks the operational model of cloud where the datacenter administrator needs to be removed from the details of the running of applications.

However, traditional IT applications are very well suited to cloud when the service is development, integration, and testing of the applications. For this function, application owners need to deploy new instances of server software, update them, make new templates, try out their work and iterate. Furthermore, mass numbers of clients will need to be deployed for load and scale testing. The development and testing process generates the dynamism in workload resource needs as well as the requirement for self-service on the part of the IT developer that make a cloud an ideal solution.

Pick the Right Applications and Move to Cloud Today!

Too many times cloud is pitched as an evolutionary technology. In many cases, this is because the vendor making this pitch is already managing a legacy application stack for the customer and sees no reason for a radical shift.

Since these legacy applications do not accommodate elasticity and do not tolerate the more unpredictable availability of any single server that the cloud datacenter operations model implies, true clouds are limited in the benefits they can provide and cause a loss of SLA that is unacceptable to the end users of the legacy applications.

Of course, legacy applications will not go away any time soon and we acknowledge that it takes tremendous time and effort to move to a new programming paradigm. But, the technology is here today and the benefits have been made obvious – scale, resiliency, efficiency. The success stories of companies like Netflix and Zynga are well known. All that is needed is the will to move in that direction.

For enterprises and service providers that leverage modern applications development process, cloud is not an evolution at all – cloud is the best and most obvious way forward for development, testing and mass deployment of their applications.

Pick your target applications and get started today!

Cloud 101

by Jay Judkowitz

This is the first in a series of four articles discussing infrastructure as a service (IaaS) clouds. The articles will start with basic level setting and will dive progressively deeper as the series progresses. The topics for the series will be:

1. Cloud 101

  • What is cloud
  • What value should cloud provide
  • Public, private, and hybrid cloud
  • Starting on a cloud project

2. Application taxonomy, what belongs in the cloud, and why
3. What you should look for in cloud infrastructure software
4. Evaluating different approaches to cloud infrastructure software

What Is Cloud?

Cloud is fundamentally about creating a dynamic computing infrastructure that enables end users to service and manage themselves in a process that is frictionless and instantaneous, but also secure and controlled. Resources are allocated in a very fine-grained fashion and can be relinquished at any time. Usage and chargeback is measured per customer, per resource utilization, and per unit of time, not by physical piece of equipment.

Cloud is not an incremental step on top of virtualization. While virtualization is a key enabler of cloud, the motivation, focus, target applications, and evaluation criteria are very different.

For clouds to be useful and cost effective, they must also have the properties of being very scalable and smooth to manage at the infrastructure level. The idea is to remove as much as possible from the day-to-day operations from the datacenter IT team and to scale that group’s reach and efficiency. The primary responsibility of the datacenter managers should be scaling the cloud in response to growth in demand. They should be able to look at utilization in aggregate and plan and deploy space, power, networking, and servers in as much of a just in time manner as possible. They should no longer need to focus on project specific deployment activity – that is handled by a combination of the end users’ self-service activities and automated responses from the cloud itself.

Private clouds represent the transformation of an IT datacenter into a large self-service pool of resources for internal customers to use. Public clouds open up that service to external organizations to purchase. Hybrid cloud refers to when one organization uses a private cloud for some of its work and a public cloud for other work; and where there is continuity between the public and private cloud utilization. In hybrid clouds, identity, policy and user interface are common, thereby blurring the distinction between public and private to the end user. This series discusses cloud in general with public, private and hybrid being implementation decisions for specific projects. We will introduce distinctions between these cloud types only where necessary.

Cloud Business Value

The end result of cloud to the business is two-fold. Primarily, cloud enables end users to service their own IT needs in a frictionless manner, making the business much more agile - as business units and individuals can innovate quickly and execute on new ideas immediately before they become stale. A secondary but still crucial benefit is that the cost of IT and its value become intimately tied together with a high degree of transparency. Costs are minimized and associated with end user organizations and business units. Let’s dive into the cost analysis in a bit more detail.

Capex is completely eliminated for any work that can be adequately served by a public cloud. For work that needs to be done inside an organization in a private cloud, capex expenditures become just-in time and are always justified by current and projected usage. The capex, plus operating expenses, is totaled and a per unit of time per resource cost is calculated. That cost is then shown back or actually billed to the different business units. This enables the business units to make sound and informed decisions on what to deploy and not to deploy based on the value of the work they are doing. This sort of calculus is always best decided by the business unit, not IT, as they are responsible for their own P&L. In this way, capex is reduced because the business unit is incentivized to only use what they need to drive the most value to the organization.

Opex costs like power, cooling and real estate are reduced as a function of consolidating the datacenters, pooling servers, reducing wasted capacity and incentivizing end users to make lower cost requests as just described above, but they will still remain a substantial cost.

However, the administrative expense component of datacenter opex can be brought much closer to zero through infrastructure automation. The provisioning of workloads and other opex that formerly belonged to central IT is moved to the business units, not as an IT operation, but as a part of the normal flow of the day to day activities they do to get their jobs done. This relieves the business from much of its opex and/or allows the business to reallocate IT staff to more value generating activity.

Besides value to the business as a whole, cloud provides another set of benefits to the IT team specifically which should motivate IT leaders to drive the cloud discussion inside their organizations. Simply stated, cloud can make IT loved again. Historically, IT brought in technology innovations that improved the lives of their internal customers – PC’s, networks, databases, client/server applications, e-mail, mobile computing, virtualization, etc. Now, services like Amazon’s AWS have set a new expectation of IT systems responsiveness. Fairly, or unfairly, end users are coming to expect instantaneous gratification of their IT desires without the need for planning, budgeting or security audits. As a result, they are becoming impatient with IT. They wonder why IT costs so much and takes so long to deliver. By implementing a good private cloud, IT can deliver a finite set of resources in an on-demand manner without compromising security or compliance. Hybrid clouds allow IT to extend their services to handle bursts of unplanned activity that the private cloud does not have the capacity to meet. By enabling offload to public cloud while simultaneously adopting policies and controls of who can do what in public clouds, IT can introduce public cloud as another tool in the IT toolkit without abdicating their traditional responsibility for the safety of data and applications. This will prevent the skunkworks use of public clouds that too many companies see when lines of business tries to circumvent IT. Clouds can make IT the hero again.

Getting Started With Your Cloud

Now that we know what cloud is and what we should expect from it here is a proposed journey to have the easiest onramp and highest value result.

Segregate Applications

First, you need to find the right applications for cloud. As we will describe in the next article in this series, the best fits are:

  • Scale-out load balanced applications with stateless instances
  • Batch processing applications
  • Test and dev for the above two and for more traditional IT applications

As for legacy stateful IT applications deployed in production, non-cloud solutions will suffice. In many cases server virtualization will help – it can lower capex and increase service levels. Well known solutions from VMware, Microsoft, Citrix, RedHat and others can help here.

Make sure you know which end customers and which workloads are the best early cloud candidates according to the ease with which they can move to cloud and the extent to which cloud provides them with real value.

Incrementally Add Projects to the Cloud

Don’t try to boil the ocean – pick the ideal project to start with and add more challenging projects as you experience success. Plan subsequent projects incorporating knowledge gained from previous projects.

Pick a Project Based on End-User Needs and Application Type

Within the applicable projects, pick one that has a burning pain, eager customers, small enough scale to experiment, but large enough scale to be a meaningful test and that aims for results that can be measured – in terms of cost, speed to get results, lowered administration time, etc.

Pick a Private/Public/Hybrid Strategy for This Project

In a later article we will emphasize the need for your cloud software to support public, private and hybrid clouds. Assuming you have chosen a software partner that accommodates this choice, you need to make a decision for this particular project.

Choose public cloud if your project:

  • Has highly variable needs without other projects that are able to statistically offset it.
  • Is not expected to persist for an extended period of time and does not justify its own dedicated infrastructure.
  • Does not have significant needs for privacy, security, or regulatory compliance, unless there is a public cloud provider that specializes in delivering the right assurances for a project of this specific nature.

Choose private cloud if:

  • A single project or a sum of a few projects is expected to have flat or steadily growing compute needs.  Avoid projects that spike heavily and drop sharply unless they can be statistically offset by other projects that peak at different times.
  • Is it either a long-lived project or a short-lived project that, upon completion, will be replaced by projects of equal or greater resource needs.
  • Has significant needs for privacy, security, or regulatory compliance that can not be met by general-purpose or even specialty public cloud providers.

Choose hybrid cloud if the project can be meaningfully segmented into parts that have the private characteristics and other parts that have the public characteristics. Placing the right work in the right place will give you the best balance of cost, security, flexibility and return on investment.

Deploy Project

Deploying the project includes the following steps:

  • Engaging with the end user customer base to explain and sell the project.
  • Collaborate on end goals, desired metrics, and a timeframe for evaluation.
  • Acquiring, deploying and configuring physical infrastructure (if you chose private or hybrid cloud).
  • Choosing the cloud management software and deploying it.
  • Training the end-users and their management team on the self-service workflows – both for the delegation or rights and for the actual execution of work.

Evaluate

At regular points in the project, measure the results to the end user customer base and review with the customers. The goals need to be quantifiable and should be set at the beginning of the project. Goals may be around project completion time, time it takes to do specific tasks, overall cost or utilization of the infrastructure, etc. If the results are not what were expected, adjust the deployment to try to meet the goals. This can involve reconfiguring HW or SW, redoubling on customer training or adding in custom automation developed internally or from contractors.

At this point, it is also good to engage with the HW and SW providers to make sure they can identify any errors in deployment or deviation from best practices that are hampering the results. At this time, you will also have a better idea of your needs and will be in a position to make stronger and more prioritized feature requests.

Bring in Least Needful Applications for Sake of Conformity of Process, Service Levels, etc.

Only after you have had success in the ideal cloud use cases should you bring in the less applicable workloads. Though it may be harder to do so and the benefits may be less, over the long haul your IT department will benefit from a common infrastructure layer for datacenter management and a common end-user interface for deploying workloads, tracking their status and measuring their cost.

EBS, a very well done solution to one of the hardest computer science problems

by Jay Judkowitz

Real cloud storage lessons from the AWS outage

It’s been very interesting watching the online firestorm over Amazon’s EBS outage.  I was not at the recent Interop show, but apparently, there was an entire panel discussion about it and then a twitter flame war between representatives from VMware and Amazon.  Then there were countless articles and blogs, all of which focused on some questions of mild interest with obvious answers.

  • Is EBS a good or bad service?
  • Will this affect people’s move to public cloud?

The answers to this are pretty uncontroversial, in my opinion. 

  • EBS is a very well done solution to one of the hardest computer science problems out there – how do you construct an infinitely scalable storage service out of commodity disk for read/write transactional data with strict consistency.  The fact that EBS has gone this long without an outage of this magnitude is a tribute to the AWS team.  They clearly made some poor choices but they will surely fix those over time.  However, few seem to be paying attention to what AWS has done correctly.  The naysayers would be hard pressed to name a production service with EBS’ characteristics deployed at EBS’ scale.
  • The best online comment on the topic of the impact to public cloud adoption likened this to an airline crash.  Airplanes crash from time to time, and those crashes always make for sensational news.  But, in terms of cost and safety, air travel remains the best way to travel long distances, so people forget the crash and keep flying.  There is a segment of people who will be nervous for some time with public cloud for some data and apps, but the trend towards public cloud will continue as before.

While all of the uproar is quite entertaining, it is not useful.  As an industry, we can be a bit more thoughtful than this.  This blog post is an attempt to get to the real lessons of this incident, which center on the industry’s transition to scale out storage models for transactional storage in the cloud.

The scale out model is undoubtedly the right architecture for cloud storage in general, especially where eventual consistency is sufficient.  It provides:

  • A highly virtualized interface – it’s one big pool of storage where placement across even thousands of nodes is completely automated
  • Great aggregate performance
  • Flexibility in the face of arbitrary failures
  • The ability to grow steadily in small increments of commodity parts as the cloud itself grows, rather than in massive chunks of proprietary equipment

But, when applied to transactional workloads, there are issues with scale out storage that (a) the vendor community still needs to work out and that (b) customers must be aware of and plan around when they take the leap.

So, why is scale-out storage for transactional workloads so hard?  To be useful to clouds, the scale-out transactional storage system needs to have the following qualities with respect to traditional enterprise storage.

  • Reliability: Cloud storage needs to be almost as reliable as enterprise storage – four or five 9’s is called for.  In this blog, I’ll speak of local availability only, not cross site-DR – that’s a topic for another day.
  • Cost: The cost of scale-out storage is expected to be considerably cheaper than enterprise storage.  Keep in mind that a lot of the enterprise storage price is in software, services, and margin.  For big cloud deals, storage vendors will negotiate down closer to the real cost of the system and/or provide leasing plans.  So, a good scale-out deployment needs to actually be cost conscious and can’t just rely on the promise of commodity parts.
  • Consistency: When using read/write transactional storage, a committed write must really be committed.  Eventual consistency does not cut it.  Any write must be truly safe as even a minute or less of data loss can be a fatal issue.
  • Performance: The transactional performance for both reads and writes must be usable, even if somewhat lower than more expensive enterprise storage systems.

With a requirement for strict consistency, protection of storage against loss and inaccessibility comes from either RAID or synchronous replication across multiple enclosures.  When you really trust an enclosure, like how people (reasonably or unreasonably) trust traditional enterprise storage, you can use RAID and minimize disk proliferation – something like 20% extra disk is a reasonable price to pay for your five 9’s.  When you don’t trust the enclosure because it’s a cheaper commodity system, you start looking at RAID over the network or erasure codes.   The challenge with this strategy is that performance can be abysmal, especially in degraded mode where each read requires too many network accesses and parity calculations.  In order to (a) make sure your data is never lost due to a critical set of commodity disks and/or enclosures being lost before rebuilds can complete while (b) maintaining adequate performance, you start mirroring the data over the network multiple times, usually 3x in scale-out systems targeted at the enterprise.  This drives up the actual cost – think TCO – disks, enclosures, power, cooling, footprint, etc…  It is notable that the mirroring system was the mechanism that brought EBS to its knees during the outage.  So, the scale-out vendor is always balancing cost, consistency, and reliability – you can get any two, but not all three at once.

Even when you achieve an acceptable balance of the first three considerations, performance can still be an issue due to economics.  To keep costs low in the face of the relatively expensive replication system, there is a temptation to pack high-density storage very tightly reducing the effective IOPs per GB and making contention a significant problem.  With all that is good about EBS, you often hear customers complaining about its performance, both in terms of maximum throughput and in terms of variability over time, even when there is no news-making outage.

All of this is just about the characteristics of the storage itself and does not deal with operational issues, which is where EBS really hit some issues.  Again, we’re not talking about TB or PB of storage, we’re talking about operations approaching the exabyte scale.

All the operations need to be completely automated – provisioning, placement, and failure response.  All this automation can be implemented in one of two places

  • Independently on each storage node
  • Through centralized controllers

The storage nodes in most scale out systems generally just store and serve data.   When data is sent to them, they store it.   When they get a read request, they serve it.  When they are given a replication partner, they send their data over.   People try to avoid putting too much logic in the storage nodes because (a) they want storage nodes to focus on streaming data and (b) if storage nodes were doing too much thinking, they’d all need to coordinate making for a potentially unsolvable peer to peer coordination problem.  Therefore, most instruction comes from one or more control systems.

In general, storage provisioning and placement operations (for primary copies, initial replicas, and new replicas after a failure), as well as data lookups, are done through more centralized controllers.  Some requirements here are as follows:

  • With thousands and thousands of users, you can’t have a single control node.  You need to have the control system itself scale-out to many, many nodes (though certainly less than the number of storage nodes). 
  • The control system needs to be even more accessible than the data.  You never want a situation where the control service nodes all die, get confused, lose metadata, or are simply starved for resources as when that happens, all administrators and end users lose the ability to interact with storage system as a whole.
  • The algorithms of the control system need to be very clever since they are controlling thousands and thousands of individual storage nodes, which generally obey even inappropriate and/or heavyweight commands faithfully. 

If you read Amazon’s well written, open, and frank EBS post-mortem, you know how and where these guidelines were violated and where the EBS team will undoubtedly be placing their efforts to improve the service over time.  But, for you, the cloud builder, here is what you need to talk about when you talk to a scale-out storage provider.

1) What tradeoffs were made between reliability, cost, and consistency?   If you need strong consistency on transactional data, find out what the uptime guarantees are, and what the implications are for overall system cost.  Dig deep into any uptime guarantees.  Make sure you understand the assumptions regarding probability of individual failures and adjust those assumptions if they do not apply exactly in your datacenter.

2) What is the price per usable GB and per IOP?   If you are building a big enough cloud and can negotiate a great price from an enterprise storage vendor, make sure the scale-out system is cost-competitive even though they will be using many more disks.   Think about TCO – don’t forget about the power, cooling, and footprint costs that come along!  This is not to say that scale-out is more expensive than traditional storage, or that you should not go for it if you don’t get the savings you hope for.  But you should double check all the math and make sure the TCO is what you expect.

3) What is the performance of the transactional storage– both in a normal mode and in a degraded mode?   Make sure that they are not assuming a lower spindle to IOPs ratio than is reasonable (like EBS) to give you a rosy picture on price.  Assume your transactional storage will actually be accessed forcing you to increase spindles, use less dense storage, and/or have a really good caching/tiering/ILM story.

4) What are the assumptions of the storage system?   The EBS design assumed that their redundant network would always be available and that there would never be a general loss of connectivity from all to all.  Has your scale-out vendor designed for this eventuality and tested it at scale?  What other datacenter assumptions are they making?

5) What happens with split-brain at scale?  Traditional enterprise storage is very simple in this area.  Local availability is handled inside a single chassis and DR is done with dedicated replication partnerships.   It is inflexible and not responsive to changing conditions.  Scale-out storage is way better in this regard, but if not done right, the flexibility of the scale-out system can backfire, just like in the EBS case where all nodes tried to re-establish replicas of all data at once.

6) Does the storage understand temporary vs. permanent outages?  If so, what if something that appeared permanent turns out to be actually temporary?   Can your storage system react to the return of service in a reasonable way, especially when the permanent failure response is very heavyweight?  EBS, unlike traditional enterprise storage, kept re-mirroring to new nodes rather than simply sync’ing back up with old mirrors when they once again became accessible.

7) Can the control system guarantee access to users and administrators?  In the EBS outage, the automated failure response overloaded the control service, which is what actually affected all users, even if they had properly replicated their data between availability zones.

8) Are your availability zones really isolated?  In EBS, there was a shared resource between availability zones.   This is what made the impact of the response to #7 so bad.

9) Does the automation know when to stop trying something?   Once it was clear that no more space was to be had and that the control systems were not responsive, the automated re-protection kept going.  Sometimes, like people, software needs to stop, take a breath and let the situation cool down.  And even though this is cloud, when the storage is in this state, it’s best to have it ask for administrator intervention rather than continuing to try to do the impossible repeatedly.

10) Are failures graceful, even the unlikely ones?   The EBS system had a corner case that crashed the nodes rather than failing an operation gracefully.  In most software, you can get away with letting those corner cases go, but when approaching the exabyte scale, you can’t.  Make sure your vendor has good software engineering practices here.

11) Are there good fail-safes?   The EBS outage started to get better when the EBS admins were able to stop some of the communication and get out of the vicious cycle.  Does your scale-out vendor have similar controls to allow you to manually stop heavyweight operations that you, as the cloud operator, determine need stopping for the sake of the cloud as a whole?

12) Are the requirements for the end customer documented?  After the outage, Amazon put out some excellent documentation on building cloud applications that everyone should read.  Does your scale-out system, due to performance or reliability tradeoffs, require end users to use the storage system in any specific and non-obvious ways?  If so, make sure those are clearly documented so you can educate your end users.

While items 4-10 in this list derive from the EBS problems, this blog posting should not be seen as anti-EBS.   With EBS, Amazon has created something unique in the industry, a massive read/write transactional storage system with strong consistency that can be operated by a reasonable sized IT staff.   Its major outage was the first of this level of seriousness in years and the long-term affects have been quite minimal.   The success of EBS has influenced the rise of a plethora of scale out storage startups that want to give you something EBS-like in your datacenter.   It has also scared the traditional storage vendors on technology and pricing and pushed them to innovate in a way they’ve not done in a long time – see their recent product announcements and M&A activity.  EBS is a great service that will only get better. 

While EBS’ failure in this case was spectacular, in a way it was fortuitous for the cloud industry because it educates us on what to look for in storage vendors.  Hopefully, the scale-out storage vendors have been paying attention as well.  They can learn important lessons about operations at massive scale without needing to do a very expensive real-world QA and without causing an outage for a paying customer.   These lessons should be the focus of our attention, not the drama.

Taking Advantage Of Multi-Tenancy To Build Collaborative Clouds

by Jay Judkowitz

When one hears of the advantages of cloud computing, the same benefits come up again and again.

  • The IT consumer gets real agility. This means instant response times to provisioning and deprovisioning requests – no red tape, no trouble tickets – just go.  The consumer also gets a radically different economic model – no pre-planning, no reservation, no sunk costs – the consumer uses as much as they want, grow and shrink in whatever size increment they want, and keep hold of the resources for only as long as they want.  Lastly, the consumer gets true transparency in their spending – each cent spent is tied to a specific resource used over a specific length of time.
  • If a proper cloud infrastructure is built, acquired, or assembled, the operations costs for the datacenter administrator are much lower than with traditional IT. Cloud infrastructure software, if done right, gives scale-out management of commodity parts by introducing (a) load balancing and rapid automated recovery of stateless components and (b) policy-based automation of workload placement and resource allocation.  Customer requests automatically trigger provisioning activity, and if anything goes wrong, the system automatically corrects.  The datacenter admin is relieved of the day-to-day burdens of end user provisioning and break/fix systems management.

The challenge in this world stems from the fact that for all this to be delivered, clouds must span organizational units. There needs to be economy of scale to drive down costs. There need to be many workloads from multiple customers peaking at different times to achieve the “law of large numbers” to achieve high utilization and predictable growth. Once you have multiple customers on the same shared infrastructure, you get the inevitable concerns – is my data secure, do I have guaranteed resources, can another tenant through malice or accident, compromise my work.

Clouds, both public and private, strive to provide secure multi-tenancy. Each service provider and each cloud software vendor promise that tenants are completely isolated from each other tenant. Obviously, different providers do this with varying levels of competency and sophistication, but there is no controversy regarding the need for this isolation.

Once you are comfortable with your cloud’s isolation strategy, though, you should turn around and ask, “How do I take advantage of multi-tenancy?”  We live in an ever more interconnected world and different organizations need to collaborate on projects large and small, short-term and long term. If two collaborators share a common cloud, or two or more clouds that can communicate with each other, shouldn’t the cloud facilitate controlled and responsible sharing of applications and data? Shouldn’t we turn multi-tenancy from the cloud’s biggest risk into its biggest long-term benefit?

To answer this challenge, we need to ask

  1. Why would we need to do this?
  2. Are there any specific examples of this today?
  3. How would we go about achieving a more generalized solution?


First, why would we do this?   There are many examples in many sectors.

  • Within large enterprises, different business units generally need to be isolated from one another, for privacy or regulatory reasons, or simply to keep trade secrets on a need to know basis. But, when large cross-functional teams are asked to deliver a complex project together, sharing becomes necessary.
  • Also in business, external contractors are used for some projects. How can they work as truly part of the team for one assignment, while being safely locked out of all other projects?
  • In education, universities collaborate on some projects and compete on others. How can the right teams work together openly while others are completely isolated?
  • In government and law enforcement at all levels, collaboration can save lives and property, but proper separation must be enforced to protect civil rights and personal privacy.
  • In medicine, doctors and insurance need to share certain records and results in order to streamline care, facilitate approvals, and reduce mistakes.   But, privacy must be protected with only the proper and allowed sharing taking place.

Since this seems like a nirvana state, the second question is what is practically being done along these lines today? To this, I would say that the SaaS providers have been on this path for some time. Google calendar allows you to selectively share your schedule in a fine-grained manner – who can see your availability, who can see your details, and who can edit your meetings. LinkedIn allows you to share your profile at varying levels of depth and regulate inbound messages based on your level of connection and common interests.

This leads to the third question – how can we do this more generally? How can a single cloud or a group of clouds facilitate generic sharing of any application or data without breaking the base isolation that multi-tenancy generally requires? Obviously, in a blog we can’t answer in gory detail, but we can discus some high level requirements.

1. Recognize distributed authority and have a permissions scheme that models this well

In all the examples we discussed in the “why” section, there was no shared authority. From the point of view of someone who wants to access something of someone else’s, there are two completely different and independent sources of authority. First, does my manager authorize me to be working on this project with these collaborators? Second, do those collaborators want to share with me, what exactly do they want to share, and what level of control over their objects do they allow me? A cloud that facilitates collaboration must have a permissions system that allows these different authorities to independently delegate rights without the need for an arbitrating force. Imagine if two government agencies needed to go to the president to settle an access control issue.  With doctors and insurance companies, who would a central authority even be? Once you have a permissions system capable of encoding multiple authority sources, you need the ability to apply that system to compute, storage, and network resources. You need to apply it to data and applications. You need to apply it to built-in cloud services and third party services.

2. Provide extremely flexible networking connectivity and security

Permissions speak to who can do what on what objects shared on a cloud network. The next part is about the network traffic itself. The cloud needs to govern connectivity in a secure, but still self-service manner. It will be impossible to build a responsive and agile collaborative environment over legacy VLANs and static firewalls. Once collaboration is setup politically, project owners need to be able to flip the switch to start the communication flow immediately. If a project ends, they need to be able to turn it off just as quickly if not faster. Given a project that already has network connectivity, as that project expands, new workloads added to the project need to be instantly granted the same network access as all the other workloads. For all this to happen, there need to be network policies that govern communications. These policies need to instantly regulate all new workloads on the cloud.  They need to be created, destroyed, and modified by the actual collaborators, not network admins. Lastly, these policies need to be governed by the collaborative permissions system described in requirement #1 so that proper governance is achieved without requiring a common authority.

3. Have a way to extend these systems across clouds

Once you have a permissions model and a networking model that work within a cloud, you need to extend those functions to work across clouds so that multiple organizations can share their resources amongst each other, not just when they share a common public or community cloud, but even when hosted in their own separate private clouds. For this to happen, identity must be agreed upon. User permissions from one cloud must be trusted by the second cloud so that those permissions can be mapped against what has been delegated by that second cloud. The networking policy mechanisms must be transferable across the Internet and take into account various levels of routing, NAT’ing, and firewalling.

Nimbula believes that we are on the path to providing general purpose collaborative clouds. Our flagship product, Nimbula Director, is architected to deliver this value in the long term and has taken substantial steps in this direction in our generally available 1.0 release.

The Cloud Ecosystem

Nimbula’s co-founder and VP of Products, Willem van Biljon, spoke at the recent Cloud Connect event in Santa Clara. Here are some of the points Willem made during his talk. The video of the full talk is available online at http://bcove.me/6fllnnzg

Building a proper cloud, whether it is a private or public cloud, is more than buying and implementing a product. It is a rather complex architecture with many interrelated pieces that need to be considered. Ultimately, it is a about a whole bunch of things that need to work together.

So, what is involved to make this work?

  • Compute and Storage hardware
  • Networking infrastructure
  • A Cloud Operating System, something that will make all of the infrastructure accessible to the outside world. 
  • On top of that, the various services that people are going to need (PaaS, SaaS, etc.)
  • Alongside we need some management infrastructure, billing, external storage or compute resources, etc.

So overall, it is a pretty large ecosystem and many vendors and products come into play.

The Infrastructure as a Service (IaaS) provides the software that gives control of hardware layer. Just like traditional Operating Systems, but with a large set of hardware. The issues we think are important are:

  • Scale: lessons learned from large scale matter at any scale. Large properties like Google, Amazon or Yahoo learned lessons that we can apply to all data centers
  • Automation: low costs implies low human touch
  • Resource management: who gets what
  • Permission / policy management: who can get what

If we look at the hypervisor, the first lesson is that the hypervisor is not the Cloud OS. It is an essential component, but not all of it. In particular, it does not provide resource management across multiple machines. The hypervisor market is rapidly maturing and one should not build applications or a cloud architecture that rely on a specific hypervisor. 

Large enterprises have shown that commodity hardware can lower costs. The magic is in the software, not the hardware: design the application for commodity hardware and you can dramatically lower costs.

In the network, as applications are no longer bound to specific servers, the topology no longer defines security. The network security now needs to be configured automatically and managed dynamically. 

How do I federate to other people’s cloud - whether private or public? There are a number of key challenges around the API, the identity that I need to present, the data that I need to move and the application environment in which the virtual machine will execute. Of all of these, identity is probably the main challenge to address. 

Billing is about getting money back for the resources that are consumed. It generally breaks down to three elements: Firstly you need to be able to properly measure and meter what is used, secondly to assign proper rates to the various resource elements and finally to generate a bill. The important  elements is finding and assigning the appropriate rate for a given resource – that is where data is transformed into business value.

There is a massive amount of data on enterprise systems today and there is an equally massive opportunity to re-architect that storage to use cheaper systems. There is no simple, one-size-fits all answer. The key is balance and figure our where do you need today’s high end enterprise storage and where do you need the lower cost and highly scalable newer storage systems.

So in conclusion, the cloud ecosystem has many components and many issues per component. We believe that one should start by focusing on the key issues per component and find the right answer for each part.

Taking Advantage of Public and Private Clouds Requires the Right Cloud Management Software

Cloud computing is just a few years old, but already has given rise to two separate approaches and architectures; one public, like Amazon’s Web services, the other private, usually inside a corporate data center. Computer users assigned to business units are attracted to the direct access and easy provisioning of the public cloud, since servers can be up and running in a few minutes. IT organizations, on the other hand, value the security and control they associate with private clouds, and worry about the proliferation of public cloud instances and its potential impact on corporate data and security policies. It’s a familiar tug-of-war.

Successful businesses have lately come to realize that both public and private clouds have advantages, and want to make able to use both of them when appropriate. Consider Intuit, the software company does the load testing for its online TurboTax program on servers at Amazon; because real customer data is not being used, there are no regulatory or privacy issues. However, once the software is made available to the public it runs on Intuit’s on-premises machines, as one would expect for information of such a sensitive nature.

Being able to move between public and private clouds in this manner requires the right kind of cloud management software, a true “Cloud Operating System” that doesn’t take a one-size-fits-all approach to cloud architecture. Instead, it must make use of, when appropriate, the growing number of cloud technologies the marketplace is accepting.

In a properly designed Cloud Operating System, an application runs in either the public or the private cloud depending on the application itself, in connection with company policies. These policies might involve, for example, the kinds of data the application uses, or the extent to which the application is mission-critical to the organization.

The actual placement of an individual application’s workload in either the public or private cloud should occur automatically and transparently to end users. Be they in IT or in business units, users should concern themselves only with choosing the proper policy for the workload. Cloud management software should then take over, determining where precisely in the public-private cloud ecosystem the program will run.

This means that to be effective a Cloud Operating System software needs to shield users from the multitude of different command systems they currently need to master to move between public and private clouds. Instead the software must present a unified user experience, with the same authorization, the access control and interfaces regardless of the workload’s final destination. Users can focus on their workload needs using credentials set up centrally by IT. That protects the enterprise from employees disclosing their credentials to others, or worse, taking them with them when they leave the organization.

A Cloud Operating System must also give users a painless way to move data and applications back and forth between public and private clouds. That’s a seemingly straightforward task, but one whose current complexity routinely leads to lengthy and unexpected delays in what IT workers had assumed was going to be a straightforward migration process.

So how might this hybrid public-private blend architectures play out in an enterprise? Traditional mission-critical ERP programs are less likely to migrate to new cloud infrastructures, just yet. That’s because these programs have strict requirements for stability and fault tolerance and their data is subject to stringent regulatory and compliance regimes. In addition, the programs themselves do not require the constant changing and updating that can occur so easily in a cloud environment. ERP customers are much more concerned about keeping the programs running stably than they are with making daily adjustments to the underlying infrastructure. While mission-critical workloads won’t be the first ones that IT will move to cloud infrastructures, they will clearly be candidates for the private cloud in the second phase of cloud adoption.

By contrast, programs built on new generations of Web-based development environments, such as Ruby on Rails, are perfect candidates for internal clouds right away. Whether you are in a development and test environment or beginning work with a new Platform as a Service or Software as a Service offering, a Cloud Operating System technologies will make possible a new level of agility and flexibility into your organization. You can scale your infrastructure as fast as you can stack racks of hardware without having to bother with the lengthy server provisioning cycles once associated with IT deployment.

Of course, you can also use third party cloud resources like Amazon to complement your own infrastructure when doing so makes sense. Intuit used the cloud for testing; some companies move to the cloud to meet seasonal demands, or to run one of the many commercial SaaS offering becoming available. Cloud management software can transform the public cloud from a rogue resource snuck in the back door by business units trying to circumvent IT and make it instead a viable business tool, properly integrated into an enterprise’s systems.

There are a few more things that IT managers need to be aware of when choosing cloud management software besides its ability to handle both public and private clouds. Has the software been designed from the ground up to deal with the complexities of today’s computing environments or are those features bolted-on as an afterthought to software initially designed simply to set up virtual machines? How much does it automate the time-consuming, repetitive manual tasks often associated with creating and configuring virtual machines? And can it scale up as effortlessly as modern IT operations are discovering they need to?

IT managers will need to deal with those issues, too, as they make a decision about cloud management software. But at the very least, they need to make sure that when they ask a cloud management vendor if they are public or private, the answer they hear back is “Yes.”

2011 Prediction - Clearer Skies Ahead as Vendors Deliver on the Promise of Cloud Computing

The word cloud was everywhere in the high tech industry in 2010. The incredible rise of Amazon’s public cloud offering and their success stories drew record interest from customers and technology providers alike. We saw everyone in the latter group start to “cloudify” their marketing. If you did not have a cloud strategy, you had the risk of falling behind. The race to “cloud” created a lot of confusion and there were very few, if any true cloud computing deployments beyond Amazon’s success stories.

As enterprise early adopters wanted to bring these benefits to their infrastructure, they started looking under the covers of the generally available private cloud offerings on the market from startups and the established virtualization and management leaders and found that those vendors couldn’t deliver on the promise of cloud computing because their solutions were not designed and built for cloud requirements and scale.

In 2011, I believe we’ll see new innovative vendors deliver private cloud solutions built from the ground up to deliver cloud benefits to enterprise customers. This will help “clear” the skies and what were just trends, ideas or initiatives this year, will start becoming real and tangible in 2011. As a result, we’ll see an acceleration of cloud computing deployment and usage beyond the current Web 2.0 world in traditional enterprise and service provider data centers.

This will be a perfect opportunity for IT to turn the tables and become an engine of innovation again. Cloud computing technologies will help with the management of data center infrastructure, which has become one of the top challenges in the enterprise, and in turn allow IT to focus on delivering new applications to the line of business. While virtualization has already brought a lot of efficiency gains to IT, there are still a number of missing pieces needed in order for IT be more agile and on the side of building competitive advantage rather than a cost center. 

One of the first major steps in that direction will come from automation. Current virtualization implementations still require numerous manual steps and that is neither efficient nor scalable. Automation is the next logical step and eliminates human errors. Automation should start from the moment you start installing the infrastructure and allow it to scale up as fast as you can stack equipment. A few minutes of manual interaction per machine results in loss of efficiency and an increase in the potential for human errors as your infrastructure grows.

But automating the build up of infrastructure should only be the first step. As you manage the infrastructure and build applications on top of it, automation will keep gaining foothold so that tedious, error-prone but well understood processes can be achieved with the maximum efficiency possible.

And this efficiency will increasingly be achieved on commodity hardware and software. In 2006, Alessandro Perilli covered the launch of Amazon’s Xen-powered virtual data center on demand, Amazon’s public cloud offering, and highlighted that he had expected VMware to be the first to launch such a service and not Amazon. But Amazon innovated with a new approach and they were not the only ones doing so. Over the past years, giants like Google, Facebook and others have demonstrated that you can build and deliver world-class applications and services at very large scale without brand name hardware or expensive hypervisors.

This movement has started entering the enterprise world and I expect it to pick up momentum in 2011. As the price of the base hypervisor is rapidly declining with some being free all together, customers are more and more comfortable running various offerings in their data centers. One size does not fit all and one should use the hypervisor most suited to the use and application being built on top of it.

As organizations start using the same building blocks as the major public cloud providers, the move towards true hybrid clouds will become a greater reality in 2011. Public clouds have demonstrated the business benefits of cloud computing in terms of efficiency, scalability and agility. Those benefits can be achieved in great part on private infrastructure using private cloud offerings. IT can look to bring greater amounts of flexibility and agility behind their firewall and empower their internal business customers. But not every application will be required to run on the IT infrastructure and in some cases, the use of public cloud infrastructure will make more sense from an economic or architecture perspective.

This will create a co-existence model where IT can pick and chose which applications should run on their traditional core systems, which should run on a new breed of cloud enabled infrastructure behind the firewall and which should be moved to the public cloud. This hybrid model will allow an unprecedented level of elasticity.

Although initial interest in the cloud was primarily driven by cost savings, other aspects of the cloud promise have been picking up steam and I expect them to dominate the reasons for adoption and deployment through 2011. The level of innovation enabled by private and hybrid cloud technologies will allow IT to build and deliver better applications with virtually unlimited capacity, using third party resources when required. Moving beyond association with cost, IT will be associated with innovation again, bringing more competitive advantages to their organization. 

(Source: nimbula.com)

Choosing the Right Enterprise Cloud Solution for Your Organization

Cloud computing is here to stay and many organizations are under pressure to move towards this powerful new technology. Yet, concerns around moving into the cloud are very real. Complex and time consuming deployment, security risks, nightmarish application migration scenarios and buggy and immature private cloud management offerings are just some of the barriers to mainstream enterprise adoption of the cloud.

Enterprises want the same benefits of agility, automation and scale demonstrated by public cloud services such as Amazon EC2 behind their firewall. Although the number of private and hybrid cloud solutions is increasing rapidly, discerning which solution is best suited to your organization can be difficult, Here we attempt to provide some pointers for choosing the right enterprise cloud management solution for your organization – one that is reliable, will meet your business needs and allow you to focus on innovation.

What is an Enterprise Cloud Solution?

An Enterprise Cloud Solution promises to convert your static data center into nimble compute capacity that is further enhanced by seamless integration with other on- and off-premise clouds. Essentially, the Enterprise Cloud Solution offers a cost-effective way to meet your organization’s computing requirements by optimizing the utilization of your current infrastructure and increasing all-round system efficiency by automating resource provisioning and reducing the need for human interaction.

The key benefits of an Enterprise Cloud Solution, offering access to both on- and off-premises compute capacity, include:

  • Flexible compute capacity promotes innovation and allows organizations to rapidly respond to changing business demands while focusing on their core-competencies.
  • Infrastructure cost savings - Corporate servers are estimated to run below 15% capacity. An Enterprise Cloud Solution promises to dramatically increase infrastructure utilization rates especially where load is variable, eliminating the need for hardware purchases that tackle only peak demand and avoiding over-provisioning of resources.
  • Operational cost savings - By automating manual processes, Enterprise Cloud Solutions reduce the demand for administration and support.

Why not just use Public Cloud Services?

An Enterprise Cloud Solution offers the benefits of flexible, responsive compute capacity both behind the firewall and in the public cloud domain. A good Enterprise Cloud Solution should provide access to both local private infrastructure and off-premise cloud services with a uniform interface and integrated authentication. Although public cloud services provide unique opportunities for certain types of compute, particularly those with unpredictable load, there are many advantages to deploying an in-house private cloud able to link to external clouds when necessary. Customers should not have to choose one over the other but be able to implement a flexible solution that allows them to utilize both as required.

The benefits of having access to local on-premise cloud capacity as opposed to solely accessing public cloud services for elastic compute needs include:

  • Security – It may be preferable to keep sensitive services, applications and data behind the firewall, instead of exposing these to the risks associated with outsourcing compute capacity and storage to an external vendor.
  • Performance – Access speeds between compute instances over a local network are generally much faster than access to a public cloud over the Internet, where speed is limited by a provider’s bandwidth or latency. Ensuring compute and data are physically close together avoids performance degradation, especially in large scale systems.
  • Service Disruption – Technology upgrades scheduled at the most convenient and cost effective time for the public cloud operator, could have serious implications for your services if they correspond with high demand.
  • Regulatory – Data flows are becoming more global but privacy laws are local. Deploying systems across regions can become problematic when regional regulations differ.
  • Internal Resource Accounting – A good Enterprise Cloud Solution will facilitate the monitoring and metering of resource consumption by various business units within the organization, allowing for consumption based intra-organizational billing.
  • Sunk Cost – Many large corporations have invested heavily in private data centers, most of which run under 15% utilization. With the spend on data centers being roughly split in thirds across operations, hardware and power, a solution that dramatically increases hardware utilization while decreasing operational overheads, becomes very cost effective.
  • Data Longevity – Keeping data in a public cloud for long periods can be costly. Where data longevity is a key system requirement, having a private cloud component with local storage as part of your cloud solution can be cost effective.

Assessing your On-Premise Cloud Needs

There are many aspects to consider when selecting a solution for your enterprise private cloud needs. Below we outline these and the questions that need to be satisfactorily answered when considering a cloud solution’s suitability for enterprise use.

Security - Authentication and Authorization
Ideally, an Enterprise Cloud Solution should provide fine-grained authorization supporting multi-tenancy.

Authentication should ideally be integrated with existing user services to provide hassle free user management and the efficient reuse of existing corporate user databases. An ideal solution should support the sophisticated authentication and authorization necessary to provide multi-tenancy which allows multiple customers, groups and users to co-exist in isolation from each other or share resources on a single site. The ability to create users and groups whose allowable actions can be determined by policy in the form of rigorously enforced permissions, is essential for robust security, controlled access to on- and off-premise resources and for monitoring and billing.

Some specific questions to be asked:

  • What security is present?
  • Is this security robust?
  • Does the security afford tight enough control?
  • Does authentication integrate with existing user services?
  • Is fine-grained policy based authorization supported?
  • Is multi-tenancy supported?
  • Does the security satisfy any industry-specific laws and regulations by which your business must abide?

Ease of Use

Installation

Complex installation and setup can be a very real barrier to entering cloudspace. As the number of physical machines (nodes) in your underlying infrastructure grows, it becomes increasingly important to avoid a per node installation setup and intensive manual configuration. Make sure your Enterprise Cloud Solution has a largely automated installation with minimal configuration and dynamic resource discovery. Ideally the underlying software should have the ability to discover resources and automatically install and configure new infrastructure..

Specific questions to be asked:

  • How quick and easy is it to deploy a site?
  • Is the base operating system for each node bundled in the solution you are considering?
  • Can you simply plug new nodes into your infrastructure and automatically have these resources installed, configured and available in you cloud environment?
  • Is operating system install for nodes fully automated?
  • Can nodes install in under 15 minutes?
  • Is installation independent of Internet connectivity?
  • Are security keys distributed at install time?
  • Is the site install of the node farm automated?

Site Administration and Maintenance

Ideally, an Enterprise Cloud Solution automates most of the manual tasks associated with static data center administration and management and provides powerful tools to handle those tasks requiring human interaction. This reduces operational overhead and saves on staff retraining time as administration and maintenance of the cloud should be a largely hands-off affair.

Specific questions to be asked:

  • How quick and easy is it to manage the site?
  • Is site management largely automated?
  • Is scripting and API support present?
  • Is it easy to get technical personnel up to speed with the requirements of the cloud solution with respect to architecture, implementation and operation?

Interface

Various interfaces need to be available for interacting with the cloud in different contexts, such as a command line interface (CLI), a graphical user interface (GUI) and an application programming interface (API) for programmatic interaction. These interfaces need to be easy to use and consistent, providing the same, comprehensive set of functionality.

Specific questions to be asked:

  • What are the various interfaces for interacting with the cloud and how usable are these?
  • How consistent are these interfaces? Is the same set of functionality supported throughout?
  • Can the Enterprise Cloud Solution interface integrate seamlessly with public cloud offerings?

Migration

Re-engineering applications to work on a new platform can dramatically escalate the cost and time required to get your business up and running in the cloud, so being able to easily migrate existing applications is a key requirement when choosing an Enterprise Cloud Solution. Network and storage constraints are the main differentiators when considering migration.

Specific questions to be asked:

  • How convenient is it to migrate existing enterprise applications into the cloud?
  • Does the solution support an architecture similar to those used by enterprise applications?
  • What guest operating systems are supported?
  • Is regular SQL RDBMS storage supported?
  • Can compute and storage be attached and detached easily to facilitate flexible movement of instances?
  • Is seamless access to your existing SAN and NAS infrastructure supported?
  • Is a familiar networking environment supported?
  • Are layer 2 access, multicast and non-tcp protocols like IPSEC supported?
  • Are all the required versions of your application operating system supported?
  • Are your existing software licenses portable to the cloud?

Integration

Integration with existing services and system management polices allows for time-saving and hassle-free reuse of existing systems.

Specific questions to be asked:

  • How easy is it to integrate existing systems management policies and functions into the cloud?
  • Does the system integrate with existing Directory Services?

Flexibility

Ideally an Enterprise Cloud Solution should provide dynamic workload allocation that optimizes infrastructure utilization and provide flexible instance management that supports customization to specific business needs. Controlled access to external compute capacity should also be supported to make additional capacity available when necessary.

Access to External Compute Capacity

Access to external compute capacity allows customers to burst into other on- and off-premise clouds subject to load demands, flexibly providing extra compute capacity when necessary.

Specific questions to be asked:

  • Is federated authentication with external on- and off-premise clouds supported?
  • What level of control is available when accessing external compute capacity?
  • Is the interface to external clouds transparent i.e. is it the same interface used to access local resources?

Instance Management

Being able to easily launch and terminate instances empowers innovation as it eliminates the manual restrictions involved in static systems deployment. The ability to customize the instance deployment environment allows customers to fine tune how their instances are launched in line with business needs.

Specific questions to be asked:

  • How easy is it to deploy and terminate instances?
  • Is it possible to customize aspects of instance deployment to specific business needs?
  • Is it possible to specify network-locality relationships between launched instances? This is useful, if for example, you require that two instances be launched on the same physical machine to facilitate inter-instance communication, or that instances be launched on different clusters, to try and guarantee the highest level of reliability even if there are data center failures.
  • Is it possible to define custom instance types or shapes that define elements needed to run an instance?
    • number of CPUs
    • amount of  memory
    • special requirements (eg GPUs)
    • attached storage (eg Fiber Channel interface to a SAN)

Dynamic Workload Allocation

Efficient dynamic workload allocation is a key requirement for a utility grade Enterprise Cloud Solution and replaces the hands-on resource provisioning of the static data center paradigm. Intelligent placement should efficiently allocate resources to instances based on user specified parameters, such as number of CPUs and amount of RAM required. Ideally you want to be able to dynamically place any instance shape on any node, according to policy, instead of having to hard configure nodes for particular instance types or shapes. Hard configuring nodes makes it very difficult to accommodate shifts in demand from one shape to another.

Specific questions to be asked:

  • Is the dynamic workload allocation model robust, providing optimized rates of infrastructure utilization?
  • Does the system place workloads on the optimal node?
  • How flexible and efficient is placement?
  • Can you dynamically place any instance shape on any node, according to policy?
  • Can the parameters be specified by the user?

Scaling

To be able to support deployment of utility grade systems, an Enterprise Cloud Solution should scale from a small cluster to hundreds of thousands of physical machines without performance degradation.

Specific questions to be asked:

  • How scalable is the solution?
  • What are the operational overhead implications associated with scaling?
  • How is performance impacted with scale?

Reliability

At large scale, failures are the norm, so being able to automatically deal with failures dramatically reduces operational burden..Ideally an Enterprise Cloud Solution should be self-healing and self-organizing with no single points of failure.  Sophisticated fail over mechanisms should be employed to ensure system integrity and resilience. Failover management should be completely automated to ensure no service interruption.

Specific questions to be asked:

  • Are there single points of failure in the solution’s architecture?
  • What redundancy mechanisms are in place?
  • Is failover management automated?

Monitoring and Metering

Ideally an Enterprise Cloud Solution should record all system requests, incidents and events, creating a rich audit trail. The monitoring system should be able to integrate with external analysis software.

Specific questions to be asked:

  • Does the monitoring and metering provided by the system support the needs of your organization?
  • Can monitoring be integrated with your existing monitoring infrastructure e.g. is SNMP supported?

In Summary

With the number of Enterprise Cloud Solutions on the market increasing and the implications of moving to the cloud representing a substantial investment in product procurement and staff training and time, it is critical that solution choice be carefully considered. Thorough evaluation of Enterprise Cloud Solutions against the indicators outlined above will ensure the best chance of your organisation enjoying the true benefits of public cloud infrastructure behind the firewall.

(Source: nimbula.com)

We're Hiring!Nimbula is looking for talented and self starter individuals to join and help with our mission to deliver EC2-like services behind the firewall.
Nimbula

Site Developed & Maintained by Unomena