Running with the pack

22 Nov 1997 View Comments
A Computing logo

Wolf packs, mountains and full moons. No, these aren't mood effectspabilities of Wolfpack, Wolf Mountain and Full Moon. Intrigued? Then read on ... for the latest Hollywood suspense-thriller or the setting for a piece of rousing classical music. These are all to do with the latest strategies from Microsoft, Novell and Sun to expand the capabilities of their respective operating systems, boosting performance and creating more robust, fault-tolerant systems.

Twenty-four by seven is a term that many network professionals will be all too familiar with. The demand for data to be online 24 hours a day, seven days a week puts a considerable strain on network servers and resources.

If a system component fails - either due to a software crash or major hardware failure - it could prove very expensive for the organisation.

It is estimated that in the US alone, downtime amounts to almost $4bn a year. The average downtime event for the US retail industry results in a loss of $140,000 and $450,000 in the securities market. With these sorts of figures, your system only needs to crash a few times within a single year for your company to lose a million bucks. In today's highly competitive world that $1m can mean the difference between success and failure.

Availability and scalability are two of the biggest areas of your network that need to be addressed in order to minimise downtime. Achieving high availability ensures that your system can run 24 x 7 by ensuring that failed components can be replaced easily, quickly and, most importantly, automatically. Scalability allows you to grow your system as processing power requirements increase; as well as dropping performance, an overloaded system is more prone to crashing than one that is only 50 per cent utilised.

Server mirroring in real time

A number of proprietary architectures are available today to increase system availability. One traditional hardware structure for gaining high availability is server mirroring - the duplication of one server on to another in real time. Traditionally, one server runs all the applications and stores data, while the other system sits idle, acting as a standby, ready to take over in the event of a system failure.

There are a number of inherent problems with server mirroring. Firstly, and most importantly is the loss in performance as data is first written to the primary server before being duplicated to the standby server; although the servers are connected using high-speed links, there is still a penalty to pay for the mirroring of the data.

Furthermore, there's nothing to actually stop the standby server failing as it sits there idle, preventing it from coming online when the primary server fails. The chances of this actually happening are remote, as the server is not actually doing anything apart from just sitting there waiting, but it is still a possibility that shouldn't be discarded completely.

Last month's feature on fault tolerance talked about systems falling into an erroneous state: a system in an erroneous state is susceptible to failure.

Your standby server may have fallen into this state the last time it was used, due to factors such as incorrect hardware or software configuration, an inherent design or manufacturing fault, or simple wear and tear. In this situation, when the network switches to the standby server, it could fail immediately.

SMP - the shared memory model

Several different architectures are available to enhance system scalability.

One such architecture to achieve scalability beyond a single processor is symmetric multiprocessing (SMP). Within an SMP system, multiple processors share global memory and I/O subsystems - the "shared memory model".

The traditional SMP software model runs a single copy of the operating system with application processes running as if they were on a single processor system. Multithreaded applications are able to take advantage of the additional CPUs by distributing their application threads across the different processors, thus increasing system performance.

The major drawback to this, at the hardware level, is that systems encounter physical limitations in bus and memory speeds that prove expensive to overcome; as processor speeds increase, shared memory multiprocessors become increasingly expensive. Today there are large price steps with PC hardware, as user needs scale beyond four processors, without a significant increase in system performance.

Clustering addresses the problems

Aiming to address both problems, clustering promises to minimise downtime by providing an architecture that keeps the system running in the event of a single system failure. And additional systems can be added, allowing the cluster to grow and help users meet the overall processing power requirements of their networks.

The easiest analogy when trying to understand clustering is to look at RAID: a collection of individual drives connected together to create a single, virtual drive, with a variety of performance and redundancy features.

In its simplest implementation, clustering of just two servers creates a fail-over system whereby if one of the servers fails, due to either a software crash or hardware failure, the other node in the cluster can continue where the first machine left off. Users will notice a delay of a few seconds as data is regenerated on the second system, but then they can continue working as normal, while the first node is repaired.

Ultimately, clustering will enable multiple servers, a mix of uni- and multi-processor systems, to work together as a single entity, sharing the application loads and system processes between the various processors, distributing the data, and providing a robust, fault-tolerant solution.

Unlike the traditional server mirroring technology, clustering provides a truly active/active solution where both nodes are able to carry out tasks independent of each other until one of the systems fails. When this happens, the other system can continue the work of the failed system.

Shared disk and shared nothing

Two software models are currently available for clustered environments: shared disk and shared nothing. The shared disk model enables software running on any system within the cluster to access any resource (for example, a disk) connected to any other system within the cluster.

As with an SMP system, the applications must synchronise and serialise their access to shared data. To achieve this, a Distributed Lock Manager (DLM) is used to control synchronisation. A DLM is a service provided to the application to track references to resources throughout the cluster.

When more than one system tries to reference a single resource, the Lock Manager resolves the potential conflict.

DLM co-ordination, however, can cause additional message traffic, because of the associated serialised access to additional systems. By doing so, it can heavily penalise system performance. Where performance is a high priority, the shared nothing model may be more suitable.

The shared nothing model minimises potential performance loss by assigning a subset of available resources to particular systems within the cluster.

This means that only one system at any one time may own and access a single resource, although when a node within the cluster fails, another dynamically assigned system may take ownership of the resource. In addition, client requests are automatically routed to the system that owns the desired resource.

Both the shared disk and shared nothing models can be run side by side within the same cluster; with some applications most easily able to exploit the capabilities of the cluster through the shared disk model, while others may work best in a shared nothing environment.

Using server-based applications

Without any modification to existing server-based applications, clustering injects a high degree of availability and scalability into the network.

Cluster-aware applications, however, are able to take greater advantage of the benefits that a clustered environment can provide.

Database server applications must be enhanced to either accommodate access to shared data in a shared disk cluster, or to partition an SQL request into a set of sub-requests in a shared nothing cluster.

Furthermore, in a shared nothing cluster, the database server may want to take advantage of the partitioned data by making intelligent, parallel queries for executing across the entire cluster. The application server software may also be enhanced, by way of cluster APIs, to detect component failures and initiate fast recovery.

Now that PC technology is maturing and applications are becoming more and more sophisticated, vendors are able to bring some of this high-end technology down to the mainstream market.

Microsoft's approach: WolfPack

With its WolfPack strategy, Microsoft is looking to expand the functionality, availability and scalability of Windows NT. This will be achieved in two stages.

The first, due for release soon, will implement the two-node fail-over scenario detailed earlier, where both nodes have access to the same disk storage system. When one of the servers fails, the software and applications are dynamically allocated to another system within the cluster.

Phase two of Microsoft's WolfPack strategy is the multiple node solution.

This will allow more than two servers to be connected for even greater availability and performance. As a result, when the overall load exceeds the capabilities of the cluster, additional nodes can be added to the cluster to increase its capabilities.

Traditionally, users have had to make the commitment up front to buy high-end servers that can accommodate multiple processors, additional storage and large memory capacities; clustering enables the network to grow on an ad hoc basis, without having to commit to specific hardware or technologies.

Microsoft intends to make WolfPack an industry standard add-on for Windows NT. To achieve this, it will be making available its Clustering APIs and Software Development Kit to enable developers to test their existing products, adapt them for clustered environments and help developers to develop cluster-specific applications.

To help it push WolfPack as the standard for clustering, Microsoft has enlisted the help of industry-leading hardware vendors, including Compaq, Digital, Hewlett-Packard, IBM, Intel, NCR and Tandem; many of whom have had their own proprietary clustering solutions for many years now.

Digital has been clustering its VMS mainframes for the past 15 years, and, more recently, its Digital-UX servers.

Whether or not Microsoft is looking to promote clustering technology to dissuade users from focusing on the scalability and availability issues surrounding Windows NT remains to be seen.

Clustering, however, does extend NT's feature set and certainly goes a long way towards dispelling many of the myths and rumours that are surrounding it.

SAP R/3 AND HOW IT WORKS WITH CLUSTERING The utmost concern for users, when running vital applications within R/3 is data security, availability and performance of the production environment. Traditionally many hardware vendors have been forced to deliver a two-tier service where the application and database servers have co-existed on a single platform. High speed networks, such as FDDI for example, between tiers have helped - although only marginally - to reduce the pain for users in multiple system environments.

The major problem when using this approach for large high availability installations is the management of such an environment - it is difficult to see where resources are being used.

A great deal of network traffic is generated between the application and database servers within an R/3 system, which can have a major influence on network performance. For this reason it is imperative to have a traffic monitoring and management system.

The base architecture of SAP R/3 is movement of requests from a client - a PC or X-terminal for example - through well defined modules. These modules are essentially application servers and database servers. With pure SMP systems the application components are required to reside within the same machine due to network and processor limitations.

With the high-speed bandwidth of clustered servers these boundaries no longer apply. The processing modules can therefore be detached from each other to provide greater flexibility and scalability for SAP customers.

This is apparent to such a degree that processor-hungry modules such as the "post-part" module, which until now was tightly bound within the database engine, can now be allocated to other nodes within the cluster, and thus free up valuable system resources on the database node.

Another advantage of a clustered R/3 environment is the ability to distribute modules among the applications servers, providing further flexibility and scalability as computing needs change.

Furthermore, the ability to separate the most critical components of the R/3 architecture - the "enqueue server" and the "message server" - from the applications servers increases throughput for the applications.

But separating the components and spreading them across various nodes also builds in a high degree of availability into the system. If a node, running the entire R/3 environment, fails then the whole system is lost.

In a clustered environment, if a single node fails, that node's service or server application can then be dynamically allocated to another node within the system and work can continue as normal.

NOVELL INTRANETWARE AND WOLF MOUNTAIN

Wolf Mountain is the code name for a set of technologies developed by Novell to provide reliable, available, scalable and manageable networking solutions based around future releases of Intranetware. Unlike Microsoft's more cautious two-phase clustering deployment, Novell has decided to jump right in at the deep end and go for the jugular.

Novell's equivalent to WolfPack's two-node fail-over system, SFTIII, has been available for over four years. Novell feels that its customers require more in the way of availability and scalability than either SFTIII or WolfPack has to offer.

The Wolf Mountain technology demonstration at Brainshare, in March this year, consisted of a 12-node cluster. This provides a level of availability beyond what the initial release of WolfPack will deliver, but whether or not Wolf Mountain will have the same impact on the network operating system market as WolfPack remains to be seen.

By means of its Intercluster protocol, Novell claims to be able to support up to 255 individual servers, each with 32 processors, thus creating a single virtual server with 8,192 processors. Quite what the utilisation of each processor will be within a live environment is not yet known, but it does sound impressive nonetheless.

But will anyone actually require the power of that many processors? Probably not, but it does at least go some way to demonstrating the flexibility and performance of Novell's Wolf Mountain technologies.

Like Microsoft, Novell has partnered itself with a variety of hardware vendors. Today, the list of Wolf Mountain Technology partners includes Intel, Oracle, IBM, HP Mitsubishi, Siemens Nixdorf, Unisys, and G2 Networks.

In the future, Novell is looking to expand the number of partners helping to develop the standard.

Reader comments
blog comments powered by Disqus
Newsletters
Is it time to open Windows?

Computing believes that Microsoft will start offering Windows free of charge by 2017. Is this a good thing for the enterprise?

53 %
20 %
7 %
16 %
4 %