DOTS: Diversity with Off-The-Shelf Components

Navigation
- Overview -
- Project Aims -
- Bibliography -



Technical Results
- Key Advances -
- Probabilistic Modelling -
- Information Modelling -
- Architecture -



Links
- CSR home -

...Results

Key Advances and Supporting Methodology

A large part of the results from all tasks in the project can be grouped in two main clusters:

  • Wrapping as a way of structuring fault tolerance with OTS components [PR01], where DOTS was largely focussed on architectural issues. We developed a more methodical approach to wrapping than proposed by other researchers, and experimented with its application to realistic case-studies.
  • Fault tolerance against design faults in popular OTS database servers [PS03]. This was chosen as a challenging application to demonstrate the viability of the DOTS approach via architectural proposals and prototyping, and to evaluate its desirability via measurement, on commercial systems that are quite complex and pose significant problems in the implementation of fault tolerance as well as in dependability assessment.

In addition, we produced useful exploratory advances in various application fields for the DOTS approach. These are reported below under the headings of the various tasks.

- back to top -

Task P (Probabilistic Modelling)

This task had a measurement strand and a modelling strand. On the measurement side, we set out to evaluate the potential advantage of diversity in our chosen application, database management systems. DOTS developed a test harness embodying a fault-tolerant database server (i.e., multiple diverse OTS servers with a wrapper enabling client applications to see them as a single database), and test clients applying various "synthetic" loads, such as the industry-standard TPC-C benchmark. In an initial campaign of over one million test cases, diversity tolerated a bug in one of the commercial servers (for which we also demonstrated an ad hoc protective wrapper) [PS04]. Our synthetic loads activated no other bugs, probably because they mostly test the core functions of the database engines. We intend to continue the measurements with more comprehensive loads, but we also explored the potential effect of diversity by the alternate method of analysing fault records: we surveyed more than 180 reported bugs for these OTS servers, and checked whether the failure-causing demands documented for one server would also cause failures in other servers. We found [GP04b] that i) most reported bugs do not produce "clean" crash failures: they could not be tolerated without diversity; ii) most would be tolerated by diversity: even a simple two-version configuration would detect the failures caused by between 94% and 100% of the bugs.

Diverse redundancy can also improve performance, since diverse servers present systematic differences in their efficiency on various kinds of queries. We measured this effect, estimating a three-fold improvement on single-transaction latencies (for a two-version server - Interbase and PostgreSQL - with the TPC-C benchmark transaction mix).

Both lines of assessment work are continuing. Other potential targets for similar measurement studies were considered, but most were found less promising for practical data collection. We also ran an exploratory, low effort experiment measuring instances of "difficulty functions" (one of the basic concepts in reliability models of diverse systems), and other aspects of programming error, with encouraging results [BB04], [MB04].

In dependability modelling and prediction, two issues were of most interest:

  • Systems based on OTS components may require changes to existing models, due to different practical constraints. One important issue for industry is the retrofit of software-based OTS components into critical systems. We described [P02] how to apply Bayesian inference to this scenario, combining the historical reliability data about the system, the reliability record of the OTS component and the observation of operation of the system after the retrofit. The paper demonstrates the risks in using the independence assumptions used in other literature. A conservative approach to assessing diverse-redundant OTS-based system has also been analysed (a paper is in preparation).
  • The efficacy of architectures based on wrappers depends on the probability of the wrapper failing together with the wrapped component, a fact ignored in much current literature. This probability is very difficult to estimate, and the use of fault injection methods to this end is even more questionable for software faults than for physical faults. We have addressed this problem by extending the modelling techniques previously used in assessing N-version architectures (a paper is being finalised).

In view of the growing interest in the application of diversity to security problems, we reviewed the recent literature in the area and published a critique with proposals for applying the modelling ideas that are well developed for reliability to security scenarios [LS04], which has been well received and led to contacts for potential new research.

- back to top -

Task I (Information Modelling and Analysis)

A review of the dependability issues concerning the use of smart sensors in plant control environments was initiated with the help of a case study (supplied by British Energy); issues considered were availability of evidence, quality of evidence, and potential benefits from diversity. A more general study identified specific failure modes of smart sensors that do not arise with conventional sensors (e.g. failures due to timing aspects and information overload) and discussed their effect on the potential for common cause failure in non-diverse redundant systems.

It is not unusual for information on some aspects of the behaviour of an OTS item to be unavailable, and acquiring the missing information may entail significant investments in time and effort. By categorising the various information elements that may be elucidated concerning the anticipated behaviour of an OTS item we derived a checklist to guide a system integrator in compiling a set of constraints that characterise the acceptable behaviour of the OTS item [PR01].

An extension to the planned scope of Task I created the definition of a formal semantics for a programming language which provides mechanisms for the specification, implementation and composition of black-box components (including OTS items and protective wrappers). More recent work [MR05] has explored the role of pre- and post-conditions in formally specifying wrapper requirements.

- back to top -

Task A (Architecture)

The main focus of work on Task A has been on developing systematic methods for engineering protective wrappers in systems with OTS items [MR05], [PR01], [R02]. After analysing a number of existing approaches we introduced a concept of Acceptable Behaviour Constraints (ABCs) intended to characterise the correct/expected behaviour exhibited at the interface between the OTS item and the rest of the system, and proposed a guide for developing ABCs so that they, in effect, define service contracts between the item and the system [PR01]. Based on the ABCs a protective wrapper can be devised as a new component monitoring information flow across the interface, and taking recovery actions when ABCs are violated. This approach has been investigated by means of a realistic case study [AF03a], [MR05], in which a steam raising system was modelled (in Simulink) using an industrial simulator. The boiler system is controlled by a PID regulator, which represents a typical COTS item employed in many industrial settings. By identifying and categorising error symptoms indicating a breach of the ABCs (e.g. excessive signal oscillation, or loss of signal) we devised an error detection regime; based on the severity of the detected symptoms the wrapper selected from a hierarchy of recovery responses (ranging from "wait and see" to an emergency shutdown) [AF03b]. To evaluate the effectiveness of this approach we conducted an empirical investigation by inserting the software wrapper into the simulator and running a large number of scenarios with various forms of fault injection. Initial analysis of the data produced indicates that the wrapper provided effective tolerance to a high percentage (c. 90%) of severe system failures. We have argued that the simulated environment offers considerable real-world validity (papers are in preparation).

An extension of this work to the area of web services [KP04] has proposed architectures to improve the dependability of component services. Related work [GR03], [GR04] has developed an architectural style enabling the utilisation of protective wrappers from an early phase of system development; by adding a protective wrapper an OTS component is transformed into an "idealised fault-tolerant component". A common thread running through architectural work at both Newcastle and City is the recognition that mismatches of operational assumptions between an OTS item and the system in which it is embedded are the essence of the problem. Protective wrapper techniques offer a well-structured and general technique for addressing these mismatches by reducing the degree of arbitrariness in the failure semantics of the OTS item.

Work on fault tolerance with database servers analysed a range of architectures that can be applied depending on the failure modes to be tolerated. The designer has to trade-off requirements for ensuring concurrency control and consistency among database replicas, for supporting OTS servers that may differ in the way they implement important parts of internal scheduling and concurrency control, with possible gains in the form of tolerating more and subtler failure modes or of better performance. We have outlined [PS04], [GP04b] several architectures corresponding to such trade-offs, including conservative synchronisation strategies to guarantee consistency among transaction orderings on diverse servers, and various degrees of optimism in committing transactions - trading-off between the two potential advantages from diversity, namely faster service and better dependability. The fault-tolerant server in our test harness, although only intended for experimentation, has proved suitable for testing some of these fault tolerance schemes with four popular database management products, producing evidence for the practical feasibility of our proposals. As a practical demonstration, we also ported to this server a database in everyday use at City.

- back to top -

References

[AF03a] T. Anderson, M. Feng, S. Riddle, A. Romanovsky. Protective Wrapper Development: A Case Study. 2nd Int. Conf. on COTS-Based Software Systems (ICCBSS '03), Ottawa, Canada, LNCS 2580, Springer, pp 1-14, 2003

[AF03b] T. Anderson, M. Feng, S. Riddle, A. Romanovsky. Error Recovery for a Boiler System with OTS PID Controller. ECOOP '03 Workshop on Exception Handling in Object-Oriented Systems (eds A. Romanovsky et al), TR 03-028, Dept of Computer Science, Univ. of Minnesota, USA, pp 74-83, 2003

[BB04] J.G.W. Bentley, P.G. Bishop, M. van der Meulen. An Empirical Exploration of the Difficulty Function, SAFECOMP '04, Potsdam, Germany, LNCS 3219, Springer, pp 60-71, 2004

[GP04a] I. Gashi, P. Popov, V. Stankovic, L. Strigini. On Designing Dependable Services with Diverse Off-The-Shelf SQL Servers, Architecting Dependable Systems (eds R. de Lemos et al), LNCS 3069, Springer, pp 196-220, 2004

[GP04b] I. Gashi, P. Popov, L. Strigini. Fault diversity among off-the-shelf SQL database servers, Int. Conf. on Dependable Systems and Networks (DSN '04), Florence, Italy, pp 389-398, 2004

[GR03] P.A. de C. Guerra, C.M.F. Rubira, A. Romanovsky, R. de Lemos. A Fault-Tolerant Software Architecture for COTS-Based Software Systems. 4th ESEC/FSE Conf., Helsinki, Finland, pp 375-378, 2003

[GR04] P.A. de C. Guerra, C.M.F. Rubira, A. Romanovsky, R. de Lemos. A Dependable Architecture for COTS-Based Software Systems using Protective Wrappers. Architecting Dependable Systems II (eds R. de Lemos et al) LNCS 3069, Springer, pp 147-170, 2004

[KP04] V. Kharchenko, P. Popov, A. Romanovsky. On Dependability of Composite Web Services with Components Upgraded Online, Int. Conf. on Dependable Systems and Networks (DSN '04 - Workshop supplement), Florence, Italy, pp 287-291, 2004

[LS04] B. Littlewood, L. Strigini. Redundancy and Diversity in Security, 9th European Symp. on Research in Computer Security (ESORICS '04), Sophia Antipolis, France, LNCS 3193, Springer, pp 423-438, 2004

[MB04] M.J.P. van der Meulen, P.G. Bishop, M. Revilla. An Exploration of Software Faults and Failure Behaviour in a Large Population of Programs, ISSRE '04, Rennes, France, 2004 [to appear]

[MR05] M.J.P. van der Meulen, S. Riddle, L. Strigini, N. Jefferson, Protective Wrapping of Off-the-Shelf Components, 4th Int. Conf. on COTS-Based Software Systems (ICCBSS '05), Bilbao, Spain, 2005 [to appear]

[P02] P. Popov. Reliability Assessment of Legacy Safety-Critical Systems Upgraded with Off-the-Shelf Components, SAFECOMP '02, Catania, Italy, LNCS 2434, Springer, pp 139-150, 2002

[PS03] P. Popov, L. Strigini. Diversity with Off-The-Shelf Components: A Study with SQL Database Servers, Int. Conf. on Dependable Systems and Networks (DSN '03 - Fast Abstracts supplement), San Francisco, USA, pp B84-B85, 2003

[PS04] P. Popov, L. Strigini, A. Kostov, V. Mollov and D. Selensky. Software Fault-Tolerance with Off-the-Shelf SQL Servers, 3rd Int. Conf. on COTS-Based Software Systems (ICCBSS '04), Redondo Beach, USA, pp 117-126, 2004

[PR01] P. Popov, S. Riddle, A. Romanovsky, L. Strigini. On Systematic Design of Protectors for Employing OTS Items, 27th Euromicro Conf., Workshop on Component-Based Software Engineering, Warsaw, Poland, pp 22-29, 2001

[R02] A. Romanovsky. On version state recovery and adjudication in class diversity. Computer Systems Science and Engineering 17, 3, pp 159-168, 2002

- back to top -