56
FTC - VIII - FTS - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business) Vorlesung 8 FAULT-TOLERANT SOFTWARE Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/zs

FTC - VIII - FTS - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

FTC - VIII - FTS - 1

HUMBOLDT-UNIVERSITÄT ZU BERLININSTITUT FÜR INFORMATIK

Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)

Vorlesung 8

FAULT-TOLERANT SOFTWARE

Wintersemester 2000/2001

Leitung: Prof. Dr. Miroslaw Malek

www.informatik.hu-berlin.de/~rok/zs

FTC - VIII - FTS - 2

OBJECTIVES

• TO INTRODUCE MEASURES OF SOFTWARE RELIABILITY AND COMPLEXITY

• TO INTRODUCE TECHNIQUES FOR RELIABLE AND FAULT-TOLERANT SOFTWARE DESIGN

FTC - VIII - FTS - 3

CONTENTS

RELIABILITY MEASURES

COMPLEXITY MEASURES

EXCEPTION HANDLING

RECOVERY BLOCKS

N-VERSION PROGRAMMING

FTC - VIII - FTS - 4

RELIABILITY MODELSJELINSKI - MORANDA MODEL

• ASSUMPTION:– A HAZARD RATE FOR FAILURES IS A PIECEWISE CONSTANT

FUNCTION AND IS PROPORTIONAL TO THE REMAINING NUMBER OF ERRORS 

• z(t) = C [ N - (i - 1) ]• C - PROPORTIONALITY CONSTANT• N - THE NUMBER OF ERRORS INITIALLY IN THE PROGRAM• z(t) IS TO BE APPLIED IN THE INTERVAL BETWEEN

DETECTION OF ERROR (i -1) AND DETECTION OF ERROR i• R(t) = EXP { - C [N-(i-1) t] } • MTTF = 1/ { C [N-(i-1)] }

FTC - VIII - FTS - 5

SHOOMAN'S MODEL (SIMILAR TO JELINSKI-MORANDA'S)

• ASSUMPTIONS:– FAILURE RATE (HAZARD FUNCTION) IS PROPORTIONAL TO

THE NUMBER OF REMAINING ERRORS. ERRORS ARE REMOVED AS SOON AS THEY ARE DISCOVERED

• R(t) = EXP { -C[ET/IT - EC/IT] t }

• ET - THE NUMBER OF ERRORS INITIALLY PRESENT IN THE PROGRAM

• IT - THE NUMBER OF STATEMENTS IN THE PROGRAM

• EC - THE NUMBER OF ERRORS CORRECTED SO FAR BOTH MODELS ARE ALMOST IDENTICAL

FTC - VIII - FTS - 6

SCHNEIDEWIND'S APPROACH

• SCHNEIDEWIND APPROACHED SOFTWARE RELIABILITY MODELING FROM EMPIRICAL VIEWPOINT. HE RECOMMENDED INVESTIGATION OF DIFFERENT RELIABILITY FUNCTIONS SUCH AS THE EXPONENTIAL, NORMAL, GAMMA AND WEIBULL AND CHOOSING THE DISTRIBUTION THAT BEST FITS THE PARTICULAR PROJECT IN QUESTION.

FTC - VIII - FTS - 7

MUSA'S MODEL(EXECUTION TIME MODEL)

• ASSUMPTIONS:– EXECUTION TIME (PROCESS TIME) IS THE BASE FOR

SOFTWARE RELIABILITY THEORY INSTEAD OF CALENDAR TIME. z(t) IS PROPORTIONAL TO THE NUMBER OF REMAINING ERRORS. EXECUTION TIME RATE OF CHANGE OF THE NUMBER OF FAULTS CORRECTED IS PROPORTIONAL TO THE HAZARD RATE.

• R(t) = EXP { [ -fK(N-m)] t }• t - THE EXECUTION TIME• f - AVERAGE INSTRUCTION EXECUTION TIME DIVIDED BY

THE TOTAL NUMBER OF INSTRUCTIONS.• K - PROPORTIONALITY CONSTANT• N -THE NUMBER OF ERRORS INITIALLY PRESENT IN THE

PROGRAM• m -THE NUMBER OF ERRORS CORRECTED SO FAR

FTC - VIII - FTS - 8

BAYESIAN MODELLITTLEWOOD'S MODEL

• ASSUMPTIONS:– EXPONENTIAL DISTRIBUTION OF FAILURES WITH RESPECT

TO OPERATING TIME OF THE PROGRAM

– INDEPENDENT ERROR DISTRIBUTION

– FAILURE RATE PARAMETER IS A RANDOM PROCESS

– ERROR IS REMOVED UPON OCCURRENCE

• z(t) = (N-i)a / (b + t + t1)

• a, b - PARAMETERS OF THE GAMMA DISTRIBUTION• t - TIME ELAPSED SO FAR

• R(t) = {1- [(b+t) / (b + t0 + t1)]b +[(b + t0) / (b + t0 + t1 + t)] a} N-i

• t1 - EXECUTION TIME

FTC - VIII - FTS - 9

CONCLUSIONS• SOFTWARE RELIABILITY MEASURES CAN AT PRESENT BE

ACHIEVED WITH PRETTY GOOD ACCURACY IF PROGRAMMING TEAM HAS A SUBSTANTIAL TRACK DATA AND LOTS OF RELIABILITY DATA TO SUPPORT IT

• NO STANDARD OR WIDELY ACCEPTED RELIABILITY MODEL EXISTS

• LOTS OF EXPERIMENTAL DATA ARE NEEDED

• SOFTWARE RELIABILITY MEASUREMENTS SHOULD BE AUTOMATED AND BUILT IN ANY LARGE SOFTWARE PROJECT

• PROGRAM COMPLEXITY MEASURES MAY ALSO HELP IN EVALUATION OF SOFTWARE RELIABILITY

FTC - VIII - FTS - 10

SOFTWARE COMPLEXITY MEASURES • COMPLEXITY MEASURES SHOULD REFLECT THE DEGREE OF

DIFFICULTY INHERENT IN DESIGNING AND IMPLEMENTING INTRICATE DATA STRUCTURES, ALGORITHMS, PROGRAM STRUCTURES, AND INTERMODULAR COMMUNICATION SCHEMES

• OVER 300 DIFFERENT SOFTWARE METRICS HAVE BEEN DEFINED AND, TO SOME DEGREE, EXAMINED FOR VALIDITY AND USEFULNESS

• TYPICAL MEASURE: – LINES OF CODE

• TYPICAL INDUSTRY STANDARDS: – FROM 0.3 TO 2 BUGS PER 1000 LINES OF CODE

• TYPICAL REALITY:– ONE BUG PER 100 LINES OF CODE

• WORST CASE SEEN: – 2.2 BUGS PER INSTRUCTION

• QUALITY ASSURANCE COSTS: – FROM 2% to 80% OF THE TOTAL SOFTWARE DEVELOPMENT COST

FTC - VIII - FTS - 11

A STATISTICAL APPROACH TO PROGRAM COMPLEXITY

• THE STATISTICAL APPROACH IS BASED ON THE VIEWPOINT THAT THE COMPLEXITY OF THE PROGRAM IS RELATED TO THE NUMBER OF OPERATORS (e.g., +, -, *, /, >, <, IF THEN ELSE, DO WHILE) AND OPERANDS (VARIABLES AND CONSTANTS*)

• EXAMPLE: – A FORTRAN PROGRAM WHICH USES NEWTON'S METHOD TO

SOLVE FOR THE SQUARE ROOT OF A NUMBER

– THE FOLLOWING (SEE NEXT PAGE), IS A TYPICAL PROGRAM TO EVALUATE THE SQUARE ROOT (B) OF A NUMBER (X)

* CHARACTER STRING VARIABLES AND CHARACTER STRING CONSTANTS MUST ALSO BE COUNTED

FTC - VIII - FTS - 12

THE PROGRAM

READ(5,1)X

1 FORMAT(F10.5)

A = X/2

2 B = (X/A + A)/2

C = B - A

IF(C.LT.0)C = -C

IF(C.LT.10.E-6)GO TO 3

A = B

GO TO 2

3 WRITE (6,1)B

STOP

END

1. THIRTEEN DISTINCT OPERATORS (OPERATOR TYPES): READ, FORMAT, =, /, ( ), +, -, .LT., IF, GO TO, WRITE, STOP, AND END

2. TWENTY-FOUR TOTAL OPERATORS: FIVE =; THREE/; TWO EACH OF ( ), -, .LT., IF, AND GO TO; AND ONE EACH OF THE OTHER SIX

3. ELEVEN DISTINCT OPERANDS (OPERAND TYPES): X, F 0.5, A, 2, B, C, 10.E - 6, 0, AND THE LABELS 1, 2, AND 3

4. TWENTY-FIVE TOTAL OPERANDS: FIVE EACH OF A AND C, FOUR B; THREE X; TWO 2; AND ONE EACH OF THE OTHER SIX

FTC - VIII - FTS - 13

ESTIMATION OF PROGRAM LENGTH A PRIORI

1. ESTIMATE THE NUMBER OF OPERATOR TYPES n1.

2. ESTIMATE THE NUMBER OF DISTINCT OPERANDS n2 BY COUNTING INPUT VARIABLES, OUTPUT VARIABLES, INTERMEDIATE VARIABLES, AND CONSTANTS WHICH WILL BE NEEDED.

3. ESTIMATE THE PROGRAM LENGTH

N = t(0.5772 + lnt) (ZIPF'S LAW)

 

• WHERE t = n1+n2

FTC - VIII - FTS - 14

ESTIMATION OF PROGRAM LENGTH FOR SEVERAL EXAMPLES

Type of Program Language Instruction

lengthEstimated

token lengthActual

token length

1. QUADRAC

Given coefficients A, B, C, as inputs, program solves and classifies roots of quadratic equation (Brown, 1976, p. 877)

Basic 53 171 177

2. PLOT

Plots 6 different functions of a single variable simultaneously (Brown, 77, 486)

Basic 110 258 306

3. Case study 3

Current in a series RLC circuit (McCracken, 1961, p. 65)

Fortran 18 140 111

4. Servo frequency response

The frequency response of a second-order transfer function is to be evaluated at a number of points (McCracken, 1961, 68)

Fortran 17 90 88

5. Subroutine RTF

Calculating the roots of a given function by three different methods (IBM, 1969, p.159)

PL/1 119 275 409

From M. L. Shooman

FTC - VIII - FTS - 15

HALSTEAD'S MEASURE

N = n1 LOG n1 + n2 LOG n2

• N - PROGRAM LENGTH (TOTAL NUMBER OF OPERATORS AND OPERANDS)

• PROGRAM VOCABULARY– n1 - NUMBER OF OPERATOR TYPES

– n2 - NUMBER OF OPERAND TYPES

• HALSTEAD AND ZIPF'S LENGTH ARE CLOSELY RELATED

SINCE t = n1+n2

THEN 

N = (n1 + n2) [0.5772 + LN (n1 + n2)]

FTC - VIII - FTS - 16

GRAPH THEORETICAL COMPLEXITY MEASURES(CONTROL FLOW COMPLEXITY)

MCCABE'S CYCLOMATIC COMPLEXITY• CYCLOMATIC NUMBER (NULLITY) FOR A STRONGLY

CONNECTED GRAPH IS:

(G) = E - N + K (G) - THE CYCLOMATIC NUMBER FOR GRAPH G

– E - THE NUMBER OF DIRECTED EDGES (ARCS) IN G

– N - THE NUMBER OF VERTICES IN G

– K - THE NUMBER OF SEPARATE PARTS (COMPONENTS) IN G (K = 1 FOR A SINGLE PROGRAM)

• (G) CORRESPONDS TO:1. THE NUMBER OF FUNDAMENTAL CIRCUITS IN GRAPH G

2. THE NUMBER OF BASIC PATHS THAT CAN BE COMBINED TO MAKE UP ANY POSSIBLE CIRCUIT ON THE GRAPH

3. THE NUMBER OF CHORDS (EDGES FORMING FUNDAMENTAL CIRCUITS FROM A SPANNING TREE OF GRAPH G) 

• (G) IS ALSO A GOOD SOFTWARE COMPLEXITY MEASURE

FTC - VIII - FTS - 17

EXAMPLEa

b

c

d

e

f

g

h

i

1

2

3

4

5

6

7

8

10

11

13 12 g

Added phantom edge

(to make a graph strongly-connected)

McCABE NUMBER IS

(G) = e - n + 1=13 - 10 + 1= 4

 

In practice all modules in top-down structured design should have (G) = 10

FTC - VIII - FTS - 18

HSU AND MALEK'S CUMULATIVE MINIMAL CUT-SETS MEASURE (MCS)

• MCS IS A SUM OF THE NUMBER OF EDGES CUT AFTER EVERY DECISION BLOCK

MCS=5 MCS=6 MCS=7 MCS=8 (a) (b) (c) (d)

S-graphs.

FTC - VIII - FTS - 19

RELIABLE SOFTWARE DESIGN

• FAULT ANALYSIS - use fault trees

• FAULT DETECTION - use methods (as shown in Testing Techniques Unit)

• DAMAGE CONFINEMENT AND ASSESSMENT

• RECOVERY

• FAULT TREATMENT AND CONTINUED SERVICE

FTC - VIII - FTS - 20

FAULT ANALYSIS (1)

• Fault characterization 

• Faults are detected when an operation violates its specification

• Three modes of operation:– The operation may return

– An exception may be raised

– The operation may fail to terminate

FTC - VIII - FTS - 21

FAULT ANALYSIS (2)• OPERATION TERMINATES

– BY RETURNING AND• THE POSTCONDITION IS SATISFIED (CORRECT OUTCOME)• THE POSTCONDITION IS NOT SATISFIED

– WHEN AN EXCEPTION IS RAISED• BY THE OPERATION AND

– THE EXCEPTION CONDITION IS SATISFIED (CORRECT OUTCOME)

– THE EXCEPTION CONDITION IS NOT SATISFIED

• BY ANOTHER OPERATION AND– THE EXCEPTION CONDITION IS SATISFIED

– THE EXCEPTION CONDITION IS NOT SATISFIED

• OPERATION FAILS TO TERMINATE AND– THE OPERATION MAKES NO FURTHER CHANGES TO ITS

OBJECT

– THE OPERATION CHANGES ITS OBJECT REPEATEDLY

FTC - VIII - FTS - 22

OPERATIONS OUTCOME(Fig. 1 I. Durham's Dissertation, CMU, 1986)

DAMAGE CONFINEMENT AND ASSESSMENT

• CONFINE THE DAMAGE– SPREAD DURING THE TIME BETWEEN THE ERRONEOUS

TRANSITION AND THE DETECTION OF THE ERROR

– PLACE CONSTRAINTS ("FENCES") ON SYSTEM INFORMATION FLOW

– ATOMIC ACTIONS

• ASSESS THE DAMAGE– BEFORE STARTING RECOVERY

– BASED ON PREDICTABLE PROCESS INTERACTIONS

FTC - VIII - FTS - 23

ATOMIC ACTIONS • GENERALIZATION OF A TRANSACTION FROM DATABASE DESIGN

• NO INTERACTIONS BETWEEN THE ACTIVITY OF A GROUP OF COMPONENTS AND THE REST OF A SYSTEM DURING THE ACTIVITY

– PARALLELISM INCREASES SPREAD OF DAMAGE

• PREVENT UNWANTED INFORMATION FLOW– FIREWALLS (FENCES)

• LIMIT SPREAD OF DAMAGE EVEN IF FAULTS ARE PRESENT

P1 P2

T1

T2T3

TIME T PRESENT

TPRESENT - time of error detection 

T3 - time when process P2 starts autonomous action. (If a fault occurs in P2 after T3 then P2 has no impact on P1) 

If a fault occurs in P1 then both P1 and P2 should be recovered

FTC - VIII - FTS - 24

HARDWARE ENFORCED ATOMIC ACTIONS

• MESSAGE PASSING SYSTEMS– SIGNAL EXCEPTION IF ATTEMPT FROM WITHIN ATOMIC

ACTION TO SEND OR RECEIVE A MESSAGE FROM OUTSIDE OF ATOMIC ACTION

• COMMUNICATIONS THROUGH SHARED OBJECTS– CONSTRAIN EXECUTION SEQUENCES BY ALLOWING ONLY

PROCESSES INVOLVED IN ATOMIC ACTION TO EXECUTE • USE SEPARATE PROCESSORS

– CONSTRAIN DATA ACCESS BY ISOLATING OBJECTS WITHIN AN ATOMIC ACTION FROM OTHER OBJECTS

• LOCKS, SEMAPHORES, MONITORS• INTERRUPT DISABLING• DATABASE LOCKING• PROTECTION MECHANISMS

FTC - VIII - FTS - 25

PROTECTION MECHANISMS • LIST-BASED SCHEMES

– ASSOCIATE WITH EACH POTENTIALLY SHAREABLE OBJECT A LIST OF AUTHORIZED USERS; CHECK EACH ACCESS FOR AUTHORIZATION

• OBJECT– PROCESS X

– PROCESS Y

– PROCESS Z

– EASY TO ENFORCE

– COSTLY RUN-TIME EFFORT

• TICKET-BASED SCHEMES– ASSOCIATE WITH EACH PROCESS A LIST OF RIGHTS TO ACCESS AN

OBJECT; CHECK EACH PROCESS FOR ITS ACCESS TICKET• CAPABILITY-BASED• READ, WRITE, EXECUTE CAPABILITIES• PROCESS

– OBJECT X

– OBJECT Y

– OBJECT Z

– EFFICIENT RUN-TIME EFFORT

– DIFFICULT TO ENFORCE• PROCESS MIGHT INCORRECTLY GIVE TICKET AWAY

FTC - VIII - FTS - 26

RECOVERY

• ROLL-FORWARD (FORWARD ERROR RECOVERY)– REAL-TIME SYSTEM (OLD DATA ARE OBSOLETE)

– DAMAGE IS PREDICTABLE

– PERFORMANCE IS MORE IMPORTANT THAN CORRECTNESS

• ROLL-BACK (BACKWARD ERROR RECOVERY)– UNPREDICTABLE DAMAGE

– MORE EXPENSIVE

– CORRECTNESS IS MORE IMPORTANT THAN PERFORMANCE

FTC - VIII - FTS - 27

ROLL-BACK TECHNIQUES

• CHECKPOINTING• AUDIT TRAILS (AUDIT LOGS)• FILE CHECKPOINTS• TRANSACTION PROCESSING• RECOVERY CACHES (SPECIAL TYPE OF CHECKPOINT TO

MINIMIZE STORAGE)

• OBJECTIVE: – REPLACEMENT OF CURRENT FAULTY STATE WITH

PREDEFINED CORRECT STATE

FTC - VIII - FTS - 28

FAULT-TOLERANT SOFTWARE DEFINITIONS

• FAULT-TOLERANT SOFTWARE IS CONCERNED WITH TECHNIQUES WHICH ENABLE A SYSTEM TO TOLERATE SOFTWARE FAULTS

• SOFTWARE FAULT TOLERANCE IS CONCERNED WITH TECHNIQUES DESIGNED TO TOLERATE THE EFFECTS OF FAULTS IN THE UNDERLYING (HARDWARE) INTERPRETER. THOSE TECHNIQUES ARE IMPLEMENTED IN SOFTWARE

MAIN FAULT-TOLERANT SOFTWARE TECHNIQUES – EXCEPTION HANDLING

– RECOVERY BLOCKS (ROLL-BACK)

– N-VERSION PROGRAMMING

– ROLL-FORWARD

FTC - VIII - FTS - 29

EXCEPTION

• AN INDICATION THAT SOMETHING OUT OF THE ORDINARY HAS OCCURRED

• SHOULD BE DETECTED AND RAISED

• EXAMPLES– NONEXISTENT MEMORY ACCESS (OR OTHER PROTECTION

VIOLATION)

– DIVISION BY ZERO

– UNDERFLOW OR OVERFLOW

– SYSTEM CRASH

– TIMEOUT

FTC - VIII - FTS - 30

EXCEPTION MECHANISMS

• PROVIDE CAPABILITY IN THE PROGRAM TO EXPLICITLY RAISE AN EXCEPTION IF THE PROGRAM DETECTS AN ERRONEOUS STATE– RAISE (EXCEPTION NAME, OTHER PARAMETERS)

• ALLOW A SYSTEM ROUTINE TO SIGNAL THE RAISING OF AN EXCEPTION BY A CALLING PROGRAM– SIGNAL (EXCEPTION NAME, OTHER PARAMETERS)

• ALLOW USER-DEFINED EXCEPTIONS

• EACH PROCEDURE SHOULD HAVE TWO EXITS:– A STANDARD EXIT AND AN EXCEPTION EXIT

FTC - VIII - FTS - 31

EXCEPTION EXAMPLES

• OV IS THE OVERFLOW EXCEPTION; BACKWARD ERROR RECOVERY IS EXPLICITLY PROGRAMMED

– ADD PROCEDURE I = I + J + KPROC P SIGNALS OW;

BEGIN

I = I + J

[OV -> SIGNAL OW]; /*EXPLICITLY PROGRAM */

I = I + K

[OV -> I = I - J; SIGNAL OW]; /*BACKWARD RECOVERY*/

END;

• PROPAGATION OF THE EXCEPTION IS MASKED BY THE PROCEDURE; THE PROCEDURE PROVIDES STANDARD SERVICE IN SPITE OF THE EXCEPTION WHICH OCCURRED IN THE CALLED ROUTINE

PROC CREATE (Z: POSITIVE.INTEGER) SIGNALS NO.SPACE;

BEGIN

CALL M1.ALLOCATE(Z)

[DISK.OVERFLOW -> CALL M2.ALLOCATE(Z)

[DISK.OVERFLOW -> RECOVER RECOVERY POINT; SIGNAL NO.SPACE]];

. . .

END;

FTC - VIII - FTS - 32

THE RECOVERY BLOCK SCHEME (1)

• PRIMARY MODULE– DEBUGGED AND TESTED IRREDUNDANT SOFTWARE WHICH

HOPEFULLY MEETS SPECIFICATIONS

• ACCEPTANCE TEST– MEASURES FOR ERROR DETECTION IN THE FORM OF A

CHECK IN THE REASONABLENESS OF THE CALCULATED RESULTS

• STRUCTURE– PRIMARY MODULE ACCEPTANCE TEST

FTC - VIII - FTS - 33

THE RECOVERY BLOCK SCHEME (2)

• ACCEPTANCE TEST SHOULD RAISE AN EXCEPTION IF THE STATE OF THE SYSTEM IS NOT ACCEPTABLE

• THE MODULE IS SAID TO BE FAILED IF ANY EXCEPTION IS RAISED OR SIGNALED BY THE ACCEPTANCE TEST

• RECOVERY POINT IS ESTABLISHED AT THE BEGINNING OF THE PRIMARY MODULE

• ALTERNATE MODULE IS A STANDBY SOFTWARE TO BE EXECUTED IF PRIMARY MODULE FAILS

FTC - VIII - FTS - 34

THE SYSTEM WILL NOW CONSIST OF:

• ESTABLISH RECOVERY POINT• PRIMARY MODULE• ACCEPTANCE TEST• ALTERNATE MODULE• ACCEPTANCE TEST

• ADDITIONAL ALTERNATE MODULES CAN BE PROVIDED TO ENSURE BETTER SOFTWARE FAULT TOLERANCE

FTC - VIII - FTS - 35

RECOVERY BLOCK STRUCTURE

ENSURE < ACCEPTANCE TEST >

BY < PRIMARY MODULE >

ELSE BY < ALTERNATE MODULE 1 >

ELSE BY < ALTERNATE MODULE 2 >• • •

ELSE BY < ALTERNATE MODULE N >

ELSE ERROR

• NOTICE THAT THE RECOVERY BLOCK SCHEME IS INDEPENDENT OF1) PROGRAMMING STYLE

2) METHODOLOGY

3) LANGUAGE

FTC - VIII - FTS - 36

EXAMPLE

ENSURE A[j+1] A[j] FOR j=1,2,...,n-1

BY SORT A USING QUICKSORT

ELSE BY SORT B USING HEAPSORT

ELSE BY SORT C USING BUBBLESORT

ELSE ERROR

FTC - VIII - FTS - 37

OPTIMUM INTERVAL BETWEEN RECOVERY

POINTS • OPTIMUM INTERVAL BETWEEN RECOVERY POINTS

SHOULD BE ESTABLISHED ON THE BASIS OF– COMPLEXITY MEASURES OR 2tT

• WHERE t IS TIME TAKEN TO ESTABLISH A RECOVERY POINT• T IS THE MEAN TIME BETWEEN RECOVERY POINT

RESTORATIONS

• IF RECOVERY POINTS ARE TOO FREQUENT THEN HIGH PROPORTION OF TIME WILL BE SPENT RECORDING RECOVERY DATA. IF THE POINTS ARE TOO RARE THEN IT MAY TAKE A LONG TIME TO RECOVER.

FTC - VIII - FTS - 38

CASE STUDY - THE DEADLINE MECHANISM • THE DEADLINE MECHANISM IS INTENDED TO SUPPORT SOFTWARE

FAULT TOLERANCE IN REAL-TIME APPLICATIONS WHERE A PROGRAM HAS TO SATISFY REQUESTS FOR SERVICE WITH A GIVEN TIME LIMIT (DEADLINE) OR ELSE SYSTEM FAILURE IS LIKELY TO ENSUE

EXAMPLE: NAVIGATION PROGRAM OUTLINE

EVERY SECOND

WITHIN TEN MS

CALCULATE BY READ SENSORS; CALCULATE NEW POSITION PRIMARY MODULE(PM) 

ELSE BY APPROXIMATE NEW POSITION FROM OLD POSITION ALTERNATE MODULE(AM)

TIMING CONSTRAINTS FOR NAVIGATION PROGRAM

– PRIMARY FIRST THEN ALTERNATE IF NECESSARY (OPTIMISTIC SCHEME) OR

– ALTERNATE FIRST AND THEN PRIMARY (PESSIMISTIC SCHEME)

i REQUEST REQUEST i + PM AM

7 ms 3 ms

10 ms

ONE SECOND

1

FTC - VIII - FTS - 39

TECHNIQUES FOR THE ACCEPTANCE TESTS

1. REASONABLENESS TESTS

2. TIMING TESTS (TIMEOUTS)

3. REVERSAL TESTS

4. REPLICATION TESTS

• ACCEPTANCE TEST HAS A MAJOR IMPACT ON THE EFFECTIVENESS OF A RECOVERY BLOCK METHOD

• COMPLETENESS AND THEREFORE THE SIZE OF THE ACCEPTANCE TEST DEPENDS ON THE SYSTEM REQUIREMENTS 

• DIFFICULT PROBLEM:– TEST TIME AND STORAGE OVERHEAD VERSUS

COMPLETENESS

• ACCEPTANCE TEST OVERHEADS ARE USUALLY BETWEEN 3% AND 40%

FTC - VIII - FTS - 40

THE N-VERSION PROGRAMMING

• N-VERSIONS OF A PROGRAM (N>1), WHICH HAVE BEEN INDEPENDENTLY DESIGNED TO SATISFY A COMMON SPECIFICATION, ARE EXECUTED AND THEIR RESULTS COMPARED BY VOTING MECHANISM

• VERSIONS MAY BE WRITTEN BY DIFFERENT PROGRAMMERS AND/OR DIFFERENT ALGORITHMS MAY BE USED

• IMPLEMENTATIONS MAY BE ON:A. A SINGLE PROCESSOR

B. MULTIPLE HOMOGENEOUS PROCESSORS

C. MULTIPLE HETEROGENEOUS PROCESSORS

FTC - VIII - FTS - 41

DRIVER PROGRAM

• THE DRIVER PROGRAM (CONTROL PROGRAM) IS NEEDED

• THE DRIVER PROGRAM IS RESPONSIBLE FOR:

i. INVOKING EACH OF THE VERSIONS

ii. WAITING FOR THE VERSIONS TO COMPLETE THEIR EXECUTION

iii. COMPARING AND ACTING UPON THE N SETS OF RESULTS

FTC - VIII - FTS - 42

VOTING CHECK

• VOTING CHECK IS PERFORMED BY THE DRIVER PROGRAM OR A HARDWARE VOTER

• IT SHOULD BE APPLICATION-INDEPENDENT BUT IN PRACTICE IT VERY OFTEN NEEDS "INEXACT" COMPARISONS DUE TO THE ROUNDOFF ERRORS IN e.g., ARITHMETIC OPERATIONS

• SOME FORM OF RANGE CHECK CAN BE USED BUT IN GENERAL "INEXACT VOTING" REQUIREMENT IS A PRETTY DIFFICULT PROBLEM

FTC - VIII - FTS - 43

CASE STUDY ONN-VERSION PROGRAMMING (NVP)

• DESIGN DIVERSITY APPLIES TO BOTH:– HARDWARE AND SOFTWARE 

• NVP PROMISES– TOLERANCE OF DESIGN FAULTS (WHICH ARE USUALLY THE

MOST DIFFICULT TO DETECT)

• Based on: A. Avizienis and J.P.J. Kelly, "Fault Tolerance by Design Diversity: Concepts and Experiments", Computer, August, 1984

FTC - VIII - FTS - 44

N-VERSION PROGRAMMING (NVP)

• TWO REQUIREMENTS FOR SUCCESSFUL NVP (N-VERSION PROGRAMMING) AND/OR NMR (N-MODULAR REDUNDANCY)1) THE CONSISTENCY OF INITIAL CONDITIONS AND INPUTS

FOR ALL N COMPUTATIONS MUST BE ASSURED

2) A RELIABLE DECISION ALGORITHM THAT DETERMINES A SINGLE DECISION RESULT FROM THE MULTIPLE RESULTS MUST BE PROVIDED

• TIME-SPACE TRADEOFFS– SEQUENTIAL VS. CONCURRENT

– EXECUTION OF N VERSIONS

FTC - VIII - FTS - 45

NVP - EXPERIMENT (1)

TWO QUESTIONS:

1. WHAT IS REQUIRED IN TERMS OF– FORMAL SPECIFICATIONS– CHOICES OF PROBLEMS– NATURE OF ALGORITHMS– TIMING CONSTRAINTS (SYNCHRONIZATION)– DECISION ALGORITHMS

TO MAKE NVP FEASIBLE?

2. WHAT METHODS SHOULD BE USED TO COMPARE THE COST AND THE EFFECTIVENESS OF THE NVP APPROACH TO: – SINGLE VERSION PROGRAMMING AND THE RECOVERY BLOCK

METHOD?

FTC - VIII - FTS - 46

NVP - EXPERIMENT (2)

• 30 PROGRAMMERS WERE SELECTED• STEPS OF EXPERIMENT

1. RECRUITING

2. TEACHING OBJ AND PDL (SPECIFICATION LANGUAGES)

3. EXAMINING AND RANKING

4. ASSIGNING THE PROBLEM

5. EVALUATING PROGRAMS

• PROBLEM (SPECS WRITTEN IN OBJ, PDL AND ENGLISH)

• AIRPORT SCHEDULER (OPERATION OF AN AIRPORT, ARRIVING AND DEPARTING FLIGHTS, SEAT RESERVATION)– 100 TRANSACTIONS WITH N-VERSION

– CROSS-CHECK AFTER EACH TRANSACTION

FTC - VIII - FTS - 47

NVP - EXPERIMENT (3)

• OBJ SPECIFICATION WAS 13 PAGES LONG• PDL SPECIFICATION WAS 74 PAGES LONG• ENGLISH SPECIFICATION WAS 10 PAGES LONG• AFTER 4 WEEKS 18 WORKING PROGRAM VERSIONS OF

THE AIRPORT SCHEDULER WERE RECEIVED. PROGRAMS PASSED THE STANDARD VIABILITY TEST

• 11 OF 18 PROGRAMS ABORTED ON INVALID INPUT– (VERY DANGEROUS IF NVP IS IMPROPERLY IMPLEMENTED;

IT MAY STOP THE SYSTEM)

• THEREFORE AN ACCEPTANCE TEST WITH RECOVERY WAS ADDED TO ALL 11 PROGRAMS

FTC - VIII - FTS - 48

NVP - EXPERIMENT (4)

• 18 PROGRAMS– 7 OBJ

– 5 PDL

– 6 ENGLISH

• Table 1. Characteristics of the 18 versions of programs written in OBJ, PDL, and English by UCLA study participants. For each program version, the number of PL/1 statements used in the program (PL/1 STMTS), the number of procedures used (NO. PROCS), the compile time (COMPILE TIME), the execution time for the 100 transaction test case (EXEC. TIME), and the program size in bytes (SIZE BYTES) are shown. The time unit is a machine-unit second, which is a measure of time, and other resources, such as I/O needs.

FTC - VIII - FTS - 49

NVP – EXPERIMENT (Table 1)

NAME OF PL/1 NO. COMPILE EXEC. SIZEVERSION STMTS PROCS TIME TIME BYTES 

OBJ1 423 22 15.14 3.89 37600OBJ2 400 28 11.35 3.96 28048OBJ3 398 17 7.42 4.33 30904OBJ4 328 14 8.62 4.77 29920OBJ5 455 14 14.79 3.10 32304OBJ6 243 16 4.71 2.70 20960OBJ7 336 23 8.30 4.92 34808PDL1 455 27 16.96 3.16 24928PDL2 501 33 19.58 19.68 29656PDL3 242 19 4.31 4.09 27360PDL4 437 39 16.31 2.84 30896PDL5 217 11 4.26 4.30 26440ENG1 260 21 4.75 3.33 27552ENG2 372 19 12.41 3.89 37792ENG3 385 30 8.12 2.41 20648ENG4 689 25 28.23 2.94 26864ENG5 481 15 8.76 2.42 24056ENG6 387 12 19.23 3.99 24656

FTC - VIII - FTS - 50

TEST RESULTS (Table 2) • Test results for individual versions of the 18 programs written in

OBJ, PDL, and English OK COSMETIC GOOD DETECTED UNDET.

VERSION POINTS ERRORS OK OR COS. ERRORS ERRORS 

OBJ1 73 0 73 2 25OBJ2 71 18 89 8 3OBJ3 67 11 78 4 18OBJ4 69 3 72 8 20OBJ5 67 12 79 0 21OBJ6 46 0 46 0 54OBJ7 52 17 69 7 24PDL1 59 2 61 1 38PDL2 54 2 56 32 12PDL3 95 0 95 4 1PDL4 45 28 73 0 27PDL5 94 0 94 5 1ENG1 74 12 86 0 14ENG2 67 27 94 0 6ENG3 97 1 98 0 2ENG4 30 5 35 25 40ENG5 55 6 61 0 39ENG6 53 3 56 9 35

•OK - EXPECTED RESULT;

•COSMETIC-BAD FORMAT

•DETECTED (D) ERROR - IF PROGRAM FAILED AT TYPOS, SPELLING;

•UNDETECTED (U or U*)

•IF DETECTED BY COMPARISON WITH OTHER VERSIONS U-DISTINCT U*-SIMILAR

•GOOD =OK+COSMETICS(G)

FTC - VIII - FTS - 51

NVP - EXPERIMENT (Table 3)ALL 816 POSSIBLE TRIPLETS OF 18 VERSIONS (( ) = 816) WERE EXECUTED WITH 3-VERSION DECISION ALGORITHM, USING 100 TRANSACTIONS FOR EACH TRIPLET

OUT OF THE 64* RESULTS THAT COULD BE OBTAINED, 27 WITH A SINGLE U*, WERE DISCARDED. THESE 37 CASES FALL INTO 14 CATEGORIES (SEE TABLE 4).

* CODE SPACE (G,D,U,U*), 3 CAN BE CHOSEN; THEREFORE ( ) 24=64 CASES

TRIPLET NUMBER OFCOMPOSITION TRIPLETS 000 35PPP 10EEE 20OPE 210OOP 105OOE 126OPP 70OEE 105PPE 60PEE 75ALL 816

Table 3. A list of the three-version triplets where 0 designates an OBJ version, P designates a PDL version, and E designates an English version.

183

43

FTC - VIII - FTS - 52

TEST RESULTS (Table 4)• Table 4. The three-version decision algorithm, where G = Good

point, D = Detected Error, and U or U* = Undetected Error (either distinct or similar, respectively). The confidence level describes how many version results the decision algorithm used in obtaining the decision result

VERSION NO. OF DECISION DECISION CONF.CASE RESULTS ERRORS ALGORITHM RESULT LEVELV1 G,G,G 0 TRIPLEX G 3V2 G,G,D 1 DUPLEX G 2V3 G,G,U 1 TRIPLEX G 2V4 G,D,D 2 SIMPLEX G 1V5 G,D,U 2 DUPLEX D 0V6 G,U,U 2 TRIPLEX D 0V7 G,U*,U* 2 TRIPLEX U* 2V8 D,D,D 3 NULL D 0V9 D,D,U 3 SIMPLEX U 1V10 D,U,U 3 DUPLEX D 0V11 D,U*,U* 3 DUPLEX U* 2V12 U,U,U 3 TRIPLEX D 0V13 U,U*,U* 3 TRIPLEX U* 2V14 U*,U*,U* 3 TRIPLEX U* 3

FTC - VIII - FTS - 53

Table 5. The results of decision algorithmsCASE ALL 000 PPP EEE OPE OTHERV1 36665 1703 448 820 9354 24340

44.9% 48.7% 44.8% 41.0% 44.5% 45.0%V2 5292 129 128 166 1341 3528

6.5% 3.7% 12.8% 8.3% 6.4% 6.5%V3 22105 842 274 554 5939 14496

27.1% 24.1% 27.4% 27.7% 28.3% 26.8%V4 1283 48 8 32 347 848

1.6% 1.4% 0.8% 1.6% 1.7% 1.6%V5 3986 141 88 80 1046 2631

4.9% 4.0% 8.8% 4.0% 5.0% 4.9%V6 6838 264 18 206 1747 4603

8.4% 7.5% 1.8% 10.3% 8.3% 8.5%V7 1944 86 12 82 386 1378

2.4% 2.5% 1.2% 4.1% 1.8% 2.5%V8 176 4 0 0 65 107

0.2% 0.1% 0.0% 0.0% 0.3% 0.2%V9 477 20 7 4 123 323

0.6% 0.6% 0.7% 0.2% 0.6% 0.6%V10 867 11 6 22 274 554

1.1% 0.3% 0.6% 1.1% 1.3% 1.0%V11 87 6 0 0 28 53

0.1% 0.2% 0.0% 0.0% 0.1% 0.1%V12 1415 173 1 20 275 946

1.7% 4.0% 0.1% 1.0% 1.3% 1.7%V13 353 48 0 8 62 235

0.4% 1.4% 0.0% 0.4% 0.3% 0.4%V14 112 25 10 6 13 58

0.1% 0.7% 1.0% 0.3% 0.1% 0.1%Total 81600 3500 1000 2000 21000 54100

100.0% 100.0% 100.0% 100.0% 100.0% 100.0%

THREE-VERSION TRIPLET

COMPOSITION

•TRIPLEX DECISION - ALL 3 OUTPUTS WERE CONSIDERED

•DUPLEX DECISION - WHEN ONE VERSION WAS DISCARDED DUE TO "INTERNAL" DETECTION BY ACCEPTANCE TEST

•SIMPLEX DECISION - WHEN TWO VERSIONS FAILED A.T. (CONF. LEVEL 1)

•NULL DECISION - WHEN THREE VERSIONS FAILED A.T.

FTC - VIII - FTS - 54

NVP EXPERIMENT (TYPES OF FAULTS)

SPEC. APPEARS IN DESCRIPTIONFAULT OBJ PDL ENG S1 1,4,5,7 Only defines four destinationsS2 4,5,6 Time shown as 09.45 in exampleS3 4,5,6 Error message order ambiguityS4 2,4 Duplicate error message ambiguityS5 6 Parameter checking ambiguity

Table 6. Specification faults causing similar errors

Table 7. Interpretation faults causing similar errors

INT. APPEARS IN DESCRIPTIONFAULT OBJ PDL ENGI1 2 1,2,3,4,5 Did not check input parameters firstI2 1,4 Expects time as 09:45I3 1,2,5 4 1 Wrong error message on illegal inputI4 4 4 1 Create does not work the second timeI5 2,4 5 Error message output on legal inputI6 3 1 6 Allows null parametersI7 1 4 Allows invalid parametersI8 6 No output after first list producedI9 2 Cannot handle invalid input

FTC - VIII - FTS - 55

Table 8. Implementation faults causing similar errors

• OF TYPES– FIVE SPECIFICATION FAULTS (NOTICE: NO FAULTS IN THE PDL

SPECS)

– NINE INTERPRETATION FAULTS

– SEVEN IMPLEMENTATION FAULTS (PROBLEMS WITH DELETING RECORDS AS REQUIRED)

IMPL. APPEARS IN DESCRIPTIONFAULT OBJ PDL ENG L1 1 Cancel does not work on last entryL2 3 5 Cancel works only on partial databaseL3 7 Cancel unknown cancels last entryL4 1 Cannot retrieve recordL5 1 Allows duplicate recordL6 4 Cancel, reserve-seat ignoredL7 4 Bad input leads to chaos

FTC - VIII - FTS - 56

NVP• ADVANTAGES:

– V&V TECHNIQUES CAN BE USED TO ALL VERSIONS

– SAME OR SIMILAR SOFTWARE TOOLS CAN BE USED

– DESIGN FAULTS ARE LESS CRITICAL

– THE COST OF FAULT ANALYSIS AND ELIMINATION SHOULD BE REDUCED BECAUSE REPAIRS ARE LESS URGENT

– POTENTIAL FOR PROVIDING NEARLY 100% AVAILABILITY

• DISADVANTAGE:– HIGH INITIAL COST

– LIMITED TO APPLICATIONS WITH WELLDEFINED OUTPUTS WHICH USE DETERMINISTIC ALGORITHMS ONLY