66
DS - II - DCMM - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business) Lecture 2 DEPENDABILITY CONCEPTS, MEASURES AND MODELS Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek http://www.informatik.hu-berlin.de/~rok/zs

DS - II - DCMM - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

DS - II - DCMM - 1

HUMBOLDT-UNIVERSITÄT ZU BERLININSTITUT FÜR INFORMATIK

Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)

Lecture 2

DEPENDABILITY CONCEPTS,MEASURES AND MODELS

Wintersemester 2000/2001

Leitung: Prof. Dr. Miroslaw Malek

http://www.informatik.hu-berlin.de/~rok/zs

DS - II - DCMM - 2

DEPENDABILITY CONCEPTS, MEASURES AND MODELS

• OBJECTIVES – TO INTRODUCE BASIC CONCEPTS AND TERMINOLOGY IN

FAULT-TOLERANT COMPUTING

– TO DEFINE MEASURES OF DEPENDABILITY

– TO DESCRIBE MODELS FOR DEPENDABILITY EVALUATION

– TO CHARACTERIZE BASIC DEPENDABILITY EVALUATION TOOLS

• CONTENTS – BASIC DEFINITIONS

– DEPENDABILITY MEASURES

– DEPENDABILITY MODELS

– EXAMPLES DEPENDABILITY EVALUATION TOOLS

DS - II - DCMM - 3

ADDING A THIRD DIMENSION

COST

DEPENDABILITY

PERFORMANCE

DE

PE N D A B I L

IT

Y

RELIABILITY

AVAILABILITY

MTTF

MTTR

MISSION TIME

FAULT TOLERANCE ETC .

DS - II - DCMM - 4

FAULT INTOLERANCE

• PRIOR ELIMINATION OF CAUSES OF UNRELIABILITY– fault avoidance– fault removal

• NO REDUNDANCY

• MANUAL / AUTOMATIC SYSTEM MAINTENANCE

• FAULT INTOLERANCE ATTAINS RELIABLE SYSTEMS BY:– very reliable components – refined design techniques– refined manufacturing techniques– shielding– comprehensive testing

DS - II - DCMM - 5

DEGREES OF DEFECTS • FAILURE

– occurs when the delivered service deviates from the specified service: failures are caused by errors

• FAULT – incorrect state of hardware or software 

• ERROR –– manifestation of a fault within a program or data structure forcing

deviation from the expected result of computation (incorrect result)PHYSICAL

DEFECTPERMANENT

FAULT

INCORRECT DESIGN

INTERMITTENT FAULT

UNSTABLE ENVIRONMENT

TRANSIENT FAULT

OPERATOR MISTAKE

UNSTABLE OR MARGINAL HARDWARE

SOURCES OF ERRORS

[from Siewiorek and Swarz]

TEMPORARY INTERNAL

TEMPORARY EXTERNAL

SERVICE FAILURE

ERROR

DS - II - DCMM - 6

TYPICAL FAULT MODELS FORPARALLEL / DISTRIBUTED SYSTEMS

• CRASH 

• OMISSION 

• TIMING 

• INCORRECT COMPUTATION /COMMUNICATION

• ARBITARARY (BYZANTINE)

DS - II - DCMM - 7

FAULT TOLERANCE: BENEFITS & DISADVANTAGE

• FAULT TOLERANCE – ACCEPT THAT AN IMPLEMENTED SYSTEM WILL NOT BE

FAULT-FREE

– FAULT TOLERANCE IS ATTAINED BY REDUNDANCY IN TIME AND/OR REDUNDANCY IN SPACE 

– AUTOMATIC RECOVERY FROM ERRORS 

– COMBINING REDUNDANCY AND FAULT INTOLERANCE 

• BENEFITS OF FAULT TOLERANCE– HIGHER RELIABILITY 

– LOWER TOTAL COST 

– PSYCHOLOGICAL SUPPORT OF USERS 

• DISADVANTAGE OF FAULT TOLERANCE – COST OF REDUNDANCY

DS - II - DCMM - 8

C o m p o n e n t F a i l u r e s ( P h y s i c a l D e f e c t s )

E x t e r n a l D i s t u r b a n c e s

( U n s t a b l e E n v i r o n m e n t )

O p e r a t o r M i s t a k e s

I n c o r r e c t S p e c i f i c a t i o n ,

D e s i g n o r I m p l e m e n t a t i o n

H a r d w a r e F a u l t s

S o f t w a r e F a u l t s

E r r o r s

F A U L T - T O L E R A N T C O M P U T I N G

O V E R V I E W

S e r v i c e F a i l u r e ( S y s t e m M a l f u n c t i o n )

DS - II - DCMM - 9

FAULT CHARACTERIZATIONS

• CAUSE– specification

– design

– implementation

– component

– external

• NATURE– hardware

– software

– analog

– digital

• DURATION– permanent

– temporary

– transient

– intermittent

– latent

• EXTENT – local

– distributed

• VALUE– determinate

– indeterminate

DS - II - DCMM - 10

BASIC TECHNIQUES/REDUNDANCY TECHNIQUES (1)

• HARDWARE REDUNDANCY

– Static (Masking) Redundancy

– Dynamic Redundancy

• SOFTWARE REDUNDANCY

– Multiple Storage of Programs and Data

– Test and Diagnostic Programs

– Reconfiguration Programs

– Program Restarts

DS - II - DCMM - 11

BASIC TECHNIQUES/REDUNDANCY TECHNIQUES (2)

TIME (EXECUTION REDUNDANCY)

Repeat or acknowledge operations at various levels

Major Goal - Fault Detection and Recovery

DS - II - DCMM - 12

DEPENDABILITY MEASURES

• DEPENDABILITY IS A VALUE OF QUANTITATIVE MEASURES SUCH AS RELIABILITY AND AVAILABILITY AS PERCEIVED OR DEFINED BY A USER

• DEPENDABILITY IS THE QUALITY OF THE DELIVERED SERVICE SUCH THAT RELIANCE CAN JUSTIFIABLY BE PLACED ON THIS SERVICE

• DEPENDABILITY IS THE ABILITY OF A SYSTEM TO PERFORM A REQUIRED SERVICE UNDER STATED CONDITIONS FOR A SPECIFIED PERIOD OF TIME

DS - II - DCMM - 13

RELIABILITY

• RELIABILITY R(t) OF A SYSTEM IS THE PROBABILITY THAT THE SYSTEM WILL PERFORM SATISFACTORILY FROM TIME ZERO TO TIME t, GIVEN THAT OPERATION COMMENCES SUCCESSFULLY AT TIME ZERO (THE PROBABILITY THAT THE SYSTEM WILL CONFORM TO ITS SPECIFICATION THROUGHOUT A PERIOD OF DURATION t)

• HARDWARE: Exponential distribution e-t

Weibull distribution e-(t) - shape parameter - failure rate

• SOFTWARE: Exponential, Weibull, normal, gamma or Bayesian

R(t) = e- t

• A constant failure rate is assumed during the life of a system

DS - II - DCMM - 14

MORTALITY CURVE: FAILURE RATE VS. AGE FAILURE RATE

e (t) w

(t)

u

EARLY LIFE

T

USEFUL LIFE

T T

m E W

WEAR OUT

DS - II - DCMM - 15

EXPANDED RELIABILITY FUNCTION

FOR EARLY LIFE e(t) = e() + u + w() 

BUT w() IS NEGLIGIBLE DURING EARLY LIFE. 

FOR USEFUL LIFE u(t) = e() + u + w() 

BUT BOTH e() AND w() ARE NEGLIGIBLE FOR USEFUL LIFE. 

FOR WEAR OUT w(t) = e() + u + w() 

WITH e() NEGLIGIBLE. 

THUS, THE GENERAL RELIABILITY FUNCTION BECOMES 

R(t) = Re(t) Ru(t) Rw(t)

OR

R(t) = exp [ -o

t

(T) d T ]

R ( t ) = e x p { - o

t

[ e ( ) +

u

+ w

( d] d

DS - II - DCMM - 16

SERIES SYSTEMS - the failure of any one module causes the failure of the entire system

Rs(t) = R1(t) R2(t) R3(t) = e-(1

+2

+3

)t

In general for n serial modules:

WHERE 

- SYSTEM FAILURE RATE 

i - INDIVIDUAL MODULE FAILURE RATE

 

For n identical modules:

R (t) R (t) R (t)1 2 3

R(t) = i=1

n R

i(t) = e

-t

i=1

n

i

R(t) = [ Rm

(t) ]n

DS - II - DCMM - 17

PARALLEL SYSTEMS - assume each module operates independently and the failure of one

module does not affect the operation of the other

RP(t) = RA(t) + RB(t) - RA(t)RB(t)

A

B

Unreliability

Q(t) = 1 - R(t)

A B R(t) Probability of Module Working

Venn Diagram

EXAMPLE: TWO IDENTICAL SYSTEMS IN PARALLEL EACH CAN ASSUME THE ENTIRE LOAD.

RA(T) = RB(T) = 0.5 RP(T) = 0.5 + 0.5 - (0.5)(0.5) = 0.75

DS - II - DCMM - 18

PARALLEL SYSTEMS (II)

In general for n parallel modules:

For n identical modules in parallel:

In our example:

R(t) = 1 - (1-R1i

(t))(1-R2(t)). . .(1-R

n(t))

R(t) = 1 - (1-Rm

(t))n

RP

(t) = 1 - (1-RA

(t)) (1-RB

(t))

DS - II - DCMM - 19

SOFTWARE EXAMPLE: JELINSKI - MORANDA MODEL

• ASSUMPTION: – A HAZARD RATE FOR FAILURES IS A PIECEWISE CONSTANT

FUNCTION AND IS PROPORTIONAL TO THE REMAINING NUMBER OF ERRORS.

z(t) = C [N - (i - 1)] 

– C – PROPORTIONALITY CONSTANT

– N - THE NUMBER OF FAULTS INITIALLY IN THE PROGRAM

– z(t) IS TO BE APPLIED IN THE INTERVAL BETWEEN DETECTION OF ERROR (i - 1) AND DETECTION OF ERROR i.

  R(t) = exp {- C •[N-(i-1)]•t]}

  MTTF = 1/ [ C •[N-(i-1)]] 

• STRONG CORRELATION WITH HARDWARE DESIGN RELIABILITY FUNCTION

DS - II - DCMM - 20

SIMILARITIES AND DIFFERENCES

• HARDWARE – Design and production

oriented

– HW wears out

• SOFTWARE– Design oriented SW does

not wear out

(t)

tDesign errors are very expensive to correct

(t)

t Design errors are cheaper to correct

DS - II - DCMM - 21

WHAT IS MORE RELIABLE?

HW 23

13SW

BIRTH DEATH

R(HW)R(SW)

= 21

12

DS - II - DCMM - 22

AVAILABILITY• AVAILABILITY

– A(t) of a system is the probability that the system is operational (delivers satisfactory service) at a given time t. 

• STEADY-STATE AVAILABILITY– As of a system is a fraction of lifetime that the system is operational. 

• As= UPTIME/TOTAL TIME = / ( + ) = MTTF/( MTTF+MTTR)

(failure rate)

(repair rate)• MTTF (Mean Time to Failure) 

• MTTR (Mean Time to Repair) 

• MTBF (Mean Time Between Failures)

MTBF = MTTF + MTTR

M T T F = o

R ( t ) d t = 1 /

MTTR = 1/

(for exponential distribution)

DS - II - DCMM - 23

MISSION TIME

• MISSION TIME

– MT(r) gives the time at which system reliability falls below the prespecified level r.

MT(r) = -ln r/ 

COVERAGE 

• a) Qualitative: List of classes of faults that are recoverable (testable, diagnosable) 

• b) Quantitative: The probability that the system successfully recovers given that a failure has occurred. 

• c) Quantitative: Percentage of testable/diagnosable/ recoverable faults.  

• d) Quantitative: Sum of the coverages of all fault classes, weighted by the probability of occurrence of each fault class. 

C = p1C1 + p2C2 + ....+ pnCn

DS - II - DCMM - 24

DIAGNOSABILITY

• A system of n units is one step t-fault diagnosable (t-diagnosable) if all faulty units within the system can be located without replacement, provided the number of faulty units does not exceed t. (Preparata-Metze-Chien, 12/67)

1) 2t+1n

2) At least t units must test each unit

FAULT TOLERANCE

• Fault tolerance – is the ability of a system to operate correctly in presence of faults

or– a system S is called k-fault-tolerant with respect to a set of algorithms

{A1,A2,...,Ap} and a set of faults {F1,F2,...Fq} if for every k-fault F in S, Ai is executable by SF when 1 i p. (Hayes, 9/76) 

or – Fault tolerance is the use of redundancy (time or space) to achieve

the desired level of system dependability. 

SF is a subsystem of a system S with k-faults.

DS - II - DCMM - 25

COMPARATIVE MEASURES

• RELIABILITY DIFFERENCE

• RELIABILITY GAIN  

• MISSION TIME IMPROVEMENT  

• SENSITIVITY  

• R2(t) - R1(t)

• Rnew(t)/Rold(t)

• MTI=MTnew(r)/MTold(r)

• dR2(t)/dt vs dR1(t)/dt

OTHER MEASURES 

MAINTAINABILITY (SERVICEABILITY)

is the probability that a system will recover to an operable state within a specified time.

SURVIVABILITY

is the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset.

DS - II - DCMM - 26

RESPONSIVENESS AN OPTIMIZATION METRIC PROPOSAL

responsiveness = ri (t) = aipi

where

ri (t) reflects the responsiveness of a task at time t

ai denotes i-th task availability

pi represents probability of timely completion of the i-th task

DS - II - DCMM - 27

QUESTION

• IN MANY PRACTICAL SITUATIONS, ESPECIALLY IN REAL-TIME SYSTEMS, WE FREQUENTLY NEED TO ANSWER A QUESTION: 

WILL WE ACCEPT 

LESS PRECISE RESULT 

IN SHORTER TIME?

 • PROPOSED METRICS: 

– 1) WEIGHTED SUM 

– ( • PRECISION / • TIME) AVAILABILITY 

– 2) QUOTIENT-PRODUCT 

– [PRECISION / log (TIME)] AVAILABILITY

DS - II - DCMM - 28

INTEGRITY(PERFORMANCE + DEPENDABILITY)

• TOTAL BENEFIT DERIVED FROM A SYSTEM OVER A TIME t.–  HOW TO MEASURE? – 1) TOTAL NUMBER OF USEFUL MACHINE CYCLES OVER THE TIME t.

– 2) P - performance index R - integrity level (probability

that an expected service is delivered) 

• Example: – In a multistage network performance index could be P = N2 (the number of

paths in the network)

• L = number of levels in a network

2) I = P o

t

R() d

I = N2 o

t

(1-p)(1-

m)(1-L

s)d

DS - II - DCMM - 29

EXAMPLE

• COMPUTE OVERALL SYSTEM's – MTTF

– MTTR

– MTBF

– AVAILABILITY

DISK

POWER

MAIN COMPUTER

CPU 4M Console

1950 1.20

1500 2.00

3800 0.80

SERIES

PARALLEL

UPS

UPS

15,800 3.75

UTILITY

460 0.50

1800 4.50

1800 4.50

1800 4.50

1800 4.50

15,800 3.75

DISK

DISK

DISK

DISK

3-OF-4 (K OF N)

MTTF MTTR

DS - II - DCMM - 30

SERIES ELEMENT SUBSYSTEM (1)

• Given MTTF and MTTR of each element– Total Failure Rate 

– s

– Series MTTF

MTTFs =

s

1 = (

i=1

3

MTTF1

)-1

i

DS - II - DCMM - 31

SERIES ELEMENT SUBSYSTEM (2)

• Availability

A1 = = 0,99938499

A2 = = 0.99866844

• MTTR

MTTR = 1.4976 Hours

• MTBF

MTBF = 693.2 + 1.5 = 694.7 Hours A

3 =

3800 + 0.803800

= 0.99978952

As =

i=1

3 A

i = 0.99784418

MTTR = A

1 - A MTTF

20.11950

1950

0.21500

1500

DS - II - DCMM - 32

PARALLEL ELEMENT SUBSYSTEM (1)• All elements must fail to cause subsystem failure• MTTF and MTTR known for each element• Unavailability for entire subsystem is

• Availability is

• MTTR

• MTTF

Us =

i=1

n U

i

As = 1 - U

s = 1 -

i=1

n (1 - A

i)

MTTRs = (

i=1

n MTTR

i

1 )

-1

MTTFs =

1-As

As

MTTRs

DS - II - DCMM - 33

PARALLEL ELEMENT SUBSYSTEM (2)

• Availability

A3 =

460 + 0.50460

= 0.9989142

0.9997627 A A

9997627.075.315800

15800

MTTR MTTF

MTTF A

12

1

As = 1 -

i=1

n (1 - A

i)

As = 1 - (0.0002373) (0.0002373) (0.0010858)

As = 0.99999999994

DS - II - DCMM - 34

PARALLEL ELEMENT SUBSYSTEM (3)

• MTTR

MTTRs = 0.3947368 Hour

• MTTF

MTTFs = 6,456,914,809 Hours

MTTRs = (

i=1

3 MTTR

i

1 )

-1

MTTRs = (

3.751

+ 3.751

+ 0.501

)-1

MTTFs = (

1 - As

As

) MTTRs

MTTFs = (

1.0 - 0.999999999940.99999999994

) (0.3947368)

DS - II - DCMM - 35

PARALLEL ELEMENT SUBSYSTEM (4)

• Paralleling modules is a technique commonly used to significantly upgrade system reliability  

• Compare one universal power supply (UPS) with availability of 

0.9997627  

• to the parallel combination of power supplies with availability of 

0.99999999994 

• In practical systems availability ranges  (2 to 12 9‘s)

0.99 to 0.999999999999

DS - II - DCMM - 36

K of N PARALLEL ELEMENT SUBSYSTEM (1)

• Have N identical modules in parallel (assume all have the same MTTF and MTTR) 

• Only K elements are required for full operation • K = 1 is the same as parallel • K = N is the same as series 

• Reliability • Rs = Prob (system works for time T) = Prob (N modules work

or N - 1 modules work or . . . or K modules work)• Note: The above conditions are mutually exclusive • Rs = Prob (N modules work) + Prob (N - 1 modules work)

+ . . . + Prob (K modules work)

DS - II - DCMM - 37

K of N PARALLEL ELEMENT SUBSYSTEM (2)

• RELIABILITY

where Rm is the individual modules reliability. Therefore

For 3 of 4 and Rm = 0.9

Rs =

L=K

N

(N-L)! L!N!

(Rm

)L

(1 - Rm

)N - L

Rs= R

m

4 + (

34)R

m

3(1-R

m)=0.9

4+4x0.9

3x0.1=0.9477

DS - II - DCMM - 38

K of N PARALLEL ELEMENT SUBSYSTEM (3)• MTTR

• MTTF

• AVAILABILITY

• Example (3 out of 4 subsystem)– N = 4 K = 3 – MTTF = 1800 Hours – MTTR = 4.50 Hours

MTTRs =

N - K + 1MTTR

MTTFs = MTTF (

MTTRMTTF)

N-K

(K(K

N))

-1

MTTFs = MTTF (

MTTRMTTF)

N-K

(N!

(N-K)!(K-1)!) =

As =

MTTFs + MTTR

s

MTTFs

DS - II - DCMM - 39

K of N PARALLEL ELEMENT SUBSYSTEM (4)

• MTTR

• MTTF

MTTFs = 60,000 Hours

MTTRs =

N-K+1MTTR

= 4-3+14.50

= 2.25 Hours

MTTFs = MTTF(

MTTRMTTF)

N-K

(N!

(N-K)! (K-1)!)

MTTFs = 1800(

4.51800)

4-3

(4!

(4-3)! (3-1)!)

DS - II - DCMM - 40

K of N PARALLEL ELEMENT SUBSYSTEM (5)

• AVAILABILITY

• MODULE AVAILABILITY

As =

MTTF + MTTRMTTF

As =

60,000 + 2.2560,000

= 0.9999625

A = MTTF + MTTR

MTTF =

1,800 + 4.501,800

= 0.9975062

DS - II - DCMM - 41

OVERALL SYSTEM (1)

• Three series elements– 1. Main Computer 

– 2. Power 

– 3.Disks

• MTTF

MTTF = 685.25 Hours

MTTF = ( i=1

3

MTTFi

1 )

-1

MTTF = ( 693.2

1 +

646 x 107

1 +

60,0001

)-1

DS - II - DCMM - 42

OVERALL SYSTEM (2)

• AVAILABILITY

A = (0.997844) (0.99999999994) (0.9999635)

 

A = 0.9978033• MTTR

MTTR = 1.51 Hours

A = i = 1

3 A

i

MTTR = (A

1 - A) MTTF

MTTR = (0.997803

1 - 0.9978033 ) 685.25

DS - II - DCMM - 43

OVERALL SYSTEM (3)

• Reliability Function – R (t) = e -t/MTTF

– R (2080) = e-2080/685.25

– R (2080) = 0.0480567

 • MTBF

– MTBF = MTTF + MTTR

– MTBF = 685.25 + 1.51 = 686.76 Hours

DS - II - DCMM - 44

IMPORTANT ATTRIBUTES OF COMPUTER SYSTEMS

• DEPENDABILITY (RELIABILITY /AVAILABILITY) – Whether or not it works and for how long ?

• PERFORMANCE– (Throughput, response time, etc.) – Given that it works, how well does it work ?

• PERFORMABILITY – For degradable Systems performance evaluation

of systems subject to failure/repair

• RESPONSIVENESS – Does it meet deadlines in presence of faults ?

HARP,CARE III, SURE ADVISER, SPADE ARIES, SURF SAVE

SPADE DEEP RESQ

SAVEDEEP KNT MODEL MEYER

DS - II - DCMM - 45

DEPENDABILITY EVALUATION

ANALYTICAL SIMULATION AND

FAULT INJECTION

(SAVE, HARP, AvSim+)

MEASUREMENT, TESTING AND FAULT INJECTION

(EXPERIMENTAL)

ANALYTICAL

MODELING

FAULT TREES

(SHARPEFaultTree+)

RELIABILITY

BLOCK

DIAGRAMS

(SHARPE, SUPER)

COMBINATORIAL

(ADVISER, CARE, CARE II, SPADE

SURE)

MARKOV

(ARIES, ARM, CARE III,GRAMP, GRAMS, HARP, MARK1, SHARPE, SURE, SURF, SURF II)

ESPN

(METASAN,SAN)

DS - II - DCMM - 46

RELIABILITY MODELS • ANALYTICAL

– Independent failures

– Constant failure rates

– Mainly non-repairable or assuming successful fault recovery

– Block diagrams

• SIMULATION– Markov chains model

• PETRI NETS MODEL– places, tokens and transitions

Good Bad (t) Δ t

Δ t

1

3

2

0 Both faulty

P1 faulty P2 fault-free

Both good

P1 fault-free P2 faulty

1 2

2

1

1 2

2

1

DS - II - DCMM - 47

EXTENDED STOCHASTIC PETRI NET MODEL - AN EXAMPLE OF A TMR SYSTEM

Operational Processors

Fault Handling

System Failed

T 1

T 2

T 3

T 4

T 5

T 6

T 7

T 8

T 9

T 10

T 11

K =2 1

K =2 2

Processors Failed

P 1

P 2

P 3

P 4

Initial state

(from Johnson and Malek)

DS - II - DCMM - 48

FAULT TREES

• Fault tree analysis is an application of deductive logic to produce a fault oriented pictorial diagram which allows one to analyze system safety and reliability. 

• Fault trees may serve as a design aid for identifying the general fault classes. 

• Fault trees were traditionally used in evaluation of hardware reliability but they may help in designing fault-tolerant software and in developing a top-down view of the system. 

• Complex event such as a system failure is consecutively broken down into simpler events such as subsystem failures, individual components and block failures, down to single element failures. These simple events are linked together by "and" or "or" Boolean functions. 

• The probability of higher level events can be calculated by combining probabilities of the lower level events.

DS - II - DCMM - 49

MARKOV MODELS

ARIES, CARE III, HARP, SAVE, SURE, SURF • Different Fault Types 

– Transient, Intermittent, Permanent

– Common-Mode

– Near-Coincident

• Details of Fault-Handling (Coverage) Behavior• Dynamic and Static Redundancy• Hierarchy is difficult to handle• State Explosion 

DS - II - DCMM - 50

FAULT TREE EXAMPLE

P P1 2P P

1 3 P P2 3 Voter

System Failure

P P1 2P

3

DS - II - DCMM - 51

HANDLING OF STATE SPACE PROBLEM

• BASIC APPROACH: – DIVIDE-AND-CONQUER (Decompose and Aggregate) 

 • STRUCTURAL DECOMPOSITION

– Consider a System as a Set of Independent Subsystems

• BEHAVIORAL DECOMPOSITION– Consider Different Time Constants in Fault-Occurrence and Fault-

Handling Processes

• Consider the Use of Different Techniques for Different Submodels

DS - II - DCMM - 52

RELIABILITY TOOLS OBJECTIVESAND TABLE OF COMPARISONS

• FAULT COVERAGE EVALUATION

 • RELIABILITY EVALUATION

• AVAILABILITY EVALUATION

• LIFE CYCLE EVALUATION

• MTTF/MTTR

• SERVICE COST

DS - II - DCMM - 53

EXPERIMENTAL

FAULT INJECTION

FAULT LATENCY

DEPENDABILITY MEASUREMENT

DS - II - DCMM - 54

RELIABILITY TOOLS (1)

Tool Application Input Models Solution

ARIES-82

-Repairable/non-

repairable systems

-Temporary and

permanent faults

-Exponential failure and repair distributions

-Physical & logical structure needed to build Markov process

Continuous-state

homogeneous

Markov process

Analytical

-Non repairable systems only

-Temporary and permanent faults

-Weibull or exponential failure distribution

-Fault tree

-Continuous non- homogeneous Markov chain

-Semi-Markov model

Analytical

SAVE -Repairable/non-repairable systems

-Permanent faults

-Maintenance strategies

-Exponential failure & repair distribution

-Fault tree

-Input language to describe Markov chain

-Fault trees

-Continuous-state homogeneous Markov chain

Analytical

Simulation

MARK1 Non-repairable systems -Poisson failure distribution

-Markov chain description

Discrete-state homogenous Markov chain

Analytical

Repairable & non-repairable systems

-Exponential polynomial distribution

-Multiple levels of models can be specified

-Series-parallel RBD or directed graphs

-Fault trees

-Continuous state homogeneous Markov chain

-Semi-Markov chain

Analytical

DS - II - DCMM - 55

RELIABILITY TOOLS (2)Tool Application Input Models Solution

HARP -Repairable & non-repairable systems

-Temporary and

permanent faults

-Any failure distribution

-Fault tree

-Markov process representation in graphic form

-ESPN nodel inputs

-CARE-III Markov models

-ESPN

-ARIES transient fault recovery model

Analytical

Simulation

GRAMP & GRAMS

-Repairable & non-

repairable systems

-Permanent faults

-Catastrophic discrete events

-Vaintenance strategies

-Life cycle costs

-Piecewise time varying failure rates

-Reliability requirements

-Input at the module, subsystem and system level

-Physical & logical structure information

-Maintenance strategy/costs

-Removal/shipping costs

-Continuous-time homogeneous Markov process

-Monte Carlo discrete event digital simulator

Analytical

Simulation

ARM -Repairable & non-repairable systems

-Temp. & permanent faults

-PMS structure

-System requirements

-time varying failure distribution

Non homogeneous Markov model

Analytical

SURE -Non-repairable systems -Markov chain description Semi-Markov Analytical

SUPER -Repairable & non repairable systems

RBD Markov Not mentioned

SURF -Repairable & non repairable systems

Transition matrix Markov processes with stages and “fictitious” events

Analytical

META-SAN

-Non-repairable systems Description of Stochastic Activity Network

Stochastic Activity Networks

Analytical Simulation

RBD - Reliability Block Diagram ESPN - Extended Stochastic Petri Nets

DS - II - DCMM - 56

RELIABILITY TOOLS (3)

Tool Application Input Models Solution

SURF II -Repairable & non repairable systems

-Markov chain description

-GSPN

-Markov model

-Generalized Stochastic Petri Net (GSPN)

Analytical

FAULT TREE+

-Repairable & non repairable systems

-Markov chains

-Fussel-Vesely method

-CCF Analysis with beta factor

-Other

-Fault trees

-Markov models

Analytical

AvSim+ -Repairable systems -RBD

-Fault trees

-Weibull analysis of datasets

-RBD

-Fault trees

Simulation

FTA -Repairable systems -OR-Markov chains

-K-out-of-N gates

-Other gates

-Fault event trees

-RBD

-FMECA tree

-Project tree

Analytical

Relia-bility Work-bench

-Repairable systems -Markov chain

-Other

-Fault trees

-Event trees

-Markov model

-FMECA

Analytical

BELL-CORE

-Repairable & non repairable systems

-Fault trees

-Multiple levels of models

-Redundancy modelling

-Parts count procedure

-with predictions with Field Tracking Data

-Other

Analytical

DS - II - DCMM - 57

INPUT/OUTPUT FOR MAJORITY OF RELIABILITY TOOLS

System structure Fault classes Failure rates Fault handling procedures Repair procedures Success criteria

Dependability Evaluation

Model Development

Tool (Model

Solution)

DS - II - DCMM - 58

ERRORS (1)

ERRORS IN DEPENDABILITY EVALUATION

INPUT

PARAMETERS

REAL-WORLD

SYSTEM MODELSOLUTION

TECHNIQUE

EVALUATINGAND

PREDIC TING

SOLUTIONERRORS

EVALUATIONSAND

PREDICTIONS

MODELINGERRORS

PARAMETRIC ERRORS

DS - II - DCMM - 59

ERRORS (2)A. MODELING ERRORS 

1.) STRUCTURAL ERRORS

- Initial state uncertainty

- Missing or extra states

- Missing or extra transition 

2.) ERRORS IN ERROR PROPAGATION MODEL 

3.) PARAMETRIC ERRORS

- Failure and repair rates

- Coverage parameters 

4.) ERRORS DUE TO NON-INDEPENDANCE 

B. SOLUTION ERRORS 

1.) APPROXIMATION ERRORS

- Due to system partition and state aggregation 

2.) NUMERICAL ERRORS

- Truncation errors

- Round-off errors 

3.) PROGRAMMING ERRORS

DS - II - DCMM - 60

RELIABILITY PREDICTION MODELS ASUSED IN A REAL WORLD

64K DRAM in plastic FIT´s per 109 hours

• MIL-HDBK-217D and 217E most widely used and serves 60,578 as the base for others (most pessimistic)

• Recull de Donnees de Fiabilite du CNET 541 (Centre National d´Etudes des Telecommunications)

• Reliability Prediction Procedure (RPP) Bellcore 550

• NIPPON TELEGRAPH and TELEPHONE 542 Standard Reliability Table

• Handbook of Reliability Data HRD3 (BRITISH TELECOM) 10

DS - II - DCMM - 61

MIL-HDBK-217E RELIABILITY MODEL

• The MIL-217 Module is a powerful reliability prediction program based on the internationally recognised method of calculating electronic equipment reliability given in MIL-HDBK-217. This standard uses a series of models for various categories of electronic, electrical and electro-mechanical components to predict failure rates which are affected by environmental conditions, quality levels, stress conditions and various other parameters. These models are fully detailed in MIL-HDBK-217.

• Multi Systems within the same Project• Transfer to and from any other module• Linked Blocks, represent blocks with identical characteristics• Redundancy modelling including hot standby• Mission Phase• Additional sources

– http://www.t-cubed.com/faq_217.htm– http://www.relexsoftware.com/

DS - II - DCMM - 62

SUMMARY OF MIL-HDBK-217E RELIABILITY MODEL

FAILURE RATE MODEL AND FACTORS

• The failure rate, , in failures per million hours for monolithic MOS and bipolar chips takes the form of

LQ(C1TV + C2E)

L- Learning Factor Q - Quality Factor T - Temperature Acceleration Factor V - Voltage Stress Factor E - Application Environment Factor C1,C2 - Technology Constants

DS - II - DCMM - 63

Output from Lambda for the SUN-2/50 study(with failure rate in failures per million hours)

(Quantity)Module

Lambda (Single Module)

Lambda(All Module

Copies) (Quantity)Module

Lambda (Single Module)

Lambda(All Module

Copies)SUN 138.7355 100.000 (1) MEMORY 82.4447 59.426(1) PROCESSOR 4.8390 3.488 (1) MM.CONTROL 39.9107 48.409(1) PROC. SUPPORT 5.4206 3.907 (1) MMC.SSI.MSI 1.1491 2.879 (1) PROC.PALS 2.1841 40.292 (1) MMC.RAM 38.7615 97.121 (1) PROC.SSI.MSI 2.7943 51.550 (1) MAIN.MEMORY 40.9333 49.649 (1) PROC.RC 0.4422 8.157 (1) MM.SSI.MSI 1.6743 4.090(1) BOOT.STRAP 9.8182 7.077 (1) MM.RC 7.6148 18.603 (1) BOOT.ROM 6.6852 68.090 (1) MM.RAM 31.6442 77.307 (1) BOOT.SSI.MSI 1.3474 13.724 (1) DVMA 1.6007 1.942 (1) BOOT.PALS 1.7856 18.186 (1) DVMA.SSI.MSI 0.2830 17.677(1) CLOCK.CKTS 2.1691 1.563 (1) DVMA.PALS 1.3177 82.323(1) VIDEO.LOGIC 16.2676 11.726 (1) SERIAL.IO 8.1328 5.862 (1) ADDR.DECODER 8.4429 51.900 (1) SIO.CTRL 3.6153 88.907 (1) VIDEO.RAM 4.8517 57.465 (1) SIO.SSI.MSI 0.5234 6.436 (1) VBI.SSI.MSI 3.5912 42.535 (1) SIO.RC 0.3787 4.657 (1) VMC.CTRL 5.2704 32.398 (1) ETHERNET 2.8325 2.042 (1) VMC.PALS 3.1402 59.582 (1) ETHERNET.CTRL 0.3796 13.400 (1) VMC.SSI.MSI 1.2260 23.262 (1) ENET.SSI.MSI 1.9832 70.017 (1) VMC.RC 0.9042 17.156 (1) ETHERNET.RC 0.4697 16.583 (1) V.SHIFT.LOGIC 0.4555 2.800 (1) VME 4.5865 3.306 (1) V.BUS.IFACE 2.0988 12.902 (1) VME.SSI.MSI 3.1130 67.874 (1) VBI.PALS 1.3177 62.785 (1) VME.PALS 1.2995 28.334 (1) VBI.SSI.MSI 0.7811 37.215 (1) VME.RC 10.1739 3.792

(1) INTERRUPTS 2.2245 1.603

DS - II - DCMM - 64

BELLCOR´S RELIABILITY PREDICTION PROGRAM• The Bellcore Module calculates the reliability prediction of electronic

equipment based on the Bellcore (Telcordia) standard TR-332 Issue 6. This standard uses a series of models for various categories of electronic, electrical and electro-mechanical components to predict steady-state failure rates which are affected by environmental conditions, quality levels, electrical stress conditions and various other parameters. These models allow reliability prediction to be performed using three methods.

• Method I : Parts Count procedure • Method II : Combines Method I predictions with laboratory data • Method III : Combines Method I predictions with Field Tracking Data • Multi Systems within the same Project • Transfer to and from any other module • Linked Blocks, represent blocks with identical characteristics • Redundancy modelling including hot standby• Global Editing• Additional source

– http://www.relexsoftware.com/

DS - II - DCMM - 65

BELLCOR´S RELIABILITY PREDICTION PROGRAM

= gbst

g – environmental factor

s – quality factor

t – temperature factor

b – base failure rate based on the number of transistors for 64K DRAM g = s = t = 1.0 = 550 FIT‘s

Basicb = (l+G)

G – gate count

– technology factor

DS - II - DCMM - 66

REFERENCES:

• A. M. Johnson and M. Malek, "Survey of Software Tools for Evaluating Reliability, Availability and Serviceability," ACM Computing Surveys, 20 (4), 227-269, December 1988; translated and reprinted in Japanese, Kyoritsu Shuppan Co., Ltd., publisher, 1990.

• R. Geist, M. Smotherman, and M. Brown, "Ultrahigh Reliability Estimates for Systems Exhibiting Globally Time-Dependent Failure Processes", Proceedings of the 19th IEEE International Symposium on Fault-Tolerant Computing (FTCS-19), 152-158, Chicago, IL, June 1989.

• Software Quality and Reliability: Tools and Methods (Unicom Applied Information Technology Reports) Darrell Ince(Editor) / Hardcover/Published 1991.