Loading

Module 1: Physical & Operational Architecture Development

Notes d'étude
Study Reminders
Support
Text Version

Implementing Fault Tolerance in Physical Architecture

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

welcome back to another session on system design. In the last lecture wediscussed about the physical architecture development for engineering systems. Wediscussed about how do we convert a functional architecture to a physical architecturebasically we develop a generic physical architecture, where the functional elements willbe converted to generic elements or there will be the generic elements corresponding tothe functions where we will be identified, and a architecture will be developed ahierarchical generic architecture will be developed.And from there we will go for a alternative options for this generic elements. So, we usea morphological box a method of morphological box to identify the alternatives andusing this alternatives, we will develop an instantiated architecture, where thecorresponding elements of the function will be convert to physical elements and thedetails of these physical elements will be specified in terms of the components theirmake and their other specifications. So, when we do this we are actually getting aninstantiated physical architecture or probably we will be getting a multiple architecturesbased on the components we select, and based on an exit criteria we choose the a finalarchitecture for the product or the system.So, that is what we saw in the previous lecture and as I mentioned in the that lecture weneed to incorporate few a components to make sure that the system is having a sufficientfault tolerance, that is we need to identify some errors in the system or whenever someerror develops in the system we should be able to identify these errors as well as we needto identify the source of error, we have to confined the damage to the system and makesure that the system continuously works without any problem. So, in addition to thenormal functions identified through the customer requirements, we need to provide thesefunctions and corresponding physical elements also.So, in this lecture we will see how do we actually implement these functions throughphysical elements in terms of fault tolerant elements.
(Refer Slide Time: 02:22)
So, we will look at the implementation of fault tolerance in physical architecture, whatare the methods by which we can do this and what are the a procedures and differentprocesses involved in implementing the fault tolerance. So, the importance of faulttolerance we discussed in the a last class and we mention that.(Refer Slide Time: 02:39)
Some of the case studies we showed that the failure of the aircraft united aircraft 232 wasbasically because of the failure in having a fault tolerance systems or failure in having a
fault tolerance or the position of a single point failure in the system, which actually causethe failure.So, as we can see in this picture the hydraulic system failed and there was a single pointwhere actually the hydraulic points converge, and it was that this convergence point wasthe fan disk heat the tyrant and then the damage the hydraulic system to the controlplanes and the flight lost its control and finally, crashed.(Refer Slide Time: 03:19)
This is the hydraulics diagram which actually shows that this point the part was missingbecause this was a point were actually the hydraulic supplied to the control planes on thisview are provided, and since it was damaged at this point a single point the system failedand it could not really control the plane control was made impossible because of thatfailure.So, we need to make sure that such single point failures are not there in the system, aswell as we have enough a redundancy in the system to make sure that even if one systemfails the other systems are there to take care of the system functionalities. In the physicalarchitecture development we will look at these aspects and then develop sufficientredundancy in the system in terms of physical elements or in terms of softwarealgorithms, to ensure that we have the redundancy to overcome any kind of emergencyscenarios.
(Refer Slide Time: 04:17)
So, this is the case study which I have explained that basically aircraft with the 3 enginesand 3 separate hydraulic system, which actually failed because of the a single pointfailure.(Refer Slide Time: 04:33)
So, in order to avoid this one we need to provide the many functionalities basically thefault tolerant system, and we need to have many error detection functions which wediscussed in the functional architecture development. Just to recap those terminologies
we have this failure which is the deviational behavior between the system and itsrequirements.And then we have this error, which is a subset of the system stage which may lead to thesystem failure and a fault which is a defect in the system that can cause an error. And inthe fault tolerance basically we will be looking at the ability of a system, to toleratefaults; and continue performing. So, whenever there is a fault the system should be ableto perform continuously, without having the problem or without affecting theperformance of the system and that is the fault tolerance in the system.And this can be achieved over those errors which can be observed we can provide thefault tolerance and which are not observable cannot be tolerated, because the systemwont to able to identify those errors and hence cannot have a tolerance for those errorsand the functions associated with the fault tolerance are basically the error detection, adamage confinement error recovery and fault isolation. These are the four functionswhich we need to provide and we need to have physical elements to provide thesefunction that is the detection of error.Then confinement of the damage is created by the error and then recovering from theerror as well as isolation of that error and reporting about that particular error.(Refer Slide Time: 06:05)
So, error detection is basically defining the possible errors, which are the deviations inthe subset of the system state from the desired state; the designed phase before theyoccur and establishing a set of functions for checking for the occurrence of each error.So, in error detection we will provide some kind of functions, where it will continuouslymonitor for the performance and then we keep a track of those performance based on aset value, and then report if there is any deviation then it will be reported as an error.So, the normal error detection functions are basically the type checks, a range checks anda timing checks. We will be see how this can be implemented at a later stage and damageconfinement is protecting the system from the possible spread of failure to other parts ofthe system. So, as the name suggests damage confinement basically you confine thedamage and protect the other system from damage, that is a damage confinement normalway of implementing it is using a firewalls especially in software installation. So, we cansee that firewalls should be provided to protect the system from damage from othersystems.(Refer Slide Time: 07:09)
And error recovery attempts to a correct the error, after the error has been detected andthe errors extent defined. So, once the error detected and the extent defined and recoveryis basically you can get back to the normal modes of operation even if there is an error.So, this can actually be attempted in 2 ways; one is the backward recovery the other one
is the forward recovery. We discussed about these methods in the functionaldecomposition or the functional development stage.So, I am not go into the details of backward recovery and forward recovery, but these arethe 2 methods by which we can implement the error recovery. And the last one is a faultisolation and reporting which attempts to determine wherein the system the faultoccurred that generated the error. So, it look for the fault in the system, and then see fromwhere that error occurred and accordingly you can isolate that particular fault. So, that isthe fault isolation and reporting.(Refer Slide Time: 08:09)
Now, how do we actually implement this fault tolerance? So, that is the in physicalarchitecture development that is a important question how do we actually implement thefault tolerance using the physical elements. The primary source of high availability andfault tolerance is redundancy. So, in any system the primary source is redundancy. If youcan provide a redundant components in the system that actually ensures to a great extentthe fault tolerance of the system, and this can actually be done using hardware, software,information and time. So, you can have hardware redundancy, you can have informationredundancy, you can have software redundancy and you can have time redundancy alsoon the system.
So, hardware redundancy basically uses extra hardware to enable the detection of errorsas well as to provide additional operation hardware components after errors haveoccurred. So, this hardware redundancy as a name suggest will have multiplecomponents of the same function. So, you will be having a 2 or 3 components, whichactually provide the same function and whenever there is an error occurring in one of thisa hardware, the that error can be detected and their other hardware can actually take overthe a functioning of that the particular a failed hardware.So, this is the basic hardware redundancy like in the case of that aircraft, we have a 3hydraulic systems while only one hydraulic system is sufficient to provide the controlplane actuation, we provide 3 hydraulic system and 3 will be a all the 3 will be poweredseparately. So, again 3 separate power sources for this hydraulic system. So, basically weprovide a redundancy in the hardware hydraulic system. But then again we have toensure that these 3 hardware alone is not sufficient, but we need to have additionalhardware also because the single point failure needs to be eliminated.So, but hardware redundancy is one of the a primary method for fault tolerance in thesystem. And this can be implemented in passive active and hybrid forms. So, you canhave a passive hardware redundancy or we can have an active hardware redundancy orwe can have a combination of passive and a active which is known as the hybrid a formof hardware redundancy. We will look at the how do we actually implement theseredundancy, that is hardware redundancies in passive active and hybrid forms.
(Refer Slide Time: 10:26)
The passive hardware redundancy, it actually masks or hides the occurrence of errorsrather than detecting them. Recovery is achieved by having extra hardware availablewhen needed.So, here in the passive hardware redundancy it will not really detect, the error it willbasically mask the error. If whenever there is an error occurring the system willautomatically mask that error or hide that error and function as if nothing has happenedor as if there is no error in the system. So, what actually happens the if there are multiplehardwares or the redundant hardware, if the one of the hardware fails then it theautomatically the other hardware will take over without informing the system that thereis an error or without knowing that there is an error in the system. So, that is the passivehardware redundancy, where it will try to hide or mask the error rather than detectingthem.A recovery is achieved by having extra hardware available when needed. So, since thereare additional hardware available. So, recovery is you see, it automatically recover fromthe error because the other hardware will take over the function and provide you thenormal functioning of the system. So, that is the passive hardware redundancy. The mostcommon implementation of a passive hardware redundancy is known as a triple modularredundancy. This relies on majority voting scheme to mask error in one of the 3 hardwareunits. So, this is the very basic implementation triple modular redundancy where the
error will be masked or hidden by the system and a recovery will be achieved throughadditional hardware which is available. The triple modular redundancy basically thediagram for the typical modular redundancy is shown here. As you can see here there area 3 components component one, component 2 and component 3 and all are identical andthe there will be getting the a same input from the a one of the previous system or thesubsystem.And that will be processing the input and getting an output over here. So, all the 3 will beproviding identical output in the normal situation and the all the output will go to a voter.And voter will take the voting out from this 3 and then decide whether all the 3 are equalor 2 are equal and or they are a totally different. So, based on this voter a scheme anoutput will be generated. So, we can see here the 1 2 and 3 will be identical and if one isdifferent from 2 and 3 then the output from 2 will be taken or if the a one and 2 areidentical then the output from 1 or 2 will be given as an output. So, 3 will bedisconnected.So, in any case if there is an error one error, then this one error can be easily masked bythe this system that is why it is known as triple modular redundancy. So, we can havemodular redundancy in this case by providing 3 redundant hardware and providing theoutput from this hardwares and the voter will decide whether there is an error or not. Andif there is an error or one of the component output is not equal to the other a 2 and thatwill be an error and without even declaring that is an error, it will be go for a next 2identify the other outputs and then given output to the system.So, it actually mask that error in one of the components and then provide continue toprovide the output, that is known as the triple modular redundancy. A problem here isthat you can only mask one error; if there are errors in the 2 of these then it will not beable to mask and then that will be a failure here. Because if all the 3 are different andthen in that case if there are 2 errors in these 2 components any 2 of this componentsthen that error cannot be masked and it will be a failure here that is one problem withthat triple modular redundancy and the other one is the about the voter which is a singlevoter. If there is a failure in this voter then again it will fail then this becomes a singlepoint failure.
So, a triple modular redundancy is a basic a building block for hardware redundancy, buta as such it alone cannot have a great value in hardware redundancy, because this it canmask only one error and the a voter a error it will not be able to be tolerated because itbecomes a single point failure. So, in order to overcome this another scheme is proposedwhich is known as the triplicated the TMR.(Refer Slide Time: 14:35)
So, as the name suggest a triplicated TMR is basically a 3 voters. So, we have a tripledmodular redundancy as a building block and then we triplicate it.So, we have this there are 3 voters here. So, the single point failure is eliminated hereand then we will have a output voter all the voters are here. So, we take this voter and aoutput from here as an output 1 output 2 and output 3. So, as we can see this output fromcomponent one is given to the 3 voters, similarly component 2 is given to the 3 voters,and component 3 also given the output of component 3 also given to 3 voters and thisinput to the voters will be compared here and then an output will be generated; if theoutput from 2 and 3 are same and 1 is not correct then the output from 2 will be given tohere and this output will be coming from here.So, like this all these 3 voters will be providing an output. So, even if one of this voter isnot performing well, these 2 output will be the same. So, you can actually get these 2outputs same and this output will be different and again this will be connected to a TMRand similar to the previous one and then finally, there will be a single voter and that will
give an output. So, the first level of single point redundancy, a single point failure iseliminated by providing 3 voters and again this will be connected to a TMR and that willgive you a again another output.So, like that we can eliminate the a possibilities of error in voter by having a triplicatedTMR. So, that is one way of improving the a performance of a triplicated modularredundancy yeah. And then problem with the TMR as I mentioned is that it can maskonly a single error. So, here it can be masked only a single error. So, if you want to maskmore than or hide more than a one error we need to increase the number of components.So, when we say triple modular redundancy we are talking about 3 components and onevoter. So, we can have n modular redundancy instead of triple we can have n modularredundancy, in that case if you have a 5 modular redundancy then we will be having a 5components here.So, it can mask 2 errors similarly if you have a 7 we can actually mask a 3 errors and soon. So, if you want to have to mask more errors we to have more number of components.So, we can actually go for triple or a 5 or s7 or a n modular redundancy. So, instead ofTMR we can go for NMR to hide more number of errors, that is the way how the passivehardware redundancy is implemented in the system to a hide the errors and recover fromthe errors. So, this is the passive hardware redundancy and this voter is a one of the aimportant a element in the TMR, and here again we have the issues of a synchronizationand computational time.So, if the synchronization of the voters are not voter is not properly implemented, all thecomputational times are different then you will be having there is a possibility of error.So, these need to be taken into account when we are having when we implement thevoters. So, voter’s voter can be implemented either through hardware or throughsoftware and implementing through hardware is bit costly. So, most of the times voterwill be implemented through software and the software implementation of a voter isshown here.
(Refer Slide Time: 17:57)
So, we will take the input from a different sources and then it will pass through a samplerand that will be given to a 2 port memory and then to a processor.This output from this will be given to the processers the 3 processers will be there, andthese 3 processors will be a processing this data all the memory will be connected tothese processers. So, each processor will be getting 3 inputs, and then it will check thisvalues it will compare the values and then it will send an output to a next to 2 portmemory and this is the way how the voting is implemented in triplicated TMR.So, this is the very important in implementing the redundancy, because voter is one ofthe most crucial critical item in passive hardware redundancy. But most of the time itwill be implemented using software because it is one way it is easy to implement and theother one is that cost of implementation is also reduced here, but of course, there is apossibility of failure because of the computation time and synchronization, that has to betaken care while we implement a voters in a triplicated TMR as well as other modularredundancies.So, that was about the a passive hardware redundancy the next one is the active hardwareredundancy.
(Refer Slide Time: 19:13)
So, compared to a passive hardware redundancy, in active hardware redundancy we willtry to identify the location of the a error, you have to do all the functions of a faulttolerance. In passive hardware redundancy we do not really declare an error or we do notknow whether there is an a error happened, because it will simply hide the error. But inthe case of a active hardware redundancy, instead of hiding the error we will try to getthe source of the error and then carryout all the other operations of a damageconfinement and reporting and other activities.So, that is the basic difference between a passive hardware redundancy and activehardware redundancy. In active hardware redundancy we will try to identify the sourceof error and declare the error and report the error and then mask the error and thencarryout all the other operations like reporting and damage confinement and recovery.Those things will be carried out once we declare the error and identify the source oferror; that is active hardware redundancy and it will do all the four functions i e detecterrors, confine damage recover from errors and isolate and report fault.So, all these four functions need to be carried out in the active hardware redundancy. Wecan actually do it by a different methods, hardware duplication with comparison is one ofthe basic blocks for active hardware redundancy. So, to implement the active hardwareredundancy, hardware duplication with comparison is a must. So, that becomes one ofthe basic building blocks for active hardware redundancy. And the methods are basically
hot standby sparing that is one method and the other one is cold standby sparing and theanother one is pair and a spare methods. So, these are the different methods ofimplementing the active hardware redundancy. Let us look at the methods by which wedo this the hardware duplication with comparison.(Refer Slide Time: 21:11)
So, you can see here and this is a basic building block for active redundancy as Imentioned. So, here you can see these basic building blocks are needed for anyimplementation of hardware redundancy. So, what we will do here is to have n redundantcomponents. So, we will be having component 1 2 3 etcetera depending on therequirement, we will be having many number of components and then the output fromsingle inputs and the output will be compared in a comparator and the comparator hasbeen set with the predefined error value. So, if the output from this one is the comparatoris and the component one compared here and component 2 is also compared.And if they do not agree with the predefined value, then an error will be declared then itwill be declared that component one is not in line with the expected output therefore,there is an error in this component. So, that is the way how we declare the error and thenoutput will be if there is a output going from here, this one component one then it will bedeclared and send a error therefore, this output will be discarded and then another outputwill be taken from component 2. So, like this if we have n number of components we can
have if there is an error in one that will be declared and the output will not taken from theinstead 2 will be taken, similarly if there is an error in 2 the other one will be taken.So, like this the duplication with comparison is there. So, we have a duplicate acomponents and the comparison of the output also therefore, the comparison actuallyhelps to declare the error or to identify where the error has occurred. And that way theerror detection is made possible in the hardware duplication with comparison. So, that isthe building block for any type of active hardware redundancy. So, here you can see thatthis is the way how the duplication with comparison works, and then if you want toimplement the.(Refer Slide Time: 22:57)
Hot standby sparing and cold standby sparing these are the 2 important methods inhardware redundancy and they are most commonly implemented and the again thebuilding block is from the hardware duplication and comparison.So, we will have that the basic blocks in here and then implement the hot standby sparingand cold standby sparing. So, let us see how do we actually implement it. So, this is theway how the hot standby sparing is implemented. As you can see here this is the basicbuilding block where we have the component one and then error detection that is theduplication with the error detection. So, we have component 1 2 etcetera up to n andeach one is having an error detection function also.
So, all these functions the components are same; that means, you have duplicatehardware or the redundant hardware’s which actually provide similar outputs from theusing a similar inputs. So, all these components will be giving the same output to thesystem and using the same input. And then there is an error detection function over here,the role of this a component error detection component is basically to look at the outputfrom this component and compare it with a predefined value or a set value.
So, this detection this algorithm or will be written or this will be having a predefinedvalue like the pressure or the temperature expected temperature or expected pressure orexpected time of processing, and they will check whether this is the output from thiscomponent one is coming as the same or not. And if it is not same then there will be anerror declared. So, it will be declaring a error here. Similarly component 2 also will behaving an error detection function. So, whenever there is a variation from the desiredoutput it will declare an error. So, same way we will be having many n components andthen error detection functions and this will be used for detecting an error in the system.So, all this output from these components and the error detection will be send to a n toone switch.So, always one output will be going from here. So, this switch will actually look at theseoutputs from all the components and error detection functions. So, if component one theoutput and the error detection function, if the error detection declares there is no errorand then the output from this will be given as an final output from the switch. If thisdeclares an error and then this output will be discarded immediately it will switched tothe second switch or the second output will be taken. So, this switch will basically switchbetween the output from components. Based on the error detection it will decide fromwhich output to be chosen from here and send as an output.So, that is the n to one switch. So, here whenever there is an error detected in one of thecomponents, that component will be declared as a faulty system or a faulty componentand the data will be stored in n to one switch and then the later all the other actions like adamage confinement and all other functions of fault tolerance will be carried out at theafter once the error is declared. So, here the error detection is there and this errordetection data will be sent to the next level for other functions in the fault tolerance. So,that is how the standby sparing works? Then what is the difference between hot standbyand cold standby in hot standby all these components will be always active.
So, here you can see that component 1 2 3 they are all having the same function and allwill be active. All the functions will be active and all the components will be in activemodes, and there are in the hot situation or that is why is known as the hot standby. So,all those components are in active modes and whenever there is an error detectedimmediately the output will be taken from here. So, there is no time delay between theoutput from here, because all the outputs are available at anytime the outputs areavailable and whenever there is an error detection in one of this immediately the switchwill change to a next mode or the output from the next step component will beautomatically chosen and given as an output.So, there is no time delay always will be getting an output from the system that is knownas the hot standby when the system where we cannot have a delay in output or verycritical and you cannot have a stoppage of output for a short duration in such situation,we need to go for hot standby. You can actually compare it with system where we have aa UPS and a power supply to a computer. So, most of the times your ups also will be in aready mode to supply power. So, it is an active mode and or you can consider the batteryor source of a laptop and the power supply.