glados.no/ntnu/21v/ttk4145/summary/summary.md

4.6 KiB

title description date math
Oppsumering av TTK4145 Lot of theory and discussion, some fomulas, spring 2021. 2021-05-04 true

Fault tolerance

Hard to capture faults.

Bugs

  • 1 bug per 50 lines before testing
  • 1 bug per 500 at release
  • 1 bug per 550 after a year, the constant
  1. Make the program work within specs.
  2. Run/Tests of the program-
  3. Errors happen
  4. Locate errors
    • Incomplete spec
    • Missing handleling of som situation
  5. Fix code

Traditional error handeling

{% highlight c %} FILE * openConfigFile(){ FILE * f = fopen("/path/to/config.conf"); if (f == NULL) { switch(errno){ case ENOMEM: { ... break; } case ENOTDIR: { ... break; } // Do this for all errors } } } {% endhighlight %}

Causes of errors

  • Incomplete specification
  • Software bugs
  • HW problems
  • Communication problems

Fault tolerance in real time systems

The problem with traditional errorhandleing is that errors can happen at any possible time. This is extremely hard to test.

This is some of the error handling real time programming have.

  • Handling of unexpected errors
  • More threads hanles errors
  • Can not test the conventional way
    • Can only show extistence of errors
    • Can not find errors in specification
    • Can not find race conditions

The fault path is shown under.

Fault tolerance

With fault tolerance the path looks something more like the figure under.

Fault tolerance

Error handling

Keep it simple!

The error modes is a part of the module interface.

One way is to handle all errors the same way. Handle the as if it was the worst error. Crash and start again.

A different approach is to check that everything is OK.

To test how the systems responds for a unknown error is to insert a failed acceptance test (a not OK signal).

Redundancy

  • If I have N copies of my data, it is possible to handle that one is destroyed.
  • Sending N messages, trying N times.

Static redundancy

  • N active copies. Sending N messages if it is necessary or not.
  • Detecting errors is not important.
  • Handles cosmic rays easily.

Dynammic redunancy

  • Relies on detecting the error and recovering
    • Resend if timeout and not receiving "ack"
    • Go with default if no messages have been received
  • The acceptancetest must be good.

Fault model

Example with storage functions.

Step 1: Failure modes

Find the failure modes: What could go wrong?

  • Write: May return "I failed". Does not know why it faield
  • Read: May return "I failed". Does not know why it failed.

Step 2: Detect, Simplify, Inject errors

  • Write information on where/what/how the process is doing.
  • All errors --> Fail
  • Inject errors

Step 3: Handling with redundancy

  • Have multiple copies of the the information
    • Use only the newest

Example with communication function

Step 1: Failure modes

  • Message
    • Lost
    • Delayed
    • Corrupted
    • Duplicated
    • Wrong recipient

Step 2: Detection, Merging of errormodes and error injection

  • Adding information to message
    • Checksum
    • Session ID
    • Sequence number
  • Adding "ack" on well recieved messages
  • All errors will be treaded as "Lost message"
  • Injection
    • Occasionally throw away some messages

Step 3: Handling with redundancy

  • Timeout
  • Retransmit message

Example with processes and caculations

A calculation is an abstract, so how can we talk generally about the failure modes.

Step 1: Failure modes

One failure mode

Step 2: Detect, simplify, inject errors

All failed acceptance tests will "PANIC" or "STOP".

Step 3: Handling with redundancy

There are three solutions:

  1. Checkpoint restart
    • Do all the work incuding the acceptance test
    • Wait with the "side effects"
    • Store a checkpoint
    • Do the "side effects"
  2. Process pairs
    • Crash and let an another process take over
  3. Presistent processes

Transactions

A transaction is a design framework for Damage Confinement and Error Recovery.

  • An atomic action, just without the backward recovery error mode as standard mode
  • invincible and instantaneous "calculation" seen from the outside
  • A transformation from one consistent state to another'
  • A modular computation

Four features: ACID

  • Atomicity: Either all side effects happens or none
  • Concistency: Leaves the system in a consistent state when finished
  • Isolation: Errors does not spread
  • Durability: Results are not lost