glados.no/ntnu/21v/ttk4145/summary/summary.md

---
title: "Oppsumering av TTK4145"
description: "Lot of theory and discussion, some fomulas, spring 2021."
date: 2021-05-04
math: true
---

## Fault tolerance

Hard to capture faults. 


### Bugs

* 1 bug per 50 lines before testing
* 1 bug per 500 at release
* 1 bug per 550 after a year, the constant

1. Make the program work within specs.
2. Run/Tests of the program-
3. Errors happen
4. Locate errors
    * Incomplete spec
    * Missing handleling of som situation
5. Fix code

### Traditional error handeling

{% highlight c %}
FILE *
openConfigFile(){
    FILE * f = fopen("/path/to/config.conf");
    if (f == NULL) {
        switch(errno){
            case ENOMEM: {
                ...
                break;
            }
            case ENOTDIR: {
                ...
                break;
            }
            // Do this for all errors
        }
    }
}
{% endhighlight %}

### Causes of errors

* Incomplete specification
* Software bugs
* HW problems
* Communication problems

### Fault tolerance in real time systems

The problem with traditional errorhandleing is that errors can happen at any possible time.
This is extremely hard to test. 

This is some of the error handling real time programming have.

* Handling of unexpected errors
* More threads hanles errors
* Can not test the conventional way
  * Can only show extistence of errors
  * Can not find errors in specification
  * Can not find race conditions

The fault path is shown under.

![Fault tolerance](figures/fault-path.svg)

With fault tolerance the path looks something more like the figure under.

![Fault tolerance](figures/fault-tolarance.svg)

### Error handling

Keep it simple!

The error modes is a part of the module interface.

One way is to handle all errors the same way.
Handle the as if it was the worst error.
Crash and start again.

A different approach is to check that everything is OK.

To test how the systems responds for a unknown error is to insert a failed acceptance test (a not OK signal).

### Redundancy

* If I have $N$ copies of my data, it is possible to handle that one is destroyed.
* Sending $N$ messages, trying $N$ times.

**Static redundancy**

* $N$ active copies. Sending $N$ messages if it is necessary or not.
* Detecting errors is not important.
* Handles cosmic rays easily.

**Dynammic redunancy**

* Relies on detecting the error and recovering
  * Resend if timeout and not receiving "ack"
  * Go with default if no messages have been received
* The acceptancetest must be good.


### Fault model

#### Example with storage functions.

**Step 1: Failure modes**

Find the failure modes: What could go wrong?

* **Write**: May return "I failed". Does not know why it faield
* **Read**: May return "I failed". Does not know why it failed.

**Step 2: Detect, Simplify, Inject errors**

* Write information on where/what/how the process is doing.
* All errors --> Fail
* Inject errors

**Step 3: Handling with redundancy**

* Have multiple copies of the the information
  * Use only the newest

#### Example with communication function

**Step 1: Failure modes**

* Message
  * Lost
  * Delayed
  * Corrupted
  * DUplicated
  * Wrong recipient

**Step 2: Detection, Merging of errormodes and error injection**

* Adding information to message
  * Checksum
  * Session ID
  * Sequence number
* Adding "ack" on well recieved messages
* All errors will be treaded as "Lost message"
* Injection
  * Occasionally throw away some messages

**Step 3: Handling with redundancy**

* Timeout 
* Retransmit message

#### Example with processes and caculations

A calculation is an abstract, so how can we talk generally about the failure modes.

**Step 1: Failure modes**

One failure mode

**Step 2: Detect, simplify, inject errors**

All failed acceptance tests will "PANIC" or "STOP".

**Step 3: Handling with redundancy**

There are three solutions:

1. Checkpoint restart
    * Do all the work incuding the acceptance test
    * Wait with the "side effects"
    * Store a checkpoint
    * Do the "side effects"
2. Process pairs
    * Crash and let an another process take over
3. Presistent processes


## Transactions

A transaction is a design framework for Damage Confinement and Error Recovery.