2021-05-04 22:24:30 +02:00
|
|
|
|
---
|
|
|
|
|
title: "Oppsumering av TTK4145"
|
|
|
|
|
description: "Lot of theory and discussion, some fomulas, spring 2021."
|
|
|
|
|
date: 2021-05-04
|
|
|
|
|
math: true
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Fault tolerance
|
|
|
|
|
|
|
|
|
|
Hard to capture faults.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Bugs
|
|
|
|
|
|
|
|
|
|
* 1 bug per 50 lines before testing
|
|
|
|
|
* 1 bug per 500 at release
|
|
|
|
|
* 1 bug per 550 after a year, the constant
|
|
|
|
|
|
|
|
|
|
1. Make the program work within specs.
|
|
|
|
|
2. Run/Tests of the program-
|
|
|
|
|
3. Errors happen
|
|
|
|
|
4. Locate errors
|
|
|
|
|
* Incomplete spec
|
|
|
|
|
* Missing handleling of som situation
|
|
|
|
|
5. Fix code
|
|
|
|
|
|
|
|
|
|
### Traditional error handeling
|
|
|
|
|
|
|
|
|
|
{% highlight c %}
|
|
|
|
|
FILE *
|
|
|
|
|
openConfigFile(){
|
|
|
|
|
FILE * f = fopen("/path/to/config.conf");
|
|
|
|
|
if (f == NULL) {
|
|
|
|
|
switch(errno){
|
|
|
|
|
case ENOMEM: {
|
|
|
|
|
...
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
case ENOTDIR: {
|
|
|
|
|
...
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
// Do this for all errors
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
{% endhighlight %}
|
|
|
|
|
|
|
|
|
|
### Causes of errors
|
|
|
|
|
|
|
|
|
|
* Incomplete specification
|
|
|
|
|
* Software bugs
|
|
|
|
|
* HW problems
|
|
|
|
|
* Communication problems
|
|
|
|
|
|
|
|
|
|
### Fault tolerance in real time systems
|
|
|
|
|
|
|
|
|
|
The problem with traditional errorhandleing is that errors can happen at any possible time.
|
|
|
|
|
This is extremely hard to test.
|
|
|
|
|
|
|
|
|
|
This is some of the error handling real time programming have.
|
|
|
|
|
|
|
|
|
|
* Handling of unexpected errors
|
|
|
|
|
* More threads hanles errors
|
|
|
|
|
* Can not test the conventional way
|
|
|
|
|
* Can only show extistence of errors
|
|
|
|
|
* Can not find errors in specification
|
|
|
|
|
* Can not find race conditions
|
|
|
|
|
|
|
|
|
|
The fault path is shown under.
|
|
|
|
|
|
|
|
|
|
![Fault tolerance](figures/fault-path.svg)
|
|
|
|
|
|
|
|
|
|
With fault tolerance the path looks something more like the figure under.
|
|
|
|
|
|
|
|
|
|
![Fault tolerance](figures/fault-tolarance.svg)
|
|
|
|
|
|
|
|
|
|
### Error handling
|
|
|
|
|
|
|
|
|
|
Keep it simple!
|
|
|
|
|
|
|
|
|
|
The error modes is a part of the module interface.
|
|
|
|
|
|
|
|
|
|
One way is to handle all errors the same way.
|
|
|
|
|
Handle the as if it was the worst error.
|
|
|
|
|
Crash and start again.
|
|
|
|
|
|
|
|
|
|
A different approach is to check that everything is OK.
|
|
|
|
|
|
|
|
|
|
To test how the systems responds for a unknown error is to insert a failed acceptance test (a not OK signal).
|
|
|
|
|
|
|
|
|
|
### Redundancy
|
|
|
|
|
|
|
|
|
|
* If I have $N$ copies of my data, it is possible to handle that one is destroyed.
|
|
|
|
|
* Sending $N$ messages, trying $N$ times.
|
|
|
|
|
|
|
|
|
|
**Static redundancy**
|
|
|
|
|
|
|
|
|
|
* $N$ active copies. Sending $N$ messages if it is necessary or not.
|
|
|
|
|
* Detecting errors is not important.
|
|
|
|
|
* Handles cosmic rays easily.
|
|
|
|
|
|
|
|
|
|
**Dynammic redunancy**
|
|
|
|
|
|
|
|
|
|
* Relies on detecting the error and recovering
|
|
|
|
|
* Resend if timeout and not receiving "ack"
|
|
|
|
|
* Go with default if no messages have been received
|
|
|
|
|
* The acceptancetest must be good.
|
|
|
|
|
|
2021-05-05 13:56:04 +02:00
|
|
|
|
|
|
|
|
|
### Fault model
|
|
|
|
|
|
|
|
|
|
#### Example with storage functions.
|
|
|
|
|
|
|
|
|
|
**Step 1: Failure modes**
|
|
|
|
|
|
|
|
|
|
Find the failure modes: What could go wrong?
|
|
|
|
|
|
|
|
|
|
* **Write**: May return "I failed". Does not know why it faield
|
|
|
|
|
* **Read**: May return "I failed". Does not know why it failed.
|
|
|
|
|
|
|
|
|
|
**Step 2: Detect, Simplify, Inject errors**
|
|
|
|
|
|
|
|
|
|
* Write information on where/what/how the process is doing.
|
|
|
|
|
* All errors --> Fail
|
|
|
|
|
* Inject errors
|
|
|
|
|
|
|
|
|
|
**Step 3: Handling with redundancy**
|
|
|
|
|
|
|
|
|
|
* Have multiple copies of the the information
|
|
|
|
|
* Use only the newest
|
|
|
|
|
|
|
|
|
|
#### Example with communication function
|
|
|
|
|
|
|
|
|
|
**Step 1: Failure modes**
|
|
|
|
|
|
|
|
|
|
* Message
|
|
|
|
|
* Lost
|
|
|
|
|
* Delayed
|
|
|
|
|
* Corrupted
|
2021-05-08 17:26:40 +02:00
|
|
|
|
* Duplicated
|
2021-05-05 13:56:04 +02:00
|
|
|
|
* Wrong recipient
|
|
|
|
|
|
|
|
|
|
**Step 2: Detection, Merging of errormodes and error injection**
|
|
|
|
|
|
|
|
|
|
* Adding information to message
|
|
|
|
|
* Checksum
|
|
|
|
|
* Session ID
|
|
|
|
|
* Sequence number
|
|
|
|
|
* Adding "ack" on well recieved messages
|
|
|
|
|
* All errors will be treaded as "Lost message"
|
|
|
|
|
* Injection
|
|
|
|
|
* Occasionally throw away some messages
|
|
|
|
|
|
|
|
|
|
**Step 3: Handling with redundancy**
|
|
|
|
|
|
|
|
|
|
* Timeout
|
2021-05-05 20:40:26 +02:00
|
|
|
|
* Retransmit message
|
|
|
|
|
|
|
|
|
|
#### Example with processes and caculations
|
|
|
|
|
|
|
|
|
|
A calculation is an abstract, so how can we talk generally about the failure modes.
|
|
|
|
|
|
|
|
|
|
**Step 1: Failure modes**
|
|
|
|
|
|
|
|
|
|
One failure mode
|
|
|
|
|
|
|
|
|
|
**Step 2: Detect, simplify, inject errors**
|
|
|
|
|
|
|
|
|
|
All failed acceptance tests will "PANIC" or "STOP".
|
|
|
|
|
|
|
|
|
|
**Step 3: Handling with redundancy**
|
|
|
|
|
|
|
|
|
|
There are three solutions:
|
|
|
|
|
|
|
|
|
|
1. Checkpoint restart
|
|
|
|
|
* Do all the work incuding the acceptance test
|
|
|
|
|
* Wait with the "side effects"
|
|
|
|
|
* Store a checkpoint
|
|
|
|
|
* Do the "side effects"
|
|
|
|
|
2. Process pairs
|
|
|
|
|
* Crash and let an another process take over
|
|
|
|
|
3. Presistent processes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Transactions
|
|
|
|
|
|
|
|
|
|
A transaction is a design framework for Damage Confinement and Error Recovery.
|
|
|
|
|
|
2021-05-08 17:26:40 +02:00
|
|
|
|
* An *atomic action*, just without the backward recovery error mode as standard mode
|
|
|
|
|
* invincible and instantaneous "calculation" seen from the outside
|
|
|
|
|
* A transformation from one consistent state to another'
|
|
|
|
|
* A modular computation
|
|
|
|
|
|
|
|
|
|
### Four features: ACID
|
|
|
|
|
|
|
|
|
|
* **A**tomicity: Either all side effects happens or none
|
|
|
|
|
* **C**oncistency: Leaves the system in a consistent state when finished
|
|
|
|
|
* **I**solation: Errors does not spread
|
|
|
|
|
* **D**urability: Results are not lost
|
2021-05-10 18:30:58 +02:00
|
|
|
|
|
|
|
|
|
### Atomic Actions
|
|
|
|
|
|
|
|
|
|
**Resumption vs. Termination mode**
|
|
|
|
|
* If we continue where we were (e.g. after the interrupt) --> *Resumption*
|
|
|
|
|
* If we continue somewhere else (i.e. terminating what we where doing) --> Termination
|
|
|
|
|
|
|
|
|
|
**Async Notification (AN) = Low level thread interaction**
|
|
|
|
|
* Async event handling. ("Signals") (resumption)
|
|
|
|
|
* Modeled after a HW interrupt
|
|
|
|
|
* Can be sent to the correct thread
|
|
|
|
|
* Can be handled, ignored, blocked --> The domain can be controlled.
|
|
|
|
|
* Often lead to polling
|
|
|
|
|
* Could rather skip the signal and poll a status variable or a message queue
|
|
|
|
|
* Useless
|
|
|
|
|
* ATC --> Async transfer of Control (termination)
|
|
|
|
|
* Canceling threads
|
|
|
|
|
* setjmpt/longjmp could convert signals to ATC (not really, but still)
|
|
|
|
|
* ADA: a strictured mechanism for ATV is integraded with the selected statement
|
|
|
|
|
* RT Java: A structured mechanism for ATC is integraded with the exception-handling mechanism
|
|
|
|
|
|
|
|
|
|
#### Cancelling threads
|
|
|
|
|
|
|
|
|
|
**Yes, killing threads is ATC!**
|
|
|
|
|
|
|
|
|
|
* Can make termination model by letting domain be a thread
|
|
|
|
|
* "Create a `doWork` thread, and kill it if the action fails"
|
|
|
|
|
* Ca still control domain by disabling "cancelstate"
|
|
|
|
|
|
|
|
|
|
**But, but, but: It leaves ut in undifined state!?**
|
|
|
|
|
* Not if we have...
|
|
|
|
|
* Full control over changed state (like logs or recovery points) or some other way of recovering well.
|
|
|
|
|
* A lock manager that can unlock on behalf of killed thread
|
|
|
|
|
* Some control of where we were killed (like nok in the middle of a lock manager or log call)
|
|
|
|
|
* An this is what we have!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Shared variable synchronization
|
|
|
|
|
|
|
|
|
|
### Non-Preemptive scheduling
|
|
|
|
|
|
|
|
|
|
Controlling a pump filling a tank.
|
|
|
|
|
|
|
|
|
|
**Spec:**
|
|
|
|
|
* Every second: measure the water level of the tank and generate the reference to the pump
|
|
|
|
|
* 10 times a second: Set the power of the pump motor
|
|
|
|
|
* Do some GUI: let the human control the process
|
|
|
|
|
|
|
|
|
|
#### A trivial solution: "Cyclic Exectutive"
|
|
|
|
|
|
|
|
|
|
{% highlight c %}
|
|
|
|
|
oldTime = now();
|
|
|
|
|
i = 0;
|
|
|
|
|
while(true) {
|
|
|
|
|
i = i + 1;
|
|
|
|
|
if (i % 10 == 0) {
|
|
|
|
|
i = 0;
|
|
|
|
|
calculatePumpReference();
|
|
|
|
|
}
|
|
|
|
|
controlPump();
|
|
|
|
|
do {
|
|
|
|
|
handleUserEvent();
|
|
|
|
|
} while(now() < oldTime + 0.1);
|
|
|
|
|
oldTime = oldTime + 0.1;
|
|
|
|
|
}
|
|
|
|
|
{% endhighlight %}
|
|
|
|
|
|
|
|
|
|
**Drawbacks**
|
|
|
|
|
|
|
|
|
|
* OK tasks?
|
|
|
|
|
* Timing hard to tune (what if pump sampling should be $\pi$/10?)
|
|
|
|
|
* Overload (what if `calucaltePumpReference` uses more than 1/10 seconds?)
|
|
|
|
|
* How to add new tasks? (Everything is coupled)
|
|
|
|
|
* Waste of time in the do-loop?
|
|
|
|
|
* What is priority of `handleUserEvents`?
|
|
|
|
|
* How are erros, exceptions, alarms etc. handled?
|
|
|
|
|
|
|
|
|
|
#### Better soulution with Non-preemptive scheduler
|
|
|
|
|
|
|
|
|
|
* *3 taskts* administered by a scheduler
|
|
|
|
|
* The scheduler takes care of who runs and timing
|
|
|
|
|
* Scheduler often inculuded in OSes
|
|
|
|
|
* Introducing priorities
|
|
|
|
|
|
|
|
|
|
{% highlight c %}
|
|
|
|
|
/**
|
|
|
|
|
* scheduler_registerThread(function, time, priority)
|
|
|
|
|
* Higher priority numer means higher priority in scheduler
|
|
|
|
|
*/
|
|
|
|
|
main() {
|
|
|
|
|
scheduler_registrerThread(controlPump, 0.1, 3);
|
|
|
|
|
scheduler_registrerThread(calculatePumpReference, 1, 2);
|
|
|
|
|
scheduler_registrerThread(handleUserEvents, 0.2, 1);
|
|
|
|
|
scheduler_mainLoop();
|
|
|
|
|
}
|
|
|
|
|
{% endhighlight %}
|
|
|
|
|
|
|
|
|
|
**Some notes on priorities**
|
|
|
|
|
* Priority is generally not important; rather, the main rule is to give higher priority to shorter-deadline tasks.
|
|
|
|
|
* This allows tasks to reach its deadlines.
|
|
|
|
|
* ... but this is not always the case - if e.g. the tasks are cooperating
|
|
|
|
|
* We still handle overload badly
|
|
|
|
|
* And: What connection between deadline and priority to start with?
|
|
|
|
|
* Is this a good dependency seen from a code quality perspective?
|
|
|
|
|
|
|
|
|
|
### Pros and cons of nonpreemptive scheduling
|
|
|
|
|
|
|
|
|
|
| **Pros** | **Cons** |
|
|
|
|
|
| :--------------------------------------------- | :------------------------------------------------------------------------- |
|
|
|
|
|
| Simple, intuitive, predictable | C macro hell |
|
|
|
|
|
| No kernel | Threads must cooperate <-- a form of dependency breaking module boundaries |
|
|
|
|
|
| Fast switching times | Heavy threads must be divided |
|
|
|
|
|
| Some elegant sunchronization patterns possible | Can we handle blocking of library functions? |
|
|
|
|
|
| | Unrobust to errors |
|
|
|
|
|
| | Unrobust to (heavy) error handling |
|
|
|
|
|
| | Hard to tune at end of project |
|
|
|
|
|
{: .table-responsive-lg .table }
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|