glados.no/ntnu/21v/ttk4145/summary/summary.md

377 lines
11 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

---
title: "Oppsumering av TTK4145"
description: "Lot of theory and discussion, some fomulas, spring 2021."
date: 2021-05-04
math: true
---
## Fault tolerance
Hard to capture faults.
### Bugs
* 1 bug per 50 lines before testing
* 1 bug per 500 at release
* 1 bug per 550 after a year, the constant
1. Make the program work within specs.
2. Run/Tests of the program-
3. Errors happen
4. Locate errors
* Incomplete spec
* Missing handleling of som situation
5. Fix code
### Traditional error handeling
{% highlight c %}
FILE *
openConfigFile(){
FILE * f = fopen("/path/to/config.conf");
if (f == NULL) {
switch(errno){
case ENOMEM: {
...
break;
}
case ENOTDIR: {
...
break;
}
// Do this for all errors
}
}
}
{% endhighlight %}
### Causes of errors
* Incomplete specification
* Software bugs
* HW problems
* Communication problems
### Fault tolerance in real time systems
The problem with traditional errorhandleing is that errors can happen at any possible time.
This is extremely hard to test.
This is some of the error handling real time programming have.
* Handling of unexpected errors
* More threads hanles errors
* Can not test the conventional way
* Can only show extistence of errors
* Can not find errors in specification
* Can not find race conditions
The fault path is shown under.
![Fault tolerance](figures/fault-path.svg)
With fault tolerance the path looks something more like the figure under.
![Fault tolerance](figures/fault-tolarance.svg)
### Error handling
Keep it simple!
The error modes is a part of the module interface.
One way is to handle all errors the same way.
Handle the as if it was the worst error.
Crash and start again.
A different approach is to check that everything is OK.
To test how the systems responds for a unknown error is to insert a failed acceptance test (a not OK signal).
### Redundancy
* If I have $N$ copies of my data, it is possible to handle that one is destroyed.
* Sending $N$ messages, trying $N$ times.
**Static redundancy**
* $N$ active copies. Sending $N$ messages if it is necessary or not.
* Detecting errors is not important.
* Handles cosmic rays easily.
**Dynammic redunancy**
* Relies on detecting the error and recovering
* Resend if timeout and not receiving "ack"
* Go with default if no messages have been received
* The acceptancetest must be good.
### Fault model
#### Example with storage functions.
**Step 1: Failure modes**
Find the failure modes: What could go wrong?
* **Write**: May return "I failed". Does not know why it faield
* **Read**: May return "I failed". Does not know why it failed.
**Step 2: Detect, Simplify, Inject errors**
* Write information on where/what/how the process is doing.
* All errors --> Fail
* Inject errors
**Step 3: Handling with redundancy**
* Have multiple copies of the the information
* Use only the newest
#### Example with communication function
**Step 1: Failure modes**
* Message
* Lost
* Delayed
* Corrupted
* Duplicated
* Wrong recipient
**Step 2: Detection, Merging of errormodes and error injection**
* Adding information to message
* Checksum
* Session ID
* Sequence number
* Adding "ack" on well recieved messages
* All errors will be treaded as "Lost message"
* Injection
* Occasionally throw away some messages
**Step 3: Handling with redundancy**
* Timeout
* Retransmit message
#### Example with processes and caculations
A calculation is an abstract, so how can we talk generally about the failure modes.
**Step 1: Failure modes**
One failure mode
**Step 2: Detect, simplify, inject errors**
All failed acceptance tests will "PANIC" or "STOP".
**Step 3: Handling with redundancy**
There are three solutions:
1. Checkpoint restart
* Do all the work incuding the acceptance test
* Wait with the "side effects"
* Store a checkpoint
* Do the "side effects"
2. Process pairs
* Crash and let an another process take over
3. Presistent processes
## Transactions
A transaction is a design framework for Damage Confinement and Error Recovery.
* An *atomic action*, just without the backward recovery error mode as standard mode
* invincible and instantaneous "calculation" seen from the outside
* A transformation from one consistent state to another'
* A modular computation
### Four features: ACID
* **A**tomicity: Either all side effects happens or none
* **C**oncistency: Leaves the system in a consistent state when finished
* **I**solation: Errors does not spread
* **D**urability: Results are not lost
### Atomic Actions
**Resumption vs. Termination mode**
* If we continue where we were (e.g. after the interrupt) --> *Resumption*
* If we continue somewhere else (i.e. terminating what we where doing) --> Termination
**Async Notification (AN) = Low level thread interaction**
* Async event handling. ("Signals") (resumption)
* Modeled after a HW interrupt
* Can be sent to the correct thread
* Can be handled, ignored, blocked --> The domain can be controlled.
* Often lead to polling
* Could rather skip the signal and poll a status variable or a message queue
* Useless
* ATC --> Async transfer of Control (termination)
* Canceling threads
* setjmpt/longjmp could convert signals to ATC (not really, but still)
* ADA: a strictured mechanism for ATV is integraded with the selected statement
* RT Java: A structured mechanism for ATC is integraded with the exception-handling mechanism
#### Cancelling threads
**Yes, killing threads is ATC!**
* Can make termination model by letting domain be a thread
* "Create a `doWork` thread, and kill it if the action fails"
* Ca still control domain by disabling "cancelstate"
**But, but, but: It leaves ut in undifined state!?**
* Not if we have...
* Full control over changed state (like logs or recovery points) or some other way of recovering well.
* A lock manager that can unlock on behalf of killed thread
* Some control of where we were killed (like nok in the middle of a lock manager or log call)
* An this is what we have!
## Shared variable synchronization
### Non-Preemptive scheduling
Controlling a pump filling a tank.
**Spec:**
* Every second: measure the water level of the tank and generate the reference to the pump
* 10 times a second: Set the power of the pump motor
* Do some GUI: let the human control the process
#### A trivial solution: "Cyclic Exectutive"
{% highlight c %}
oldTime = now();
i = 0;
while(true) {
i = i + 1;
if (i % 10 == 0) {
i = 0;
calculatePumpReference();
}
controlPump();
do {
handleUserEvent();
} while(now() < oldTime + 0.1);
oldTime = oldTime + 0.1;
}
{% endhighlight %}
**Drawbacks**
* OK tasks?
* Timing hard to tune (what if pump sampling should be $\pi$/10?)
* Overload (what if `calucaltePumpReference` uses more than 1/10 seconds?)
* How to add new tasks? (Everything is coupled)
* Waste of time in the do-loop?
* What is priority of `handleUserEvents`?
* How are erros, exceptions, alarms etc. handled?
#### Better soulution with Non-preemptive scheduler
* *3 taskts* administered by a scheduler
* The scheduler takes care of who runs and timing
* Scheduler often inculuded in OSes
* Introducing priorities
{% highlight c %}
/**
* scheduler_registerThread(function, time, priority)
* Higher priority numer means higher priority in scheduler
*/
main() {
scheduler_registrerThread(controlPump, 0.1, 3);
scheduler_registrerThread(calculatePumpReference, 1, 2);
scheduler_registrerThread(handleUserEvents, 0.2, 1);
scheduler_mainLoop();
}
{% endhighlight %}
**Some notes on priorities**
* Priority is generally not important; rather, the main rule is to give higher priority to shorter-deadline tasks.
* This allows tasks to reach its deadlines.
* ... but this is not always the case - if e.g. the tasks are cooperating
* We still handle overload badly
* And: What connection between deadline and priority to start with?
* Is this a good dependency seen from a code quality perspective?
### Pros and cons of nonpreemptive scheduling
| **Pros** | **Cons** |
| :--------------------------------------------- | :------------------------------------------------------------------------- |
| Simple, intuitive, predictable | C macro hell |
| No kernel | Threads must cooperate <-- a form of dependency breaking module boundaries |
| Fast switching times | Heavy threads must be divided |
| Some elegant sunchronization patterns possible | Can we handle blocking of library functions? |
| | Unrobust to errors |
| | Unrobust to (heavy) error handling |
| | Hard to tune at end of project |
{: .table-responsive-lg .table }
### Preemptive Kernel
* Preemption, thread objects and the timer interrupt
* Enabling synchronization: Busy waiting, tes-and-set, disabling the timer interrupt
* Blocking and suspend & resume
* An API for synchronization? Semaphores!
#### Preemption
* Make a handler for a timer interrupt
* Store all registers (including IP & SP) in a "thread object"
* Organize queue of processes (Round Robin e.g. - a collection of thread objects?)
* Can synchronize by: while(!ready); (busy wating, "spin locks")
**Bad solution**
{% highlight c%}
while(lock==1) {}
lock = 1;
// We may run
lock = 0;
{% endhighlight %}
**Better solution**
{% highlight c%}
void t1() {
flag1 = 1; // Declare my intention
turn = 2; // But try to be polite
while(flag2 == 1 && turn == 2) {}
// We may run
flag1 = 0;
}
{% endhighlight %}
##### Looking more closely at the arsenal
**How can we make basic synchronization under preemption?**
* Spin locks (wasting time and cpu)
* Test&Set (swap) assembly instruction (atomic, but not obvious)
* Disable interrupt (steals control from OS/scheduler)
**But**
* If we disable the timer interrupt we don not have preemption any more
* And... Are these good abstractions in the application programmer domain?
#### Blocked threads
**Let us introduce another queue; the collection of threads not running, waiting for something**
* Fixes the bad performance of spin locks. Is conceptually better.
* "Suspend" moves a thread object from "run" queue to "blocked" queue
* "Resume" moves it back.