541 lines
16 KiB
Markdown
541 lines
16 KiB
Markdown
---
|
||
title: "Oppsumering av TTK4145"
|
||
description: "Lot of theory and discussion, some fomulas, spring 2021."
|
||
date: 2021-05-04
|
||
math: true
|
||
---
|
||
|
||
## Fault tolerance
|
||
|
||
Hard to capture faults.
|
||
|
||
|
||
### Bugs
|
||
|
||
* 1 bug per 50 lines before testing
|
||
* 1 bug per 500 at release
|
||
* 1 bug per 550 after a year, the constant
|
||
|
||
1. Make the program work within specs.
|
||
2. Run/Tests of the program-
|
||
3. Errors happen
|
||
4. Locate errors
|
||
* Incomplete spec
|
||
* Missing handleling of som situation
|
||
5. Fix code
|
||
|
||
### Traditional error handeling
|
||
|
||
{% highlight java %}
|
||
FILE *
|
||
openConfigFile(){
|
||
FILE * f = fopen("/path/to/config.conf");
|
||
if (f == NULL) {
|
||
switch(errno){
|
||
case ENOMEM: {
|
||
...
|
||
break;
|
||
}
|
||
case ENOTDIR: {
|
||
...
|
||
break;
|
||
}
|
||
// Do this for all errors
|
||
}
|
||
}
|
||
}
|
||
{% endhighlight %}
|
||
|
||
### Causes of errors
|
||
|
||
* Incomplete specification
|
||
* Software bugs
|
||
* HW problems
|
||
* Communication problems
|
||
|
||
### Fault tolerance in real time systems
|
||
|
||
The problem with traditional errorhandleing is that errors can happen at any possible time.
|
||
This is extremely hard to test.
|
||
|
||
This is some of the error handling real time programming have.
|
||
|
||
* Handling of unexpected errors
|
||
* More threads hanles errors
|
||
* Can not test the conventional way
|
||
* Can only show extistence of errors
|
||
* Can not find errors in specification
|
||
* Can not find race conditions
|
||
|
||
The fault path is shown under.
|
||
|
||
![Fault tolerance](figures/fault-path.svg)
|
||
|
||
With fault tolerance the path looks something more like the figure under.
|
||
|
||
![Fault tolerance](figures/fault-tolarance.svg)
|
||
|
||
### Error handling
|
||
|
||
Keep it simple!
|
||
|
||
The error modes is a part of the module interface.
|
||
|
||
One way is to handle all errors the same way.
|
||
Handle the as if it was the worst error.
|
||
Crash and start again.
|
||
|
||
A different approach is to check that everything is OK.
|
||
|
||
To test how the systems responds for a unknown error is to insert a failed acceptance test (a not OK signal).
|
||
|
||
### Redundancy
|
||
|
||
* If I have $N$ copies of my data, it is possible to handle that one is destroyed.
|
||
* Sending $N$ messages, trying $N$ times.
|
||
|
||
**Static redundancy**
|
||
|
||
* $N$ active copies. Sending $N$ messages if it is necessary or not.
|
||
* Detecting errors is not important.
|
||
* Handles cosmic rays easily.
|
||
|
||
**Dynammic redunancy**
|
||
|
||
* Relies on detecting the error and recovering
|
||
* Resend if timeout and not receiving "ack"
|
||
* Go with default if no messages have been received
|
||
* The acceptancetest must be good.
|
||
|
||
|
||
### Fault model
|
||
|
||
#### Example with storage functions.
|
||
|
||
**Step 1: Failure modes**
|
||
|
||
Find the failure modes: What could go wrong?
|
||
|
||
* **Write**: May return "I failed". Does not know why it faield
|
||
* **Read**: May return "I failed". Does not know why it failed.
|
||
|
||
**Step 2: Detect, Simplify, Inject errors**
|
||
|
||
* Write information on where/what/how the process is doing.
|
||
* All errors --> Fail
|
||
* Inject errors
|
||
|
||
**Step 3: Handling with redundancy**
|
||
|
||
* Have multiple copies of the the information
|
||
* Use only the newest
|
||
|
||
#### Example with communication function
|
||
|
||
**Step 1: Failure modes**
|
||
|
||
* Message
|
||
* Lost
|
||
* Delayed
|
||
* Corrupted
|
||
* Duplicated
|
||
* Wrong recipient
|
||
|
||
**Step 2: Detection, Merging of errormodes and error injection**
|
||
|
||
* Adding information to message
|
||
* Checksum
|
||
* Session ID
|
||
* Sequence number
|
||
* Adding "ack" on well recieved messages
|
||
* All errors will be treaded as "Lost message"
|
||
* Injection
|
||
* Occasionally throw away some messages
|
||
|
||
**Step 3: Handling with redundancy**
|
||
|
||
* Timeout
|
||
* Retransmit message
|
||
|
||
#### Example with processes and caculations
|
||
|
||
A calculation is an abstract, so how can we talk generally about the failure modes.
|
||
|
||
**Step 1: Failure modes**
|
||
|
||
One failure mode
|
||
|
||
**Step 2: Detect, simplify, inject errors**
|
||
|
||
All failed acceptance tests will "PANIC" or "STOP".
|
||
|
||
**Step 3: Handling with redundancy**
|
||
|
||
There are three solutions:
|
||
|
||
1. Checkpoint restart
|
||
* Do all the work incuding the acceptance test
|
||
* Wait with the "side effects"
|
||
* Store a checkpoint
|
||
* Do the "side effects"
|
||
2. Process pairs
|
||
* Crash and let an another process take over
|
||
3. Presistent processes
|
||
|
||
|
||
## Transactions
|
||
|
||
A transaction is a design framework for Damage Confinement and Error Recovery.
|
||
|
||
* An *atomic action*, just without the backward recovery error mode as standard mode
|
||
* invincible and instantaneous "calculation" seen from the outside
|
||
* A transformation from one consistent state to another'
|
||
* A modular computation
|
||
|
||
### Four features: ACID
|
||
|
||
* **A**tomicity: Either all side effects happens or none
|
||
* **C**oncistency: Leaves the system in a consistent state when finished
|
||
* **I**solation: Errors does not spread
|
||
* **D**urability: Results are not lost
|
||
|
||
### Atomic Actions
|
||
|
||
**Resumption vs. Termination mode**
|
||
* If we continue where we were (e.g. after the interrupt) --> *Resumption*
|
||
* If we continue somewhere else (i.e. terminating what we where doing) --> Termination
|
||
|
||
**Async Notification (AN) = Low level thread interaction**
|
||
* Async event handling. ("Signals") (resumption)
|
||
* Modeled after a HW interrupt
|
||
* Can be sent to the correct thread
|
||
* Can be handled, ignored, blocked --> The domain can be controlled.
|
||
* Often lead to polling
|
||
* Could rather skip the signal and poll a status variable or a message queue
|
||
* Useless
|
||
* ATC --> Async transfer of Control (termination)
|
||
* Canceling threads
|
||
* setjmpt/longjmp could convert signals to ATC (not really, but still)
|
||
* ADA: a strictured mechanism for ATV is integraded with the selected statement
|
||
* RT Java: A structured mechanism for ATC is integraded with the exception-handling mechanism
|
||
|
||
#### Cancelling threads
|
||
|
||
**Yes, killing threads is ATC!**
|
||
|
||
* Can make termination model by letting domain be a thread
|
||
* "Create a `doWork` thread, and kill it if the action fails"
|
||
* Ca still control domain by disabling "cancelstate"
|
||
|
||
**But, but, but: It leaves ut in undifined state!?**
|
||
* Not if we have...
|
||
* Full control over changed state (like logs or recovery points) or some other way of recovering well.
|
||
* A lock manager that can unlock on behalf of killed thread
|
||
* Some control of where we were killed (like nok in the middle of a lock manager or log call)
|
||
* An this is what we have!
|
||
|
||
|
||
## Shared variable synchronization
|
||
|
||
### Non-Preemptive scheduling
|
||
|
||
Controlling a pump filling a tank.
|
||
|
||
**Spec:**
|
||
* Every second: measure the water level of the tank and generate the reference to the pump
|
||
* 10 times a second: Set the power of the pump motor
|
||
* Do some GUI: let the human control the process
|
||
|
||
#### A trivial solution: "Cyclic Exectutive"
|
||
|
||
{% highlight java %}
|
||
oldTime = now();
|
||
i = 0;
|
||
while(true) {
|
||
i = i + 1;
|
||
if (i % 10 == 0) {
|
||
i = 0;
|
||
calculatePumpReference();
|
||
}
|
||
controlPump();
|
||
do {
|
||
handleUserEvent();
|
||
} while(now() < oldTime + 0.1);
|
||
oldTime = oldTime + 0.1;
|
||
}
|
||
{% endhighlight %}
|
||
|
||
**Drawbacks**
|
||
|
||
* OK tasks?
|
||
* Timing hard to tune (what if pump sampling should be $\pi$/10?)
|
||
* Overload (what if `calucaltePumpReference` uses more than 1/10 seconds?)
|
||
* How to add new tasks? (Everything is coupled)
|
||
* Waste of time in the do-loop?
|
||
* What is priority of `handleUserEvents`?
|
||
* How are erros, exceptions, alarms etc. handled?
|
||
|
||
#### Better soulution with Non-preemptive scheduler
|
||
|
||
* *3 taskts* administered by a scheduler
|
||
* The scheduler takes care of who runs and timing
|
||
* Scheduler often inculuded in OSes
|
||
* Introducing priorities
|
||
|
||
{% highlight java %}
|
||
/**
|
||
* scheduler_registerThread(function, time, priority)
|
||
* Higher priority numer means higher priority in scheduler
|
||
*/
|
||
main() {
|
||
scheduler_registrerThread(controlPump, 0.1, 3);
|
||
scheduler_registrerThread(calculatePumpReference, 1, 2);
|
||
scheduler_registrerThread(handleUserEvents, 0.2, 1);
|
||
scheduler_mainLoop();
|
||
}
|
||
{% endhighlight %}
|
||
|
||
**Some notes on priorities**
|
||
* Priority is generally not important; rather, the main rule is to give higher priority to shorter-deadline tasks.
|
||
* This allows tasks to reach its deadlines.
|
||
* ... but this is not always the case - if e.g. the tasks are cooperating
|
||
* We still handle overload badly
|
||
* And: What connection between deadline and priority to start with?
|
||
* Is this a good dependency seen from a code quality perspective?
|
||
|
||
### Pros and cons of nonpreemptive scheduling
|
||
|
||
| **Pros** | **Cons** |
|
||
| :--------------------------------------------- | :------------------------------------------------------------------------- |
|
||
| Simple, intuitive, predictable | C macro hell |
|
||
| No kernel | Threads must cooperate <-- a form of dependency breaking module boundaries |
|
||
| Fast switching times | Heavy threads must be divided |
|
||
| Some elegant sunchronization patterns possible | Can we handle blocking of library functions? |
|
||
| | Unrobust to errors |
|
||
| | Unrobust to (heavy) error handling |
|
||
| | Hard to tune at end of project |
|
||
{: .table-responsive-lg .table }
|
||
|
||
|
||
|
||
### Preemptive Kernel
|
||
|
||
* Preemption, thread objects and the timer interrupt
|
||
* Enabling synchronization: Busy waiting, tes-and-set, disabling the timer interrupt
|
||
* Blocking and suspend & resume
|
||
* An API for synchronization? Semaphores!
|
||
|
||
|
||
#### Preemption
|
||
|
||
* Make a handler for a timer interrupt
|
||
* Store all registers (including IP & SP) in a "thread object"
|
||
* Organize queue of processes (Round Robin e.g. - a collection of thread objects?)
|
||
* Can synchronize by: `while(!ready);` (busy wating, "spin locks")
|
||
|
||
**Bad solution**
|
||
|
||
{% highlight java%}
|
||
while(lock==1) {}
|
||
lock = 1;
|
||
// We may run
|
||
lock = 0;
|
||
{% endhighlight %}
|
||
|
||
**Better solution**
|
||
|
||
{% highlight java%}
|
||
void t1() {
|
||
flag1 = 1; // Declare my intention
|
||
turn = 2; // But try to be polite
|
||
while(flag2 == 1 && turn == 2) {}
|
||
// We may run
|
||
flag1 = 0;
|
||
}
|
||
{% endhighlight %}
|
||
|
||
##### Looking more closely at the arsenal
|
||
|
||
**How can we make basic synchronization under preemption?**
|
||
|
||
* Spin locks (wasting time and cpu)
|
||
* Test&Set (swap) assembly instruction (atomic, but not obvious)
|
||
* Disable interrupt (steals control from OS/scheduler)
|
||
|
||
**But**
|
||
* If we disable the timer interrupt we don not have preemption any more
|
||
* And... Are these good abstractions in the application programmer domain?
|
||
|
||
#### Blocked threads
|
||
|
||
**Let us introduce another queue; the collection of threads not running, waiting for something**
|
||
|
||
* Fixes the bad performance of spin locks. Is conceptually better.
|
||
* `suspend` moves a thread object from "run" queue to "blocked" queue
|
||
* `resume` moves it back.
|
||
|
||
##### Two bad solutions
|
||
|
||
{% highlight java%}
|
||
t1(){
|
||
while(busy == 1) suspend();
|
||
busy = 1; // It is free; tak it - No
|
||
// Run
|
||
busy = 0; // Release resource
|
||
|
||
resume t2 // No
|
||
}
|
||
{% endhighlight %}
|
||
|
||
or
|
||
|
||
{% highlight java%}
|
||
t1(){
|
||
while(TestNSet(busy, 1) == 1) suspend();
|
||
// We own resource
|
||
// Run
|
||
busy = 0;
|
||
|
||
resume t2 // No
|
||
}
|
||
{% endhighlight %}
|
||
|
||
##### The suspend/resume problem
|
||
|
||
{% highlight java%}
|
||
// Global variables
|
||
bool g_initDone = False;
|
||
|
||
// Threads
|
||
t1(){ t2(){
|
||
/* Do init */ if (g_initDone == False) {
|
||
g_initDone = True; Suspend();
|
||
resume(t2) }
|
||
// Continue executing // Continue exectuting
|
||
} }
|
||
{% endhighlight %}
|
||
|
||
#### Priorities
|
||
|
||
* Threads mey have different *priorities*. (A sortet run-queue, or more of them.)
|
||
* Only if there are no running threads on a higher priority, a thread will run.
|
||
* We are not aiming for some sens of fairness (!). But predictability.
|
||
* And priorities supports schedulability proofs.
|
||
* But we open ourselves up to *starvation*. A thread may not ever get to run, even if it is runnable.
|
||
|
||
|
||
#### Application-level syncronization
|
||
|
||
**SO, the application programmer needs some syncronozation primitives...**
|
||
|
||
* `sleep()`? - Ok
|
||
* Publish `suspend` and `resume` - No
|
||
* Events (`wait` and `signal`) - Just named versions of suspend & resume semantics.
|
||
* Fixes the need to know aboud "thread objects". But no
|
||
* ...or "Condition variables" - same
|
||
|
||
|
||
### Semaphores
|
||
|
||
**A counting semaphore**
|
||
|
||
* `signal(SEM)` increases the counter (possibly resuming a thread waiting for the semaphore)
|
||
* `wait(SEM)` decrements the counter - will block (be suspended) `if SEM == 0`
|
||
* The semaphores value can not be negative
|
||
* Of course; These calls are protected from interleaving by disabling the timer interrupt
|
||
|
||
**We solve beautifully:**
|
||
* Mutual Exclusion
|
||
* Conditional Synchronization (ref `suspend`/`resume`)
|
||
* Basic resource allocation
|
||
|
||
**Semaphore variations**
|
||
|
||
* `wait` and `signal` nay take parameter value to add or subtract
|
||
* `getValue(SEM)` returning the value of the semaphore. (Fishy)
|
||
* BInary semaphores (`signal` will fail `if SEM == 1`)
|
||
* Who is woken at `signal` (FIFO, Arbitrary, Highest priority)
|
||
* The mutex
|
||
* binary
|
||
* ownership
|
||
* allows mulitple waits by owner
|
||
* regions (may be released by Javas `wait` or POSIX condition variables)
|
||
* RTFM
|
||
|
||
**Semaphore challenges**
|
||
|
||
* Breaks modules (both ways)
|
||
* Does not scale!
|
||
* Deadlocks
|
||
* Global analysis --> Does not scale
|
||
* Can not release "temporarily
|
||
* "Limited expressive power". Some reasonalbe problems are hard to solve
|
||
* Ref ["The Little Book of Semaphores"](https://greenteapress.com/semaphores/LittleBookOfSemaphores.pdf)
|
||
|
||
### Why shared-variable synchronization
|
||
|
||
**Why not?**
|
||
|
||
* "Shared variables" is bad code quality
|
||
* Ref global variables, and data members in module interfaces
|
||
* An obvious bottleneck? Scales terribly
|
||
* "Variables" are passive objects
|
||
* They can not protect themselves
|
||
* Why use synchronization when it is communication we need?
|
||
* Technology transfers badly to distibuted systems
|
||
* ... and this is before we start discussing how hard it is
|
||
|
||
**Why?**
|
||
|
||
* Part of the "real-time" design pattern
|
||
* "One thread per timing demand"
|
||
* We do have scheduling proofs and best practises
|
||
* Timing analysis is global anyway
|
||
* Scalability and deadlock analysis may not be the limiting constraint
|
||
* HW is shared memory architecture
|
||
* Infrastucture is avalible
|
||
* Communication systems requires infrastucture that we may not have
|
||
|
||
#### *All* resources are shared!
|
||
|
||
* Memory, certainly
|
||
* "Hidden" memory used by libraries (.. your own modules and the kernel)
|
||
* If the library takes care of this itself, it is called *"reentrant"*
|
||
* Sensors and actuators
|
||
* "CPU" - Computing capacity
|
||
* *This* is real-time programming; We solve it by *Scheduling*
|
||
* ... any other interface
|
||
|
||
|
||
#### Some standard problems/pit-falls
|
||
|
||
* **Race condition**: A bug that surfaces by unfortunate timing or order of events
|
||
* **Deadlock:** system in circular wait
|
||
* Special case of livelock
|
||
* Does not use CPU
|
||
* **Livelock:** system locked in a subset of states
|
||
* like deadlock, but we use CPU
|
||
* Busy-Waiting is a livelock
|
||
* **Starvation:** A thread does "by accident" not get the necessary resources
|
||
|
||
|
||
#### Features in syncronization
|
||
|
||
* Critical Section - Code that must not be interupted
|
||
* Mutual Exclusion - More piecesof code that must not interrupt each other
|
||
* Bounded buffer - Buffer with full/empty synchronization
|
||
* Read/Write Locks
|
||
* Readers can interleave eachother
|
||
* Writers have mutual exclusion
|
||
* Condition Syncronization - Blocking on event or status
|
||
* Guards etc.
|
||
* Resource allocation
|
||
* More than mutual exclution!
|
||
* Ref: The lock manager
|
||
* Rendezvouz/barriere - Synchronization point
|
||
* Ref: AA "end boundary"
|
||
* Communication
|
||
* Broadcast
|
||
* ...
|
||
|